LinuxLists.cc - [RFC PATCH] sched/eevdf: Return leftmost entity in pick

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Yu

On Mon, Feb 26, 2024 at 4:26 PM Chen Yu <[email protected]> wrote:
>
> There is occasional report from lkp that the kernel hits the NULL pointer
> exception:
>
> [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> [ 512.082337][ T8305] *pde = 00000000
> [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
>
> This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> After
> commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> the NULL candidate would trigger the NULL pointer exception. While before
> this commit, there would be warning.
>
> This NULL entity issue was always there before above commit. With debug
> patch to print the cfs_rq and all the entities in the tree, we have the
> information when the issue was reproduced:
>
> [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> [ 514.536772][ T8390] Traverse rb-tree from left to right
> [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> [ 514.538539][ T8390] Traverse rb-tree from topdown
> [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> [ 514.540340][ T8390] Found best:0x0
> [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
>
> We can see that non of the entities in the tree are eligible, neither is
> the current entity on this cfs_rq. As a result, curr is set to NULL:
> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> curr = NULL;
>
> and the best is set to NULL, which caused the problem:
> if (!best || (curr && entity_before(curr, best)))
> best = curr;
>
> The cause is that, the curr is eligible, but vruntime_eligible()
> returns false. And the false negative is due to the following
> code in vruntime_eligible():
>
> return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
>
> According to the log, vruntime is 18435852013561943404, the
> cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> thus:
> s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> delta * 2677 = 7733399554989275921
> that is to say, the multiply result overflow the s64, which turns the
> negative value into a positive value, thus eligible check fails.
>
> So where is this insane huge vruntime 18435852013561943404 coming from?
> My guess is that, it is because the initial value of cfs_rq->min_vruntime
> is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> seldom scheduled in, its vruntime might not move forward too much and
> remain its original value by previous place_entity().

Could you please test the follow patch:

https://lore.kernel.org/all/[email protected]/

Because the se->vruntime would be changed in place_entity() and
reweight_eevdf(),
however, now the vlag may exceed eevdf's limit in some scenarios, it may cause
the vrunitme's error.

>
> The proper fix should deal with the overflow of entity_key() * load, but
> I don't have much clue on that, so propose this conservative method to
> restore the previous behavior before the mentioned commit.
>
> Fixes: 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Chen Yu <[email protected]>
> ---
> kernel/sched/fair.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..fb9202f464e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -880,7 +880,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> struct sched_entity *curr = cfs_rq->curr;
> - struct sched_entity *best = NULL;
> + struct sched_entity *best = NULL, *leftmost;
>
> /*
> * We can safely skip eligibility check if there is only one entity
> @@ -905,6 +905,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> goto found;
> }
>
> + leftmost = se;
> +

Maybe judging which of the deadlines of curr and se is smaller is better?

> /* Heap search for the EEVD entity */
> while (node) {
> struct rb_node *left = node->rb_left;
> @@ -937,6 +939,15 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (!best || (curr && entity_before(curr, best)))
> best = curr;
>
> + /*
> + * entity_eligible() could bring false negative due to
> + * multiply overflow, which reports no eligible entity.
> + * Return leftmost entity as a backup(it is guaranteed
> + * the tree is not NULL.
> + */
> + if (!best)
> + best = leftmost;
> +
> return best;
> }
>
> --
> 2.25.1
>
>

BR
--
xuewen

2024-02-28 15:52:21

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Xuewen,

On 2024-02-28 at 17:04:49 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Mon, Feb 26, 2024 at 4:26 PM Chen Yu <[email protected]> wrote:
> >
> > There is occasional report from lkp that the kernel hits the NULL pointer
> > exception:
> >
> > [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> > [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> > [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> > [ 512.082337][ T8305] *pde = 00000000
> > [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> > [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> > [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
> >
> > This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> > After
> > commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > the NULL candidate would trigger the NULL pointer exception. While before
> > this commit, there would be warning.
> >
> > This NULL entity issue was always there before above commit. With debug
> > patch to print the cfs_rq and all the entities in the tree, we have the
> > information when the issue was reproduced:
> >
> > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > [ 514.540340][ T8390] Found best:0x0
> > [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
> >
> > We can see that non of the entities in the tree are eligible, neither is
> > the current entity on this cfs_rq. As a result, curr is set to NULL:
> > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > curr = NULL;
> >
> > and the best is set to NULL, which caused the problem:
> > if (!best || (curr && entity_before(curr, best)))
> > best = curr;
> >
> > The cause is that, the curr is eligible, but vruntime_eligible()
> > returns false. And the false negative is due to the following
> > code in vruntime_eligible():
> >
> > return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> >
> > According to the log, vruntime is 18435852013561943404, the
> > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > thus:
> > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > delta * 2677 = 7733399554989275921
> > that is to say, the multiply result overflow the s64, which turns the
> > negative value into a positive value, thus eligible check fails.
> >
> > So where is this insane huge vruntime 18435852013561943404 coming from?
> > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > seldom scheduled in, its vruntime might not move forward too much and
> > remain its original value by previous place_entity().
>
> Could you please test the follow patch:
>
> https://lore.kernel.org/all/[email protected]/
>
> Because the se->vruntime would be changed in place_entity() and
> reweight_eevdf(),
> however, now the vlag may exceed eevdf's limit in some scenarios, it may cause
> the vrunitme's error.
>

Thanks for providing this fix. I took a quick look at your patch, and it looks
reasonable to me. I'm not quite sure if the vlag issue is the root cause of
this NULL candidate problem, say, if the abnormal vlag happens, it should satisfy:
curr->vlag == curr->deadline ?

According to the data,
curr->deadline =
cfs_rq->min_vruntime = 763383370431
cfs_rq->avg_vruntime = 386638640128
entity_key = -10892823530978643
avg = -6851585614346926319
load = 2677
so avg_vruntime() = 18444185410266494357
curr->vlag = 8333396704550953
which does not equal to curr->deadline = 18435852013562231446

Anyway, if we want to root cause why vruntime is behind cfs_rq->min_vruntime so
much, we can add more debug code to further investigation. And we will also leverage
lkp to test your vlag fix patch set.

> >
> > The proper fix should deal with the overflow of entity_key() * load, but
> > I don't have much clue on that, so propose this conservative method to
> > restore the previous behavior before the mentioned commit.
> >
> > Fixes: 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > Reported-by: kernel test robot <[email protected]>
> > Closes: https://lore.kernel.org/lkml/[email protected]/
> > Signed-off-by: Chen Yu <[email protected]>
> > ---
> > kernel/sched/fair.c | 13 ++++++++++++-
> > 1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 533547e3c90a..fb9202f464e2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -880,7 +880,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > struct sched_entity *se = __pick_first_entity(cfs_rq);
> > struct sched_entity *curr = cfs_rq->curr;
> > - struct sched_entity *best = NULL;
> > + struct sched_entity *best = NULL, *leftmost;
> >
> > /*
> > * We can safely skip eligibility check if there is only one entity
> > @@ -905,6 +905,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > goto found;
> > }
> >
> > + leftmost = se;
> > +
>
> Maybe judging which of the deadlines of curr and se is smaller is better?
>

Yes it is donable. When the NULL pointer exception happens, the deadline of
current is very large(nearly u64 overflow), we might always choose se
if we compare them.

thanks,
Chenyu

2024-02-29 09:04:56

by Abel Wu

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Chen, thanks for detailed analysis.

The title of this patch sounds a little weird to me, since any
non-empty cfs_rq should have at least one eligible entity. Besides,
choosing the leftmost entity which could be non-eligible can be
sub-optimal, anyway this is only a workaround..

On 2/26/24 4:23 PM, Chen Yu Wrote:
> There is occasional report from lkp that the kernel hits the NULL pointer
> exception:
>
> [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> [ 512.082337][ T8305] *pde = 00000000
> [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
>
> This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> After
> commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> the NULL candidate would trigger the NULL pointer exception. While before
> this commit, there would be warning.
>
> This NULL entity issue was always there before above commit. With debug
> patch to print the cfs_rq and all the entities in the tree, we have the
> information when the issue was reproduced:
>
> [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> [ 514.536772][ T8390] Traverse rb-tree from left to right
> [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> [ 514.538539][ T8390] Traverse rb-tree from topdown
> [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> [ 514.540340][ T8390] Found best:0x0
> [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
>
> We can see that non of the entities in the tree are eligible, neither is
> the current entity on this cfs_rq. As a result, curr is set to NULL:
> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> curr = NULL;
>
> and the best is set to NULL, which caused the problem:
> if (!best || (curr && entity_before(curr, best)))
> best = curr;
>
> The cause is that, the curr is eligible, but vruntime_eligible()
> returns false. And the false negative is due to the following
> code in vruntime_eligible():
>
> return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
>
> According to the log, vruntime is 18435852013561943404, the
> cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> thus:
> s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> delta * 2677 = 7733399554989275921
> that is to say, the multiply result overflow the s64, which turns the
> negative value into a positive value, thus eligible check fails.

Indeed.

>
> So where is this insane huge vruntime 18435852013561943404 coming from?
> My guess is that, it is because the initial value of cfs_rq->min_vruntime
> is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> seldom scheduled in, its vruntime might not move forward too much and
> remain its original value by previous place_entity().

So why not just initialize to 0? The (unsigned long)(-(1LL << 20))
thing is dangerous as it can easily blow up lots of calculations in
lag, key, avg_vruntime and so on.

Say during this pre-life, which is about 1ms for 1024-weight entity,
there is only one entity running in this cfs_rq. Now another entity
with funny lag joins in, being placed somewhere at 0+ vruntime, so
cfs_rq->min_vruntime needs to be adjusted accordingly which leads to
the breakage of cfs_rq->curr's key as you showed above.

>
> The proper fix should deal with the overflow of entity_key() * load, but
> I don't have much clue on that, so propose this conservative method to
> restore the previous behavior before the mentioned commit.

Inspired by Xuewen's proposal, will it work if limit the key?

Thanks & BR,
Abel

>
> Fixes: 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Signed-off-by: Chen Yu <[email protected]>
> ---
> kernel/sched/fair.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..fb9202f464e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -880,7 +880,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> struct sched_entity *curr = cfs_rq->curr;
> - struct sched_entity *best = NULL;
> + struct sched_entity *best = NULL, *leftmost;
>
> /*
> * We can safely skip eligibility check if there is only one entity
> @@ -905,6 +905,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> goto found;
> }
>
> + leftmost = se;
> +
> /* Heap search for the EEVD entity */
> while (node) {
> struct rb_node *left = node->rb_left;
> @@ -937,6 +939,15 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (!best || (curr && entity_before(curr, best)))
> best = curr;
>
> + /*
> + * entity_eligible() could bring false negative due to
> + * multiply overflow, which reports no eligible entity.
> + * Return leftmost entity as a backup(it is guaranteed
> + * the tree is not NULL.
> + */
> + if (!best)
> + best = leftmost;
> +
> return best;
> }
>

2024-02-29 12:10:58

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Yu

On Wed, Feb 28, 2024 at 11:25 PM Chen Yu <[email protected]> wrote:
>
> Hi Xuewen,
>
> On 2024-02-28 at 17:04:49 +0800, Xuewen Yan wrote:
> > Hi Yu
> >
> > On Mon, Feb 26, 2024 at 4:26 PM Chen Yu <[email protected]> wrote:
> > >
> > > There is occasional report from lkp that the kernel hits the NULL pointer
> > > exception:
> > >
> > > [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> > > [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> > > [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> > > [ 512.082337][ T8305] *pde = 00000000
> > > [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> > > [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> > > [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
> > >
> > > This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> > > After
> > > commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > > the NULL candidate would trigger the NULL pointer exception. While before
> > > this commit, there would be warning.
> > >
> > > This NULL entity issue was always there before above commit. With debug
> > > patch to print the cfs_rq and all the entities in the tree, we have the
> > > information when the issue was reproduced:
> > >
> > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > [ 514.540340][ T8390] Found best:0x0
> > > [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
> > >
> > > We can see that non of the entities in the tree are eligible, neither is
> > > the current entity on this cfs_rq. As a result, curr is set to NULL:
> > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > curr = NULL;
> > >
> > > and the best is set to NULL, which caused the problem:
> > > if (!best || (curr && entity_before(curr, best)))
> > > best = curr;
> > >
> > > The cause is that, the curr is eligible, but vruntime_eligible()
> > > returns false. And the false negative is due to the following
> > > code in vruntime_eligible():
> > >
> > > return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> > >
> > > According to the log, vruntime is 18435852013561943404, the
> > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > thus:
> > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > delta * 2677 = 7733399554989275921
> > > that is to say, the multiply result overflow the s64, which turns the
> > > negative value into a positive value, thus eligible check fails.
> > >
> > > So where is this insane huge vruntime 18435852013561943404 coming from?
> > > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > > seldom scheduled in, its vruntime might not move forward too much and
> > > remain its original value by previous place_entity().
> >
> > Could you please test the follow patch:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > Because the se->vruntime would be changed in place_entity() and
> > reweight_eevdf(),
> > however, now the vlag may exceed eevdf's limit in some scenarios, it may cause
> > the vrunitme's error.
> >
>
> Thanks for providing this fix. I took a quick look at your patch, and it looks
> reasonable to me. I'm not quite sure if the vlag issue is the root cause of
> this NULL candidate problem, say, if the abnormal vlag happens, it should satisfy:
> curr->vlag == curr->deadline ?

Actually, this is just a situation, because in set_next_entity(), it
would set the valg be equal to deadline, and if using the vlag to
update the vruntime, the vruntime is most likely wrong.
However, even if vlag is normal, after the calc: ((W + w_i)*vl'_i / W)
or (w/w'), it may also be an out-of-bounds value.
For example: if change the nice from -20(weight:88761) to
19(weight:15), its vlag would be 88761/15=5917*vlag.

>
> According to the data,
> curr->deadline =
> cfs_rq->min_vruntime = 763383370431
> cfs_rq->avg_vruntime = 386638640128
> entity_key = -10892823530978643
> avg = -6851585614346926319
> load = 2677
> so avg_vruntime() = 18444185410266494357
> curr->vlag = 8333396704550953
> which does not equal to curr->deadline = 18435852013562231446
>
> Anyway, if we want to root cause why vruntime is behind cfs_rq->min_vruntime so
> much, we can add more debug code to further investigation. And we will also leverage
> lkp to test your vlag fix patch set.

Yes, maybe you can see more information by adding log in place_entity().

BR

>
> > >
> > > The proper fix should deal with the overflow of entity_key() * load, but
> > > I don't have much clue on that, so propose this conservative method to
> > > restore the previous behavior before the mentioned commit.
> > >
> > > Fixes: 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > > Reported-by: kernel test robot <[email protected]>
> > > Closes: https://lore.kernel.org/lkml/[email protected]/
> > > Signed-off-by: Chen Yu <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 13 ++++++++++++-
> > > 1 file changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 533547e3c90a..fb9202f464e2 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -880,7 +880,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > > struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > > struct sched_entity *se = __pick_first_entity(cfs_rq);
> > > struct sched_entity *curr = cfs_rq->curr;
> > > - struct sched_entity *best = NULL;
> > > + struct sched_entity *best = NULL, *leftmost;
> > >
> > > /*
> > > * We can safely skip eligibility check if there is only one entity
> > > @@ -905,6 +905,8 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > > goto found;
> > > }
> > >
> > > + leftmost = se;
> > > +
> >
> > Maybe judging which of the deadlines of curr and se is smaller is better?
> >
>
> Yes it is donable. When the NULL pointer exception happens, the deadline of
> current is very large(nearly u64 overflow), we might always choose se
> if we compare them.
>
> thanks,
> Chenyu

--
xuewen

2024-03-01 06:47:11

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-02-29 at 20:10:32 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Wed, Feb 28, 2024 at 11:25 PM Chen Yu <[email protected]> wrote:
> >
> > Hi Xuewen,
> >
> > On 2024-02-28 at 17:04:49 +0800, Xuewen Yan wrote:
> > > Hi Yu
> > >
> > > On Mon, Feb 26, 2024 at 4:26 PM Chen Yu <[email protected]> wrote:
> > > >
> > > > There is occasional report from lkp that the kernel hits the NULL pointer
> > > > exception:
> > > >
> > > > [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> > > > [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> > > > [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> > > > [ 512.082337][ T8305] *pde = 00000000
> > > > [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> > > > [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> > > > [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
> > > >
> > > > This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> > > > After
> > > > commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > > > the NULL candidate would trigger the NULL pointer exception. While before
> > > > this commit, there would be warning.
> > > >
> > > > This NULL entity issue was always there before above commit. With debug
> > > > patch to print the cfs_rq and all the entities in the tree, we have the
> > > > information when the issue was reproduced:
> > > >
> > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > [ 514.540340][ T8390] Found best:0x0
> > > > [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
> > > >
> > > > We can see that non of the entities in the tree are eligible, neither is
> > > > the current entity on this cfs_rq. As a result, curr is set to NULL:
> > > > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > > curr = NULL;
> > > >
> > > > and the best is set to NULL, which caused the problem:
> > > > if (!best || (curr && entity_before(curr, best)))
> > > > best = curr;
> > > >
> > > > The cause is that, the curr is eligible, but vruntime_eligible()
> > > > returns false. And the false negative is due to the following
> > > > code in vruntime_eligible():
> > > >
> > > > return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> > > >
> > > > According to the log, vruntime is 18435852013561943404, the
> > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > thus:
> > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > delta * 2677 = 7733399554989275921
> > > > that is to say, the multiply result overflow the s64, which turns the
> > > > negative value into a positive value, thus eligible check fails.
> > > >
> > > > So where is this insane huge vruntime 18435852013561943404 coming from?
> > > > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > > > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > > > seldom scheduled in, its vruntime might not move forward too much and
> > > > remain its original value by previous place_entity().
> > >
> > > Could you please test the follow patch:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > Because the se->vruntime would be changed in place_entity() and
> > > reweight_eevdf(),
> > > however, now the vlag may exceed eevdf's limit in some scenarios, it may cause
> > > the vrunitme's error.
> > >
> >
> > Thanks for providing this fix. I took a quick look at your patch, and it looks
> > reasonable to me. I'm not quite sure if the vlag issue is the root cause of
> > this NULL candidate problem, say, if the abnormal vlag happens, it should satisfy:
> > curr->vlag == curr->deadline ?
>
> Actually, this is just a situation, because in set_next_entity(), it
> would set the valg be equal to deadline, and if using the vlag to
> update the vruntime, the vruntime is most likely wrong.

Yes, the child with wrong vlag might bring potential risk after child been place_entity().

> However, even if vlag is normal, after the calc: ((W + w_i)*vl'_i / W)
> or (w/w'), it may also be an out-of-bounds value.
> For example: if change the nice from -20(weight:88761) to
> 19(weight:15), its vlag would be 88761/15=5917*vlag.
>

Yes, this might exceed the limit range of vlag,

> >
> > According to the data,
> > curr->deadline =
> > cfs_rq->min_vruntime = 763383370431
> > cfs_rq->avg_vruntime = 386638640128
> > entity_key = -10892823530978643
> > avg = -6851585614346926319
> > load = 2677
> > so avg_vruntime() = 18444185410266494357
> > curr->vlag = 8333396704550953
> > which does not equal to curr->deadline = 18435852013562231446
> >
> > Anyway, if we want to root cause why vruntime is behind cfs_rq->min_vruntime so
> > much, we can add more debug code to further investigation. And we will also leverage
> > lkp to test your vlag fix patch set.
>
> Yes, maybe you can see more information by adding log in place_entity().
>

Sure, let me think about that.

thanks,
Chenyu

2024-03-01 07:08:08

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-02-29 at 17:00:18 +0800, Abel Wu wrote:
> Hi Chen, thanks for detailed analysis.
>
> The title of this patch sounds a little weird to me, since any
> non-empty cfs_rq should have at least one eligible entity.

My understanding is that, current + cfs_rq's tree has at least one
eligible entity, as the current running entity is not on the tree.
In vruntime_eligible() check, we consider all the entities on the
tree, plus the current running one. So, it could be possible that
all the entities on the tree are not eligible, while the current is
eligible, no?

> Besides, choosing the leftmost entity which could be non-eligible can be
> sub-optimal, anyway this is only a workaround..
>

Yes, it is a workaround to show the possible scenario to bring NULL exception,
and it was not root caused yet.

> On 2/26/24 4:23 PM, Chen Yu Wrote:
> > There is occasional report from lkp that the kernel hits the NULL pointer
> > exception:
> >
> > [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
> > [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
> > [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
> > [ 512.082337][ T8305] *pde = 00000000
> > [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
> > [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
> > [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
> >
> > This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
> > After
> > commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
> > the NULL candidate would trigger the NULL pointer exception. While before
> > this commit, there would be warning.
> >
> > This NULL entity issue was always there before above commit. With debug
> > patch to print the cfs_rq and all the entities in the tree, we have the
> > information when the issue was reproduced:
> >
> > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
> > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
> > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > [ 514.540340][ T8390] Found best:0x0
> > [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
> >
> > We can see that non of the entities in the tree are eligible, neither is
> > the current entity on this cfs_rq. As a result, curr is set to NULL:
> > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > curr = NULL;
> >
> > and the best is set to NULL, which caused the problem:
> > if (!best || (curr && entity_before(curr, best)))
> > best = curr;
> >
> > The cause is that, the curr is eligible, but vruntime_eligible()
> > returns false. And the false negative is due to the following
> > code in vruntime_eligible():
> >
> > return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
> >
> > According to the log, vruntime is 18435852013561943404, the
> > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > thus:
> > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > delta * 2677 = 7733399554989275921
> > that is to say, the multiply result overflow the s64, which turns the
> > negative value into a positive value, thus eligible check fails.
>
> Indeed.
>
> >
> > So where is this insane huge vruntime 18435852013561943404 coming from?
> > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > seldom scheduled in, its vruntime might not move forward too much and
> > remain its original value by previous place_entity().
>
> So why not just initialize to 0? The (unsigned long)(-(1LL << 20))
> thing is dangerous as it can easily blow up lots of calculations in
> lag, key, avg_vruntime and so on.
>

Usually the min variable is initialized to a very large value, so later
the real min can overwrite it easily. The 1LL << 20 was introduced decade ago
in 67e9fb2a39a1, and I don't know why it is this value.

> Say during this pre-life, which is about 1ms for 1024-weight entity,
> there is only one entity running in this cfs_rq. Now another entity
> with funny lag joins in, being placed somewhere at 0+ vruntime, so
> cfs_rq->min_vruntime needs to be adjusted accordingly which leads to
> the breakage of cfs_rq->curr's key as you showed above.
>

Yes, this is possible.

> >
> > The proper fix should deal with the overflow of entity_key() * load, but
> > I don't have much clue on that, so propose this conservative method to
> > restore the previous behavior before the mentioned commit.
>
> Inspired by Xuewen's proposal, will it work if limit the key?
>

Yes we can, let me think about some debug code to demonstrate the scenario
you described above.

thanks,
Chenyu

2024-03-01 08:55:22

by Abel Wu

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 3/1/24 3:07 PM, Chen Yu Wrote:
> On 2024-02-29 at 17:00:18 +0800, Abel Wu wrote:
>> Hi Chen, thanks for detailed analysis.
>>
>> The title of this patch sounds a little weird to me, since any
>> non-empty cfs_rq should have at least one eligible entity.
>
> My understanding is that, current + cfs_rq's tree has at least one
> eligible entity, as the current running entity is not on the tree.
> In vruntime_eligible() check, we consider all the entities on the
> tree, plus the current running one. So, it could be possible that
> all the entities on the tree are not eligible, while the current is
> eligible, no?

Yes, and I didn't mean only rb-tree when I said "non-empty cfs_rq" :)

>
>> Besides, choosing the leftmost entity which could be non-eligible can be
>> sub-optimal, anyway this is only a workaround..
>>
>
> Yes, it is a workaround to show the possible scenario to bring NULL exception,
> and it was not root caused yet.
>
>> On 2/26/24 4:23 PM, Chen Yu Wrote:
>>> There is occasional report from lkp that the kernel hits the NULL pointer
>>> exception:
>>>
>>> [ 512.079810][ T8305] BUG: kernel NULL pointer dereference, address: 0000002c
>>> [ 512.080897][ T8305] #PF: supervisor read access in kernel mode
>>> [ 512.081636][ T8305] #PF: error_code(0x0000) - not-present page
>>> [ 512.082337][ T8305] *pde = 00000000
>>> [ 512.082829][ T8305] Oops: 0000 [#1] PREEMPT SMP
>>> [ 512.083407][ T8305] CPU: 1 PID: 8305 Comm: watchdog Tainted: G W
>>> [ 512.086203][ T8305] EIP: set_next_entity (fair.c:?)
>>>
>>> This is caused by NULL candidate returned by pick_eevdf() as Abel analyzed.
>>> After
>>> commit 2227a957e1d5 ("sched/eevdf: Sort the rbtree by virtual deadline")
>>> the NULL candidate would trigger the NULL pointer exception. While before
>>> this commit, there would be warning.
>>>
>>> This NULL entity issue was always there before above commit. With debug
>>> patch to print the cfs_rq and all the entities in the tree, we have the
>>> information when the issue was reproduced:
>>>
>>> [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 min_vruntime:763383370431
>>> [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
>>> min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
>>> [ 514.536772][ T8390] Traverse rb-tree from left to right
>>> [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
>>> [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
>>> [ 514.538539][ T8390] Traverse rb-tree from topdown
>>> [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible
>>> [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
>>> [ 514.540340][ T8390] Found best:0x0
>>> [ 514.540613][ T8390] BUG: kernel NULL pointer dereference, address: 00000074
>>>
>>> We can see that non of the entities in the tree are eligible, neither is
>>> the current entity on this cfs_rq. As a result, curr is set to NULL:
>>> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>> curr = NULL;
>>>
>>> and the best is set to NULL, which caused the problem:
>>> if (!best || (curr && entity_before(curr, best)))
>>> best = curr;
>>>
>>> The cause is that, the curr is eligible, but vruntime_eligible()
>>> returns false. And the false negative is due to the following
>>> code in vruntime_eligible():
>>>
>>> return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
>>>
>>> According to the log, vruntime is 18435852013561943404, the
>>> cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
>>> thus:
>>> s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
>>> delta * 2677 = 7733399554989275921
>>> that is to say, the multiply result overflow the s64, which turns the
>>> negative value into a positive value, thus eligible check fails.
>>
>> Indeed.
>>
>>>
>>> So where is this insane huge vruntime 18435852013561943404 coming from?
>>> My guess is that, it is because the initial value of cfs_rq->min_vruntime
>>> is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
>>> seldom scheduled in, its vruntime might not move forward too much and
>>> remain its original value by previous place_entity().
>>
>> So why not just initialize to 0? The (unsigned long)(-(1LL << 20))
>> thing is dangerous as it can easily blow up lots of calculations in
>> lag, key, avg_vruntime and so on.
>>
>
> Usually the min variable is initialized to a very large value, so later
> the real min can overwrite it easily. The 1LL << 20 was introduced decade ago
> in 67e9fb2a39a1, and I don't know why it is this value.

Yes, as the min_vruntime keeps moving forward given the way
update_min_vruntime() does, I don't think it would be a problem.

>
>
>> Say during this pre-life, which is about 1ms for 1024-weight entity,
>> there is only one entity running in this cfs_rq. Now another entity
>> with funny lag joins in, being placed somewhere at 0+ vruntime, so
>> cfs_rq->min_vruntime needs to be adjusted accordingly which leads to
>> the breakage of cfs_rq->curr's key as you showed above.
>>
>
> Yes, this is possible.
>
>>>
>>> The proper fix should deal with the overflow of entity_key() * load, but
>>> I don't have much clue on that, so propose this conservative method to
>>> restore the previous behavior before the mentioned commit.
>>
>> Inspired by Xuewen's proposal, will it work if limit the key?

After a second thought, this doesn't solve the insane huge vruntime
itself. Since the immediate problem is that 'huge' number actually is
a small one, is it possible to make vruntime s64 at definition?

2024-04-08 11:58:56

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:

> > According to the log, vruntime is 18435852013561943404, the
> > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > thus:
> > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > delta * 2677 = 7733399554989275921
> > that is to say, the multiply result overflow the s64, which turns the
> > negative value into a positive value, thus eligible check fails.
>
> Indeed.

From the data presented it looks like min_vruntime is wrong and needs
update. If you can readily reproduce this, dump the vruntime of all
tasks on the runqueue and see if min_vruntime is indeed correct.

> > So where is this insane huge vruntime 18435852013561943404 coming from?
> > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > seldom scheduled in, its vruntime might not move forward too much and
> > remain its original value by previous place_entity().
>
> So why not just initialize to 0? The (unsigned long)(-(1LL << 20))
> thing is dangerous as it can easily blow up lots of calculations in
> lag, key, avg_vruntime and so on.

The reason is to ensure the wrap-around logic works -- which it must,
because with the weighting thing, the vruntime can wrap quite quickly,
something like one day IIRC (20 bit for precision etc.)

Better to have the wrap around happen quickly after boot and have
everybody suffer, rather than have it be special and hard to reproduce.

2024-04-08 12:01:20

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Fri, Mar 01, 2024 at 04:42:43PM +0800, Abel Wu wrote:

> After a second thought, this doesn't solve the insane huge vruntime
> itself. Since the immediate problem is that 'huge' number actually is
> a small one, is it possible to make vruntime s64 at definition?

In part that was never done because wrapping of signed types in C is UB
and UBSAN used to be buggy. But also, given how it all works, it mostly
doesn't matter.

2024-04-08 13:16:57

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
>
> > > According to the log, vruntime is 18435852013561943404, the
> > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > thus:
> > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > delta * 2677 = 7733399554989275921
> > > that is to say, the multiply result overflow the s64, which turns the
> > > negative value into a positive value, thus eligible check fails.
> >
> > Indeed.
>
> From the data presented it looks like min_vruntime is wrong and needs
> update. If you can readily reproduce this, dump the vruntime of all
> tasks on the runqueue and see if min_vruntime is indeed correct.
>

This was the dump of all the entities on the tree, from left to right,
and also from top down in middle order traverse, when this issue happens:

[ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
[ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629

[ 514.536772][ T8390] Traverse rb-tree from left to right
[ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
[ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible

[ 514.538539][ T8390] Traverse rb-tree from topdown
[ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
[ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible

The tree looks like:

se (0xec1234e0)
|
|
----> se (0xec4fcf20)

The root se 0xec1234e0 is also the leftmost se, its min_vruntime and vruntime are both 763383370431,
which is aligned with cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets updated correctly,
because it is monotonic increasing.

My guess is that, for some reason, one newly forked se in a newly created task group, in the rb-tree has not
been picked for a long time(maybe not eligible). Its vruntime stopped at the negative value(near (unsigned long)(-(1LL << 20))
for a long time, its vruntime is long behind the cfs_rq->vruntime, thus the overflow happens.

thanks,
Chenyu

> > > So where is this insane huge vruntime 18435852013561943404 coming from?
> > > My guess is that, it is because the initial value of cfs_rq->min_vruntime
> > > is set to (unsigned long)(-(1LL << 20)). If the task(watchdog in this case)
> > > seldom scheduled in, its vruntime might not move forward too much and
> > > remain its original value by previous place_entity().
> >
> > So why not just initialize to 0? The (unsigned long)(-(1LL << 20))
> > thing is dangerous as it can easily blow up lots of calculations in
> > lag, key, avg_vruntime and so on.
>
> The reason is to ensure the wrap-around logic works -- which it must,
> because with the weighting thing, the vruntime can wrap quite quickly,
> something like one day IIRC (20 bit for precision etc.)
>
> Better to have the wrap around happen quickly after boot and have
> everybody suffer, rather than have it be special and hard to reproduce.

2024-04-09 09:23:20

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> >
> > > > According to the log, vruntime is 18435852013561943404, the
> > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > thus:
> > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > delta * 2677 = 7733399554989275921
> > > > that is to say, the multiply result overflow the s64, which turns the
> > > > negative value into a positive value, thus eligible check fails.
> > >
> > > Indeed.
> >
> > From the data presented it looks like min_vruntime is wrong and needs
> > update. If you can readily reproduce this, dump the vruntime of all
> > tasks on the runqueue and see if min_vruntime is indeed correct.
> >
>
> This was the dump of all the entities on the tree, from left to right,

Oh, my bad, I thought it was the pick path.

> and also from top down in middle order traverse, when this issue happens:
>
> [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
>
>
> [ 514.536772][ T8390] Traverse rb-tree from left to right
> [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
>
> [ 514.538539][ T8390] Traverse rb-tree from topdown
> [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
>
> The tree looks like:
>
> se (0xec1234e0)
> |
> |
> ----> se (0xec4fcf20)
>
>
> The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> vruntime are both 763383370431, which is aligned with
> cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> updated correctly, because it is monotonic increasing.

Right.

> My guess is that, for some reason, one newly forked se in a newly
> created task group, in the rb-tree has not been picked for a long
> time(maybe not eligible). Its vruntime stopped at the negative
> value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> is long behind the cfs_rq->vruntime, thus the overflow happens.

I'll have to do the math again, but that's something in the order of not
picking a task in about a day, that would be 'bad' :-)

Is there any sane way to reproduce this, and how often does it happen?

2024-04-15 07:23:48

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Tue, Apr 09, 2024 at 11:21:04AM +0200, Peter Zijlstra wrote:

> Is there any sane way to reproduce this, and how often does it happen?

This, how do you all make it go bang?

2024-04-15 08:03:35

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-15 at 09:22:51 +0200, Peter Zijlstra wrote:
> On Tue, Apr 09, 2024 at 11:21:04AM +0200, Peter Zijlstra wrote:
>
> > Is there any sane way to reproduce this, and how often does it happen?
>
> This, how do you all make it go bang?

It was reproduced in lkp's environment, and originally reported here:
https://lore.kernel.org/lkml/[email protected]/

It is a trinity test on a vm guest, and seems like be triggered after some futex
test. And it was reproduced at a rate of 23/999 according to that report.
Previously I could not reproduce it locally, so lkp has helped test my debug
patch in their environment and got the clue that it was broken by s64 overflow.

Breno told me that he has reproduce this issue with KASAN on and using:
'stress-ng -a 20', but I can not reproduce it locally neither.

I'm thinking of creating a debug patch to trace all the changes related to
cfs_rq->avg_vruntime in avg_vruntime_add()/sub().
To see how cfs_rq->avg_vruntime gets far behind the cfs_rq->min_vruntime, which
caused the overflow. My understanding is that, the se's vruntime is got from
cfs_rq->avg_vruntime in place_entity(), if the se's vruntime gets an extrem
smaller value than cfs_rq->min_vruntime, then it might indicate that there could
be something wrong with the update of cfs_rq->avg_vruntime.
Then lkp could help us to further debug.

Or do you have any suggestion/suspect that how to narrow down this, I could
try as you suggest.

thanks,
Chenyu

2024-04-17 18:35:20

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > >
> > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > thus:
> > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > delta * 2677 = 7733399554989275921
> > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > negative value into a positive value, thus eligible check fails.
> > > >
> > > > Indeed.
> > >
> > > From the data presented it looks like min_vruntime is wrong and needs
> > > update. If you can readily reproduce this, dump the vruntime of all
> > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > >
> >
> > This was the dump of all the entities on the tree, from left to right,
>
> Oh, my bad, I thought it was the pick path.
>
> > and also from top down in middle order traverse, when this issue happens:
> >
> > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> >
> >
> > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> >
> > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> >
> > The tree looks like:
> >
> > se (0xec1234e0)
> > |
> > |
> > ----> se (0xec4fcf20)
> >
> >
> > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > vruntime are both 763383370431, which is aligned with
> > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > updated correctly, because it is monotonic increasing.
>
> Right.
>
> > My guess is that, for some reason, one newly forked se in a newly
> > created task group, in the rb-tree has not been picked for a long
> > time(maybe not eligible). Its vruntime stopped at the negative
> > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > is long behind the cfs_rq->vruntime, thus the overflow happens.
>
> I'll have to do the math again, but that's something in the order of not
> picking a task in about a day, that would be 'bad' :-)
>
> Is there any sane way to reproduce this, and how often does it happen?

After adding some ftrace in place_entity() and pick_eevdf(), with the
help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
is very small seems to be related to task group's reweight_entity():

vlag = (s64)(avruntime - se->vruntime);
vlag = div_s64(vlag * old_weight, weight);
se->vruntime = avruntime - vlag;

The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
if the new weight is very small, which is very likely, then the vlag
could be very large, results in a very small vruntime.

The followings are the details why I think above could bring problems:

Here is the debug log printed by place_entity():

[ 397.597268]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-1111846207333767
cfs_rq.min_vruntime:810640668779
avg_vruntime():686982466017
curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
..

[ 397.877251]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-759390883821798
cfs_rq.min_vruntime:810640668779
avg_vruntime(): 689577229374
curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)

The leftmost se is a task group, its vruntime reduces from 332464705486 to
-59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
on the tree entities.

Back to reweight_entity():
vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
se->vruntime = avruntime - vlag = -59751717438670;

the new se vruntime -59751717438670 is close to what we printed -59752941080010,
consider that the avg_vruntime() vary.

Then later this leftmost se has changed its load back and forth, and when the load is 2,
the vuntime has reached a dangerous threshold to trigger the s64 overflow in
eligible check:

[ 398.011991]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-11875977385353427
cfs_rq.min_vruntime:810640668779
cfs_rq.avg_load:96985
leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)

vruntime_eligible()
{

key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
key * avg_load overflow s64...
}

As a result the leftmost one can not be picked, and NULL is returned.

One workaround patch I'm thinking of, if this analysis is in the
right direction, maybe I can have a test later:

thanks,
Chenyu

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e0968fb9ba8..7ab26cdc3487 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
* = V - vl'
*/
if (avruntime != se->vruntime) {
+ s64 limit;
+
vlag = (s64)(avruntime - se->vruntime);
vlag = div_s64(vlag * old_weight, weight);
+ /* TBD: using old weight or new weight? */
+ limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+ vlag = clamp(lag, -limit, limit);
se->vruntime = avruntime - vlag;
}

--
2.25.1

2024-04-18 02:57:44

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Yu

On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <[email protected]> wrote:
>
> On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > >
> > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > thus:
> > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > delta * 2677 = 7733399554989275921
> > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > negative value into a positive value, thus eligible check fails.
> > > > >
> > > > > Indeed.
> > > >
> > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > >
> > >
> > > This was the dump of all the entities on the tree, from left to right,
> >
> > Oh, my bad, I thought it was the pick path.
> >
> > > and also from top down in middle order traverse, when this issue happens:
> > >
> > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > >
> > >
> > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > The tree looks like:
> > >
> > > se (0xec1234e0)
> > > |
> > > |
> > > ----> se (0xec4fcf20)
> > >
> > >
> > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > vruntime are both 763383370431, which is aligned with
> > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > updated correctly, because it is monotonic increasing.
> >
> > Right.
> >
> > > My guess is that, for some reason, one newly forked se in a newly
> > > created task group, in the rb-tree has not been picked for a long
> > > time(maybe not eligible). Its vruntime stopped at the negative
> > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> >
> > I'll have to do the math again, but that's something in the order of not
> > picking a task in about a day, that would be 'bad' :-)
> >
> > Is there any sane way to reproduce this, and how often does it happen?
>
> After adding some ftrace in place_entity() and pick_eevdf(), with the
> help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> is very small seems to be related to task group's reweight_entity():
>
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> se->vruntime = avruntime - vlag;
>
> The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> if the new weight is very small, which is very likely, then the vlag
> could be very large, results in a very small vruntime.
>
>
> The followings are the details why I think above could bring problems:
>
> Here is the debug log printed by place_entity():
>
>
> [ 397.597268]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-1111846207333767
> cfs_rq.min_vruntime:810640668779
> avg_vruntime():686982466017
> curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> ..
>
> [ 397.877251]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-759390883821798
> cfs_rq.min_vruntime:810640668779
> avg_vruntime(): 689577229374
> curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
>
>
> The leftmost se is a task group, its vruntime reduces from 332464705486 to
> -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> on the tree entities.
>
> Back to reweight_entity():
> vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> se->vruntime = avruntime - vlag = -59751717438670;
>
> the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> consider that the avg_vruntime() vary.
>
> Then later this leftmost se has changed its load back and forth, and when the load is 2,
> the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> eligible check:
>
> [ 398.011991]cfs_rq:0xe75f7100
> cfs_rq.avg_vruntime:-11875977385353427
> cfs_rq.min_vruntime:810640668779
> cfs_rq.avg_load:96985
> leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
>
> vruntime_eligible()
> {
>
> key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> key * avg_load overflow s64...
> }
>
> As a result the leftmost one can not be picked, and NULL is returned.
>
> One workaround patch I'm thinking of, if this analysis is in the
> right direction, maybe I can have a test later:
>
> thanks,
> Chenyu
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e0968fb9ba8..7ab26cdc3487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> * = V - vl'
> */
> if (avruntime != se->vruntime) {
> + s64 limit;
> +
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> + /* TBD: using old weight or new weight? */
> + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> + vlag = clamp(lag, -limit, limit);
> se->vruntime = avruntime - vlag;
> }
>

According to previous discussion:
https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/

Could this patch avoid this problem?

BR

> --
> 2.25.1
>
>
>

2024-04-18 03:09:31

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <[email protected]> wrote:
> >
> > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > >
> > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > thus:
> > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > > delta * 2677 = 7733399554989275921
> > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > >
> > > > > > Indeed.
> > > > >
> > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > >
> > > >
> > > > This was the dump of all the entities on the tree, from left to right,
> > >
> > > Oh, my bad, I thought it was the pick path.
> > >
> > > > and also from top down in middle order traverse, when this issue happens:
> > > >
> > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > >
> > > >
> > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > The tree looks like:
> > > >
> > > > se (0xec1234e0)
> > > > |
> > > > |
> > > > ----> se (0xec4fcf20)
> > > >
> > > >
> > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > vruntime are both 763383370431, which is aligned with
> > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > updated correctly, because it is monotonic increasing.
> > >
> > > Right.
> > >
> > > > My guess is that, for some reason, one newly forked se in a newly
> > > > created task group, in the rb-tree has not been picked for a long
> > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > >
> > > I'll have to do the math again, but that's something in the order of not
> > > picking a task in about a day, that would be 'bad' :-)
> > >
> > > Is there any sane way to reproduce this, and how often does it happen?
> >
> > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > is very small seems to be related to task group's reweight_entity():
> >
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > se->vruntime = avruntime - vlag;
> >
> > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > if the new weight is very small, which is very likely, then the vlag
> > could be very large, results in a very small vruntime.
> >
> >
> > The followings are the details why I think above could bring problems:
> >
> > Here is the debug log printed by place_entity():
> >
> >
> > [ 397.597268]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-1111846207333767
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime():686982466017
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > ..
> >
> > [ 397.877251]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-759390883821798
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime(): 689577229374
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> >
> >
> > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > on the tree entities.
> >
> > Back to reweight_entity():
> > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > se->vruntime = avruntime - vlag = -59751717438670;
> >
> > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > consider that the avg_vruntime() vary.
> >
> > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > eligible check:
> >
> > [ 398.011991]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-11875977385353427
> > cfs_rq.min_vruntime:810640668779
> > cfs_rq.avg_load:96985
> > leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> >
> > vruntime_eligible()
> > {
> >
> > key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> > key * avg_load overflow s64...
> > }
> >
> > As a result the leftmost one can not be picked, and NULL is returned.
> >
> > One workaround patch I'm thinking of, if this analysis is in the
> > right direction, maybe I can have a test later:
> >
> > thanks,
> > Chenyu
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e0968fb9ba8..7ab26cdc3487 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > * = V - vl'
> > */
> > if (avruntime != se->vruntime) {
> > + s64 limit;
> > +
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > + /* TBD: using old weight or new weight? */
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + vlag = clamp(lag, -limit, limit);
> > se->vruntime = avruntime - vlag;
> > }
> >
>
> According to previous discussion:
> https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/
>
> Could this patch avoid this problem?
>

Ah, right, I did not notice your previous patch has also addressed the overflow in
reweight_eevdf(), as I was trying hard to reproduce the issue. Let me check your
patch in detail.

thanks,
Chenyu

> BR
>
> > --
> > 2.25.1
> >
> >
> >

2024-04-18 03:37:47

by Tianchen Ding

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi, we've found a bug about reweight_eevdf(). Would you also please try this
patchset?

https://lore.kernel.org/all/[email protected]/

Thanks.

2024-04-18 05:53:11

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-18 at 11:37:31 +0800, Tianchen Ding wrote:
> Hi, we've found a bug about reweight_eevdf(). Would you also please try this
> patchset?
>
> https://lore.kernel.org/all/[email protected]/
>

Sure, I'll have a try after I test Xuewen's patch. I assume your fix is mainly in
2/2 which adjusts the se->vruntime in reweight_eevdf() by using the correct
avg_vruntime(), although it might not be directly related to the s64 overflow issue
here, it is worthy a fix.

thanks,
Chenyu

2024-04-18 06:18:00

by Tianchen Ding

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024/4/18 13:52, Chen Yu wrote:
> On 2024-04-18 at 11:37:31 +0800, Tianchen Ding wrote:
>> Hi, we've found a bug about reweight_eevdf(). Would you also please try this
>> patchset?
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>
> Sure, I'll have a try after I test Xuewen's patch. I assume your fix is mainly in
> 2/2 which adjusts the se->vruntime in reweight_eevdf() by using the correct
> avg_vruntime(), although it might not be directly related to the s64 overflow issue
> here, it is worthy a fix.
>

Yes. I'm hoping to reduce "noises" about the unexpected avg_vruntime() varying
during your tests.

Thanks.

2024-04-18 13:08:33

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

Hi Xuewen,

On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> Hi Yu
>
> On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <[email protected]> wrote:
> >
> > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > >
> > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > thus:
> > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > > delta * 2677 = 7733399554989275921
> > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > >
> > > > > > Indeed.
> > > > >
> > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > >
> > > >
> > > > This was the dump of all the entities on the tree, from left to right,
> > >
> > > Oh, my bad, I thought it was the pick path.
> > >
> > > > and also from top down in middle order traverse, when this issue happens:
> > > >
> > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > >
> > > >
> > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > The tree looks like:
> > > >
> > > > se (0xec1234e0)
> > > > |
> > > > |
> > > > ----> se (0xec4fcf20)
> > > >
> > > >
> > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > vruntime are both 763383370431, which is aligned with
> > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > updated correctly, because it is monotonic increasing.
> > >
> > > Right.
> > >
> > > > My guess is that, for some reason, one newly forked se in a newly
> > > > created task group, in the rb-tree has not been picked for a long
> > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > >
> > > I'll have to do the math again, but that's something in the order of not
> > > picking a task in about a day, that would be 'bad' :-)
> > >
> > > Is there any sane way to reproduce this, and how often does it happen?
> >
> > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > is very small seems to be related to task group's reweight_entity():
> >
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > se->vruntime = avruntime - vlag;
> >
> > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > if the new weight is very small, which is very likely, then the vlag
> > could be very large, results in a very small vruntime.
> >
> >
> > The followings are the details why I think above could bring problems:
> >
> > Here is the debug log printed by place_entity():
> >
> >
> > [ 397.597268]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-1111846207333767
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime():686982466017
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > ..
> >
> > [ 397.877251]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-759390883821798
> > cfs_rq.min_vruntime:810640668779
> > avg_vruntime(): 689577229374
> > curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> > leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> >
> >
> > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > on the tree entities.
> >
> > Back to reweight_entity():
> > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > se->vruntime = avruntime - vlag = -59751717438670;
> >
> > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > consider that the avg_vruntime() vary.
> >
> > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > eligible check:
> >
> > [ 398.011991]cfs_rq:0xe75f7100
> > cfs_rq.avg_vruntime:-11875977385353427
> > cfs_rq.min_vruntime:810640668779
> > cfs_rq.avg_load:96985
> > leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> >
> > vruntime_eligible()
> > {
> >
> > key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> > key * avg_load overflow s64...
> > }
> >
> > As a result the leftmost one can not be picked, and NULL is returned.
> >
> > One workaround patch I'm thinking of, if this analysis is in the
> > right direction, maybe I can have a test later:
> >
> > thanks,
> > Chenyu
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e0968fb9ba8..7ab26cdc3487 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > * = V - vl'
> > */
> > if (avruntime != se->vruntime) {
> > + s64 limit;
> > +
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > + /* TBD: using old weight or new weight? */
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + vlag = clamp(lag, -limit, limit);
> > se->vruntime = avruntime - vlag;
> > }
> >
>
> According to previous discussion:
> https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/
>
> Could this patch avoid this problem?
>

100 cycles has passed with the following change, with the fork() cased removed.
There was no NULL pointer issue observed so far. Meanwhile, I've launched a test
with Tianchen's patch set on top of this one to see if everything goes well.

Can you take a look if the following change make sense, and if yes, could you
send v2 and CC the reporters, so they could also have a try if this helps.

Reported-by: Sergei Trofimovich <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Igor Raits <[email protected]>
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
Reported-by: Breno Leitao <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Yujie Liu <[email protected]>

And in v2, it would be helpful to include the reason why we did this change, how
about something like this:

"
kernel encounters the following error when running workload:

BUG: kernel NULL pointer dereference, address: 0000002c
EIP: set_next_entity (fair.c:?)

which was caused by NULL pointer returned by pick_eevdf().

Further investigation has shown that, the entity_eligible() has an
false-negative issue when the entity's vruntime is far behind the
cfs_rq.min_vruntime that, the (vruntime - cfs_rq->min_vruntime) * load
caused a s64 overflow, thus every entity on the rb-tree is not
eligible, which results in a NULL candidate.

The reason why entity's vruntime is far behind the cfs_rq.min_vruntime
is because during a on_rq task group's update_cfs_group()->reweight_eevdf(),
there is no limit on the new entity's vlag. If the new weight is much
smaller than the old one,

vlag = div_s64(vlag * old_weight, weight)

generates a huge vlag, and results in very small(negative) vruntime.

Thus limit the range of vlag accordingly.

"

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31bca05c3612..9f203012e8f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
*
* XXX could add max_slice to the augmented data to track this.
*/
+
+static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
+{
+ s64 limit;
+
+ limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+ return clamp(lag, -limit, limit);
+}
+
static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- s64 lag, limit;
+ s64 lag;

SCHED_WARN_ON(!se->on_rq);
lag = avg_vruntime(cfs_rq) - se->vruntime;

- limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
- se->vlag = clamp(lag, -limit, limit);
+ se->vlag = limit_entity_lag(se, lag);
}

/*
@@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
if (avruntime != se->vruntime) {
vlag = (s64)(avruntime - se->vruntime);
vlag = div_s64(vlag * old_weight, weight);
+ vlag = limit_entity_lag(se, vlag);
se->vruntime = avruntime - vlag;
}

@@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,

update_load_set(&se->load, weight);

+ if (!se->on_rq)
+ se->vlag = limit_entity_lag(se, se->vlag);
+
#ifdef CONFIG_SMP
do {
u32 divider = get_pelt_divider(&se->avg);
--
2.25.1

thanks,
Chenyu

2024-04-18 23:45:45

by Tim Chen

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Thu, 2024-04-18 at 21:03 +0800, Chen Yu wrote:
> Hi Xuewen,
>
> On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> > Hi Yu
> >
> > On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <[email protected]> wrote:
> > >
> > > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > > >
> > > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > > thus:
> > > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > > > delta * 2677 = 7733399554989275921
> > > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > > >
> > > > > > > Indeed.
> > > > > >
> > > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > > >
> > > > >
> > > > > This was the dump of all the entities on the tree, from left to right,
> > > >
> > > > Oh, my bad, I thought it was the pick path.
> > > >
> > > > > and also from top down in middle order traverse, when this issue happens:
> > > > >
> > > > > [ 514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > > [ 514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > > > min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > > >
> > > > >
> > > > > [ 514.536772][ T8390] Traverse rb-tree from left to right
> > > > > [ 514.537138][ T8390] se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- leftmost se
> > > > > [ 514.537835][ T8390] se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > >
> > > > > [ 514.538539][ T8390] Traverse rb-tree from topdown
> > > > > [ 514.538877][ T8390] middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible <-- root se
> > > > > [ 514.539605][ T8390] middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > > >
> > > > > The tree looks like:
> > > > >
> > > > > se (0xec1234e0)
> > > > > |
> > > > > |
> > > > > ----> se (0xec4fcf20)
> > > > >
> > > > >
> > > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > > vruntime are both 763383370431, which is aligned with
> > > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > > updated correctly, because it is monotonic increasing.
> > > >
> > > > Right.
> > > >
> > > > > My guess is that, for some reason, one newly forked se in a newly
> > > > > created task group, in the rb-tree has not been picked for a long
> > > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > > >
> > > > I'll have to do the math again, but that's something in the order of not
> > > > picking a task in about a day, that would be 'bad' :-)
> > > >
> > > > Is there any sane way to reproduce this, and how often does it happen?
> > >
> > > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > > is very small seems to be related to task group's reweight_entity():
> > >
> > > vlag = (s64)(avruntime - se->vruntime);
> > > vlag = div_s64(vlag * old_weight, weight);
> > > se->vruntime = avruntime - vlag;
> > >
> > > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > > if the new weight is very small, which is very likely, then the vlag
> > > could be very large, results in a very small vruntime.
> > >
> > >
> > > The followings are the details why I think above could bring problems:
> > >
> > > Here is the debug log printed by place_entity():
> > >
> > >
> > > [ 397.597268]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-1111846207333767
> > > cfs_rq.min_vruntime:810640668779
> > > avg_vruntime():686982466017
> > > curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> > > leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > > ..
> > >
> > > [ 397.877251]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-759390883821798
> > > cfs_rq.min_vruntime:810640668779
> > > avg_vruntime(): 689577229374
> > > curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> > > leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> > >
> > >
> > > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > > on the tree entities.
> > >
> > > Back to reweight_entity():
> > > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > > se->vruntime = avruntime - vlag = -59751717438670;
> > >
> > > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > > consider that the avg_vruntime() vary.
> > >
> > > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > > eligible check:
> > >
> > > [ 398.011991]cfs_rq:0xe75f7100
> > > cfs_rq.avg_vruntime:-11875977385353427
> > > cfs_rq.min_vruntime:810640668779
> > > cfs_rq.avg_load:96985
> > > leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> > >
> > > vruntime_eligible()
> > > {
> > >
> > > key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> > > key * avg_load overflow s64...
> > > }
> > >
> > > As a result the leftmost one can not be picked, and NULL is returned.
> > >
> > > One workaround patch I'm thinking of, if this analysis is in the
> > > right direction, maybe I can have a test later:
> > >
> > > thanks,
> > > Chenyu
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6e0968fb9ba8..7ab26cdc3487 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > > * = V - vl'
> > > */
> > > if (avruntime != se->vruntime) {
> > > + s64 limit;
> > > +
> > > vlag = (s64)(avruntime - se->vruntime);
> > > vlag = div_s64(vlag * old_weight, weight);
> > > + /* TBD: using old weight or new weight? */
> > > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > > + vlag = clamp(lag, -limit, limit);
> > > se->vruntime = avruntime - vlag;
> > > }
> > >
> >
> > According to previous discussion:
> > https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/
> >
> > Could this patch avoid this problem?
> >
>
> 100 cycles has passed with the following change, with the fork() cased removed.
> There was no NULL pointer issue observed so far. Meanwhile, I've launched a test
> with Tianchen's patch set on top of this one to see if everything goes well.
>
> Can you take a look if the following change make sense, and if yes, could you
> send v2 and CC the reporters, so they could also have a try if this helps.
>
> Reported-by: Sergei Trofimovich <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]/
> Reported-by: Igor Raits <[email protected]>
> Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
> Reported-by: Breno Leitao <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Reported-by: Yujie Liu <[email protected]>
>
>
> And in v2, it would be helpful to include the reason why we did this change, how
> about something like this:
>
> "
> kernel encounters the following error when running workload:
>
> BUG: kernel NULL pointer dereference, address: 0000002c
> EIP: set_next_entity (fair.c:?)
>
> which was caused by NULL pointer returned by pick_eevdf().
>
> Further investigation has shown that, the entity_eligible() has an
> false-negative issue when the entity's vruntime is far behind the
> cfs_rq.min_vruntime that, the (vruntime - cfs_rq->min_vruntime) * load
> caused a s64 overflow, thus every entity on the rb-tree is not
> eligible, which results in a NULL candidate.
>
> The reason why entity's vruntime is far behind the cfs_rq.min_vruntime
> is because during a on_rq task group's update_cfs_group()->reweight_eevdf(),
> there is no limit on the new entity's vlag. If the new weight is much
> smaller than the old one,

Probably worth mentioning in commit log that the vlag could be many times larger than the scheduling
entity's time slice leading to vruntime overflow on going to a really small weight. We're limiting
the lag update on reweight_eevdf() to a max of twice the sched entity's time slice to prevent lag overflow, just
as we are doing on update_sched_entity().

This will help to clarify the reason for the change when looking at this patch for the first time.

Tim

>
> vlag = div_s64(vlag * old_weight, weight)
>
> generates a huge vlag, and results in very small(negative) vruntime.
>
> Thus limit the range of vlag accordingly.
>
> "
>
>

2024-04-19 08:25:05

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Thu, Apr 18, 2024 at 09:03:36PM +0800, Chen Yu wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 31bca05c3612..9f203012e8f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> *
> * XXX could add max_slice to the augmented data to track this.
> */
> +
> +static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
> +{
> + s64 limit;
> +
> + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> + return clamp(lag, -limit, limit);
> +}

Right, helper makes sense.

> @@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> if (avruntime != se->vruntime) {
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> + vlag = limit_entity_lag(se, vlag);
> se->vruntime = avruntime - vlag;

So the !on_rq case has clamping in update_entity_lag() which is before
scaling. And that makes more sense to me, because putting a limit on
vlag before the multiplication *should* ensure the multiplication itself
doesn't overflow.

But now you allow it to compute garbage and then clip the garbage.

> }
>
> @@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
> update_load_set(&se->load, weight);
>
> + if (!se->on_rq)
> + se->vlag = limit_entity_lag(se, se->vlag);
> +

Except you now add clamping after scaling too, but in a really weird
place. Should this not go right after the div_s64() that scales?

Unlike the reweight_eevdf() case, there might be an argument for doing
it after scaling in this case. Namely, you can have multiple reweights
stacking their scale ops.

Also, could you put a few words in on how often these clips are hit? I
suspect it's fairly rare (but crucial when it does).

2024-04-19 08:45:42

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Fri, Apr 19, 2024 at 10:24:40AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 18, 2024 at 09:03:36PM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 31bca05c3612..9f203012e8f5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > *
> > * XXX could add max_slice to the augmented data to track this.
> > */
> > +
> > +static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
> > +{
> > + s64 limit;
> > +
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + return clamp(lag, -limit, limit);
> > +}
>
> Right, helper makes sense.

A possible other helper might be the below. But that depends if we want
to clip after scaling too. No super strong preference.

Please post your patch with a proper changelog and we'll get it sorted.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,15 +696,19 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
*
* XXX could add max_slice to the augmented data to track this.
*/
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static s64 entity_lag(u64 avruntime, struct sched_entity *se)
{
- s64 lag, limit;
-
- SCHED_WARN_ON(!se->on_rq);
- lag = avg_vruntime(cfs_rq) - se->vruntime;
+ s64 vlag, limit;

+ vlag = avruntime - se->vruntime;
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
- se->vlag = clamp(lag, -limit, limit);
+ return clamp(vlag, -limit, limit);
+}
+
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ SCHED_WARN_ON(!se->on_rq);
+ se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
}

/*
@@ -3760,7 +3764,7 @@ static void reweight_eevdf(struct sched_
* = V - vl'
*/
if (avruntime != se->vruntime) {
- vlag = (s64)(avruntime - se->vruntime);
+ vlag = entity_lag(avruntime, se);
vlag = div_s64(vlag * old_weight, weight);
se->vruntime = avruntime - vlag;
}

2024-04-19 09:17:50

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Fri, Apr 19, 2024 at 4:24 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 09:03:36PM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 31bca05c3612..9f203012e8f5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > *
> > * XXX could add max_slice to the augmented data to track this.
> > */
> > +
> > +static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
> > +{
> > + s64 limit;
> > +
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + return clamp(lag, -limit, limit);
> > +}
>
> Right, helper makes sense.
>
> > @@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > if (avruntime != se->vruntime) {
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > + vlag = limit_entity_lag(se, vlag);
> > se->vruntime = avruntime - vlag;
>
> So the !on_rq case has clamping in update_entity_lag() which is before
> scaling. And that makes more sense to me, because putting a limit on
> vlag before the multiplication *should* ensure the multiplication itself
> doesn't overflow.
>
> But now you allow it to compute garbage and then clip the garbage.
>
> > }
> >
> > @@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >
> > update_load_set(&se->load, weight);
> >
> > + if (!se->on_rq)
> > + se->vlag = limit_entity_lag(se, se->vlag);
> > +
>
> Except you now add clamping after scaling too, but in a really weird
> place. Should this not go right after the div_s64() that scales?

Yes, it's better to put it after the div_s64().

>
> Unlike the reweight_eevdf() case, there might be an argument for doing
> it after scaling in this case. Namely, you can have multiple reweights
> stacking their scale ops.
>
>
> Also, could you put a few words in on how often these clips are hit? I
> suspect it's fairly rare (but crucial when it does).

I think this probability may be not small, because the user space
often modifies the task's nice value while the task is sleeping.
I also would add logs to check it.

BR

2024-04-19 09:21:03

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Fri, Apr 19, 2024 at 4:45 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Apr 19, 2024 at 10:24:40AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 18, 2024 at 09:03:36PM +0800, Chen Yu wrote:
> >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 31bca05c3612..9f203012e8f5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > > *
> > > * XXX could add max_slice to the augmented data to track this.
> > > */
> > > +
> > > +static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
> > > +{
> > > + s64 limit;
> > > +
> > > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > > + return clamp(lag, -limit, limit);
> > > +}
> >
> > Right, helper makes sense.
>
> A possible other helper might be the below. But that depends if we want
> to clip after scaling too. No super strong preference.
>
> Please post your patch with a proper changelog and we'll get it sorted.

Thanks Peter, I will send the v2 later after my local testing:)

>
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -696,15 +696,19 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> *
> * XXX could add max_slice to the augmented data to track this.
> */
> -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static s64 entity_lag(u64 avruntime, struct sched_entity *se)
> {
> - s64 lag, limit;
> -
> - SCHED_WARN_ON(!se->on_rq);
> - lag = avg_vruntime(cfs_rq) - se->vruntime;
> + s64 vlag, limit;
>
> + vlag = avruntime - se->vruntime;
> limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> - se->vlag = clamp(lag, -limit, limit);
> + return clamp(vlag, -limit, limit);
> +}
> +
> +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> + SCHED_WARN_ON(!se->on_rq);
> + se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
> }
>
> /*
> @@ -3760,7 +3764,7 @@ static void reweight_eevdf(struct sched_
> * = V - vl'
> */
> if (avruntime != se->vruntime) {
> - vlag = (s64)(avruntime - se->vruntime);
> + vlag = entity_lag(avruntime, se);
> vlag = div_s64(vlag * old_weight, weight);
> se->vruntime = avruntime - vlag;
> }

BR
---
xuewen

2024-04-19 10:05:02

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On 2024-04-19 at 10:24:40 +0200, Peter Zijlstra wrote:
> On Thu, Apr 18, 2024 at 09:03:36PM +0800, Chen Yu wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 31bca05c3612..9f203012e8f5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > *
> > * XXX could add max_slice to the augmented data to track this.
> > */
> > +
> > +static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
> > +{
> > + s64 limit;
> > +
> > + limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > + return clamp(lag, -limit, limit);
> > +}
>
> Right, helper makes sense.
>
> > @@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > if (avruntime != se->vruntime) {
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > + vlag = limit_entity_lag(se, vlag);
> > se->vruntime = avruntime - vlag;
>
> So the !on_rq case has clamping in update_entity_lag() which is before
> scaling. And that makes more sense to me, because putting a limit on
> vlag before the multiplication *should* ensure the multiplication itself
> doesn't overflow.
>
> But now you allow it to compute garbage and then clip the garbage.
>

Yes, there is possibility to get multiplication overflow. Clamp first should
be better.
(BTW for !on_rq case in this patch, it is also scaled before the clamp).

> > }
> >
> > @@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >
> > update_load_set(&se->load, weight);
> >
> > + if (!se->on_rq)
> > + se->vlag = limit_entity_lag(se, se->vlag);
> > +
>
> Except you now add clamping after scaling too, but in a really weird
> place. Should this not go right after the div_s64() that scales?
>

The reason to put this after update_load_set(&se->load, weight) is because
we want to clamp the vlag based on the latest load, although for reweight_eevdf(),
it uses the old load to clamp it, unless we add new parameter to the calc_delta_fair() to
use the new load rather than the current se->load.

> Unlike the reweight_eevdf() case, there might be an argument for doing
> it after scaling in this case. Namely, you can have multiple reweights
> stacking their scale ops.
>
>

Yes, I saw that your patch which clamp the vlag before scaling, I'll have a try
on that patch. Xuewen should post that v2 if everything works well.

> Also, could you put a few words in on how often these clips are hit? I
> suspect it's fairly rare (but crucial when it does).

Everytime it happens a subsequent NULL pointer exception usually happens.
The trace data was posted here:

https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/

"
Here is the debug log printed by place_entity():

[ 397.597268]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-1111846207333767
cfs_rq.min_vruntime:810640668779
avg_vruntime():686982466017
curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
..

[ 397.877251]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-759390883821798
cfs_rq.min_vruntime:810640668779
avg_vruntime(): 689577229374
curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)

The leftmost se is a task group, its vruntime reduces from 332464705486 to
-59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
on the tree entities.

Back to reweight_entity():
vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044; <-------- ouch!
se->vruntime = avruntime - vlag = -59751717438670;

the new se vruntime -59751717438670 is close to what we printed -59752941080010,
consider that the avg_vruntime() vary.

Then later this leftmost se has changed its load back and forth, and when the load is 2,
the vuntime has reached a dangerous threshold to trigger the s64 overflow in
eligible check:

[ 398.011991]cfs_rq:0xe75f7100
cfs_rq.avg_vruntime:-11875977385353427
cfs_rq.min_vruntime:810640668779
cfs_rq.avg_load:96985
leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)

vruntime_eligible()
{
key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
key * avg_load <--------------------- OVERFLOW s64
}
"

2024-04-19 16:24:58

[permalink] [raw]

Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

On Fri, Apr 19, 2024 at 06:04:25PM +0800, Chen Yu wrote:

> > Also, could you put a few words in on how often these clips are hit? I
> > suspect it's fairly rare (but crucial when it does).
>
> Everytime it happens a subsequent NULL pointer exception usually happens.
> The trace data was posted here:

That was not the question. Suppose I put a printk whenever the clamp()
actually changes the value, do I get a stormflood of output, or do I get
a sporadic blip.

2024-04-19 17:33:36