2016-04-01 19:43:03

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On arm64, when calling enqueue_task_fair() from migration_cpu_stop(),
we find the nr_running value updated by add_nr_running(), but the
cfs.nr_running value has not always yet been updated. Accordingly,
the sched_can_stop_tick() false returns true when we are migrating a
second task onto a core.

Correct this by using rq->nr_running instead of rq->cfs.nr_running.
This should always be more conservative, and reverts the test to the
form it had before commit 76d92ac305f2 ("sched: Migrate sched to use
new tick dependency mask model").

Signed-off-by: Chris Metcalf <[email protected]>
---
I found this bug because I had a program running in nohz_full
on a core, and from a different core I called sched_setaffinity()
to force that task onto the nohz_full core, but I did not end up with
a kick to the nohz_full core, so tick-based scheduling did not start.
This is probably bad enough that we should fix it for 4.6.

Strangely, for some reason, the existing code worked correctly for me
for tilegx, but not for arm64. I see that the enqueue_task_fair()
code calls enqueue_entity(), which calls account_entity_enqueue() to
adjust cfs.nr_running. That seemed to happen on tilegx, but not arm64.
Perhaps there is some difference in how the sched_entity stuff is done,
but frankly that took me a little deeper into the CFS stuff than I was
willing to dive in this moment.

I could also argue that sched/core.c shouldn't have a lot of CFS
stuff in it anyway, and if we view the FIFO/RR stuff as handling the
real special cases in sched_can_stop_tick() anyway, then just checking
the core nr_running feels like the right thing to do regardless.

kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 00649f7ad567..1737d63c65fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -599,7 +599,7 @@ bool sched_can_stop_tick(struct rq *rq)
}

/* Normal multitasking need periodic preemption checks */
- if (rq->cfs.nr_running > 1)
+ if (rq->nr_running > 1)
return false;

return true;
--
2.7.2


2016-04-04 19:12:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On Fri, 2016-04-01 at 15:42 -0400, Chris Metcalf wrote:
> On arm64, when calling enqueue_task_fair() from migration_cpu_stop(),
> we find the nr_running value updated by add_nr_running(), but the
> cfs.nr_running value has not always yet been updated.  Accordingly,
> the sched_can_stop_tick() false returns true when we are migrating a
> second task onto a core.

I don't get it.

Looking at the enqueue_task_fair(), I see this:

        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_running++;
...
}

        if (!se)
                add_nr_running(rq, 1);

What is the difference between cfs_rq->h_nr_running,
and rq->cfs.nr_running?

Why do we have two?

Are we simply testing against the wrong one in
sched_can_stop_tick?

> Correct this by using rq->nr_running instead of rq->cfs.nr_running.
> This should always be more conservative, and reverts the test to the
> form it had before commit 76d92ac305f2 ("sched: Migrate sched to use
> new tick dependency mask model").

That would cause us to run the timer tick while running
a single SCHED_RR real time task, with a single
SCHED_OTHER task sitting in the background (which will
not get run until the SCHED_RR task is done).

I don't think that is the quite behaviour we want.

> Signed-off-by: Chris Metcalf <[email protected]>
> ---
> I found this bug because I had a program running in nohz_full
> on a core, and from a different core I called sched_setaffinity()
> to force that task onto the nohz_full core, but I did not end up with
> a kick to the nohz_full core, so tick-based scheduling did not start.
> This is probably bad enough that we should fix it for 4.6.
>
> Strangely, for some reason, the existing code worked correctly for me
> for tilegx, but not for arm64.  I see that the enqueue_task_fair()
> code calls enqueue_entity(), which calls account_entity_enqueue() to
> adjust cfs.nr_running.  That seemed to happen on tilegx, but not
> arm64.
> Perhaps there is some difference in how the sched_entity stuff is
> done,
> but frankly that took me a little deeper into the CFS stuff than I
> was
> willing to dive in this moment.
>
> I could also argue that sched/core.c shouldn't have a lot of CFS
> stuff in it anyway, and if we view the FIFO/RR stuff as handling the
> real special cases in sched_can_stop_tick() anyway, then just
> checking
> the core nr_running feels like the right thing to do regardless.
>
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 00649f7ad567..1737d63c65fa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -599,7 +599,7 @@ bool sched_can_stop_tick(struct rq *rq)
>   }
>  
>   /* Normal multitasking need periodic preemption checks */
> - if (rq->cfs.nr_running > 1)
> + if (rq->nr_running > 1)
>   return false;
>  
>   return true;
--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-04-04 19:23:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative



On 4 April 2016 21:12:23 CEST, Rik van Riel <[email protected]> wrote:

>What is the difference between cfs_rq->h_nr_running,
>and rq->cfs.nr_running?
>
>Why do we have two?


H is for hierarchy. That counts the total of runnable tasks in the entire child hierarchy. Nr_running is the number of se entities in the current tree.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2016-04-04 19:32:12

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On 4/4/2016 3:12 PM, Rik van Riel wrote:
> On Fri, 2016-04-01 at 15:42 -0400, Chris Metcalf wrote:
>> On arm64, when calling enqueue_task_fair() from migration_cpu_stop(),
>> we find the nr_running value updated by add_nr_running(), but the
>> cfs.nr_running value has not always yet been updated. Accordingly,
>> the sched_can_stop_tick() false returns true when we are migrating a
>> second task onto a core.
> I don't get it.
>
> Looking at the enqueue_task_fair(), I see this:
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> cfs_rq->h_nr_running++;
> ...
> }
>
> if (!se)
> add_nr_running(rq, 1);
>
> What is the difference between cfs_rq->h_nr_running,
> and rq->cfs.nr_running?
>
> Why do we have two?
> Are we simply testing against the wrong one in
> sched_can_stop_tick?

It seems that using the non-CFS one is what we want. I don't know whether
using a different CFS count instead might be more correct.

Since I'm not sure what causes the difference I see between tile (correct)
and arm64 (incorrect) it's hard for me to speculate.

>> Correct this by using rq->nr_running instead of rq->cfs.nr_running.
>> This should always be more conservative, and reverts the test to the
>> form it had before commit 76d92ac305f2 ("sched: Migrate sched to use
>> new tick dependency mask model").
> That would cause us to run the timer tick while running
> a single SCHED_RR real time task, with a single
> SCHED_OTHER task sitting in the background (which will
> not get run until the SCHED_RR task is done).

No, because in sched_can_stop_tick(), we first handle the special
cases of RR or FIFO tasks present. For example, RR:

if (rq->rt.rr_nr_running) {
if (rq->rt.rr_nr_running == 1)
return true;
else
return false;
}

Once we see there's any RR tasks running, the return value
ignores any possible SCHED_OTHER tasks. Only after the code
concludes there are no RR/FIFO tasks do we even look at
the over nr_running value.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-04-04 19:36:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On Mon, 2016-04-04 at 15:31 -0400, Chris Metcalf wrote:
> On 4/4/2016 3:12 PM, Rik van Riel wrote:
> >
> > On Fri, 2016-04-01 at 15:42 -0400, Chris Metcalf wrote:
> > >
> > > On arm64, when calling enqueue_task_fair() from
> > > migration_cpu_stop(),
> > > we find the nr_running value updated by add_nr_running(), but the
> > > cfs.nr_running value has not always yet been
> > > updated.  Accordingly,
> > > the sched_can_stop_tick() false returns true when we are
> > > migrating a
> > > second task onto a core.
> > I don't get it.
> >
> > Looking at the enqueue_task_fair(), I see this:
> >
> >          for_each_sched_entity(se) {
> >                  cfs_rq = cfs_rq_of(se);
> >                  cfs_rq->h_nr_running++;
> > ...
> > }
> >
> >          if (!se)
> >                  add_nr_running(rq, 1);
> >
> > What is the difference between cfs_rq->h_nr_running,
> > and rq->cfs.nr_running?
> >
> > Why do we have two?
> > Are we simply testing against the wrong one in
> > sched_can_stop_tick?
> It seems that using the non-CFS one is what we want.  I don't know
> whether
> using a different CFS count instead might be more correct.
>
> Since I'm not sure what causes the difference I see between tile
> (correct)
> and arm64 (incorrect) it's hard for me to speculate.
>
> >
> > >
> > > Correct this by using rq->nr_running instead of rq-
> > > >cfs.nr_running.
> > > This should always be more conservative, and reverts the test to
> > > the
> > > form it had before commit 76d92ac305f2 ("sched: Migrate sched to
> > > use
> > > new tick dependency mask model").
> > That would cause us to run the timer tick while running
> > a single SCHED_RR real time task, with a single
> > SCHED_OTHER task sitting in the background (which will
> > not get run until the SCHED_RR task is done).
> No, because in sched_can_stop_tick(), we first handle the special
> cases of RR or FIFO tasks present.  For example, RR:
>
>          if (rq->rt.rr_nr_running) {
>                  if (rq->rt.rr_nr_running == 1)
>                          return true;
>                  else
>                          return false;
>          }
>
> Once we see there's any RR tasks running, the return value
> ignores any possible SCHED_OTHER tasks.  Only after the code
> concludes there are no RR/FIFO tasks do we even look at
> the over nr_running value.

OK, fair enough. I guess both of the RT cases are
covered already.

Patch gets my:

Acked-by: Rik van Riel <[email protected]>

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-04-05 00:27:49

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On 4/4/2016 3:36 PM, Rik van Riel wrote:
>>> On Fri, 2016-04-01 at 15:42 -0400, Chris Metcalf wrote:
>>>> On arm64, when calling enqueue_task_fair() from
>>>> migration_cpu_stop(),
>>>> we find the nr_running value updated by add_nr_running(), but the
>>>> cfs.nr_running value has not always yet been
>>>> updated. Accordingly,
>>>> the sched_can_stop_tick() false returns true when we are
>>>> migrating a
>>>> second task onto a core.
>>>> Correct this by using rq->nr_running instead of rq-
>>>>> cfs.nr_running.
>>>> This should always be more conservative, and reverts the test to
>>>> the
>>>> form it had before commit 76d92ac305f2 ("sched: Migrate sched to
>>>> use
>>>> new tick dependency mask model").
>>>>
>>>>
> [...]
>
> Patch gets my:
>
> Acked-by: Rik van Riel <[email protected]>

Thanks! Whose tree should this go through: Frederic, PeterZ, Ingo?
Do any of you have any concerns with it?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-04-18 02:00:44

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

Hi Peterz,
2016-04-05 3:23 GMT+08:00 Peter Zijlstra <[email protected]>:
>
>
> On 4 April 2016 21:12:23 CEST, Rik van Riel <[email protected]> wrote:
>
>>What is the difference between cfs_rq->h_nr_running,
>>and rq->cfs.nr_running?
>>
>>Why do we have two?
>
>
> H is for hierarchy. That counts the total of runnable tasks in the entire child hierarchy. Nr_running is the number of se entities in the current tree.

So I think we should at least change cfs_rq->nr_running to
cfs->h_nr_running, I can send a formal patch if you think it makes
sense. :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..79197df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -616,7 +616,7 @@ bool sched_can_stop_tick(struct rq *rq)
}

/* Normal multitasking need periodic preemption checks */
- if (rq->cfs.nr_running > 1)
+ if (rq->cfs.h_nr_running > 1)
return false;

return true;

Regards,
Wanpeng Li

2016-04-21 14:42:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On Mon, Apr 18, 2016 at 10:00:42AM +0800, Wanpeng Li wrote:
> > H is for hierarchy. That counts the total of runnable tasks in the
> > entire child hierarchy. Nr_running is the number of se entities in
> > the current tree.
>
> So I think we should at least change cfs_rq->nr_running to
> cfs->h_nr_running, I can send a formal patch if you think it makes
> sense. :-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1159423..79197df 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -616,7 +616,7 @@ bool sched_can_stop_tick(struct rq *rq)
> }
>
> /* Normal multitasking need periodic preemption checks */
> - if (rq->cfs.nr_running > 1)
> + if (rq->cfs.h_nr_running > 1)
> return false;
>
> return true;

So I think that is indeed the right thing here. But looking at this
function I think there's more problems with it.

It seems to assume that if there's FIFO tasks, those will run. This is
incorrect. The FIFO task can have a lower prio than an RR task, in which
case the RR task will run.

So the whole fifo_nr_running test seems misplaced, it should go after
the rr_nr_running tests. That is, only if !rr_nr_running, can we use
fifo_nr_running like this.

2016-04-21 16:03:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On Thu, Apr 21, 2016 at 04:42:13PM +0200, Peter Zijlstra wrote:

> So I think that is indeed the right thing here. But looking at this
> function I think there's more problems with it.
>
> It seems to assume that if there's FIFO tasks, those will run. This is
> incorrect. The FIFO task can have a lower prio than an RR task, in which
> case the RR task will run.
>
> So the whole fifo_nr_running test seems misplaced, it should go after
> the rr_nr_running tests. That is, only if !rr_nr_running, can we use
> fifo_nr_running like this.

A little something like so perhaps; can anybody test?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ffec7d9e7763..4240686f6857 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -596,17 +596,8 @@ bool sched_can_stop_tick(struct rq *rq)
return false;

/*
- * FIFO realtime policy runs the highest priority task (after DEADLINE).
- * Other runnable tasks are of a lower priority. The scheduler tick
- * isn't needed.
- */
- fifo_nr_running = rq->rt.rt_nr_running - rq->rt.rr_nr_running;
- if (fifo_nr_running)
- return true;
-
- /*
- * Round-robin realtime tasks time slice with other tasks at the same
- * realtime priority.
+ * If there are more than one RR tasks, we need the tick to effect the
+ * actual RR behaviour.
*/
if (rq->rt.rr_nr_running) {
if (rq->rt.rr_nr_running == 1)
@@ -615,8 +606,20 @@ bool sched_can_stop_tick(struct rq *rq)
return false;
}

- /* Normal multitasking need periodic preemption checks */
- if (rq->cfs.nr_running > 1)
+ /*
+ * If there's no RR tasks, but FIFO tasks, we can skip the tick, no
+ * forced preemption between FIFO tasks.
+ */
+ fifo_nr_running = rq->rt.rt_nr_running - rq->rt.rr_nr_running;
+ if (fifo_nr_running)
+ return true;
+
+ /*
+ * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
+ * if there's more than one we need the tick for involuntary
+ * preemption.
+ */
+ if (rq->nr_running > 1)
return false;

return true;

2016-04-25 21:30:51

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH] nohz_full: Make sched_should_stop_tick() more conservative

On 4/21/2016 12:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 21, 2016 at 04:42:13PM +0200, Peter Zijlstra wrote:
>
>> So I think that is indeed the right thing here. But looking at this
>> function I think there's more problems with it.
>>
>> It seems to assume that if there's FIFO tasks, those will run. This is
>> incorrect. The FIFO task can have a lower prio than an RR task, in which
>> case the RR task will run.
>>
>> So the whole fifo_nr_running test seems misplaced, it should go after
>> the rr_nr_running tests. That is, only if !rr_nr_running, can we use
>> fifo_nr_running like this.
> A little something like so perhaps; can anybody test?

Tested-by: Chris Metcalf <[email protected]>

To be clear, I only tested that it fixed my original bug, where we weren't
kicking a remote cpu when we should have been; I have not tested that
it works properly in the presence of RR or FIFO scheduled tasks.

But this or something like it should definitely go into 4.6 before it's done.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

Subject: [tip:sched/urgent] nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()

Commit-ID: 2548d546d40c0014efdde88a53bf7896e917dcce
Gitweb: http://git.kernel.org/tip/2548d546d40c0014efdde88a53bf7896e917dcce
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 21 Apr 2016 18:03:15 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 28 Apr 2016 10:28:55 +0200

nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()

Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to
re-enable the tick.

His observed problem is that rq->cfs.nr_running can be 1 even though
there are multiple runnable CFS tasks. This happens in the cgroup
case, in which case cfs.nr_running is the number of runnable entities
for that level.

If there is a single runnable cgroup (which can have an arbitrary
number of runnable child entries itself) rq->cfs.nr_running will be 1.

However, looking at that function I think there's more problems with it.

It seems to assume that if there's FIFO tasks, those will run. This is
incorrect. The FIFO task can have a lower prio than an RR task, in which
case the RR task will run.

So the whole fifo_nr_running test seems misplaced, it should go after
the rr_nr_running tests. That is, only if !rr_nr_running, can we use
fifo_nr_running like this.

Reported-by: Chris Metcalf <[email protected]>
Tested-by: Chris Metcalf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Wanpeng Li <[email protected]>
Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..d1f7149 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -596,17 +596,8 @@ bool sched_can_stop_tick(struct rq *rq)
return false;

/*
- * FIFO realtime policy runs the highest priority task (after DEADLINE).
- * Other runnable tasks are of a lower priority. The scheduler tick
- * isn't needed.
- */
- fifo_nr_running = rq->rt.rt_nr_running - rq->rt.rr_nr_running;
- if (fifo_nr_running)
- return true;
-
- /*
- * Round-robin realtime tasks time slice with other tasks at the same
- * realtime priority.
+ * If there are more than one RR tasks, we need the tick to effect the
+ * actual RR behaviour.
*/
if (rq->rt.rr_nr_running) {
if (rq->rt.rr_nr_running == 1)
@@ -615,8 +606,20 @@ bool sched_can_stop_tick(struct rq *rq)
return false;
}

- /* Normal multitasking need periodic preemption checks */
- if (rq->cfs.nr_running > 1)
+ /*
+ * If there's no RR tasks, but FIFO tasks, we can skip the tick, no
+ * forced preemption between FIFO tasks.
+ */
+ fifo_nr_running = rq->rt.rt_nr_running - rq->rt.rr_nr_running;
+ if (fifo_nr_running)
+ return true;
+
+ /*
+ * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
+ * if there's more than one we need the tick for involuntary
+ * preemption.
+ */
+ if (rq->nr_running > 1)
return false;

return true;

2016-04-28 13:30:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:sched/urgent] nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()

On Thu, Apr 28, 2016 at 03:24:43AM -0700, tip-bot for Peter Zijlstra wrote:
> Commit-ID: 2548d546d40c0014efdde88a53bf7896e917dcce
> Gitweb: http://git.kernel.org/tip/2548d546d40c0014efdde88a53bf7896e917dcce
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Thu, 21 Apr 2016 18:03:15 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 28 Apr 2016 10:28:55 +0200
>
> nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
>
> Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to
> re-enable the tick.
>
> His observed problem is that rq->cfs.nr_running can be 1 even though
> there are multiple runnable CFS tasks. This happens in the cgroup
> case, in which case cfs.nr_running is the number of runnable entities
> for that level.
>
> If there is a single runnable cgroup (which can have an arbitrary
> number of runnable child entries itself) rq->cfs.nr_running will be 1.
>
> However, looking at that function I think there's more problems with it.
>
> It seems to assume that if there's FIFO tasks, those will run. This is
> incorrect. The FIFO task can have a lower prio than an RR task, in which
> case the RR task will run.
>
> So the whole fifo_nr_running test seems misplaced, it should go after
> the rr_nr_running tests. That is, only if !rr_nr_running, can we use
> fifo_nr_running like this.

Thanks for this patch. I indeed made confusions around SCHED_RR and SCHED_FIFO priorities.

Too late for me to ACK but I would have. Thanks!