LinuxLists.cc - [PATCH v4] Sched/fair: Block nohz tick

2023-06-30 14:27:53

Subject: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

CFS bandwidth limits and NOHZ full don't play well together. Tasks
can easily run well past their quotas before a remote tick does
accounting. This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled.

Add sched_feat HZ_BW (off by default) to control this behavior.

Signed-off-by: Phil Auld <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
---
v4: Made checks for runtime_enabled hierarchical.

v3: Moved sched_cfs_bandwidth_active() prototype to sched.h outside of
guards to silence -Wmissing-prototypes.

v2: Ben pointed out that the bit could get cleared in the dequeue path
if we migrate a newly enqueued task without preempting curr. Added a
check for that edge case to sched_can_stop_tick. Removed the call to
sched_can_stop_tick from sched_fair_update_stop_tick since it was
redundant.

kernel/sched/core.c | 10 ++++++++
kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 2 ++
kernel/sched/sched.h | 1 +
4 files changed, 66 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..2685373e12f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1229,6 +1229,16 @@ bool sched_can_stop_tick(struct rq *rq)
if (rq->nr_running > 1)
return false;

+ /*
+ * If there is one task and it has CFS runtime bandwidth constraints
+ * and it's on the cpu now we don't want to stop the tick.
+ */
+ if (sched_feat(HZ_BW) && rq->nr_running == 1 && rq->curr
+ && rq->curr->sched_class == &fair_sched_class && task_on_rq_queued(rq->curr)) {
+ if (sched_cfs_bandwidth_active(rq->curr))
+ return false;
+ }
+
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..125b1ec4476f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6139,6 +6139,50 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
rcu_read_unlock();
}

+#ifdef CONFIG_NO_HZ_FULL
+static inline bool cfs_se_bandwidth_enabled(struct sched_entity *se)
+{
+ int ret = 0;
+
+ for_each_sched_entity(se)
+ ret += cfs_rq_of(se)->runtime_enabled;
+
+ return ret != 0;
+}
+
+bool sched_cfs_bandwidth_active(struct task_struct *p)
+{
+ if (cfs_bandwidth_used() && cfs_se_bandwidth_enabled(&p->se))
+ return true;
+
+ return false;
+}
+
+/* called from pick_next_task_fair() */
+static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
+{
+ int cpu = cpu_of(rq);
+
+ if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+ return;
+
+ if (!tick_nohz_full_cpu(cpu))
+ return;
+
+ if (rq->nr_running != 1)
+ return;
+
+ /*
+ * We know there is only one task runnable and we've just picked it. The
+ * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
+ * be otherwise able to stop the tick. Just need to check if we are using
+ * bandwidth control.
+ */
+ if (cfs_se_bandwidth_enabled(&p->se))
+ tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+}
+#endif
+
#else /* CONFIG_CFS_BANDWIDTH */

static inline bool cfs_bandwidth_used(void)
@@ -6181,9 +6225,17 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
static inline void update_runtime_enabled(struct rq *rq) {}
static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
+bool sched_cfs_bandwidth_active(struct task_struct *p)
+{
+ return false;
+}

#endif /* CONFIG_CFS_BANDWIDTH */

+#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
+static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
+#endif
+
/**************************************************
* CFS operations on tasks:
*/
@@ -8097,6 +8149,7 @@ done: __maybe_unused;
hrtick_start_fair(rq, p);

update_misfit_status(p, rq);
+ sched_fair_update_stop_tick(rq, p);

return p;

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..6fdf1fdf6b17 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(HZ_BW, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..41f80dd5538d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2789,6 +2789,7 @@ extern void init_dl_rq(struct dl_rq *dl_rq);

extern void cfs_bandwidth_usage_inc(void);
extern void cfs_bandwidth_usage_dec(void);
+extern bool sched_cfs_bandwidth_active(struct task_struct *p);

#ifdef CONFIG_NO_HZ_COMMON
#define NOHZ_BALANCE_KICK_BIT 0
--
2.31.1

2023-06-30 15:26:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Fri, Jun 30, 2023 at 09:57:14AM -0400, Phil Auld wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a68d1276bab0..2685373e12f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1229,6 +1229,16 @@ bool sched_can_stop_tick(struct rq *rq)
> if (rq->nr_running > 1)
> return false;
>
> + /*
> + * If there is one task and it has CFS runtime bandwidth constraints
> + * and it's on the cpu now we don't want to stop the tick.
> + */
> + if (sched_feat(HZ_BW) && rq->nr_running == 1 && rq->curr
> + && rq->curr->sched_class == &fair_sched_class && task_on_rq_queued(rq->curr)) {

&& goes at the end of the previous line

rq->curr is never NULL

But surely you can find a saner way to write this?

> + if (sched_cfs_bandwidth_active(rq->curr))
> + return false;
> + }
> +
> return true;
> }
> #endif /* CONFIG_NO_HZ_FULL */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..125b1ec4476f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6139,6 +6139,50 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> rcu_read_unlock();
> }
>
> +#ifdef CONFIG_NO_HZ_FULL
> +static inline bool cfs_se_bandwidth_enabled(struct sched_entity *se)
> +{
> + int ret = 0;
> +
> + for_each_sched_entity(se)
> + ret += cfs_rq_of(se)->runtime_enabled;
> +
> + return ret != 0;
> +}
> +
> +bool sched_cfs_bandwidth_active(struct task_struct *p)
> +{
> + if (cfs_bandwidth_used() && cfs_se_bandwidth_enabled(&p->se))
> + return true;
> +
> + return false;
> +}
> +
> +/* called from pick_next_task_fair() */
> +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> +{
> + int cpu = cpu_of(rq);
> +
> + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> + return;
> +
> + if (!tick_nohz_full_cpu(cpu))
> + return;
> +
> + if (rq->nr_running != 1)
> + return;
> +
> + /*
> + * We know there is only one task runnable and we've just picked it. The
> + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> + * be otherwise able to stop the tick. Just need to check if we are using
> + * bandwidth control.
> + */
> + if (cfs_se_bandwidth_enabled(&p->se))
> + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> +}

Yeah, I think not; pick_next_task_fair() just walked the cgroup
hierarchy and now you do it again.

Also, why does this code exist at all? Both enqueue/dequeue already end
up in sched_update_tick_depenency() and should be able to handle the
nr_running==1 with bandwidth crap, no?

2023-06-30 15:49:22

by Phil Auld

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

Hi Peter,

On Fri, Jun 30, 2023 at 05:06:41PM +0200 Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 09:57:14AM -0400, Phil Auld wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..2685373e12f1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1229,6 +1229,16 @@ bool sched_can_stop_tick(struct rq *rq)
> > if (rq->nr_running > 1)
> > return false;
> >
> > + /*
> > + * If there is one task and it has CFS runtime bandwidth constraints
> > + * and it's on the cpu now we don't want to stop the tick.
> > + */
> > + if (sched_feat(HZ_BW) && rq->nr_running == 1 && rq->curr
> > + && rq->curr->sched_class == &fair_sched_class && task_on_rq_queued(rq->curr)) {
>
> && goes at the end of the previous line
>
> rq->curr is never NULL
>
> But surely you can find a saner way to write this?
>

Okay, I'll try to clean that up.

> > + if (sched_cfs_bandwidth_active(rq->curr))
> > + return false;
> > + }
> > +
> > return true;
> > }
> > #endif /* CONFIG_NO_HZ_FULL */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..125b1ec4476f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6139,6 +6139,50 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> > rcu_read_unlock();
> > }
> >
> > +#ifdef CONFIG_NO_HZ_FULL
> > +static inline bool cfs_se_bandwidth_enabled(struct sched_entity *se)
> > +{
> > + int ret = 0;
> > +
> > + for_each_sched_entity(se)
> > + ret += cfs_rq_of(se)->runtime_enabled;
> > +
> > + return ret != 0;
> > +}
> > +
> > +bool sched_cfs_bandwidth_active(struct task_struct *p)
> > +{
> > + if (cfs_bandwidth_used() && cfs_se_bandwidth_enabled(&p->se))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +/* called from pick_next_task_fair() */
> > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> > +{
> > + int cpu = cpu_of(rq);
> > +
> > + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> > + return;
> > +
> > + if (!tick_nohz_full_cpu(cpu))
> > + return;
> > +
> > + if (rq->nr_running != 1)
> > + return;
> > +
> > + /*
> > + * We know there is only one task runnable and we've just picked it. The
> > + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> > + * be otherwise able to stop the tick. Just need to check if we are using
> > + * bandwidth control.
> > + */
> > + if (cfs_se_bandwidth_enabled(&p->se))
> > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> > +}
>
> Yeah, I think not; pick_next_task_fair() just walked the cgroup
> hierarchy and now you do it again.
>
> Also, why does this code exist at all? Both enqueue/dequeue already end
> up in sched_update_tick_depenency() and should be able to handle the
> nr_running==1 with bandwidth crap, no?
>

No. Or at least not without plumbing the enqueued/dequeued task all the way
through. I can do it that way if you prefer but that seemed a lot more
intrusive. When we are in sched_can_stop_tick() we don't have access to the
cfs task which will end up running. Curr is idle in that case. We'd have to
essential run pick_next_task_fair() to find the task to check which seemed
wrong. Maybe there is a better way?

The code in sched_can_stop_tick() was added to catch the edge case
of waking a second task and having it migrated before it runs so we don't
clear the dependency of the running bandwidth enabled task by the dequeue
of the transient waking task.

Thanks for taking a look. This is better than "OMG" :)

Cheers,
Phil

--

2023-06-30 16:30:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Fri, Jun 30, 2023 at 11:28:24AM -0400, Phil Auld wrote:

> No. Or at least not without plumbing the enqueued/dequeued task all the way
> through. I can do it that way if you prefer but that seemed a lot more
> intrusive. When we are in sched_can_stop_tick() we don't have access to the
> cfs task which will end up running. Curr is idle in that case. We'd have to
> essential run pick_next_task_fair() to find the task to check which seemed
> wrong. Maybe there is a better way?

Ah, you worry about where we have two runnable tasks, one is bandwidth
constrained the other is not. One task goes away, how can we tell what
the remaining task is?

This is never a concern for add_nr_running(), the only case there is
0->1 and then only the hierarchy you just walked for enqueue is
relevant.

But if you remove the unconstrained task, sub_nr_running() can't tell
what the remaining task is.

Unless, of course, you have enqueue() set a bit somewhere in
task_struct::sched_bw_constrained:1.

Then pick and your should_stop thing can look at that, no?

2023-06-30 16:34:13

by Phil Auld

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Fri, Jun 30, 2023 at 06:05:34PM +0200 Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 11:28:24AM -0400, Phil Auld wrote:
>
> > No. Or at least not without plumbing the enqueued/dequeued task all the way
> > through. I can do it that way if you prefer but that seemed a lot more
> > intrusive. When we are in sched_can_stop_tick() we don't have access to the
> > cfs task which will end up running. Curr is idle in that case. We'd have to
> > essential run pick_next_task_fair() to find the task to check which seemed
> > wrong. Maybe there is a better way?
>
> Ah, you worry about where we have two runnable tasks, one is bandwidth
> constrained the other is not. One task goes away, how can we tell what
> the remaining task is?

It didn't even have to be an unconstrained task before I added the
check in sched_can_stop_tick(). But yes that check catches the case
where the one (still) running task is bw constrained.

>
> This is never a concern for add_nr_running(), the only case there is
> 0->1 and then only the hierarchy you just walked for enqueue is
> relevant.
>

Right but we don't (currently) have the task here (hence the pick_next code).

> But if you remove the unconstrained task, sub_nr_running() can't tell
> what the remaining task is.
>
> Unless, of course, you have enqueue() set a bit somewhere in
> task_struct::sched_bw_constrained:1.
>
> Then pick and your should_stop thing can look at that, no?
>

Yes.

I think you are agreeing that I need the pick next code but need to remove
the hierarchy walks, right?

If I set the bit in task_struct it would do that. I'll go poke at that.

Thanks,
Phil

--

2023-07-03 12:17:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Fri, Jun 30, 2023 at 12:29:10PM -0400, Phil Auld wrote:

> I think you are agreeing that I need the pick next code but need to remove
> the hierarchy walks, right?

Yeah, the dequeue case makes we have to care about pick, not sure we
then also need to care about sched_update_tick_dependency() though.
There is indeed a window where these two will 'race', but afaict it is
benign.

2023-07-03 14:32:08

by Phil Auld

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Mon, Jul 03, 2023 at 10:10:56AM -0400 Phil Auld wrote:
> On Mon, Jul 03, 2023 at 02:10:09PM +0200 Peter Zijlstra wrote:
> > On Fri, Jun 30, 2023 at 12:29:10PM -0400, Phil Auld wrote:
> >
> > > I think you are agreeing that I need the pick next code but need to remove
> > > the hierarchy walks, right?
> >
> > Yeah, the dequeue case makes we have to care about pick, not sure we
> > then also need to care about sched_update_tick_dependency() though.
> > There is indeed a window where these two will 'race', but afaict it is
> > benign.
> >
>
> Hm, that's confusing.
>
> As I see it it's the enqueue case (0->1 mostly) where we need the check
> in pick. At that point in enqueue we only have a handle on ->curr which
> is the idle thread.
>
> For the dequeue case (2->1) we need the check in the
> sched_update_tick_dependency() path because if the 1 is the task on the
> cpu (and is staying there) then we'd otherwise clear the bit when we
> shouldn't (since we aren't going to go back through pick).
>

And, for clarity, the "bit" in that case is TICK_DEP_BIT_SCHED not the new
task::bw_constrained field. I.e. we'll endup allowing the tick to stop
for a bw limited task and thus not solve the problem.

> I'm thinking that I'll try to set the bit in pick since we only care about
> it when it's the task on the cpu. That, I think, will simplify the
> code needed to update the bit when the quota is changed (to or from
> RUNTIME_INF).
>
> Setting the bit in enqueue/dequeue means updating it on all the queued
> task if it changes. Although I may clear it in dequeue just to not leave
> it around stale.
>
>
> Cheers,
> Phil
> --
>

--

2023-07-03 14:41:38

by Phil Auld

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Mon, Jul 03, 2023 at 02:10:09PM +0200 Peter Zijlstra wrote:
> On Fri, Jun 30, 2023 at 12:29:10PM -0400, Phil Auld wrote:
>
> > I think you are agreeing that I need the pick next code but need to remove
> > the hierarchy walks, right?
>
> Yeah, the dequeue case makes we have to care about pick, not sure we
> then also need to care about sched_update_tick_dependency() though.
> There is indeed a window where these two will 'race', but afaict it is
> benign.
>

Hm, that's confusing.

As I see it it's the enqueue case (0->1 mostly) where we need the check
in pick. At that point in enqueue we only have a handle on ->curr which
is the idle thread.

For the dequeue case (2->1) we need the check in the
sched_update_tick_dependency() path because if the 1 is the task on the
cpu (and is staying there) then we'd otherwise clear the bit when we
shouldn't (since we aren't going to go back through pick).

I'm thinking that I'll try to set the bit in pick since we only care about
it when it's the task on the cpu. That, I think, will simplify the
code needed to update the bit when the quota is changed (to or from
RUNTIME_INF).

Setting the bit in enqueue/dequeue means updating it on all the queued
task if it changes. Although I may clear it in dequeue just to not leave
it around stale.

Cheers,
Phil
--

2023-07-03 14:43:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Mon, Jul 03, 2023 at 10:10:56AM -0400, Phil Auld wrote:
> On Mon, Jul 03, 2023 at 02:10:09PM +0200 Peter Zijlstra wrote:
> > On Fri, Jun 30, 2023 at 12:29:10PM -0400, Phil Auld wrote:
> >
> > > I think you are agreeing that I need the pick next code but need to remove
> > > the hierarchy walks, right?
> >
> > Yeah, the dequeue case makes we have to care about pick, not sure we
> > then also need to care about sched_update_tick_dependency() though.
> > There is indeed a window where these two will 'race', but afaict it is
> > benign.
> >
>
> Hm, that's confusing.
>
> As I see it it's the enqueue case (0->1 mostly) where we need the check
> in pick. At that point in enqueue we only have a handle on ->curr which
> is the idle thread.

Well, the 0->1 case is trivial, we'll run the task that's enqueued, and
as such everything can DTRT and be simple.

> For the dequeue case (2->1) we need the check in the
> sched_update_tick_dependency() path because if the 1 is the task on the
> cpu (and is staying there) then we'd otherwise clear the bit when we
> shouldn't (since we aren't going to go back through pick).

The 2->1 case OTOH is tricky, because then we'll end up running a task
we've not recently seen. sub_nr_running() will hit the ==1 case and
clear TICK_DEP_BIT_SCHED.

But then pick will come and set it again, no harm done, right?

.oO Ah!, You're worried about the case where a task is already running,
a second task comes in, (1->2) and then quickly leaves again (2->1)
without passing through schedule(). And you don't want to disable the
tick if that running task needs it.

Mooo :-(

> I'm thinking that I'll try to set the bit in pick since we only care about
> it when it's the task on the cpu. That, I think, will simplify the
> code needed to update the bit when the quota is changed (to or from
> RUNTIME_INF).
>
> Setting the bit in enqueue/dequeue means updating it on all the queued
> task if it changes. Although I may clear it in dequeue just to not leave
> it around stale.

Hmm, no you have to set on enqueue (1->2), otherwise the running task
doesn't get preempted when it runs out of slice.

And I don't suppose you want to delay clearing to the first tick after,
because NOHZ_FULL doesn't want spurious ticks :/

What a mess.

Please document all these stupid cases in a comment, otherwise we'll go
bananas trying to make sense of the code later on.

2023-07-03 16:26:18

by Phil Auld

[permalink] [raw]

Subject: Re: [PATCH v4] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Mon, Jul 03, 2023 at 04:32:57PM +0200 Peter Zijlstra wrote:
> On Mon, Jul 03, 2023 at 10:10:56AM -0400, Phil Auld wrote:
> > On Mon, Jul 03, 2023 at 02:10:09PM +0200 Peter Zijlstra wrote:
> > > On Fri, Jun 30, 2023 at 12:29:10PM -0400, Phil Auld wrote:
> > >
> > > > I think you are agreeing that I need the pick next code but need to remove
> > > > the hierarchy walks, right?
> > >
> > > Yeah, the dequeue case makes we have to care about pick, not sure we
> > > then also need to care about sched_update_tick_dependency() though.
> > > There is indeed a window where these two will 'race', but afaict it is
> > > benign.
> > >
> >
> > Hm, that's confusing.
> >
> > As I see it it's the enqueue case (0->1 mostly) where we need the check
> > in pick. At that point in enqueue we only have a handle on ->curr which
> > is the idle thread.
>
> Well, the 0->1 case is trivial, we'll run the task that's enqueued, and
> as such everything can DTRT and be simple.
>

Right, but we have to do it (check for bw_constraint and set the TICK_DEP bit)
in pick because we don't have a handle on the task that's enqueued in
sched_update_tick_dependency(). Simple :)

> > For the dequeue case (2->1) we need the check in the
> > sched_update_tick_dependency() path because if the 1 is the task on the
> > cpu (and is staying there) then we'd otherwise clear the bit when we
> > shouldn't (since we aren't going to go back through pick).
>
> The 2->1 case OTOH is tricky, because then we'll end up running a task
> we've not recently seen. sub_nr_running() will hit the ==1 case and
> clear TICK_DEP_BIT_SCHED.
>
> But then pick will come and set it again, no harm done, right?
>
> .oO Ah!, You're worried about the case where a task is already running,
> a second task comes in, (1->2) and then quickly leaves again (2->1)
> without passing through schedule(). And you don't want to disable the
> tick if that running task needs it.
>
> Mooo :-(
>

Yeah, Ben pointed that out and then I was able to rt-app a way to hit it
reliably.

> > I'm thinking that I'll try to set the bit in pick since we only care about
> > it when it's the task on the cpu. That, I think, will simplify the
> > code needed to update the bit when the quota is changed (to or from
> > RUNTIME_INF).
> >
> > Setting the bit in enqueue/dequeue means updating it on all the queued
> > task if it changes. Although I may clear it in dequeue just to not leave
> > it around stale.
>
> Hmm, no you have to set on enqueue (1->2), otherwise the running task
> doesn't get preempted when it runs out of slice.

Sorry, I'm not sure I'm following. I meant the bw_constrained bit in the
task not the actual TICK_DEP bit.

So in this case we don't go through pick because we may be preempting
from say a wakeup? If we stay at 2 none of this matters because
the existing tick_dependency stuff will work (nr_running > 1)

That's why I wanted to clarify which bit I was talking about where.

Ah... If we go from 1->2 via a wakeup and preempt rather than pick_next
then the task would not get the bw_constrained bit set if we
then drop from 2->1. Right, okay. Will need to set it in enqueue
and update all queued tasks if bandwidth changes. Or also update it
in pick, maybe. I.e. make sure task::bw_constrained is still right when
we actually land on the cpu because the only place we really care about
it is when we are ->curr.

>
> And I don't suppose you want to delay clearing to the first tick after,
> because NOHZ_FULL doesn't want spurious ticks :/

Here you mean clearing the TICK_DEP yes?

>
> What a mess.
>
> Please document all these stupid cases in a comment, otherwise we'll go
> bananas trying to make sense of the code later on.
>

Will do.

Thanks for your input.

Cheers,
Phil

--