2023-06-22 13:39:19

by Phil Auld

[permalink] [raw]
Subject: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

CFS bandwidth limits and NOHZ full don't play well together. Tasks
can easily run well past their quotas before a remote tick does
accounting. This leads to long, multi-period stalls before such
tasks can run again. Currentlyi, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled.

Add sched_feat HZ_BW (off by default) to control this behavior.

Signed-off-by: Phil Auld <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 2 ++
2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..880eadfac330 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
rcu_read_unlock();
}

+#ifdef CONFIG_NO_HZ_FULL
+/* called from pick_next_task_fair() */
+static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ int cpu = cpu_of(rq);
+
+ if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+ return;
+
+ if (!tick_nohz_full_cpu(cpu))
+ return;
+
+ if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
+ return;
+
+ /*
+ * We know there is only one task runnable and we've just picked it. The
+ * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
+ * be otherwise able to stop the tick. Just need to check if we are using
+ * bandwidth control.
+ */
+ if (cfs_rq->runtime_enabled)
+ tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
+}
+#endif
+
#else /* CONFIG_CFS_BANDWIDTH */

static inline bool cfs_bandwidth_used(void)
@@ -6181,9 +6208,12 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
static inline void update_runtime_enabled(struct rq *rq) {}
static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
-
#endif /* CONFIG_CFS_BANDWIDTH */

+#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
+static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
+#endif
+
/**************************************************
* CFS operations on tasks:
*/
@@ -8097,6 +8127,7 @@ done: __maybe_unused;
hrtick_start_fair(rq, p);

update_misfit_status(p, rq);
+ sched_fair_update_stop_tick(rq, p);

return p;

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..6fdf1fdf6b17 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)

SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(HZ_BW, false)
--
2.31.1



2023-06-22 13:55:56

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Thu, Jun 22, 2023 at 09:27:51AM -0400 Phil Auld wrote:
> CFS bandwidth limits and NOHZ full don't play well together. Tasks
> can easily run well past their quotas before a remote tick does
> accounting. This leads to long, multi-period stalls before such
> tasks can run again. Currentlyi, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
>
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled.
>
> Add sched_feat HZ_BW (off by default) to control this behavior.

This is instead of the previous HRTICK version. The problem addressed
is causing significant issues for conainterized telco systems so I'm
trying a different approach. Maybe it will get more traction.

This leaves the sched tick running, but won't require a full
pass through schedule(). As Ben pointed out the HRTICK version
would basically fire every 5ms so depending on your HZ value it
might not have bought much uninterrupted runtime anyway.


Thanks for taking a look.


Cheers,
Phil

>
> Signed-off-by: Phil Auld <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Ben Segall <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> kernel/sched/features.h | 2 ++
> 2 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..880eadfac330 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> rcu_read_unlock();
> }
>
> +#ifdef CONFIG_NO_HZ_FULL
> +/* called from pick_next_task_fair() */
> +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> +{
> + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> + int cpu = cpu_of(rq);
> +
> + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> + return;
> +
> + if (!tick_nohz_full_cpu(cpu))
> + return;
> +
> + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
> + return;
> +
> + /*
> + * We know there is only one task runnable and we've just picked it. The
> + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> + * be otherwise able to stop the tick. Just need to check if we are using
> + * bandwidth control.
> + */
> + if (cfs_rq->runtime_enabled)
> + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> +}
> +#endif
> +
> #else /* CONFIG_CFS_BANDWIDTH */
>
> static inline bool cfs_bandwidth_used(void)
> @@ -6181,9 +6208,12 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {}
> static inline void update_runtime_enabled(struct rq *rq) {}
> static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
> -
> #endif /* CONFIG_CFS_BANDWIDTH */
>
> +#if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
> +static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p) {}
> +#endif
> +
> /**************************************************
> * CFS operations on tasks:
> */
> @@ -8097,6 +8127,7 @@ done: __maybe_unused;
> hrtick_start_fair(rq, p);
>
> update_misfit_status(p, rq);
> + sched_fair_update_stop_tick(rq, p);
>
> return p;
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..6fdf1fdf6b17 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
>
> SCHED_FEAT(ALT_PERIOD, true)
> SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(HZ_BW, false)
> --
> 2.31.1
>

--


2023-06-22 14:40:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Thu, 22 Jun 2023 09:27:51 -0400
Phil Auld <[email protected]> wrote:

> CFS bandwidth limits and NOHZ full don't play well together. Tasks
> can easily run well past their quotas before a remote tick does
> accounting. This leads to long, multi-period stalls before such
> tasks can run again. Currentlyi, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
>
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled.
>
> Add sched_feat HZ_BW (off by default) to control this behavior.

So the tl;dr; is: "If the current task has a bandwidth limit, do not disable
the tick" ?

-- Steve

2023-06-22 16:10:25

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Thu, Jun 22, 2023 at 10:22:16AM -0400 Steven Rostedt wrote:
> On Thu, 22 Jun 2023 09:27:51 -0400
> Phil Auld <[email protected]> wrote:
>
> > CFS bandwidth limits and NOHZ full don't play well together. Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting. This leads to long, multi-period stalls before such
> > tasks can run again. Currentlyi, when presented with these conflicting
> > requirements the scheduler is favoring nohz_full and letting the tick
> > be stopped. However, nohz tick stopping is already best-effort, there
> > are a number of conditions that can prevent it, whereas cfs runtime
> > bandwidth is expected to be enforced.
> >
> > Make the scheduler favor bandwidth over stopping the tick by setting
> > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > runtime limit enabled.
> >
> > Add sched_feat HZ_BW (off by default) to control this behavior.
>
> So the tl;dr; is: "If the current task has a bandwidth limit, do not disable
> the tick" ?
>

Yes. W/o the tick we can't reliably support/enforce the bandwidth limit.


Cheers,
Phil

> -- Steve
>

--


2023-06-22 21:15:58

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

Phil Auld <[email protected]> writes:

> CFS bandwidth limits and NOHZ full don't play well together. Tasks
> can easily run well past their quotas before a remote tick does
> accounting. This leads to long, multi-period stalls before such
> tasks can run again. Currentlyi, when presented with these conflicting
> requirements the scheduler is favoring nohz_full and letting the tick
> be stopped. However, nohz tick stopping is already best-effort, there
> are a number of conditions that can prevent it, whereas cfs runtime
> bandwidth is expected to be enforced.
>
> Make the scheduler favor bandwidth over stopping the tick by setting
> TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> runtime limit enabled.
>
> Add sched_feat HZ_BW (off by default) to control this behavior.
>
> Signed-off-by: Phil Auld <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Ben Segall <[email protected]>
> ---
> kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> kernel/sched/features.h | 2 ++
> 2 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..880eadfac330 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> rcu_read_unlock();
> }
>
> +#ifdef CONFIG_NO_HZ_FULL
> +/* called from pick_next_task_fair() */
> +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> +{
> + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> + int cpu = cpu_of(rq);
> +
> + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> + return;
> +
> + if (!tick_nohz_full_cpu(cpu))
> + return;
> +
> + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
> + return;
> +
> + /*
> + * We know there is only one task runnable and we've just picked it. The
> + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> + * be otherwise able to stop the tick. Just need to check if we are using
> + * bandwidth control.
> + */
> + if (cfs_rq->runtime_enabled)
> + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> +}
> +#endif

So from a CFS_BANDWIDTH pov runtime_enabled && nr_running == 1 seems
fine. But working around sched_can_stop_tick instead of with it seems
sketchy in general, and in an edge case like "migrate a task onto the
cpu and then off again" you'd get sched_update_tick_dependency resetting
the TICK_DEP_BIT and then not call PNT (ie a task wakes up onto this cpu
without preempting, and then another cpu goes idle and pulls it, causing
this cpu to go into nohz_full).

2023-06-22 21:56:21

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Thu, Jun 22, 2023 at 01:49:52PM -0700 Benjamin Segall wrote:
> Phil Auld <[email protected]> writes:
>
> > CFS bandwidth limits and NOHZ full don't play well together. Tasks
> > can easily run well past their quotas before a remote tick does
> > accounting. This leads to long, multi-period stalls before such
> > tasks can run again. Currentlyi, when presented with these conflicting
> > requirements the scheduler is favoring nohz_full and letting the tick
> > be stopped. However, nohz tick stopping is already best-effort, there
> > are a number of conditions that can prevent it, whereas cfs runtime
> > bandwidth is expected to be enforced.
> >
> > Make the scheduler favor bandwidth over stopping the tick by setting
> > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > runtime limit enabled.
> >
> > Add sched_feat HZ_BW (off by default) to control this behavior.
> >
> > Signed-off-by: Phil Auld <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Ben Segall <[email protected]>
> > ---
> > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> > kernel/sched/features.h | 2 ++
> > 2 files changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f55884..880eadfac330 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> > rcu_read_unlock();
> > }
> >
> > +#ifdef CONFIG_NO_HZ_FULL
> > +/* called from pick_next_task_fair() */
> > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> > +{
> > + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> > + int cpu = cpu_of(rq);
> > +
> > + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> > + return;
> > +
> > + if (!tick_nohz_full_cpu(cpu))
> > + return;
> > +
> > + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
> > + return;
> > +
> > + /*
> > + * We know there is only one task runnable and we've just picked it. The
> > + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> > + * be otherwise able to stop the tick. Just need to check if we are using
> > + * bandwidth control.
> > + */
> > + if (cfs_rq->runtime_enabled)
> > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> > +}
> > +#endif
>
> So from a CFS_BANDWIDTH pov runtime_enabled && nr_running == 1 seems
> fine. But working around sched_can_stop_tick instead of with it seems
> sketchy in general, and in an edge case like "migrate a task onto the
> cpu and then off again" you'd get sched_update_tick_dependency resetting
> the TICK_DEP_BIT and then not call PNT (ie a task wakes up onto this cpu
> without preempting, and then another cpu goes idle and pulls it, causing
> this cpu to go into nohz_full).
>

The information to make these tests is not available in sched_can_stop_tick.
I did start there. When that is called, and we are likely to go nohz_full,
curr is null so it's hard to find the right cfs_rq to make that
runtime_enabled test against. We could, maybe, plumb the task being enqueued
in but it would not be valid for the dequeue path and would be a bit messy.

But yes, I suppose you could end up in a state that is just as bad as today.

Maybe I could add a redundant check in sched_can_stop_tick for when
nr_running == 1 and curr is not null and make sure the bit does not get
cleared. I'll look into that.


Thanks,
Phil

--


2023-06-23 13:17:31

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Thu, Jun 22, 2023 at 05:37:30PM -0400 Phil Auld wrote:
> On Thu, Jun 22, 2023 at 01:49:52PM -0700 Benjamin Segall wrote:
> > Phil Auld <[email protected]> writes:
> >
> > > CFS bandwidth limits and NOHZ full don't play well together. Tasks
> > > can easily run well past their quotas before a remote tick does
> > > accounting. This leads to long, multi-period stalls before such
> > > tasks can run again. Currentlyi, when presented with these conflicting
> > > requirements the scheduler is favoring nohz_full and letting the tick
> > > be stopped. However, nohz tick stopping is already best-effort, there
> > > are a number of conditions that can prevent it, whereas cfs runtime
> > > bandwidth is expected to be enforced.
> > >
> > > Make the scheduler favor bandwidth over stopping the tick by setting
> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> > > runtime limit enabled.
> > >
> > > Add sched_feat HZ_BW (off by default) to control this behavior.
> > >
> > > Signed-off-by: Phil Auld <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Vincent Guittot <[email protected]>
> > > Cc: Juri Lelli <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: Ben Segall <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> > > kernel/sched/features.h | 2 ++
> > > 2 files changed, 34 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 373ff5f55884..880eadfac330 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> > > rcu_read_unlock();
> > > }
> > >
> > > +#ifdef CONFIG_NO_HZ_FULL
> > > +/* called from pick_next_task_fair() */
> > > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> > > +{
> > > + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> > > + int cpu = cpu_of(rq);
> > > +
> > > + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> > > + return;
> > > +
> > > + if (!tick_nohz_full_cpu(cpu))
> > > + return;
> > > +
> > > + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
> > > + return;
> > > +
> > > + /*
> > > + * We know there is only one task runnable and we've just picked it. The
> > > + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> > > + * be otherwise able to stop the tick. Just need to check if we are using
> > > + * bandwidth control.
> > > + */
> > > + if (cfs_rq->runtime_enabled)
> > > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> > > +}
> > > +#endif
> >
> > So from a CFS_BANDWIDTH pov runtime_enabled && nr_running == 1 seems
> > fine. But working around sched_can_stop_tick instead of with it seems
> > sketchy in general, and in an edge case like "migrate a task onto the
> > cpu and then off again" you'd get sched_update_tick_dependency resetting
> > the TICK_DEP_BIT and then not call PNT (ie a task wakes up onto this cpu
> > without preempting, and then another cpu goes idle and pulls it, causing
> > this cpu to go into nohz_full).
> >
>
> The information to make these tests is not available in sched_can_stop_tick.
> I did start there. When that is called, and we are likely to go nohz_full,
> curr is null so it's hard to find the right cfs_rq to make that
> runtime_enabled test against. We could, maybe, plumb the task being enqueued
> in but it would not be valid for the dequeue path and would be a bit messy.
>

Sorry, mispoke... rq->curr == rq-idle not null. But still we don't have
access to the task and its cfs_rq which will have runtime_enabled set.

> But yes, I suppose you could end up in a state that is just as bad as today.
>
> Maybe I could add a redundant check in sched_can_stop_tick for when
> nr_running == 1 and curr is not null and make sure the bit does not get
> cleared. I'll look into that.
>
>
> Thanks,
> Phil
>
> --
>

--


2023-06-23 19:56:14

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

Phil Auld <[email protected]> writes:

> On Thu, Jun 22, 2023 at 05:37:30PM -0400 Phil Auld wrote:
>> On Thu, Jun 22, 2023 at 01:49:52PM -0700 Benjamin Segall wrote:
>> > Phil Auld <[email protected]> writes:
>> >
>> > > CFS bandwidth limits and NOHZ full don't play well together. Tasks
>> > > can easily run well past their quotas before a remote tick does
>> > > accounting. This leads to long, multi-period stalls before such
>> > > tasks can run again. Currentlyi, when presented with these conflicting
>> > > requirements the scheduler is favoring nohz_full and letting the tick
>> > > be stopped. However, nohz tick stopping is already best-effort, there
>> > > are a number of conditions that can prevent it, whereas cfs runtime
>> > > bandwidth is expected to be enforced.
>> > >
>> > > Make the scheduler favor bandwidth over stopping the tick by setting
>> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
>> > > runtime limit enabled.
>> > >
>> > > Add sched_feat HZ_BW (off by default) to control this behavior.
>> > >
>> > > Signed-off-by: Phil Auld <[email protected]>
>> > > Cc: Ingo Molnar <[email protected]>
>> > > Cc: Peter Zijlstra <[email protected]>
>> > > Cc: Vincent Guittot <[email protected]>
>> > > Cc: Juri Lelli <[email protected]>
>> > > Cc: Dietmar Eggemann <[email protected]>
>> > > Cc: Valentin Schneider <[email protected]>
>> > > Cc: Ben Segall <[email protected]>
>> > > ---
>> > > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
>> > > kernel/sched/features.h | 2 ++
>> > > 2 files changed, 34 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > > index 373ff5f55884..880eadfac330 100644
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
>> > > rcu_read_unlock();
>> > > }
>> > >
>> > > +#ifdef CONFIG_NO_HZ_FULL
>> > > +/* called from pick_next_task_fair() */
>> > > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
>> > > +{
>> > > + struct cfs_rq *cfs_rq = task_cfs_rq(p);
>> > > + int cpu = cpu_of(rq);
>> > > +
>> > > + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
>> > > + return;
>> > > +
>> > > + if (!tick_nohz_full_cpu(cpu))
>> > > + return;
>> > > +
>> > > + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
>> > > + return;
>> > > +
>> > > + /*
>> > > + * We know there is only one task runnable and we've just picked it. The
>> > > + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
>> > > + * be otherwise able to stop the tick. Just need to check if we are using
>> > > + * bandwidth control.
>> > > + */
>> > > + if (cfs_rq->runtime_enabled)
>> > > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>> > > +}
>> > > +#endif
>> >
>> > So from a CFS_BANDWIDTH pov runtime_enabled && nr_running == 1 seems
>> > fine. But working around sched_can_stop_tick instead of with it seems
>> > sketchy in general, and in an edge case like "migrate a task onto the
>> > cpu and then off again" you'd get sched_update_tick_dependency resetting
>> > the TICK_DEP_BIT and then not call PNT (ie a task wakes up onto this cpu
>> > without preempting, and then another cpu goes idle and pulls it, causing
>> > this cpu to go into nohz_full).
>> >
>>
>> The information to make these tests is not available in sched_can_stop_tick.
>> I did start there. When that is called, and we are likely to go nohz_full,
>> curr is null so it's hard to find the right cfs_rq to make that
>> runtime_enabled test against. We could, maybe, plumb the task being enqueued
>> in but it would not be valid for the dequeue path and would be a bit messy.
>>
>
> Sorry, mispoke... rq->curr == rq-idle not null. But still we don't have
> access to the task and its cfs_rq which will have runtime_enabled set.
>

That is unfortunate. I suppose then you'd wind up needing both this
extra bit in PNT to handle the switch into nr_running == 1 territory,
and a "HZ_BW && nr_running == 1 && curr is fair && curr->on_rq &&
curr->cfs_rq->runtime_enabled" check in sched_can_stop_tick to catch
edge cases. (I think that would be sufficient, if an annoyingly long set
of conditionals)

2023-06-23 20:16:12

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH] Sched/fair: Block nohz tick_stop when cfs bandwidth in use

On Fri, Jun 23, 2023 at 11:59:09AM -0700 Benjamin Segall wrote:
> Phil Auld <[email protected]> writes:
>
> > On Thu, Jun 22, 2023 at 05:37:30PM -0400 Phil Auld wrote:
> >> On Thu, Jun 22, 2023 at 01:49:52PM -0700 Benjamin Segall wrote:
> >> > Phil Auld <[email protected]> writes:
> >> >
> >> > > CFS bandwidth limits and NOHZ full don't play well together. Tasks
> >> > > can easily run well past their quotas before a remote tick does
> >> > > accounting. This leads to long, multi-period stalls before such
> >> > > tasks can run again. Currentlyi, when presented with these conflicting
> >> > > requirements the scheduler is favoring nohz_full and letting the tick
> >> > > be stopped. However, nohz tick stopping is already best-effort, there
> >> > > are a number of conditions that can prevent it, whereas cfs runtime
> >> > > bandwidth is expected to be enforced.
> >> > >
> >> > > Make the scheduler favor bandwidth over stopping the tick by setting
> >> > > TICK_DEP_BIT_SCHED when the only running task is a cfs task with
> >> > > runtime limit enabled.
> >> > >
> >> > > Add sched_feat HZ_BW (off by default) to control this behavior.
> >> > >
> >> > > Signed-off-by: Phil Auld <[email protected]>
> >> > > Cc: Ingo Molnar <[email protected]>
> >> > > Cc: Peter Zijlstra <[email protected]>
> >> > > Cc: Vincent Guittot <[email protected]>
> >> > > Cc: Juri Lelli <[email protected]>
> >> > > Cc: Dietmar Eggemann <[email protected]>
> >> > > Cc: Valentin Schneider <[email protected]>
> >> > > Cc: Ben Segall <[email protected]>
> >> > > ---
> >> > > kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++++++-
> >> > > kernel/sched/features.h | 2 ++
> >> > > 2 files changed, 34 insertions(+), 1 deletion(-)
> >> > >
> >> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > > index 373ff5f55884..880eadfac330 100644
> >> > > --- a/kernel/sched/fair.c
> >> > > +++ b/kernel/sched/fair.c
> >> > > @@ -6139,6 +6139,33 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
> >> > > rcu_read_unlock();
> >> > > }
> >> > >
> >> > > +#ifdef CONFIG_NO_HZ_FULL
> >> > > +/* called from pick_next_task_fair() */
> >> > > +static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
> >> > > +{
> >> > > + struct cfs_rq *cfs_rq = task_cfs_rq(p);
> >> > > + int cpu = cpu_of(rq);
> >> > > +
> >> > > + if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
> >> > > + return;
> >> > > +
> >> > > + if (!tick_nohz_full_cpu(cpu))
> >> > > + return;
> >> > > +
> >> > > + if (rq->nr_running != 1 || !sched_can_stop_tick(rq))
> >> > > + return;
> >> > > +
> >> > > + /*
> >> > > + * We know there is only one task runnable and we've just picked it. The
> >> > > + * normal enqueue path will have cleared TICK_DEP_BIT_SCHED if we will
> >> > > + * be otherwise able to stop the tick. Just need to check if we are using
> >> > > + * bandwidth control.
> >> > > + */
> >> > > + if (cfs_rq->runtime_enabled)
> >> > > + tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
> >> > > +}
> >> > > +#endif
> >> >
> >> > So from a CFS_BANDWIDTH pov runtime_enabled && nr_running == 1 seems
> >> > fine. But working around sched_can_stop_tick instead of with it seems
> >> > sketchy in general, and in an edge case like "migrate a task onto the
> >> > cpu and then off again" you'd get sched_update_tick_dependency resetting
> >> > the TICK_DEP_BIT and then not call PNT (ie a task wakes up onto this cpu
> >> > without preempting, and then another cpu goes idle and pulls it, causing
> >> > this cpu to go into nohz_full).
> >> >
> >>
> >> The information to make these tests is not available in sched_can_stop_tick.
> >> I did start there. When that is called, and we are likely to go nohz_full,
> >> curr is null so it's hard to find the right cfs_rq to make that
> >> runtime_enabled test against. We could, maybe, plumb the task being enqueued
> >> in but it would not be valid for the dequeue path and would be a bit messy.
> >>
> >
> > Sorry, mispoke... rq->curr == rq-idle not null. But still we don't have
> > access to the task and its cfs_rq which will have runtime_enabled set.
> >
>
> That is unfortunate. I suppose then you'd wind up needing both this
> extra bit in PNT to handle the switch into nr_running == 1 territory,
> and a "HZ_BW && nr_running == 1 && curr is fair && curr->on_rq &&
> curr->cfs_rq->runtime_enabled" check in sched_can_stop_tick to catch
> edge cases. (I think that would be sufficient, if an annoyingly long set
> of conditionals)
>

Right. That's more or less what the version I'm testing now does.

Thanks again.


Cheers,
Phil

--