Vincent's load balance rework [1] got me thinking about how and where we
use rq.nr_running vs rq.cfs.h_nr_running checks, and this lead me to
stare intently at the active load balancer.
I haven't seen it happen (yet), but from reading the code it really looks
like we can have some scenarios where the cpu_stopper ends up preempting
a > CFS class task, since we never actually look at what's the remote rq's
running task.
This series shuffles things around the CFS active load balancer to prevent
this from happening.
- Patch 1 is a freebie cleanup
- Patch 2 is a preparatory code move
- Patch 3 adds h_nr_running checks
- Patch 4 adds a sched class check + detach_one_task() to the active balance
This is based on top of today's tip/sched/core:
a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")
v1 -> v2:
- (new patch) Added need_active_balance() cleanup
- Tweaked active balance code move to respect existing
sd->nr_balance_failed modifications
- Added explicit checks of active_load_balance()'s return value
- Added an h_nr_running < 1 check before kicking the cpu_stopper
- Added a detach_one_task() call in active_load_balance() when the remote
rq's running task is > CFS
[1]: https://lore.kernel.org/lkml/[email protected]/
Valentin Schneider (4):
sched/fair: Make need_active_balance() return bools
sched/fair: Move active balance logic to its own function
sched/fair: Check for CFS tasks before detach_one_task()
sched/fair: Prevent active LB from preempting higher sched classes
kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++----------------
1 file changed, 95 insertions(+), 56 deletions(-)
--
2.22.0
That function has no need to return ints, make it return bools.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1054d2cf6aaa..22be1ca8d117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8734,12 +8734,12 @@ voluntary_active_balance(struct lb_env *env)
return 0;
}
-static int need_active_balance(struct lb_env *env)
+static bool need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
if (voluntary_active_balance(env))
- return 1;
+ return true;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
--
2.22.0
The CFS load balancer can cause the cpu_stopper to run a function to
try and steal a remote rq's running task. However, it so happens
that while only CFS tasks will ever be migrated by that function, we
can end up preempting higher sched class tasks, since it is executed
by the cpu_stopper.
This can potentially occur whenever a rq's running task is > CFS but
the rq has runnable CFS tasks.
Check the sched class of the remote rq's running task after we've
grabbed its lock. If it's CFS, carry on, otherwise run
detach_one_task() locally since we don't need the cpu_stopper (that
!CFS task is doing the exact same thing).
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f5f6cad5008..bf4f7f43252f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8759,6 +8759,7 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
enum alb_status status = cancelled;
struct sched_domain *sd = env->sd;
struct rq *busiest = env->src_rq;
+ struct task_struct *p = NULL;
unsigned long flags;
schedstat_inc(sd->lb_failed[env->idle]);
@@ -8780,6 +8781,16 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
if (busiest->cfs.h_nr_running < 1)
goto unlock;
+ /*
+ * If busiest->curr isn't CFS, then there's no point in running the
+ * cpu_stopper. We can try to nab one CFS task though, since they're
+ * all ready to be plucked.
+ */
+ if (busiest->curr->sched_class != &fair_sched_class) {
+ p = detach_one_task(env);
+ goto unlock;
+ }
+
/*
* Don't kick the active_load_balance_cpu_stop, if the curr task on
* busiest CPU can't be moved to dst_cpu:
@@ -8803,7 +8814,9 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
unlock:
raw_spin_unlock_irqrestore(&busiest->lock, flags);
- if (status == started)
+ if (p)
+ attach_one_task(env->dst_rq, p);
+ else if (status == started)
stop_one_cpu_nowait(cpu_of(busiest),
active_load_balance_cpu_stop, busiest,
&busiest->active_balance_work);
--
2.22.0
On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
<[email protected]> wrote:
>
> The CFS load balancer can cause the cpu_stopper to run a function to
> try and steal a remote rq's running task. However, it so happens
> that while only CFS tasks will ever be migrated by that function, we
> can end up preempting higher sched class tasks, since it is executed
> by the cpu_stopper.
>
> This can potentially occur whenever a rq's running task is > CFS but
> the rq has runnable CFS tasks.
>
> Check the sched class of the remote rq's running task after we've
> grabbed its lock. If it's CFS, carry on, otherwise run
> detach_one_task() locally since we don't need the cpu_stopper (that
> !CFS task is doing the exact same thing).
AFAICT, this doesn't prevent from preempting !CFS task but only reduce
the window.
As soon as you unlock, !CFS task can preempt CFS before you start stop thread
testing busiest->cfs.h_nr_running < 1 and/or
busiest->curr->sched_class != &fair_sched_class
in need_active_balance() will do almost the same and is much simpler
than this patchset IMO.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> kernel/sched/fair.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f5f6cad5008..bf4f7f43252f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8759,6 +8759,7 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
> enum alb_status status = cancelled;
> struct sched_domain *sd = env->sd;
> struct rq *busiest = env->src_rq;
> + struct task_struct *p = NULL;
> unsigned long flags;
>
> schedstat_inc(sd->lb_failed[env->idle]);
> @@ -8780,6 +8781,16 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
> if (busiest->cfs.h_nr_running < 1)
> goto unlock;
>
> + /*
> + * If busiest->curr isn't CFS, then there's no point in running the
> + * cpu_stopper. We can try to nab one CFS task though, since they're
> + * all ready to be plucked.
> + */
> + if (busiest->curr->sched_class != &fair_sched_class) {
> + p = detach_one_task(env);
> + goto unlock;
> + }
> +
> /*
> * Don't kick the active_load_balance_cpu_stop, if the curr task on
> * busiest CPU can't be moved to dst_cpu:
> @@ -8803,7 +8814,9 @@ static inline enum alb_status active_load_balance(struct lb_env *env)
> unlock:
> raw_spin_unlock_irqrestore(&busiest->lock, flags);
>
> - if (status == started)
> + if (p)
> + attach_one_task(env->dst_rq, p);
> + else if (status == started)
> stop_one_cpu_nowait(cpu_of(busiest),
> active_load_balance_cpu_stop, busiest,
> &busiest->active_balance_work);
> --
> 2.22.0
>
On 27/08/2019 13:28, Vincent Guittot wrote:
> On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
> <[email protected]> wrote:
>>
>> The CFS load balancer can cause the cpu_stopper to run a function to
>> try and steal a remote rq's running task. However, it so happens
>> that while only CFS tasks will ever be migrated by that function, we
>> can end up preempting higher sched class tasks, since it is executed
>> by the cpu_stopper.
>>
>> This can potentially occur whenever a rq's running task is > CFS but
>> the rq has runnable CFS tasks.
>>
>> Check the sched class of the remote rq's running task after we've
>> grabbed its lock. If it's CFS, carry on, otherwise run
>> detach_one_task() locally since we don't need the cpu_stopper (that
>> !CFS task is doing the exact same thing).
>
> AFAICT, this doesn't prevent from preempting !CFS task but only reduce
> the window.
> As soon as you unlock, !CFS task can preempt CFS before you start stop thread
>
Right, if we end up kicking the cpu_stopper this can still happen (since
we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
since the currently running is obviously not going to be CFS (and it's
too late anyway, we already preempted whatever was running there). Though
I should probably change the name of the patch to reflect that it's not a
100% cure.
I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
to bail out early, but AFAICT that's the best we can do without big changes
elsewhere.
If we wanted to prevent those preemptions at all cost, I suppose we'd want
the equivalent of a sched class sitting between CFS and RT: have the
callback only run when there's no runnable > CFS tasks. But then by the
time we execute it we may no longer need to balance anything...
At the very least, what I'm proposing here alleviates *some* of the
preemption cases without swinging the wrecking ball too hard (and without
delaying the balancing either).
> testing busiest->cfs.h_nr_running < 1 and/or
> busiest->curr->sched_class != &fair_sched_class
> in need_active_balance() will do almost the same and is much simpler
> than this patchset IMO.
>
I had this initially but convinced myself out of it: since we hold no
lock in need_active_balance(), the information we get on the current task
(and, arguably, on the h_nr_running) is too volatile to be of any use.
I do believe those checks have their place in active_load_balance()'s
critical section, as that's the most accurate we're going to get. On the
plus side, if we *do* detect the remote rq's current task isn't CFS, we
can run detach_one_task() locally, which is an improvement IMO.
On Wed, 28 Aug 2019 at 11:46, Valentin Schneider
<[email protected]> wrote:
>
> On 27/08/2019 13:28, Vincent Guittot wrote:
> > On Thu, 15 Aug 2019 at 16:52, Valentin Schneider
> > <[email protected]> wrote:
> >>
> >> The CFS load balancer can cause the cpu_stopper to run a function to
> >> try and steal a remote rq's running task. However, it so happens
> >> that while only CFS tasks will ever be migrated by that function, we
> >> can end up preempting higher sched class tasks, since it is executed
> >> by the cpu_stopper.
> >>
> >> This can potentially occur whenever a rq's running task is > CFS but
> >> the rq has runnable CFS tasks.
> >>
> >> Check the sched class of the remote rq's running task after we've
> >> grabbed its lock. If it's CFS, carry on, otherwise run
> >> detach_one_task() locally since we don't need the cpu_stopper (that
> >> !CFS task is doing the exact same thing).
> >
> > AFAICT, this doesn't prevent from preempting !CFS task but only reduce
> > the window.
> > As soon as you unlock, !CFS task can preempt CFS before you start stop thread
> >
>
> Right, if we end up kicking the cpu_stopper this can still happen (since
> we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
> since the currently running is obviously not going to be CFS (and it's
> too late anyway, we already preempted whatever was running there). Though
> I should probably change the name of the patch to reflect that it's not a
> 100% cure.
>
> I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
> to bail out early, but AFAICT that's the best we can do without big changes
> elsewhere.
>
> If we wanted to prevent those preemptions at all cost, I suppose we'd want
I'm not sure that it's worth the effort and the complexity
> the equivalent of a sched class sitting between CFS and RT: have the
> callback only run when there's no runnable > CFS tasks. But then by the
> time we execute it we may no longer need to balance anything...
>
> At the very least, what I'm proposing here alleviates *some* of the
> preemption cases without swinging the wrecking ball too hard (and without
> delaying the balancing either).
>
> > testing busiest->cfs.h_nr_running < 1 and/or
> > busiest->curr->sched_class != &fair_sched_class
> > in need_active_balance() will do almost the same and is much simpler
> > than this patchset IMO.
> >
>
> I had this initially but convinced myself out of it: since we hold no
> lock in need_active_balance(), the information we get on the current task
> (and, arguably, on the h_nr_running) is too volatile to be of any use.
But since the lock is released anyway, everything will always be too
volatile in this case.
>
> I do believe those checks have their place in active_load_balance()'s
> critical section, as that's the most accurate we're going to get. On the
> plus side, if we *do* detect the remote rq's current task isn't CFS, we
> can run detach_one_task() locally, which is an improvement IMO.
This add complexity in the code by adding another path to detach attach task(s).
We could simply bail out and wait the next load balance (which is
already the case sometime) or if you really want to detach a task jump
back to more_balance
On 29/08/2019 15:19, Vincent Guittot wrote:
[...]
>> Right, if we end up kicking the cpu_stopper this can still happen (since
>> we drop the lock). Thing is, you can't detect it on the cpu_stopper side,
>> since the currently running is obviously not going to be CFS (and it's
>> too late anyway, we already preempted whatever was running there). Though
>> I should probably change the name of the patch to reflect that it's not a
>> 100% cure.
>>
>> I tweaked the nr_running check of the cpu_stop callback in patch 3/4 to try
>> to bail out early, but AFAICT that's the best we can do without big changes
>> elsewhere.
>>
>> If we wanted to prevent those preemptions at all cost, I suppose we'd want
>
> I'm not sure that it's worth the effort and the complexity
>
My point exactly :)
[...]
>> I had this initially but convinced myself out of it: since we hold no
>> lock in need_active_balance(), the information we get on the current task
>> (and, arguably, on the h_nr_running) is too volatile to be of any use.
>
> But since the lock is released anyway, everything will always be too
> volatile in this case.
We do release the lock if we go kick the cpu_stopper, but can nevertheless
make a decision with the most up to date information. I'd say it's for
similar reasons that we check busiest->curr->cpus_ptr right before
kicking the cpu_stopper rather than in need_active_balance().
The majority of the checks in need_active_balance() (all but one) depend
on env/sd stats which aren't volatile.
>>
>> I do believe those checks have their place in active_load_balance()'s
>> critical section, as that's the most accurate we're going to get. On the
>> plus side, if we *do* detect the remote rq's current task isn't CFS, we
>> can run detach_one_task() locally, which is an improvement IMO.
>
> This add complexity in the code by adding another path to detach attach task(s).
Note that it's not a new detach/attach per se, rather it's about doing it
in active_load_balance() rather than active_load_balance_cpu_stop() in some
cases.
> We could simply bail out and wait the next load balance (which is
> already the case sometime) or if you really want to detach a task jump
> back to more_balance
>
A simple bail-out is what I had in v1, but following Qais' comments I
figured I could add the detach_one_tasks().
Jumping back to more_balance is quite different than doing a detach_one_task().
(expanded the Cc list)
RT/DL folks, any thought on the thing?
On 15/08/2019 15:51, Valentin Schneider wrote:
> Vincent's load balance rework [1] got me thinking about how and where we
> use rq.nr_running vs rq.cfs.h_nr_running checks, and this lead me to
> stare intently at the active load balancer.
>
> I haven't seen it happen (yet), but from reading the code it really looks
> like we can have some scenarios where the cpu_stopper ends up preempting
> a > CFS class task, since we never actually look at what's the remote rq's
> running task.
>
> This series shuffles things around the CFS active load balancer to prevent
> this from happening.
>
> - Patch 1 is a freebie cleanup
> - Patch 2 is a preparatory code move
> - Patch 3 adds h_nr_running checks
> - Patch 4 adds a sched class check + detach_one_task() to the active balance
>
> This is based on top of today's tip/sched/core:
> a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")
>
> v1 -> v2:
> - (new patch) Added need_active_balance() cleanup
>
> - Tweaked active balance code move to respect existing
> sd->nr_balance_failed modifications
> - Added explicit checks of active_load_balance()'s return value
>
> - Added an h_nr_running < 1 check before kicking the cpu_stopper
>
> - Added a detach_one_task() call in active_load_balance() when the remote
> rq's running task is > CFS
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
>
> Valentin Schneider (4):
> sched/fair: Make need_active_balance() return bools
> sched/fair: Move active balance logic to its own function
> sched/fair: Check for CFS tasks before detach_one_task()
> sched/fair: Prevent active LB from preempting higher sched classes
>
> kernel/sched/fair.c | 151 ++++++++++++++++++++++++++++----------------
> 1 file changed, 95 insertions(+), 56 deletions(-)
>
> --
> 2.22.0
>
Hi Valentin,
On 01/10/19 11:29, Valentin Schneider wrote:
> (expanded the Cc list)
> RT/DL folks, any thought on the thing?
Even if I like your idea and it looks theoretically the right thing to
do, I'm not sure we want it in practice if it adds complexity to CFS.
I personally never noticed this kind of interference from CFS, but, at
the same time, for RT we usually like more to be safe than sorry.
However, since this doesn't seem to be bullet-proof (as you also say), I
guess it all boils down again to complexity vs. practical benefits.
Best,
Juri
Hi Juri,
On 01/10/2019 14:31, Juri Lelli wrote:
> Hi Valentin,
>
> On 01/10/19 11:29, Valentin Schneider wrote:
>> (expanded the Cc list)
>> RT/DL folks, any thought on the thing?
>
> Even if I like your idea and it looks theoretically the right thing to
> do, I'm not sure we want it in practice if it adds complexity to CFS.
>
> I personally never noticed this kind of interference from CFS, but, at
> the same time, for RT we usually like more to be safe than sorry.
> However, since this doesn't seem to be bullet-proof (as you also say), I
> guess it all boils down again to complexity vs. practical benefits.
>
Thanks for having a look.
IMO worst part is the local detach_one_task() thing, I added that after v1
following Qais' comments but perhaps it doesn't gain us much.
I'll try to cook something up with rt-app and see if I can get sensible
numbers.
> Best,
>
> Juri
>