While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.
However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.
This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
affined to two cores
- a large number of threads with 100% utilization are pinned to all
other cores
Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.
Signed-off-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c5b74f66bd3..5abf30117824 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9834,9 +9834,15 @@ static int should_we_balance(struct lb_env *env)
/*
* In the newly idle case, we will allow all the CPUs
* to do the newly idle load balance.
+ *
+ * However, we bail out if we already have tasks or a wakeup pending,
+ * to optimize wakeup latency.
*/
- if (env->idle == CPU_NEWLY_IDLE)
+ if (env->idle == CPU_NEWLY_IDLE) {
+ if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
+ return 0;
return 1;
+ }
/* Try to find first idle CPU */
for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
--
2.36.1.476.g0c4daa206d-goog
On Thu, 9 Jun 2022 at 04:55, Josh Don <[email protected]> wrote:
>
> While doing newidle load balancing, it is possible for new tasks to
> arrive, such as with pending wakeups. newidle_balance() already accounts
> for this by exiting the sched_domain load_balance() iteration if it
> detects these cases. This is very important for minimizing wakeup
> latency.
>
> However, if we are already in load_balance(), we may stay there for a
> while before returning back to newidle_balance(). This is most
> exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
> very straightforward workaround to this is to adjust should_we_balance()
> to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
> detected.
This one is close to the other tests and I wonder if it should be
better placed before taking the busiest rq lock and detaching some
tasks.
Beside your use case where all other threads can't move in local cpu
and load_balance() loops and clears other cpus, most of the time is
probably spent in fbg() and fbq() so there are more chance that a task
woke in this meantime and I imagine that it becomes useless to take
lock and move tasks from another cpu if the local cpu is no more newly
idle.
Have you tried other places in load_balance() and does this one
provide the lowest wakeup latency ?
That being said, the current patch makes sense.
>
> This was tested with the following reproduction:
> - two threads that take turns sleeping and waking each other up are
> affined to two cores
> - a large number of threads with 100% utilization are pinned to all
> other cores
>
> Without this patch, wakeup latency was ~120us for the pair of threads,
> almost entirely spent in load_balance(). With this patch, wakeup latency
> is ~6us.
>
> Signed-off-by: Josh Don <[email protected]>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8c5b74f66bd3..5abf30117824 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9834,9 +9834,15 @@ static int should_we_balance(struct lb_env *env)
> /*
> * In the newly idle case, we will allow all the CPUs
> * to do the newly idle load balance.
> + *
> + * However, we bail out if we already have tasks or a wakeup pending,
> + * to optimize wakeup latency.
> */
> - if (env->idle == CPU_NEWLY_IDLE)
> + if (env->idle == CPU_NEWLY_IDLE) {
> + if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> + return 0;
> return 1;
> + }
>
> /* Try to find first idle CPU */
> for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
> --
> 2.36.1.476.g0c4daa206d-goog
>
Thanks Vincent,
On Thu, Jun 9, 2022 at 6:42 AM Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 9 Jun 2022 at 04:55, Josh Don <[email protected]> wrote:
> >
> > While doing newidle load balancing, it is possible for new tasks to
> > arrive, such as with pending wakeups. newidle_balance() already accounts
> > for this by exiting the sched_domain load_balance() iteration if it
> > detects these cases. This is very important for minimizing wakeup
> > latency.
> >
> > However, if we are already in load_balance(), we may stay there for a
> > while before returning back to newidle_balance(). This is most
> > exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
> > very straightforward workaround to this is to adjust should_we_balance()
> > to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
> > detected.
>
> This one is close to the other tests and I wonder if it should be
> better placed before taking the busiest rq lock and detaching some
> tasks.
>
> Beside your use case where all other threads can't move in local cpu
> and load_balance() loops and clears other cpus, most of the time is
> probably spent in fbg() and fbq() so there are more chance that a task
> woke in this meantime and I imagine that it becomes useless to take
> lock and move tasks from another cpu if the local cpu is no more newly
> idle.
>
> Have you tried other places in load_balance() and does this one
> provide the lowest wakeup latency ?
>
> That being said, the current patch makes sense.
I tested with another check after fbg/fbq and there wasn't any
noticeable improvement to observed wakeup latency (not totally
unexpected, since it only helps for wakeups that come during fbg/fbq).
However, I don't think there's any harm in having that extra check in
the CPU_NEWLY_IDLE case; might as well avoid bouncing the rq lock if
we can. fbq+fbg are together taking ~3-4us per iteration in my repro.
If there are no objections I can send a v2 with the added delta:
@@ -9906,6 +9906,16 @@ static int load_balance(int this_cpu, struct rq *this_rq,
goto out_balanced;
}
+ /*
+ * fbg/fbq can take a while. In the newly idle case, recheck whether
+ * we should continue with balancing, since it is possible that a
+ * task woke up in the interim.
+ */
+ if (env.idle == CPU_NEWLY_IDLE && !should_we_balance(&env)) {
+ *continue_balancing = 0;
+ goto out_balanced;
+ }
+
BUG_ON(busiest == env.dst_rq);
schedstat_add(sd->lb_imbalance[idle], env.imbalance);
On Thu, 9 Jun 2022 at 21:40, Josh Don <[email protected]> wrote:
>
> Thanks Vincent,
>
> On Thu, Jun 9, 2022 at 6:42 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > On Thu, 9 Jun 2022 at 04:55, Josh Don <[email protected]> wrote:
> > >
> > > While doing newidle load balancing, it is possible for new tasks to
> > > arrive, such as with pending wakeups. newidle_balance() already accounts
> > > for this by exiting the sched_domain load_balance() iteration if it
> > > detects these cases. This is very important for minimizing wakeup
> > > latency.
> > >
> > > However, if we are already in load_balance(), we may stay there for a
> > > while before returning back to newidle_balance(). This is most
> > > exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
> > > very straightforward workaround to this is to adjust should_we_balance()
> > > to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
> > > detected.
> >
> > This one is close to the other tests and I wonder if it should be
> > better placed before taking the busiest rq lock and detaching some
> > tasks.
> >
> > Beside your use case where all other threads can't move in local cpu
> > and load_balance() loops and clears other cpus, most of the time is
> > probably spent in fbg() and fbq() so there are more chance that a task
> > woke in this meantime and I imagine that it becomes useless to take
> > lock and move tasks from another cpu if the local cpu is no more newly
> > idle.
> >
> > Have you tried other places in load_balance() and does this one
> > provide the lowest wakeup latency ?
> >
> > That being said, the current patch makes sense.
>
> I tested with another check after fbg/fbq and there wasn't any
> noticeable improvement to observed wakeup latency (not totally
> unexpected, since it only helps for wakeups that come during fbg/fbq).
ok. so IIUC the wakeup has already happened when we start
load_balance() in your case so the additional test is useless in your
case
> However, I don't think there's any harm in having that extra check in
> the CPU_NEWLY_IDLE case; might as well avoid bouncing the rq lock if
> we can. fbq+fbg are together taking ~3-4us per iteration in my repro.
>
> If there are no objections I can send a v2 with the added delta:
Would be good to get figures that show some benefits of this
additional check for some benchmarks
So I think that we can stay with your current proposal for now
>
> @@ -9906,6 +9906,16 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + /*
> + * fbg/fbq can take a while. In the newly idle case, recheck whether
> + * we should continue with balancing, since it is possible that a
> + * task woke up in the interim.
> + */
> + if (env.idle == CPU_NEWLY_IDLE && !should_we_balance(&env)) {
> + *continue_balancing = 0;
> + goto out_balanced;
> + }
> +
> BUG_ON(busiest == env.dst_rq);
>
> schedstat_add(sd->lb_imbalance[idle], env.imbalance);
On Fri, Jun 10, 2022 at 1:10 AM Vincent Guittot
<[email protected]> wrote:
>
> On Thu, 9 Jun 2022 at 21:40, Josh Don <[email protected]> wrote:
> >
> > Thanks Vincent,
> >
> > On Thu, Jun 9, 2022 at 6:42 AM Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > On Thu, 9 Jun 2022 at 04:55, Josh Don <[email protected]> wrote:
> > > >
> > > > While doing newidle load balancing, it is possible for new tasks to
> > > > arrive, such as with pending wakeups. newidle_balance() already accounts
> > > > for this by exiting the sched_domain load_balance() iteration if it
> > > > detects these cases. This is very important for minimizing wakeup
> > > > latency.
> > > >
> > > > However, if we are already in load_balance(), we may stay there for a
> > > > while before returning back to newidle_balance(). This is most
> > > > exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
> > > > very straightforward workaround to this is to adjust should_we_balance()
> > > > to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
> > > > detected.
> > >
> > > This one is close to the other tests and I wonder if it should be
> > > better placed before taking the busiest rq lock and detaching some
> > > tasks.
> > >
> > > Beside your use case where all other threads can't move in local cpu
> > > and load_balance() loops and clears other cpus, most of the time is
> > > probably spent in fbg() and fbq() so there are more chance that a task
> > > woke in this meantime and I imagine that it becomes useless to take
> > > lock and move tasks from another cpu if the local cpu is no more newly
> > > idle.
> > >
> > > Have you tried other places in load_balance() and does this one
> > > provide the lowest wakeup latency ?
> > >
> > > That being said, the current patch makes sense.
> >
> > I tested with another check after fbg/fbq and there wasn't any
> > noticeable improvement to observed wakeup latency (not totally
> > unexpected, since it only helps for wakeups that come during fbg/fbq).
>
> ok. so IIUC the wakeup has already happened when we start
> load_balance() in your case so the additional test is useless in your
> case
Not necessarily; the wakeup could also happen while we're in the
ALL_PINNED redo loop (this lasts ~100us), but the added check doesn't
meaningfully affect latency for my specific repro.
> > However, I don't think there's any harm in having that extra check in
> > the CPU_NEWLY_IDLE case; might as well avoid bouncing the rq lock if
> > we can. fbq+fbg are together taking ~3-4us per iteration in my repro.
> >
> > If there are no objections I can send a v2 with the added delta:
>
> Would be good to get figures that show some benefits of this
> additional check for some benchmarks
>
> So I think that we can stay with your current proposal for now
Sounds good, thanks for taking a look!
> >
> > @@ -9906,6 +9906,16 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> > goto out_balanced;
> > }
> >
> > + /*
> > + * fbg/fbq can take a while. In the newly idle case, recheck whether
> > + * we should continue with balancing, since it is possible that a
> > + * task woke up in the interim.
> > + */
> > + if (env.idle == CPU_NEWLY_IDLE && !should_we_balance(&env)) {
> > + *continue_balancing = 0;
> > + goto out_balanced;
> > + }
> > +
> > BUG_ON(busiest == env.dst_rq);
> >
> > schedstat_add(sd->lb_imbalance[idle], env.imbalance);
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 792b9f65a568f48c50b3175536db9cde5a1edcc0
Gitweb: https://git.kernel.org/tip/792b9f65a568f48c50b3175536db9cde5a1edcc0
Author: Josh Don <[email protected]>
AuthorDate: Wed, 08 Jun 2022 19:55:15 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 13 Jun 2022 10:30:01 +02:00
sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.
However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.
This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
affined to two cores
- a large number of threads with 100% utilization are pinned to all
other cores
Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.
Signed-off-by: Josh Don <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d8ef01..8bed757 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9824,9 +9824,15 @@ static int should_we_balance(struct lb_env *env)
/*
* In the newly idle case, we will allow all the CPUs
* to do the newly idle load balance.
+ *
+ * However, we bail out if we already have tasks or a wakeup pending,
+ * to optimize wakeup latency.
*/
- if (env->idle == CPU_NEWLY_IDLE)
+ if (env->idle == CPU_NEWLY_IDLE) {
+ if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
+ return 0;
return 1;
+ }
/* Try to find first idle CPU */
for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {