LinuxLists.cc - [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

2024-05-31 22:57:46

Subject: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

We observed that the overhead on trigger_load_balance(), now renamed
sched_balance_trigger(), has risen with a system's core counts.

For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
trigger_load_balance(). On older systems with fewer cores/socket, this
function's overhead was less than 0.1%.

The cause of this overhead was that there are multiple cpus calling
kick_ilb(flags), updating the balancing work needed to a common idle
load balancer cpu. The ilb_cpu's flags field got updated unconditionally
with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
causes much cache bouncing and cpu cycles overhead. This is seen in the
annotated profile below.

kick_ilb():
if (ilb_cpu < 0)
test %r14d,%r14d
↑ js 6c
flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
mov $0x2d600,%rdi
movslq %r14d,%r8
mov %rdi,%rdx
add -0x7dd0c3e0(,%r8,8),%rdx
arch_atomic_read():
0.01 mov 0x64(%rdx),%esi
35.58 add $0x64,%rdx
arch_atomic_fetch_or():

static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
{
int val = arch_atomic_read(v);

do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
0.03 157: mov %r12d,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.00 mov %esi,%eax
arch_atomic_fetch_or():
do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
or %esi,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.01 lock cmpxchg %ecx,(%rdx)
42.96 ↓ jne 2d2
kick_ilb():

With instrumentation, we found that 81% of the updates do not result in
any change in the ilb_cpu's flags. That is, multiple cpus are asking
the ilb_cpu to do the same things over and over again, before the ilb_cpu
has a chance to run NOHZ load balance.

Skip updates to ilb_cpu's flags if no new work needs to be done.
Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
atomic read but it is less expensive than frequent unnecessary atomic
updates that generate cache bounces.

We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
(or sched_balance_trigger()) got reduced from 0.7% to 0.2%.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..9ab6dff6d8ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
if (ilb_cpu < 0)
return;

+ /*
+ * Don't bother if no new NOHZ balance work items for ilb_cpu,
+ * i.e. all bits in flags are already set in ilb_cpu.
+ */
+ if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
+ return;
+
/*
* Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
* the first flag owns it; cleared by nohz_csd_func().
--
2.32.0

2024-06-02 16:40:46

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

On 2024-05-31 at 13:54:52 -0700, Tim Chen wrote:
> We observed that the overhead on trigger_load_balance(), now renamed
> sched_balance_trigger(), has risen with a system's core counts.
>
> For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
> having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
> trigger_load_balance(). On older systems with fewer cores/socket, this
> function's overhead was less than 0.1%.
>
> The cause of this overhead was that there are multiple cpus calling
> kick_ilb(flags), updating the balancing work needed to a common idle
> load balancer cpu. The ilb_cpu's flags field got updated unconditionally
> with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
> causes much cache bouncing and cpu cycles overhead. This is seen in the
> annotated profile below.
>
> kick_ilb():
> if (ilb_cpu < 0)
> test %r14d,%r14d
> ↑ js 6c
> flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
> mov $0x2d600,%rdi
> movslq %r14d,%r8
> mov %rdi,%rdx
> add -0x7dd0c3e0(,%r8,8),%rdx
> arch_atomic_read():
> 0.01 mov 0x64(%rdx),%esi
> 35.58 add $0x64,%rdx
> arch_atomic_fetch_or():
>
> static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
> {
> int val = arch_atomic_read(v);
>
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> 0.03 157: mov %r12d,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.00 mov %esi,%eax
> arch_atomic_fetch_or():
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> or %esi,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.01 lock cmpxchg %ecx,(%rdx)
> 42.96 ↓ jne 2d2
> kick_ilb():
>
> With instrumentation, we found that 81% of the updates do not result in
> any change in the ilb_cpu's flags. That is, multiple cpus are asking
> the ilb_cpu to do the same things over and over again, before the ilb_cpu
> has a chance to run NOHZ load balance.
>
> Skip updates to ilb_cpu's flags if no new work needs to be done.
> Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> atomic read but it is less expensive than frequent unnecessary atomic
> updates that generate cache bounces.

A race condition is that many CPUs choose the same ilb_cpu and ask it to trigger
the nohz idle balance. This is because find_new_ilb() always finds the first
nohz idle CPU. I wonder if we could change the
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
into
for_each_cpu_wrap(ilb_cpu, cpumask_and(nohz.idle_cpus_mask, hk_mask), this_cpu+1)
so different ilb_cpu might be found by different CPUs.
Then the extra atomic read could brings less cache bounces.

>
> We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> if (ilb_cpu < 0)
> return;
>
> + /*
> + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> + * i.e. all bits in flags are already set in ilb_cpu.
> + */
> + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)

Maybe also mention in the comment that when above statement is true, the
current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directly
here is safe(anyway just 2 cents)

Reviewed-by: Chen Yu <[email protected]>

thanks,
Chenyu

> + return;
> +
> /*
> * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> * the first flag owns it; cleared by nohz_csd_func().
> --
> 2.32.0
>

2024-06-03 16:16:17

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

On Mon, 2024-06-03 at 00:40 +0800, Chen Yu wrote:
> >
> > With instrumentation, we found that 81% of the updates do not result in
> > any change in the ilb_cpu's flags. That is, multiple cpus are asking
> > the ilb_cpu to do the same things over and over again, before the ilb_cpu
> > has a chance to run NOHZ load balance.
> >
> > Skip updates to ilb_cpu's flags if no new work needs to be done.
> > Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> > atomic read but it is less expensive than frequent unnecessary atomic
> > updates that generate cache bounces.
>
> A race condition is that many CPUs choose the same ilb_cpu and ask it to trigger
> the nohz idle balance. This is because find_new_ilb() always finds the first
> nohz idle CPU. I wonder if we could change the
> for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
> into
> for_each_cpu_wrap(ilb_cpu, cpumask_and(nohz.idle_cpus_mask, hk_mask), this_cpu+1)
> so different ilb_cpu might be found by different CPUs.
> Then the extra atomic read could brings less cache bounces.
>

Your proposal improves scaling. However,
that could result in many idle CPUs getting kicked. I assume that
current approach of delegating to a common idle CPU will disturb fewer CPUs
and let them stay in deeper idle states, and get the power benefits
from NOHZ scheme.

> >
> > We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> > (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
> >
> > Signed-off-by: Tim Chen <[email protected]>
> > ---
> > kernel/sched/fair.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> > if (ilb_cpu < 0)
> > return;
> >
> > + /*
> > + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> > + * i.e. all bits in flags are already set in ilb_cpu.
> > + */
> > + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
>
> Maybe also mention in the comment that when above statement is true, the
> current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directly
> here is safe(anyway just 2 cents)

Not sure I follow your comments about return being safe. Let me explain
in details.

We will return directly if and only if the bits set in flags are also set
in nohz_flags(ilb_cpu).

The comment's intention is to say that if the above statement is true, then
the later operation of

atomic_fetch_or(flags, nohz_flags(ilb_cpu))

will be useless and not result in any change to nohz_flags(ilb_cpu), since all the set bits
in flags are already set in nohz_flags(ilb_cpu).

So will it be clearer if I say

/*
* Don't bother if no new NOHZ balance work items for ilb_cpu,
* i.e. all bits in flags are already set in ilb_cpu.
* Later OR of flags to nohz_flags(ilb_cpu)
* will not change nohz_flags(ilb_cpu).
*/

Thanks.

Tim

> Reviewed-by: Chen Yu <[email protected]>
>
> thanks,
> Chenyu
>
> > + return;
> > +
> > /*
> > * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> > * the first flag owns it; cleared by nohz_csd_func().
> > --
> > 2.32.0
> >

2024-06-04 02:10:56

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

On 2024-06-03 at 09:13:47 -0700, Tim Chen wrote:
> On Mon, 2024-06-03 at 00:40 +0800, Chen Yu wrote:
> > >
> > > With instrumentation, we found that 81% of the updates do not result in
> > > any change in the ilb_cpu's flags. That is, multiple cpus are asking
> > > the ilb_cpu to do the same things over and over again, before the ilb_cpu
> > > has a chance to run NOHZ load balance.
> > >
> > > Skip updates to ilb_cpu's flags if no new work needs to be done.
> > > Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> > > atomic read but it is less expensive than frequent unnecessary atomic
> > > updates that generate cache bounces.
> >
> > A race condition is that many CPUs choose the same ilb_cpu and ask it to trigger
> > the nohz idle balance. This is because find_new_ilb() always finds the first
> > nohz idle CPU. I wonder if we could change the
> > for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
> > into
> > for_each_cpu_wrap(ilb_cpu, cpumask_and(nohz.idle_cpus_mask, hk_mask), this_cpu+1)
> > so different ilb_cpu might be found by different CPUs.
> > Then the extra atomic read could brings less cache bounces.
> >
>
> Your proposal improves scaling. However,
> that could result in many idle CPUs getting kicked. I assume that
> current approach of delegating to a common idle CPU will disturb fewer CPUs
> and let them stay in deeper idle states, and get the power benefits
> from NOHZ scheme.
>

I see, from power point of view, current solution is better.

> > >
> > > We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> > > (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
> > >
> > > Signed-off-by: Tim Chen <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> > > if (ilb_cpu < 0)
> > > return;
> > >
> > > + /*
> > > + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> > > + * i.e. all bits in flags are already set in ilb_cpu.
> > > + */
> > > + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
> >
> > Maybe also mention in the comment that when above statement is true, the
> > current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directly
> > here is safe(anyway just 2 cents)
>
> Not sure I follow your comments about return being safe. Let me explain
> in details.
>
> We will return directly if and only if the bits set in flags are also set
> in nohz_flags(ilb_cpu).
>
> The comment's intention is to say that if the above statement is true, then
> the later operation of
>
> atomic_fetch_or(flags, nohz_flags(ilb_cpu))
>
> will be useless and not result in any change to nohz_flags(ilb_cpu), since all the set bits
> in flags are already set in nohz_flags(ilb_cpu).

Understand. My previous thought was that, what if the current nohz_flags(ilb_cpu) is 0 or
NOHZ_NEWILB_KICK. If yes, return directly might miss one ipi to the ilb_cpu(because
the current code checks flags & NOHZ_KICK_MASK to return directly). But from the current
logic when we reach kick_ilb(), the flag is not 0, and the flag passed by nohz_balancer_kick()
satisfy (flags & NOHZ_KICK_MASK), so returns here is ok.

>
> So will it be clearer if I say
>
> /*
> * Don't bother if no new NOHZ balance work items for ilb_cpu,
> * i.e. all bits in flags are already set in ilb_cpu.
> * Later OR of flags to nohz_flags(ilb_cpu)
> * will not change nohz_flags(ilb_cpu).
> */
>

Yes, this is ok.

thanks,
Chenyu

> Thanks.
>
>
> Tim
>
> > Reviewed-by: Chen Yu <[email protected]>
> >
> > thanks,
> > Chenyu
> >
> > > + return;
> > > +
> > > /*
> > > * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> > > * the first flag owns it; cleared by nohz_csd_func().
> > > --
> > > 2.32.0
> > >
>

2024-06-04 14:39:09

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

On Fri, 31 May 2024 at 22:52, Tim Chen <[email protected]> wrote:
>
> We observed that the overhead on trigger_load_balance(), now renamed
> sched_balance_trigger(), has risen with a system's core counts.
>
> For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
> having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
> trigger_load_balance(). On older systems with fewer cores/socket, this
> function's overhead was less than 0.1%.
>
> The cause of this overhead was that there are multiple cpus calling
> kick_ilb(flags), updating the balancing work needed to a common idle
> load balancer cpu. The ilb_cpu's flags field got updated unconditionally
> with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
> causes much cache bouncing and cpu cycles overhead. This is seen in the
> annotated profile below.
>
> kick_ilb():
> if (ilb_cpu < 0)
> test %r14d,%r14d
> ↑ js 6c
> flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
> mov $0x2d600,%rdi
> movslq %r14d,%r8
> mov %rdi,%rdx
> add -0x7dd0c3e0(,%r8,8),%rdx
> arch_atomic_read():
> 0.01 mov 0x64(%rdx),%esi
> 35.58 add $0x64,%rdx
> arch_atomic_fetch_or():
>
> static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
> {
> int val = arch_atomic_read(v);
>
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> 0.03 157: mov %r12d,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.00 mov %esi,%eax
> arch_atomic_fetch_or():
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> or %esi,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.01 lock cmpxchg %ecx,(%rdx)
> 42.96 ↓ jne 2d2
> kick_ilb():
>
> With instrumentation, we found that 81% of the updates do not result in
> any change in the ilb_cpu's flags. That is, multiple cpus are asking
> the ilb_cpu to do the same things over and over again, before the ilb_cpu
> has a chance to run NOHZ load balance.
>
> Skip updates to ilb_cpu's flags if no new work needs to be done.
> Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> atomic read but it is less expensive than frequent unnecessary atomic
> updates that generate cache bounces.
>
> We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.

Make sense, we have seen other variables being a bottleneck in the
scheduler like task_group's load_avg or root domain's overload.

Reviewed-by: Vincent Guittot <[email protected]>

>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> if (ilb_cpu < 0)
> return;
>
> + /*
> + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> + * i.e. all bits in flags are already set in ilb_cpu.
> + */
> + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
> + return;
> +
> /*
> * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> * the first flag owns it; cleared by nohz_csd_func().
> --
> 2.32.0
>

2024-06-04 22:32:32

by Tim Chen

[permalink] [raw]

Subject: Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

On Tue, 2024-06-04 at 10:10 +0800, Chen Yu wrote:
>
> >
> > Not sure I follow your comments about return being safe. Let me explain
> > in details.
> >
> > We will return directly if and only if the bits set in flags are also set
> > in nohz_flags(ilb_cpu).
> >
> > The comment's intention is to say that if the above statement is true, then
> > the later operation of
> >
> > atomic_fetch_or(flags, nohz_flags(ilb_cpu))
> >
> > will be useless and not result in any change to nohz_flags(ilb_cpu), since all the set bits
> > in flags are already set in nohz_flags(ilb_cpu).
>
> Understand. My previous thought was that, what if the current nohz_flags(ilb_cpu) is 0 or
> NOHZ_NEWILB_KICK. If yes, return directly might miss one ipi to the ilb_cpu(because
> the current code checks flags & NOHZ_KICK_MASK to return directly). But from the current
> logic when we reach kick_ilb(), the flag is not 0, and the flag passed by nohz_balancer_kick()
> satisfy (flags & NOHZ_KICK_MASK), so returns here is ok.

Should not have issues of missing an IPI. kick_ilb(flags) is always called with non-zero flags
value. And the first guy who update nohz_flags(ilb_cpu) successfully with bits
in the NOHZ_KICK_MASK will send the IPI.

Tim

2024-06-05 14:54:56

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/core] sched/balance: Skip unnecessary updates to idle load balancer's flags

The following commit has been merged into the sched/core branch of tip:

Commit-ID: f90cc919f9e5cbfcd0b952290c57ef1317f4e91e
Gitweb: https://git.kernel.org/tip/f90cc919f9e5cbfcd0b952290c57ef1317f4e91e
Author: Tim Chen <[email protected]>
AuthorDate: Fri, 31 May 2024 13:54:52 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 05 Jun 2024 15:52:36 +02:00

sched/balance: Skip unnecessary updates to idle load balancer's flags

We observed that the overhead on trigger_load_balance(), now renamed
sched_balance_trigger(), has risen with a system's core counts.

For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
trigger_load_balance(). On older systems with fewer cores/socket, this
function's overhead was less than 0.1%.

The cause of this overhead was that there are multiple cpus calling
kick_ilb(flags), updating the balancing work needed to a common idle
load balancer cpu. The ilb_cpu's flags field got updated unconditionally
with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
causes much cache bouncing and cpu cycles overhead. This is seen in the
annotated profile below.

kick_ilb():
if (ilb_cpu < 0)
test %r14d,%r14d
↑ js 6c
flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
mov $0x2d600,%rdi
movslq %r14d,%r8
mov %rdi,%rdx
add -0x7dd0c3e0(,%r8,8),%rdx
arch_atomic_read():
0.01 mov 0x64(%rdx),%esi
35.58 add $0x64,%rdx
arch_atomic_fetch_or():

static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
{
int val = arch_atomic_read(v);

do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
0.03 157: mov %r12d,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.00 mov %esi,%eax
arch_atomic_fetch_or():
do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
or %esi,%ecx
arch_atomic_try_cmpxchg():
return arch_try_cmpxchg(&v->counter, old, new);
0.01 lock cmpxchg %ecx,(%rdx)
42.96 ↓ jne 2d2
kick_ilb():

With instrumentation, we found that 81% of the updates do not result in
any change in the ilb_cpu's flags. That is, multiple cpus are asking
the ilb_cpu to do the same things over and over again, before the ilb_cpu
has a chance to run NOHZ load balance.

Skip updates to ilb_cpu's flags if no new work needs to be done.
Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
atomic read but it is less expensive than frequent unnecessary atomic
updates that generate cache bounces.

We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
(or sched_balance_trigger()) got reduced from 0.7% to 0.2%.

Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chen Yu <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63113dc..41b5838 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11892,6 +11892,13 @@ static void kick_ilb(unsigned int flags)
return;

/*
+ * Don't bother if no new NOHZ balance work items for ilb_cpu,
+ * i.e. all bits in flags are already set in ilb_cpu.
+ */
+ if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
+ return;
+
+ /*
* Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
* the first flag owns it; cleared by nohz_csd_func().
*/

2024-06-05 18:46:38

by Tim Chen

[permalink] [raw]

Subject: Re: [tip: sched/core] sched/balance: Skip unnecessary updates to idle load balancer's flags

On Wed, 2024-06-05 at 14:54 +0000, tip-bot2 for Tim Chen wrote:
>
> Signed-off-by: Tim Chen <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Reviewed-by: Chen Yu <[email protected]>
> Reviewed-by: Vincent Guittot <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]

Peter,

Thanks for putting the patch in tip.

Can you also also add

Tested-by: Mohini Narkhede <[email protected]>

Thanks.

Tim

> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63113dc..41b5838 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11892,6 +11892,13 @@ static void kick_ilb(unsigned int flags)
> return;
>
> /*
> + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> + * i.e. all bits in flags are already set in ilb_cpu.
> + */
> + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
> + return;
> +
> + /*
> * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> * the first flag owns it; cleared by nohz_csd_func().
> */