2024-01-19 08:43:22

by K Prateek Nayak

[permalink] [raw]
Subject: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

When measuring IPI throughput using a modified version of Anton
Blanchard's ipistorm benchmark [1], configured to measure time taken to
perform a fixed number of smp_call_function_single() (with wait set to
1), an increase in benchmark time was observed between v5.7 and v5.8.
Bisection pointed to commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") as the reason behind this increase in
runtime.

Since a clean revert is not possible with the latest kernel, to restore
the older behavior, the call_function_single_prep_ipi() check in
send_call_function_single_ipi() was skipped leading to
send_call_function_single_ipi() always calling
arch_send_call_function_single_ipi(). The
flush_smp_call_function_queue() introduced in do_idle() by the same
commit was removed as well since the above change would unconditionally
send an IPI to the idle CPU in TIF_POLLING mode.

With the above changes (which will be referred to as "revert"
henceforth) the ipistorm benchmark time improves. Following are
normalized results from test runs (average runtime from 10 runs of
ipistorm) on a dual socket 4th Generation EPYC system (2 x 128C/256T):

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
==================================================================
kernel: time [pct imp]
v6.7-rc5 1.00 [0.00]
v6.7-rc5 + revert 0.66 [34.41]

Although the revert improves ipistorm performance, it also regresses
tbench and netperf when applied on top of latest tip:sched/core
supporting the validity of the optimization

tip:sched/core was at tag "sched-core-2024-01-08" during the test. C2
was disabled but boost remained on coupled with performance governor for
all the tests.

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) revert[pct imp](CV)
1 1.00 [ 0.00]( 0.24) 0.91 [ -8.96]( 0.30)
2 1.00 [ 0.00]( 0.25) 0.92 [ -8.20]( 0.97)
4 1.00 [ 0.00]( 0.23) 0.91 [ -9.20]( 1.75)
8 1.00 [ 0.00]( 0.69) 0.91 [ -9.48]( 1.56)
16 1.00 [ 0.00]( 0.66) 0.92 [ -8.49]( 2.43)
32 1.00 [ 0.00]( 0.96) 0.89 [-11.13]( 0.96)
64 1.00 [ 0.00]( 1.06) 0.90 [ -9.72]( 2.49)
128 1.00 [ 0.00]( 0.70) 0.92 [ -8.36]( 1.26)
256 1.00 [ 0.00]( 0.72) 0.97 [ -3.30]( 1.10)
512 1.00 [ 0.00]( 0.42) 0.98 [ -1.73]( 0.37)
1024 1.00 [ 0.00]( 0.28) 0.99 [ -1.39]( 0.43)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) revert[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.50) 0.89 [-10.51]( 0.20)
2-clients 1.00 [ 0.00]( 1.16) 0.89 [-11.10]( 0.59)
4-clients 1.00 [ 0.00]( 1.03) 0.89 [-10.68]( 0.38)
8-clients 1.00 [ 0.00]( 0.99) 0.89 [-10.54]( 0.50)
16-clients 1.00 [ 0.00]( 0.87) 0.89 [-10.92]( 0.95)
32-clients 1.00 [ 0.00]( 1.24) 0.89 [-10.85]( 0.63)
64-clients 1.00 [ 0.00]( 1.58) 0.90 [-10.11]( 1.18)
128-clients 1.00 [ 0.00]( 0.87) 0.89 [-10.94]( 1.11)
256-clients 1.00 [ 0.00]( 4.77) 1.00 [ -0.16]( 3.45)
512-clients 1.00 [ 0.00](56.16) 1.02 [ 2.10](56.05)

Looking further into the difference in the code path taken, it was
noticed that to pull a TIF_POLLING thread out of idle to process an IPI,
the sender sets the TIF_NEED_RESCHED bit in the idle task's thread info.
As a result, the scheduler expects a task to be enqueued when exiting
the idle path. This is not the case with non-polling idle states where
the idle CPU exits the non-polling idle state to process the interrupt,
and since need_resched() returns false, soon goes back to idle again.

When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(),
a large part of which runs with local IRQ disabled. In case of ipistorm,
when measuring IPI throughput, this large IRQ disabled section delays
processing of IPIs. Further auditing revealed that in absence of any
runnable tasks, pick_next_task_fair(), which is called from the
pick_next_task() fast path, will always call newidle_balance() in this
scenario, further increasing the time spent in the IRQ disabled section.

In cases where an idle CPU in TIF_POLLING mode wakes up to process only
an interrupt, with no new tasks enqueued, skip newidle_balance(). This
also makes the behavior consistent with scenarios where an idle CPU in
non-polling mode is woken up only to process an interrupt where
need_resched() will return false leading to an immediate idle re-entry.

Following is the ipistorm results on the same test system with this
patch:

==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
==================================================================
kernel: time [pct imp]
v6.7-rc5 1.00 [0.00]
v6.7-rc5 + revert 0.66 [34.41]
v6.7-rc5 + patch 0.55 [44.81]

netperf and tbench results with the patch match the results on tip.
Additionally, hackbench, stream, and schbench too were tested, with
results from the patched kernel matching that of the tip.

Link: https://github.com/antonblanchard/ipistorm [1]
Suggested-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: K Prateek Nayak <[email protected]>
---
This patch is based on tip:sched/core at tag "sched-core-2024-01-08".
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a03..1fedc7e29c98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8499,6 +8499,16 @@ done: __maybe_unused;
if (!rf)
return NULL;

+ /*
+ * An idle CPU in TIF_POLLING mode might end up here after processing
+ * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
+ * sending an actual IPI. In such cases, where an idle CPU was woken
+ * up only to process an interrupt, without necessarily queuing a task
+ * on it, skip newidle_balance() to facilitate faster idle re-entry.
+ */
+ if (prev == rq->idle)
+ return NULL;
+
new_tasks = newidle_balance(rq, rf);

/*
--
2.25.1



2024-01-23 00:26:34

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b803030c3a03..1fedc7e29c98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
> if (!rf)
> return NULL;
>
> + /*
> + * An idle CPU in TIF_POLLING mode might end up here after processing
> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
> + * sending an actual IPI. In such cases, where an idle CPU was woken
> + * up only to process an interrupt, without necessarily queuing a task
> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
> + */
> + if (prev == rq->idle)
> + return NULL;
> +

Should we check the call function queue directly to detect that there is
an IPI waiting to be processed? something like

if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
return NULL;

Could there be cases where we want to do idle balance in this code path?
Say a cpu is idle and a scheduling tick came in, we may try
to look for something to run on the idle cpu. Seems like after
your change above, that would be skipped.

Tim


> new_tasks = newidle_balance(rq, rf);
>
> /*


2024-01-23 04:58:54

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

Hello Tim,

On 1/23/2024 3:29 AM, Tim Chen wrote:
> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b803030c3a03..1fedc7e29c98 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>> if (!rf)
>> return NULL;
>>
>> + /*
>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>> + * up only to process an interrupt, without necessarily queuing a task
>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>> + */
>> + if (prev == rq->idle)
>> + return NULL;
>> +
>
> Should we check the call function queue directly to detect that there is
> an IPI waiting to be processed? something like
>
> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
> return NULL;

That could be a valid check too. However, if an IPI is queued right
after this check, the processing is still delayed since
newidle_balance() only bails out for scenarios when a wakeup is trying
to queue a new task on the CPU running the newidle_balance().

>
> Could there be cases where we want to do idle balance in this code path?
> Say a cpu is idle and a scheduling tick came in, we may try
> to look for something to run on the idle cpu. Seems like after
> your change above, that would be skipped.

Wouldn't scheduler_tick() do load balancing when the time comes? In my
testing, I did not see a case where the workloads I tested were
sensitive to the aspect of newidle_balance() being invoked at scheduler
tick. Have you come across a workload which might be sensitive to this
aspect that I can quickly test and verify? Meanwhile, I'll run the
workloads mentioned in the commit log on an Intel system to see if I
can spot any sensitivity to this change.

Adding David to the thread too since HHVM seems to be one of those
workloads that is very sensitive to a successful newidle_balance().

>
> Tim
>
>
>> new_tasks = newidle_balance(rq, rf);
>>
>> /*
>

--
Thanks and Regards,
Prateek

2024-01-23 08:06:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <[email protected]> wrote:
>
> Hello Tim,
>
> On 1/23/2024 3:29 AM, Tim Chen wrote:
> > On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b803030c3a03..1fedc7e29c98 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
> >> if (!rf)
> >> return NULL;
> >>
> >> + /*
> >> + * An idle CPU in TIF_POLLING mode might end up here after processing
> >> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
> >> + * sending an actual IPI. In such cases, where an idle CPU was woken
> >> + * up only to process an interrupt, without necessarily queuing a task
> >> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
> >> + */
> >> + if (prev == rq->idle)
> >> + return NULL;
> >> +
> >
> > Should we check the call function queue directly to detect that there is
> > an IPI waiting to be processed? something like
> >
> > if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
> > return NULL;
>
> That could be a valid check too. However, if an IPI is queued right
> after this check, the processing is still delayed since
> newidle_balance() only bails out for scenarios when a wakeup is trying
> to queue a new task on the CPU running the newidle_balance().
>
> >
> > Could there be cases where we want to do idle balance in this code path?
> > Say a cpu is idle and a scheduling tick came in, we may try
> > to look for something to run on the idle cpu. Seems like after
> > your change above, that would be skipped.
>
> Wouldn't scheduler_tick() do load balancing when the time comes? In my
> testing, I did not see a case where the workloads I tested were
> sensitive to the aspect of newidle_balance() being invoked at scheduler
> tick. Have you come across a workload which might be sensitive to this
> aspect that I can quickly test and verify? Meanwhile, I'll run the
> workloads mentioned in the commit log on an Intel system to see if I
> can spot any sensitivity to this change.

Instead of trying to fix spurious need_resched in the scheduler,
can't we find a way to prevent it from happening ?

Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
load balances are already skipped for a less aggressive newly idle
load balanced:
https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@mail.gmail.com/

The root of the problem is that we keep TIF_NEED_RESCHED set

>
> Adding David to the thread too since HHVM seems to be one of those
> workloads that is very sensitive to a successful newidle_balance().
>
> >
> > Tim
> >
> >
> >> new_tasks = newidle_balance(rq, rf);
> >>
> >> /*
> >
>
> --
> Thanks and Regards,
> Prateek

2024-01-23 10:01:20

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

Hello Vincent,

On 1/23/2024 1:36 PM, Vincent Guittot wrote:
> On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <[email protected]> wrote:
>>
>> Hello Tim,
>>
>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b803030c3a03..1fedc7e29c98 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>> if (!rf)
>>>> return NULL;
>>>>
>>>> + /*
>>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>>>> + * up only to process an interrupt, without necessarily queuing a task
>>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>> + */
>>>> + if (prev == rq->idle)
>>>> + return NULL;
>>>> +
>>>
>>> Should we check the call function queue directly to detect that there is
>>> an IPI waiting to be processed? something like
>>>
>>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>> return NULL;
>>
>> That could be a valid check too. However, if an IPI is queued right
>> after this check, the processing is still delayed since
>> newidle_balance() only bails out for scenarios when a wakeup is trying
>> to queue a new task on the CPU running the newidle_balance().
>>
>>>
>>> Could there be cases where we want to do idle balance in this code path?
>>> Say a cpu is idle and a scheduling tick came in, we may try
>>> to look for something to run on the idle cpu. Seems like after
>>> your change above, that would be skipped.
>>
>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>> testing, I did not see a case where the workloads I tested were
>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>> tick. Have you come across a workload which might be sensitive to this
>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>> workloads mentioned in the commit log on an Intel system to see if I
>> can spot any sensitivity to this change.
>
> Instead of trying to fix spurious need_resched in the scheduler,
> can't we find a way to prevent it from happening ?

The need_resched is not spurious. It is an effect of the optimization
introduced by commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") where, to pull a CPU out of
TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
on the test machine), instead of sending an IPI for
smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
the idle task's thread info and in the path to "schedule_idle()", the
call to "flush_smp_call_function_queue()" processes the function call.

But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
scheduler now believes a new task exists which leads to the following
call stack:

do_idle()
schedule_idle()
__schedule(SM_NONE)
/* local_irq_disable() */
pick_next_task()
__pick_next_task()
pick_next_task_fair()
newidle_balance()
... /* Still running with IRQs disabled */

Since IRQs are disabled, the processing of IPIs are delayed leading
issue similar to the one outlined in commit 792b9f65a568 ("sched:
Allow newidle balancing to bail out of load_balance") when benchmarking
ipistorm.

>
> Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
> load balances are already skipped for a less aggressive newly idle
> load balanced:
> https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@mail.gmail.com/

Are you referring to the "need_resched()" condition check in
"nohz_csd_func()"? Please correct me if I'm wrong.

When I ran with sched-scoreboard
(https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
system for 60s I see the idle "load_balance count" go up in sched-stat

Following are the data for idle balance on SMT domain for each kernel:

o tip:sched/core

< ---------------------------------------- Category: idle ----------- >
load_balance count on cpu idle : 2678
load_balance found balanced on cpu idle : 2678
->load_balance failed to find busier queue on cpu idle : 0
->load_balance failed to find busier group on cpu idle : 2678
load_balance move task failed on cpu idle : 0
*load_balance success count on cpu idle : 0
imbalance sum on cpu idle : 0
pull_task count on cpu idle : 0
*avg task pulled per successfull lb attempt (cpu idle) : 0
->pull_task when target task was cache-hot on cpu idle : 0
-------------------------------------------------------------------------

o tip:sched/core + patch

< ---------------------------------------- Category: idle ----------- >
load_balance count on cpu idle : 1895
load_balance found balanced on cpu idle : 1895
->load_balance failed to find busier queue on cpu idle : 0
->load_balance failed to find busier group on cpu idle : 1895
load_balance move task failed on cpu idle : 0
*load_balance success count on cpu idle : 0
imbalance sum on cpu idle : 0
pull_task count on cpu idle : 0
*avg task pulled per successfull lb attempt (cpu idle) : 0
->pull_task when target task was cache-hot on cpu idle : 0
-------------------------------------------------------------------------

Am I missing something? Since "load_balance count" is only updated when
"load_balance()" is called.

>
> The root of the problem is that we keep TIF_NEED_RESCHED set

We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
Although it solves the problem described in the commit log, it also
required enabling and testing it on multiple architectures.

$ grep -r "_TIF_POLLING" arch/ | cut -d '/' -f2 | uniq
csky
x86
powerpc
parisc
openrisc
sparc
nios2
microblaze
sh
alpha

This optimization in the scheduler was the simpler of the two to achieve
the same result in case of ipistorm.

>
>>
>> [..snip..]

--
Thanks and Regards,
Prateek

2024-01-23 13:47:14

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

On Tue, 23 Jan 2024 at 11:01, K Prateek Nayak <[email protected]> wrote:
>
> Hello Vincent,
>
> On 1/23/2024 1:36 PM, Vincent Guittot wrote:
> > On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <[email protected]> wrote:
> >>
> >> Hello Tim,
> >>
> >> On 1/23/2024 3:29 AM, Tim Chen wrote:
> >>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index b803030c3a03..1fedc7e29c98 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
> >>>> if (!rf)
> >>>> return NULL;
> >>>>
> >>>> + /*
> >>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
> >>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
> >>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
> >>>> + * up only to process an interrupt, without necessarily queuing a task
> >>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
> >>>> + */
> >>>> + if (prev == rq->idle)
> >>>> + return NULL;
> >>>> +
> >>>
> >>> Should we check the call function queue directly to detect that there is
> >>> an IPI waiting to be processed? something like
> >>>
> >>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
> >>> return NULL;
> >>
> >> That could be a valid check too. However, if an IPI is queued right
> >> after this check, the processing is still delayed since
> >> newidle_balance() only bails out for scenarios when a wakeup is trying
> >> to queue a new task on the CPU running the newidle_balance().
> >>
> >>>
> >>> Could there be cases where we want to do idle balance in this code path?
> >>> Say a cpu is idle and a scheduling tick came in, we may try
> >>> to look for something to run on the idle cpu. Seems like after
> >>> your change above, that would be skipped.
> >>
> >> Wouldn't scheduler_tick() do load balancing when the time comes? In my
> >> testing, I did not see a case where the workloads I tested were
> >> sensitive to the aspect of newidle_balance() being invoked at scheduler
> >> tick. Have you come across a workload which might be sensitive to this
> >> aspect that I can quickly test and verify? Meanwhile, I'll run the
> >> workloads mentioned in the commit log on an Intel system to see if I
> >> can spot any sensitivity to this change.
> >
> > Instead of trying to fix spurious need_resched in the scheduler,
> > can't we find a way to prevent it from happening ?
>
> The need_resched is not spurious. It is an effect of the optimization
> introduced by commit b2a02fc43a1f ("smp: Optimize
> send_call_function_single_ipi()") where, to pull a CPU out of
> TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
> on the test machine), instead of sending an IPI for
> smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
> the idle task's thread info and in the path to "schedule_idle()", the
> call to "flush_smp_call_function_queue()" processes the function call.

I mean it's spurious in the sense that we don't need to resched but we
need to pull the CPU out of the polling loop. At that time we don't
know if there is a need to resched

>
> But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
> scheduler now believes a new task exists which leads to the following
> call stack:

Exactly, TIF_NEED_RESCHED has been set so scheduler now believes it
needs to look for a task. The solution is to not set TIF_NEED_RESCHED
if you don't want the scheduler to look for a task including pulling
it from another cpu

>
> do_idle()
> schedule_idle()
> __schedule(SM_NONE)
> /* local_irq_disable() */
> pick_next_task()
> __pick_next_task()
> pick_next_task_fair()
> newidle_balance()
> ... /* Still running with IRQs disabled */
>
> Since IRQs are disabled, the processing of IPIs are delayed leading
> issue similar to the one outlined in commit 792b9f65a568 ("sched:
> Allow newidle balancing to bail out of load_balance") when benchmarking
> ipistorm.

IMO it's not the same because commit 792b9f65a568 wants to abort early
if something new happened

>
> >
> > Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
> > load balances are already skipped for a less aggressive newly idle
> > load balanced:
> > https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@mail.gmail.com/
>
> Are you referring to the "need_resched()" condition check in
> "nohz_csd_func()"? Please correct me if I'm wrong.

yes

>
> When I ran with sched-scoreboard
> (https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
> system for 60s I see the idle "load_balance count" go up in sched-stat

If TIF_POLLING is not set, you will use normal IPI but otherwise, the
wakeup for an idle load balance is skipped because need_resched is set
and we have an newly idle load balance which you now want to skip too

>
> Following are the data for idle balance on SMT domain for each kernel:
>
> o tip:sched/core
>
> < ---------------------------------------- Category: idle ----------- >
> load_balance count on cpu idle : 2678
> load_balance found balanced on cpu idle : 2678
> ->load_balance failed to find busier queue on cpu idle : 0
> ->load_balance failed to find busier group on cpu idle : 2678
> load_balance move task failed on cpu idle : 0
> *load_balance success count on cpu idle : 0
> imbalance sum on cpu idle : 0
> pull_task count on cpu idle : 0
> *avg task pulled per successfull lb attempt (cpu idle) : 0
> ->pull_task when target task was cache-hot on cpu idle : 0
> -------------------------------------------------------------------------
>
> o tip:sched/core + patch
>
> < ---------------------------------------- Category: idle ----------- >
> load_balance count on cpu idle : 1895
> load_balance found balanced on cpu idle : 1895
> ->load_balance failed to find busier queue on cpu idle : 0
> ->load_balance failed to find busier group on cpu idle : 1895
> load_balance move task failed on cpu idle : 0
> *load_balance success count on cpu idle : 0
> imbalance sum on cpu idle : 0
> pull_task count on cpu idle : 0
> *avg task pulled per successfull lb attempt (cpu idle) : 0
> ->pull_task when target task was cache-hot on cpu idle : 0
> -------------------------------------------------------------------------
>
> Am I missing something? Since "load_balance count" is only updated when
> "load_balance()" is called.
>
> >
> > The root of the problem is that we keep TIF_NEED_RESCHED set
>
> We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
> on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
> Although it solves the problem described in the commit log, it also
> required enabling and testing it on multiple architectures.

Yes, but that's the right solution IMO and it will prevent us to then
try to catch the needless TIF_NEED_RESCHED

>
> $ grep -r "_TIF_POLLING" arch/ | cut -d '/' -f2 | uniq
> csky
> x86
> powerpc
> parisc
> openrisc
> sparc
> nios2
> microblaze
> sh
> alpha
>
> This optimization in the scheduler was the simpler of the two to achieve
> the same result in case of ipistorm.
>
> >
> >>
> >> [..snip..]
>
> --
> Thanks and Regards,
> Prateek

2024-01-23 21:18:11

by David Vernet

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

On Tue, Jan 23, 2024 at 10:28:31AM +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> On 1/23/2024 3:29 AM, Tim Chen wrote:
> > On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b803030c3a03..1fedc7e29c98 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
> >> if (!rf)
> >> return NULL;
> >>
> >> + /*
> >> + * An idle CPU in TIF_POLLING mode might end up here after processing
> >> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
> >> + * sending an actual IPI. In such cases, where an idle CPU was woken
> >> + * up only to process an interrupt, without necessarily queuing a task
> >> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
> >> + */
> >> + if (prev == rq->idle)
> >> + return NULL;
> >> +
> >
> > Should we check the call function queue directly to detect that there is
> > an IPI waiting to be processed? something like
> >
> > if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
> > return NULL;
>
> That could be a valid check too. However, if an IPI is queued right
> after this check, the processing is still delayed since
> newidle_balance() only bails out for scenarios when a wakeup is trying
> to queue a new task on the CPU running the newidle_balance().
>
> >
> > Could there be cases where we want to do idle balance in this code path?
> > Say a cpu is idle and a scheduling tick came in, we may try
> > to look for something to run on the idle cpu. Seems like after
> > your change above, that would be skipped.
>
> Wouldn't scheduler_tick() do load balancing when the time comes? In my
> testing, I did not see a case where the workloads I tested were
> sensitive to the aspect of newidle_balance() being invoked at scheduler
> tick. Have you come across a workload which might be sensitive to this
> aspect that I can quickly test and verify? Meanwhile, I'll run the
> workloads mentioned in the commit log on an Intel system to see if I
> can spot any sensitivity to this change.
>
> Adding David to the thread too since HHVM seems to be one of those
> workloads that is very sensitive to a successful newidle_balance().

Thanks for the cc. FWIW, I think a lot of things are very sensitive to
timing in newidle_balance(), but it goes both ways. For example, we had
to revert commit e60b56e46b38 ("sched/fair: Wait before decaying
max_newidle_lb_cost") [0] on our internal kernel because it regressed
some workloads by causing us to load_balance() too frequently. I think
the fix is correct in that there's no reason we shouldn't apply the ~1%
decay / second to newidle lb cost in newidle_balance(), but by causing
us to (correctly) decay newidle lb cost in newidle_balance(), it also
increased CPU util rather significantly and had us spending too much
time in load_balance().

[0]: https://lore.kernel.org/all/[email protected]/

On the other hand, on other hosts, we use SHARED_RUNQ to load balance as
aggressively as possible, and those hosts would have benefited from that
change if SHARED_RUNQ wasn't an option.

My 2 cents is that I think it's impossible to make everyone happy, and I
think the change here makes sense. If there's imbalance, it's something
we would uncover when load_balance() is kicked off on the tick path
anyways. I also agree with Vincent [1] that your idea / prototype of
adding a TIF_NEED_IPI flag is an overall better solution, but this does
seem fine to me as well in the interim.

[1]: https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/

Thanks,
David


Attachments:
(No filename) (3.70 kB)
signature.asc (235.00 B)
Download all attachments

2024-01-24 08:26:53

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

Hello Vincent,

On 1/23/2024 7:09 PM, Vincent Guittot wrote:
> On Tue, 23 Jan 2024 at 11:01, K Prateek Nayak <[email protected]> wrote:
>>
>> Hello Vincent,
>>
>> On 1/23/2024 1:36 PM, Vincent Guittot wrote:
>>> On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <[email protected]> wrote:
>>>>
>>>> Hello Tim,
>>>>
>>>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index b803030c3a03..1fedc7e29c98 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>>>> if (!rf)
>>>>>> return NULL;
>>>>>>
>>>>>> + /*
>>>>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>>>>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>>>>>> + * up only to process an interrupt, without necessarily queuing a task
>>>>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>>>> + */
>>>>>> + if (prev == rq->idle)
>>>>>> + return NULL;
>>>>>> +
>>>>>
>>>>> Should we check the call function queue directly to detect that there is
>>>>> an IPI waiting to be processed? something like
>>>>>
>>>>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>>>> return NULL;
>>>>
>>>> That could be a valid check too. However, if an IPI is queued right
>>>> after this check, the processing is still delayed since
>>>> newidle_balance() only bails out for scenarios when a wakeup is trying
>>>> to queue a new task on the CPU running the newidle_balance().
>>>>
>>>>>
>>>>> Could there be cases where we want to do idle balance in this code path?
>>>>> Say a cpu is idle and a scheduling tick came in, we may try
>>>>> to look for something to run on the idle cpu. Seems like after
>>>>> your change above, that would be skipped.
>>>>
>>>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>>>> testing, I did not see a case where the workloads I tested were
>>>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>>>> tick. Have you come across a workload which might be sensitive to this
>>>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>>>> workloads mentioned in the commit log on an Intel system to see if I
>>>> can spot any sensitivity to this change.
>>>
>>> Instead of trying to fix spurious need_resched in the scheduler,
>>> can't we find a way to prevent it from happening ?
>>
>> The need_resched is not spurious. It is an effect of the optimization
>> introduced by commit b2a02fc43a1f ("smp: Optimize
>> send_call_function_single_ipi()") where, to pull a CPU out of
>> TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
>> on the test machine), instead of sending an IPI for
>> smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
>> the idle task's thread info and in the path to "schedule_idle()", the
>> call to "flush_smp_call_function_queue()" processes the function call.
>
> I mean it's spurious in the sense that we don't need to resched but we
> need to pull the CPU out of the polling loop. At that time we don't
> know if there is a need to resched

Agreed.

>
>>
>> But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
>> scheduler now believes a new task exists which leads to the following
>> call stack:
>
> Exactly, TIF_NEED_RESCHED has been set so scheduler now believes it
> needs to look for a task. The solution is to not set TIF_NEED_RESCHED
> if you don't want the scheduler to look for a task including pulling
> it from another cpu
>
>>
>> do_idle()
>> schedule_idle()
>> __schedule(SM_NONE)
>> /* local_irq_disable() */
>> pick_next_task()
>> __pick_next_task()
>> pick_next_task_fair()
>> newidle_balance()
>> ... /* Still running with IRQs disabled */
>>
>> Since IRQs are disabled, the processing of IPIs are delayed leading
>> issue similar to the one outlined in commit 792b9f65a568 ("sched:
>> Allow newidle balancing to bail out of load_balance") when benchmarking
>> ipistorm.
>
> IMO it's not the same because commit 792b9f65a568 wants to abort early
> if something new happened

In case of commit 792b9f65a568, the newidle_balance() is cut short since
a pending IPI wants to queue a task which otherwise would have otherwise
led to a spike in tail latency. For ipistorm, there is a pending IPI
which is not queuing a task, but since the smp_call_function_single()
was invoked with wait=1, the sender will wait until the the the target
enables interrupts again and processes the call function which manifests
as a spike in tail latency.

>
>>
>>>
>>> Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
>>> load balances are already skipped for a less aggressive newly idle
>>> load balanced:
>>> https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@mail.gmail.com/
>>
>> Are you referring to the "need_resched()" condition check in
>> "nohz_csd_func()"? Please correct me if I'm wrong.
>
> yes
>
>>
>> When I ran with sched-scoreboard
>> (https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
>> system for 60s I see the idle "load_balance count" go up in sched-stat
>
> If TIF_POLLING is not set, you will use normal IPI but otherwise, the
> wakeup for an idle load balance is skipped because need_resched is set
> and we have an newly idle load balance which you now want to skip too

I see! You are right.

>
>>
>> Following are the data for idle balance on SMT domain for each kernel:
>>
>> o tip:sched/core
>>
>> < ---------------------------------------- Category: idle ----------- >
>> load_balance count on cpu idle : 2678
>> load_balance found balanced on cpu idle : 2678
>> ->load_balance failed to find busier queue on cpu idle : 0
>> ->load_balance failed to find busier group on cpu idle : 2678
>> load_balance move task failed on cpu idle : 0
>> *load_balance success count on cpu idle : 0
>> imbalance sum on cpu idle : 0
>> pull_task count on cpu idle : 0
>> *avg task pulled per successfull lb attempt (cpu idle) : 0
>> ->pull_task when target task was cache-hot on cpu idle : 0
>> -------------------------------------------------------------------------
>>
>> o tip:sched/core + patch
>>
>> < ---------------------------------------- Category: idle ----------- >
>> load_balance count on cpu idle : 1895
>> load_balance found balanced on cpu idle : 1895
>> ->load_balance failed to find busier queue on cpu idle : 0
>> ->load_balance failed to find busier group on cpu idle : 1895
>> load_balance move task failed on cpu idle : 0
>> *load_balance success count on cpu idle : 0
>> imbalance sum on cpu idle : 0
>> pull_task count on cpu idle : 0
>> *avg task pulled per successfull lb attempt (cpu idle) : 0
>> ->pull_task when target task was cache-hot on cpu idle : 0
>> -------------------------------------------------------------------------
>>
>> Am I missing something? Since "load_balance count" is only updated when
>> "load_balance()" is called.
>>
>>>
>>> The root of the problem is that we keep TIF_NEED_RESCHED set
>>
>> We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
>> on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
>> Although it solves the problem described in the commit log, it also
>> required enabling and testing it on multiple architectures.
>
> Yes, but that's the right solution IMO and it will prevent us to then
> try to catch the needless TIF_NEED_RESCHED

I'll discuss with Gautham on cleaning up the prototype and testing it
some more (we had only looked at ipistorm case) before sending out an
RFC.

> [..snip..]

--
Thanks and Regards,
Prateek

2024-01-24 08:35:48

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

Hello David,

Thank you for taking a look at the patch.

On 1/24/2024 2:47 AM, David Vernet wrote:
> On Tue, Jan 23, 2024 at 10:28:31AM +0530, K Prateek Nayak wrote:
>> Hello Tim,
>>
>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b803030c3a03..1fedc7e29c98 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>> if (!rf)
>>>> return NULL;
>>>>
>>>> + /*
>>>> + * An idle CPU in TIF_POLLING mode might end up here after processing
>>>> + * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>> + * sending an actual IPI. In such cases, where an idle CPU was woken
>>>> + * up only to process an interrupt, without necessarily queuing a task
>>>> + * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>> + */
>>>> + if (prev == rq->idle)
>>>> + return NULL;
>>>> +
>>>
>>> Should we check the call function queue directly to detect that there is
>>> an IPI waiting to be processed? something like
>>>
>>> if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>> return NULL;
>>
>> That could be a valid check too. However, if an IPI is queued right
>> after this check, the processing is still delayed since
>> newidle_balance() only bails out for scenarios when a wakeup is trying
>> to queue a new task on the CPU running the newidle_balance().
>>
>>>
>>> Could there be cases where we want to do idle balance in this code path?
>>> Say a cpu is idle and a scheduling tick came in, we may try
>>> to look for something to run on the idle cpu. Seems like after
>>> your change above, that would be skipped.
>>
>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>> testing, I did not see a case where the workloads I tested were
>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>> tick. Have you come across a workload which might be sensitive to this
>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>> workloads mentioned in the commit log on an Intel system to see if I
>> can spot any sensitivity to this change.
>>
>> Adding David to the thread too since HHVM seems to be one of those
>> workloads that is very sensitive to a successful newidle_balance().
>
> Thanks for the cc. FWIW, I think a lot of things are very sensitive to
> timing in newidle_balance(), but it goes both ways. For example, we had
> to revert commit e60b56e46b38 ("sched/fair: Wait before decaying
> max_newidle_lb_cost") [0] on our internal kernel because it regressed
> some workloads by causing us to load_balance() too frequently. I think
> the fix is correct in that there's no reason we shouldn't apply the ~1%
> decay / second to newidle lb cost in newidle_balance(), but by causing
> us to (correctly) decay newidle lb cost in newidle_balance(), it also
> increased CPU util rather significantly and had us spending too much
> time in load_balance().
>
> [0]: https://lore.kernel.org/all/[email protected]/
>
> On the other hand, on other hosts, we use SHARED_RUNQ to load balance as
> aggressively as possible, and those hosts would have benefited from that
> change if SHARED_RUNQ wasn't an option.
>
> My 2 cents is that I think it's impossible to make everyone happy, and I
> think the change here makes sense. If there's imbalance, it's something
> we would uncover when load_balance() is kicked off on the tick path
> anyways. I also agree with Vincent [1] that your idea / prototype of
> adding a TIF_NEED_IPI flag is an overall better solution, but this does
> seem fine to me as well in the interim.

Thank you for the detailing your explorations and giving more insights
around newidle_balance(). I'll clean up and test the TIF_NEED_IPI
prototype some more before sending out an RFC.

>
> [1]: https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/
>
> Thanks,
> David

--
Thanks and Regards,
Prateek