The main idea of wakelist is to avoid cache bouncing. However,
commit 518cd6234178 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.
Benefits:
Queuing the task on idle cpu can help improving performance on waker cpu
and utilization on wakee cpu, and further improve locality because
the wakee cpu can handle its own rq. This patch helps improving rt on
our real java workloads where wakeup happens frequently.
Consider the normal condition (CPU0 and CPU1 share same llc)
Before this patch:
CPU0 CPU1
select_task_rq() idle
rq_lock(CPU1->rq)
enqueue_task(CPU1->rq)
notify CPU1 (by sending IPI or CPU1 polling)
resched()
After this patch:
CPU0 CPU1
select_task_rq() idle
add to wakelist of CPU1
notify CPU1 (by sending IPI or CPU1 polling)
rq_lock(CPU1->rq)
enqueue_task(CPU1->rq)
resched()
We see CPU0 can finish its work earlier. It only needs to put task to
wakelist and return.
While CPU1 is idle, so let itself handle its own runqueue data.
This patch brings no difference about IPI.
This patch only takes effect when the wakee cpu is:
1) idle polling
2) idle not polling
For 1), there will be no IPI with or without this patch.
For 2), there will always be an IPI before or after this patch.
Before this patch: waker cpu will enqueue task and check preempt. Since
"idle" will be sure to be preempted, waker cpu must send an resched IPI.
After this patch: waker cpu will put the task to the wakelist of wakee
cpu, and send an IPI.
Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.
On x86 (Intel Xeon Platinum 8269CY):
schbench -m 2 -t 8
Latency percentiles (usec) before after
50.0000th: 8 6
75.0000th: 10 7
90.0000th: 11 8
95.0000th: 12 8
*99.0000th: 15 10
99.5000th: 16 11
99.9000th: 20 14
Unixbench with full threads (104)
before after
Dhrystone 2 using register variables 3004614211 3004725417 0.00%
Double-Precision Whetstone 616764.3 617355.9 0.10%
Execl Throughput 26449.2 26468.6 0.07%
File Copy 1024 bufsize 2000 maxblocks 832763.3 824099.4 -1.04%
File Copy 256 bufsize 500 maxblocks 210718.7 211775.1 0.50%
File Copy 4096 bufsize 8000 maxblocks 2393528.2 2398755.4 0.22%
Pipe Throughput 144559102.7 144605068.8 0.03%
Pipe-based Context Switching 3192192.9 3571238.1 11.87%
Process Creation 95270.5 98865.4 3.77%
Shell Scripts (1 concurrent) 113780.6 113924.7 0.13%
Shell Scripts (8 concurrent) 15557.2 15508.9 -0.31%
System Call Overhead 5359984.1 5356711.4 -0.06%
hackbench -g 1 -l 100000
before after
Time 3.246 2.251
On arm64 (Ampere Altra):
schbench -m 2 -t 8
Latency percentiles (usec) before after
50.0000th: 14 10
75.0000th: 19 14
90.0000th: 22 16
95.0000th: 23 16
*99.0000th: 24 17
99.5000th: 24 17
99.9000th: 31 25
Unixbench with full threads (80)
before after
Dhrystone 2 using register variables 3536787968 3536476016 -0.01%
Double-Precision Whetstone 629370.6 629333.3 -0.01%
Execl Throughput 66615.9 66288.8 -0.49%
File Copy 1024 bufsize 2000 maxblocks 1038402.1 1050181.2 1.13%
File Copy 256 bufsize 500 maxblocks 311054.2 310317.2 -0.24%
File Copy 4096 bufsize 8000 maxblocks 2276795.6 2297703 0.92%
Pipe Throughput 130409359.9 130390848.7 -0.01%
Pipe-based Context Switching 3148440.7 3383705.1 7.47%
Process Creation 111574.3 119728.6 7.31%
Shell Scripts (1 concurrent) 122980.7 122657.4 -0.26%
Shell Scripts (8 concurrent) 17482.8 17476.8 -0.03%
System Call Overhead 4424103.4 4430062.6 0.13%
hackbench -g 1 -l 100000
before after
Time 4.217 2.916
Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.
Signed-off-by: Tianchen Ding <[email protected]>
---
v2:
Modify commit log to describe key point in detail.
Add more benchmark results on more archs.
v1: https://lore.kernel.org/all/[email protected]/
---
kernel/sched/core.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bfa7452ca92e..8764ad152f6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3817,6 +3817,9 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
if (!cpu_active(cpu))
return false;
+ if (cpu == smp_processor_id())
+ return false;
+
/*
* If the CPU does not share cache, then queue the task on the
* remote rqs wakelist to avoid accessing remote data.
@@ -3824,6 +3827,12 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
if (!cpus_share_cache(smp_processor_id(), cpu))
return true;
+ /*
+ * If the CPU is idle, let itself do activation to improve utilization.
+ */
+ if (available_idle_cpu(cpu))
+ return true;
+
/*
* If the task is descheduling and the only running task on the
* CPU then use the wakelist to offload the task activation to
@@ -3839,9 +3848,6 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
- if (WARN_ON_ONCE(cpu == smp_processor_id()))
- return false;
-
sched_clock_cpu(cpu); /* Sync clocks across CPUs */
__ttwu_queue_wakelist(p, cpu, wake_flags);
return true;
--
2.27.0
On 27/05/22 17:05, Tianchen Ding wrote:
> The main idea of wakelist is to avoid cache bouncing. However,
> commit 518cd6234178 ("sched: Only queue remote wakeups when
> crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.
[...]
> Our patch has improvement on schbench, hackbench
> and Pipe-based Context Switching of unixbench
> when there exists idle cpus,
> and no obvious regression on other tests of unixbench.
> This can help improve rt in scenes where wakeup happens frequently.
>
> Signed-off-by: Tianchen Ding <[email protected]>
This feels a bit like a generalization of
2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
Given rq->curr is updated before prev->on_cpu is cleared, the waker
executing ttwu_queue_cond() can observe:
p->on_rq=0
p->on_cpu=1
rq->curr=swapper/x (aka idle task)
So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
matches that when invoked via:
if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
goto unlock;
but it also affects
ttwu_queue(p, cpu, wake_flags);
at the tail end of try_to_wake_up().
With all that in mind, I'm curious whether your patch is functionaly close
to the below.
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c4e5922fe1..ffd43264722a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
* the soon-to-be-idle CPU as the current CPU is likely busy.
* nr_running is checked to avoid unnecessary task stacking.
*/
- if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
+ if (cpu_rq(cpu)->nr_running <= 1)
return true;
return false;
On 2022/5/31 00:24, Valentin Schneider wrote:
> On 27/05/22 17:05, Tianchen Ding wrote:
>> The main idea of wakelist is to avoid cache bouncing. However,
>> commit 518cd6234178 ("sched: Only queue remote wakeups when
>> crossing cache boundaries") disabled queuing tasks on wakelist when
>> the cpus share llc. This is because, at that time, the scheduler must
>> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
>> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
>> in idle polling.
>
> [...]
>
>> Our patch has improvement on schbench, hackbench
>> and Pipe-based Context Switching of unixbench
>> when there exists idle cpus,
>> and no obvious regression on other tests of unixbench.
>> This can help improve rt in scenes where wakeup happens frequently.
>>
>> Signed-off-by: Tianchen Ding <[email protected]>
>
> This feels a bit like a generalization of
>
> 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
>
> Given rq->curr is updated before prev->on_cpu is cleared, the waker
> executing ttwu_queue_cond() can observe:
>
> p->on_rq=0
> p->on_cpu=1
> rq->curr=swapper/x (aka idle task)
>
> So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
> matches that when invoked via:
>
> if (smp_load_acquire(&p->on_cpu) &&
> ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> goto unlock;
>
> but it also affects
>
> ttwu_queue(p, cpu, wake_flags);
>
> at the tail end of try_to_wake_up().
Yes. This part is what we mainly want to affect. The above WF_ON_CPU is
not our point.
>
> With all that in mind, I'm curious whether your patch is functionaly close
> to the below.
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66c4e5922fe1..ffd43264722a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
> * the soon-to-be-idle CPU as the current CPU is likely busy.
> * nr_running is checked to avoid unnecessary task stacking.
> */
> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> + if (cpu_rq(cpu)->nr_running <= 1)
> return true;
>
> return false;
It's a little different. This may bring extra IPIs when nr_running == 1
and the current task on wakee cpu is not the target wakeup task (i.e.,
rq->curr == another_task && rq->curr != p). Then this another_task may
be disturbed by IPI which is not expected. So IMO the promise by
WF_ON_CPU is necessary.
On 31/05/22 14:55, Mel Gorman wrote:
> On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
>> >> With all that in mind, I'm curious whether your patch is functionaly close
>> >> to the below.
>> >>
>> >> ---
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index 66c4e5922fe1..ffd43264722a 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>> >> * the soon-to-be-idle CPU as the current CPU is likely busy.
>> >> * nr_running is checked to avoid unnecessary task stacking.
>> >> */
>> >> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> >> + if (cpu_rq(cpu)->nr_running <= 1)
>> >> return true;
>> >>
>> >> return false;
>> >
>> > It's a little different. This may bring extra IPIs when nr_running == 1
>> > and the current task on wakee cpu is not the target wakeup task (i.e.,
>> > rq->curr == another_task && rq->curr != p). Then this another_task may
>> > be disturbed by IPI which is not expected. So IMO the promise by
>> > WF_ON_CPU is necessary.
>>
>> You're right, actually taking a second look at that WF_ON_CPU path,
>> shouldn't the existing condition be:
>>
>> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>
>> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
>> we must have !p->on_rq, so the deactivate has happened, thus the task
>> being alone on the rq implies nr_running==0.
>>
>> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
>> on the original posting.
>>
>
> I don't recall exactly why I went with <= 1 there but I may not have
> considered the memory ordering of on_rq and nr_running and the comment
> above it is literally what I was thinking at the time. I think you're
> right and that check can be !cpu_rq(cpu)->nr_running.
>
Thanks!
So I'm thinking we could first make that into
if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
Then building on this, we can generalize using the wakelist to any remote
idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
depending on how deeply idle the CPU is...)
We need the cpu != this_cpu check, as that's currently served by the
WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
tasks).
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c4e5922fe1..60038743f2f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
if (!cpus_share_cache(smp_processor_id(), cpu))
return true;
+ if (cpu == smp_processor_id())
+ return false;
+
/*
* If the task is descheduling and the only running task on the
* CPU then use the wakelist to offload the task activation to
* the soon-to-be-idle CPU as the current CPU is likely busy.
* nr_running is checked to avoid unnecessary task stacking.
+ *
+ * Note that we can only get here with (wakee) p->on_rq=0,
+ * p->on_cpu can be whatever, we've done the dequeue, so
+ * the wakee has been accounted out of ->nr_running
*/
- if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
+ if (!cpu_rq(cpu)->nr_running)
return true;
return false;
On 2022/6/1 18:58, Valentin Schneider wrote:
> On 01/06/22 13:54, Tianchen Ding wrote:
>> On 2022/5/31 23:56, Valentin Schneider wrote:
>>
>>> Thanks!
>>>
>>> So I'm thinking we could first make that into
>>>
>>> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>>
>>> Then building on this, we can generalize using the wakelist to any remote
>>> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
>>> depending on how deeply idle the CPU is...)
>>>
>>> We need the cpu != this_cpu check, as that's currently served by the
>>> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
>>> tasks).
>>>
>>> ---
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 66c4e5922fe1..60038743f2f1 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>> if (!cpus_share_cache(smp_processor_id(), cpu))
>>> return true;
>>>
>>> + if (cpu == smp_processor_id())
>>> + return false;
>>> +
>>> /*
>>> * If the task is descheduling and the only running task on the
>>> * CPU then use the wakelist to offload the task activation to
>>> * the soon-to-be-idle CPU as the current CPU is likely busy.
>>> * nr_running is checked to avoid unnecessary task stacking.
>>> + *
>>> + * Note that we can only get here with (wakee) p->on_rq=0,
>>> + * p->on_cpu can be whatever, we've done the dequeue, so
>>> + * the wakee has been accounted out of ->nr_running
>>> */
>>> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>>> + if (!cpu_rq(cpu)->nr_running)
>>> return true;
>>>
>>> return false;
>>
>> Hi Valentin. I've done a simple unixbench test (Pipe-based Context
>> Switching) on my x86 machine with full threads (104).
>>
>> old patch1 patch1+patch2
>> score 7825.4 7500(more)-8000 9061.6
>>
>> patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
>> patch2: ignore WF_ON_CPU check
>>
>> The score of patch1 is not stable. I've tested for many times and the
>> score is floating between about 7500-8000 (more at 7500).
>>
>> patch1 means more strict limit on using wakelist. But it may cause
>> performance regression.
>>
>> It seems that, using wakelist properly can help improve wakeup
>> performance, but using it too much may cause more IPIs. It's a trade-off
>> about how strict the ttwu_queue_cond() is.
>>
>> Anyhow, I think patch2 should be a pure improvement. What's your idea?
>
> Thanks for separately testing these two.
>
> I take it the results for patch1 are noticeably more swingy than the
> baseline? (FWIW boxplots are usually a nice way to summarize those sort of
> results).
>
Hmm... I'm not familiar with this...
T want to say that I'm not sure about the performance impact about
patch1. While from the view of logic, patch1 should be correct.
> WF_ON_CPU && nr_running == 1 means the wakee is scheduling out *and* there
> is another task queued, I'm guessing that's relatively common in your
> unixbench scenario...
>
> Either way, I think we want to keep the two changes separate for the sake
> of testing and bisecting.
Yes. I'll split the patch to 2 parts. One for logic fix and another for
performance improvement.
On 01/06/22 13:54, Tianchen Ding wrote:
> On 2022/5/31 23:56, Valentin Schneider wrote:
>
>> Thanks!
>>
>> So I'm thinking we could first make that into
>>
>> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>
>> Then building on this, we can generalize using the wakelist to any remote
>> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
>> depending on how deeply idle the CPU is...)
>>
>> We need the cpu != this_cpu check, as that's currently served by the
>> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
>> tasks).
>>
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 66c4e5922fe1..60038743f2f1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>> if (!cpus_share_cache(smp_processor_id(), cpu))
>> return true;
>>
>> + if (cpu == smp_processor_id())
>> + return false;
>> +
>> /*
>> * If the task is descheduling and the only running task on the
>> * CPU then use the wakelist to offload the task activation to
>> * the soon-to-be-idle CPU as the current CPU is likely busy.
>> * nr_running is checked to avoid unnecessary task stacking.
>> + *
>> + * Note that we can only get here with (wakee) p->on_rq=0,
>> + * p->on_cpu can be whatever, we've done the dequeue, so
>> + * the wakee has been accounted out of ->nr_running
>> */
>> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> + if (!cpu_rq(cpu)->nr_running)
>> return true;
>>
>> return false;
>
> Hi Valentin. I've done a simple unixbench test (Pipe-based Context
> Switching) on my x86 machine with full threads (104).
>
> old patch1 patch1+patch2
> score 7825.4 7500(more)-8000 9061.6
>
> patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
> patch2: ignore WF_ON_CPU check
>
> The score of patch1 is not stable. I've tested for many times and the
> score is floating between about 7500-8000 (more at 7500).
>
> patch1 means more strict limit on using wakelist. But it may cause
> performance regression.
>
> It seems that, using wakelist properly can help improve wakeup
> performance, but using it too much may cause more IPIs. It's a trade-off
> about how strict the ttwu_queue_cond() is.
>
> Anyhow, I think patch2 should be a pure improvement. What's your idea?
Thanks for separately testing these two.
I take it the results for patch1 are noticeably more swingy than the
baseline? (FWIW boxplots are usually a nice way to summarize those sort of
results).
WF_ON_CPU && nr_running == 1 means the wakee is scheduling out *and* there
is another task queued, I'm guessing that's relatively common in your
unixbench scenario...
Either way, I think we want to keep the two changes separate for the sake
of testing and bisecting.
On 31/05/22 15:20, Tianchen Ding wrote:
> On 2022/5/31 00:24, Valentin Schneider wrote:
>>
>> This feels a bit like a generalization of
>>
>> 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
>>
>> Given rq->curr is updated before prev->on_cpu is cleared, the waker
>> executing ttwu_queue_cond() can observe:
>>
>> p->on_rq=0
>> p->on_cpu=1
>> rq->curr=swapper/x (aka idle task)
>>
>> So your addition of available_idle_cpu() in ttwu_queue_cond() (sort of)
>> matches that when invoked via:
>>
>> if (smp_load_acquire(&p->on_cpu) &&
>> ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
>> goto unlock;
>>
>> but it also affects
>>
>> ttwu_queue(p, cpu, wake_flags);
>>
>> at the tail end of try_to_wake_up().
>
> Yes. This part is what we mainly want to affect. The above WF_ON_CPU is
> not our point.
>
Right.
>>
>> With all that in mind, I'm curious whether your patch is functionaly close
>> to the below.
>>
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 66c4e5922fe1..ffd43264722a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>> * the soon-to-be-idle CPU as the current CPU is likely busy.
>> * nr_running is checked to avoid unnecessary task stacking.
>> */
>> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>> + if (cpu_rq(cpu)->nr_running <= 1)
>> return true;
>>
>> return false;
>
> It's a little different. This may bring extra IPIs when nr_running == 1
> and the current task on wakee cpu is not the target wakeup task (i.e.,
> rq->curr == another_task && rq->curr != p). Then this another_task may
> be disturbed by IPI which is not expected. So IMO the promise by
> WF_ON_CPU is necessary.
You're right, actually taking a second look at that WF_ON_CPU path,
shouldn't the existing condition be:
if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
we must have !p->on_rq, so the deactivate has happened, thus the task
being alone on the rq implies nr_running==0.
@Mel, do you remember why you went for <=1 here? I couldn't find any clues
on the original posting.
On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
> >> With all that in mind, I'm curious whether your patch is functionaly close
> >> to the below.
> >>
> >> ---
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 66c4e5922fe1..ffd43264722a 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
> >> * the soon-to-be-idle CPU as the current CPU is likely busy.
> >> * nr_running is checked to avoid unnecessary task stacking.
> >> */
> >> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> >> + if (cpu_rq(cpu)->nr_running <= 1)
> >> return true;
> >>
> >> return false;
> >
> > It's a little different. This may bring extra IPIs when nr_running == 1
> > and the current task on wakee cpu is not the target wakeup task (i.e.,
> > rq->curr == another_task && rq->curr != p). Then this another_task may
> > be disturbed by IPI which is not expected. So IMO the promise by
> > WF_ON_CPU is necessary.
>
> You're right, actually taking a second look at that WF_ON_CPU path,
> shouldn't the existing condition be:
>
> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>
> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
> we must have !p->on_rq, so the deactivate has happened, thus the task
> being alone on the rq implies nr_running==0.
>
> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
> on the original posting.
>
I don't recall exactly why I went with <= 1 there but I may not have
considered the memory ordering of on_rq and nr_running and the comment
above it is literally what I was thinking at the time. I think you're
right and that check can be !cpu_rq(cpu)->nr_running.
--
Mel Gorman
SUSE Labs
On 2022/5/31 21:55, Mel Gorman wrote:
> On Tue, May 31, 2022 at 12:50:49PM +0100, Valentin Schneider wrote:
>>>> With all that in mind, I'm curious whether your patch is functionaly close
>>>> to the below.
>>>>
>>>> ---
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 66c4e5922fe1..ffd43264722a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3836,7 +3836,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
>>>> * the soon-to-be-idle CPU as the current CPU is likely busy.
>>>> * nr_running is checked to avoid unnecessary task stacking.
>>>> */
>>>> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
>>>> + if (cpu_rq(cpu)->nr_running <= 1)
>>>> return true;
>>>>
>>>> return false;
>>>
>>> It's a little different. This may bring extra IPIs when nr_running == 1
>>> and the current task on wakee cpu is not the target wakeup task (i.e.,
>>> rq->curr == another_task && rq->curr != p). Then this another_task may
>>> be disturbed by IPI which is not expected. So IMO the promise by
>>> WF_ON_CPU is necessary.
>>
>> You're right, actually taking a second look at that WF_ON_CPU path,
>> shouldn't the existing condition be:
>>
>> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>>
>> ? Per the p->on_rq and p->on_cpu ordering, if we have WF_ON_CPU here then
>> we must have !p->on_rq, so the deactivate has happened, thus the task
>> being alone on the rq implies nr_running==0.
>>
>> @Mel, do you remember why you went for <=1 here? I couldn't find any clues
>> on the original posting.
>>
>
> I don't recall exactly why I went with <= 1 there but I may not have
> considered the memory ordering of on_rq and nr_running and the comment
> above it is literally what I was thinking at the time. I think you're
> right and that check can be !cpu_rq(cpu)->nr_running.
>
If the check becomes !cpu_rq(cpu)->nr_running
My patch would change, too.
Shall we remove WF_ON_CPU completely?
On 2022/5/31 23:56, Valentin Schneider wrote:
> Thanks!
>
> So I'm thinking we could first make that into
>
> if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running)
>
> Then building on this, we can generalize using the wakelist to any remote
> idle CPU (which on paper isn't as much as a clear win as just WF_ON_CPU,
> depending on how deeply idle the CPU is...)
>
> We need the cpu != this_cpu check, as that's currently served by the
> WF_ON_CPU check (AFAIU we can only observe p->on_cpu in there for remote
> tasks).
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66c4e5922fe1..60038743f2f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3830,13 +3830,20 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags)
> if (!cpus_share_cache(smp_processor_id(), cpu))
> return true;
>
> + if (cpu == smp_processor_id())
> + return false;
> +
> /*
> * If the task is descheduling and the only running task on the
> * CPU then use the wakelist to offload the task activation to
> * the soon-to-be-idle CPU as the current CPU is likely busy.
> * nr_running is checked to avoid unnecessary task stacking.
> + *
> + * Note that we can only get here with (wakee) p->on_rq=0,
> + * p->on_cpu can be whatever, we've done the dequeue, so
> + * the wakee has been accounted out of ->nr_running
> */
> - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1)
> + if (!cpu_rq(cpu)->nr_running)
> return true;
>
> return false;
Hi Valentin. I've done a simple unixbench test (Pipe-based Context
Switching) on my x86 machine with full threads (104).
old patch1 patch1+patch2
score 7825.4 7500(more)-8000 9061.6
patch1: use !cpu_rq(cpu)->nr_running instead of cpu_rq(cpu)->nr_running <= 1
patch2: ignore WF_ON_CPU check
The score of patch1 is not stable. I've tested for many times and the
score is floating between about 7500-8000 (more at 7500).
patch1 means more strict limit on using wakelist. But it may cause
performance regression.
It seems that, using wakelist properly can help improve wakeup
performance, but using it too much may cause more IPIs. It's a trade-off
about how strict the ttwu_queue_cond() is.
Anyhow, I think patch2 should be a pure improvement. What's your idea?