LinuxLists.cc - Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle

2023-09-12 14:34:57

Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chenyu,

On 9/12/2023 6:02 PM, Chen Yu wrote:
> [..snip..]
>
>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>> wakee's previous CPU. We can only restrict that other wakee does not
>>> occupy the previous CPU, but do not enhance the possibility that
>>> wake_affine_idle() chooses the previous CPU.
>>
>> Correct me if I'm wrong here,
>>
>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
>> and restrict any wakeup happening until the "cache_hot_timeout" is
>> crossed. Let us assume a perfect world where the task wakes up before
>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>> this while for the short sleeper but since the wakeup bears WF_SYNC
>> flag, the whole reservation is ignored and waker's LLC is explored.
>>
>
> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> forces the short sleeping wakee to be woken up on waker's CPU rather the
> wakee's previous CPU, but wakee's previous has been marked as cache hot
> for nothing?

Precisely :)

>
>> Should the timeout be cleared if the wakeup decides to not target the
>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>> small enough to curb any side effect that could possibly show here but
>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>> a larger value, the wakeup path might be affected where lot of idle
>> targets are overlooked since the CPUs are marked cache-hot forr longer
>> duration)
>>
>> Let me know what you think.
>>
>
> This makes sense. In theory the above logic can be added in
> select_idle_sibling(), if target CPU is chosen rather than
> the previous CPU, the previous CPU's cache hot flag should be
> cleared.
>
> But this might bring overhead. Because we need to grab the rq
> lock and write to other CPU's rq, which could be costly. It
> seems to be a trade-off of current implementation.

I agree, it will not be pretty. Maybe the other way is to have a
history of the type of wakeup the task experiences (similar to
wakee_flips but for sync and non-syn wakeups) and only reserve
the CPU if the task wakes up more via non-sync wakeups? Thinking
out loud here.

> On the other
> hand, if the user sets the sysctl_sched_migration_cost to a quite
> large value:
> 1. Without SIS_CACHE, there is no task migration.

But that is in the load balancing path. I think the wakeup path will
still migrate the task. But I believe there might be very few cases
where all CPUs are marked cache-hot and the SIS_UTIL will not bail
out straight away as a result of high utilization. Probably a rare
scenario.

> 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
> in select_idle_cpu(), the wakee will be woken up locally.
> It seems to be of the same effect, so there is no much impact
> to wakeup behavior I suppose.
>
> [..snip..]
>

--
Thanks and Regards,
Prateek

2023-09-13 08:12:13

by Chen Yu

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/12/2023 6:02 PM, Chen Yu wrote:
> > [..snip..]
> >
> >>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>> wakee's previous CPU. We can only restrict that other wakee does not
> >>> occupy the previous CPU, but do not enhance the possibility that
> >>> wake_affine_idle() chooses the previous CPU.
> >>
> >> Correct me if I'm wrong here,
> >>
> >> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
> >> and restrict any wakeup happening until the "cache_hot_timeout" is
> >> crossed. Let us assume a perfect world where the task wakes up before
> >> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >> this while for the short sleeper but since the wakeup bears WF_SYNC
> >> flag, the whole reservation is ignored and waker's LLC is explored.
> >>
> >
> > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> > forces the short sleeping wakee to be woken up on waker's CPU rather the
> > wakee's previous CPU, but wakee's previous has been marked as cache hot
> > for nothing?
>
> Precisely :)
>
> >
> >> Should the timeout be cleared if the wakeup decides to not target the
> >> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >> small enough to curb any side effect that could possibly show here but
> >> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >> a larger value, the wakeup path might be affected where lot of idle
> >> targets are overlooked since the CPUs are marked cache-hot forr longer
> >> duration)
> >>
> >> Let me know what you think.
> >>
> >
> > This makes sense. In theory the above logic can be added in
> > select_idle_sibling(), if target CPU is chosen rather than
> > the previous CPU, the previous CPU's cache hot flag should be
> > cleared.
> >
> > But this might bring overhead. Because we need to grab the rq
> > lock and write to other CPU's rq, which could be costly. It
> > seems to be a trade-off of current implementation.
>
> I agree, it will not be pretty. Maybe the other way is to have a
> history of the type of wakeup the task experiences (similar to
> wakee_flips but for sync and non-syn wakeups) and only reserve
> the CPU if the task wakes up more via non-sync wakeups? Thinking
> out loud here.
>

This looks good to consider the task's attribute, or maybe something
like this:

new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
if (new_cpu != prev_cpu)
p->burst_sleep_avg >>= 1;
So the duration of reservation could be shrinked.

> > On the other
> > hand, if the user sets the sysctl_sched_migration_cost to a quite
> > large value:
> > 1. Without SIS_CACHE, there is no task migration.
>
> But that is in the load balancing path. I think the wakeup path will
> still migrate the task.

OK, I see.

> But I believe there might be very few cases
> where all CPUs are marked cache-hot and the SIS_UTIL will not bail
> out straight away as a result of high utilization. Probably a rare
> scenario.
>

Agree.

thanks,
Chenyu

2023-09-14 08:11:56

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chenyu,

On 9/13/2023 8:27 AM, Chen Yu wrote:
> On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/12/2023 6:02 PM, Chen Yu wrote:
>>> [..snip..]
>>>
>>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>>>> wakee's previous CPU. We can only restrict that other wakee does not
>>>>> occupy the previous CPU, but do not enhance the possibility that
>>>>> wake_affine_idle() chooses the previous CPU.
>>>>
>>>> Correct me if I'm wrong here,
>>>>
>>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
>>>> and restrict any wakeup happening until the "cache_hot_timeout" is
>>>> crossed. Let us assume a perfect world where the task wakes up before
>>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>>>> this while for the short sleeper but since the wakeup bears WF_SYNC
>>>> flag, the whole reservation is ignored and waker's LLC is explored.
>>>>
>>>
>>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
>>> forces the short sleeping wakee to be woken up on waker's CPU rather the
>>> wakee's previous CPU, but wakee's previous has been marked as cache hot
>>> for nothing?
>>
>> Precisely :)
>>
>>>
>>>> Should the timeout be cleared if the wakeup decides to not target the
>>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>>>> small enough to curb any side effect that could possibly show here but
>>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>>>> a larger value, the wakeup path might be affected where lot of idle
>>>> targets are overlooked since the CPUs are marked cache-hot forr longer
>>>> duration)
>>>>
>>>> Let me know what you think.
>>>>
>>>
>>> This makes sense. In theory the above logic can be added in
>>> select_idle_sibling(), if target CPU is chosen rather than
>>> the previous CPU, the previous CPU's cache hot flag should be
>>> cleared.
>>>
>>> But this might bring overhead. Because we need to grab the rq
>>> lock and write to other CPU's rq, which could be costly. It
>>> seems to be a trade-off of current implementation.
>>
>> I agree, it will not be pretty. Maybe the other way is to have a
>> history of the type of wakeup the task experiences (similar to
>> wakee_flips but for sync and non-syn wakeups) and only reserve
>> the CPU if the task wakes up more via non-sync wakeups? Thinking
>> out loud here.
>>
>
> This looks good to consider the task's attribute, or maybe something
> like this:
>
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.

That seems like a good approach.

Meanwhile, here is result for the current series without any
modifications:

tl;dr

- There seems to be a noticeable increase in hackbench runtime with a
single group but big gains beyond that. The regression could possibly
be because of added searching but let me do some digging to confirm
that.

- Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
and DeathStarBench (High Utilization)

- Other benchmarks are more of less perf neutral with the changes.

More information below:

o System information

- Dual socket 3rd Generation EPYC System (2 x 64C/128T)
- NPS1 mode (each socket is a NUMA node)
- Boost Enabled
- C2 disabled (MWAIT based C1 is still enabled)

o Kernel information

base : tip:sched/core at commit b41bbb33cf75 ("Merge branch
'sched/eevdf' into sched/core")
+ cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
wakeup-preemption")

SIS_CACHE : base
+ this series as is

o Benchmark results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03)
2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04)
4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88)
8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76)
16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27)
2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09)
4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42)
8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65)
16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55)
32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55)
64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74)
128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88)
256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49)
512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33)
1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72)
Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83)
Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49)
Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26)
Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74)
Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82)
Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49)
2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19)
4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21)
8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08)
16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04)
32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20)
64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77)
128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99)
256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71)
512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72)

==================================================================
Test : schbench (old)
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68)
2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96)
4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61)
8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86)
16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13)
32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73)
64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26)
128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08)
256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59)
512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37)

==================================================================
Test : Unixbench
Units : Various, Throughput
Interpretation: Higher is better
Statistic : AMean, Hmean (Specified)
==================================================================
Metric variant base SIS_CACHE
Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%)
Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%)
Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%*
Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%*
Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%*
Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%*
Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%)
Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%)
Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%)
Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%)

==================================================================
Test : ycsb-mongodb
Units : Throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
base : 309589.33 (var: 1.41%)
SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%]

==================================================================
Test : DeathStarBench
Units : Normalized Throughput, relative to base
Interpretation: Higher is better
Statistic : AMean
==================================================================
Pinning base SIS_CACHE
1 CCD 100% 99.18% [%diff: -0.82%]
2 CCD 100% 97.46% [%diff: -2.54%]
4 CCD 100% 97.22% [%diff: -2.78%]
8 CCD 100% 99.01% [%diff: -0.99%]

--

Regression observed could either be because of the larger search time to
find a non cache-hot idle CPU, or perhaps just the larger search time in
general adding to utilization and curbing the SIS_UTIL limits further.
I'll go gather some stats to back my suspicion (particularly for
hackbench).

>
> [..snip..]

--
Thanks and Regards,
Prateek

2023-09-14 16:11:26

by Chen Yu

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hi Prateek,

thanks for the test,

On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/13/2023 8:27 AM, Chen Yu wrote:
> > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/12/2023 6:02 PM, Chen Yu wrote:
> >>> [..snip..]
> >>>
> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>>>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>>>> wakee's previous CPU. We can only restrict that other wakee does not
> >>>>> occupy the previous CPU, but do not enhance the possibility that
> >>>>> wake_affine_idle() chooses the previous CPU.
> >>>>
> >>>> Correct me if I'm wrong here,
> >>>>
> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
> >>>> and restrict any wakeup happening until the "cache_hot_timeout" is
> >>>> crossed. Let us assume a perfect world where the task wakes up before
> >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >>>> this while for the short sleeper but since the wakeup bears WF_SYNC
> >>>> flag, the whole reservation is ignored and waker's LLC is explored.
> >>>>
> >>>
> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> >>> forces the short sleeping wakee to be woken up on waker's CPU rather the
> >>> wakee's previous CPU, but wakee's previous has been marked as cache hot
> >>> for nothing?
> >>
> >> Precisely :)
> >>
> >>>
> >>>> Should the timeout be cleared if the wakeup decides to not target the
> >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >>>> small enough to curb any side effect that could possibly show here but
> >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >>>> a larger value, the wakeup path might be affected where lot of idle
> >>>> targets are overlooked since the CPUs are marked cache-hot forr longer
> >>>> duration)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>
> >>> This makes sense. In theory the above logic can be added in
> >>> select_idle_sibling(), if target CPU is chosen rather than
> >>> the previous CPU, the previous CPU's cache hot flag should be
> >>> cleared.
> >>>
> >>> But this might bring overhead. Because we need to grab the rq
> >>> lock and write to other CPU's rq, which could be costly. It
> >>> seems to be a trade-off of current implementation.
> >>
> >> I agree, it will not be pretty. Maybe the other way is to have a
> >> history of the type of wakeup the task experiences (similar to
> >> wakee_flips but for sync and non-syn wakeups) and only reserve
> >> the CPU if the task wakes up more via non-sync wakeups? Thinking
> >> out loud here.
> >>
> >
> > This looks good to consider the task's attribute, or maybe something
> > like this:
> >
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
>
> That seems like a good approach.
>
> Meanwhile, here is result for the current series without any
> modifications:
>
> tl;dr
>
> - There seems to be a noticeable increase in hackbench runtime with a
> single group but big gains beyond that. The regression could possibly
> be because of added searching but let me do some digging to confirm
> that.

Ah OK. May I have the command to run 1 group hackbench?

>
> - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
> and DeathStarBench (High Utilization)
>
> - Other benchmarks are more of less perf neutral with the changes.
>
> More information below:
>
> o System information
>
> - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
> - NPS1 mode (each socket is a NUMA node)
> - Boost Enabled
> - C2 disabled (MWAIT based C1 is still enabled)
>
>
> o Kernel information
>
> base : tip:sched/core at commit b41bbb33cf75 ("Merge branch
> 'sched/eevdf' into sched/core")
> + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
> wakeup-preemption")
>
> SIS_CACHE : base
> + this series as is
>
>
> o Benchmark results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03)
> 2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04)
> 4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88)
> 8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76)
> 16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27)
> 2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09)
> 4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42)
> 8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65)
> 16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55)
> 32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55)
> 64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74)
> 128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88)
> 256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49)
> 512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33)
> 1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72)
> Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83)
> Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49)
> Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26)
> Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74)
> Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82)
> Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49)
> 2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19)
> 4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21)
> 8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08)
> 16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04)
> 32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20)
> 64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77)
> 128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99)
> 256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71)
> 512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72)
>
>
> ==================================================================
> Test : schbench (old)
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68)
> 2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96)
> 4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61)
> 8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86)
> 16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13)
> 32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73)
> 64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26)
> 128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08)
> 256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59)
> 512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37)
>
>
> ==================================================================
> Test : Unixbench
> Units : Various, Throughput
> Interpretation: Higher is better
> Statistic : AMean, Hmean (Specified)
> ==================================================================
> Metric variant base SIS_CACHE
> Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%)
> Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%)
> Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%*
> Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%*
> Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%*
> Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%*
> Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%)
> Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%)
> Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%)
> Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%)
>
>
> ==================================================================
> Test : ycsb-mongodb
> Units : Throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> base : 309589.33 (var: 1.41%)
> SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%]
>
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized Throughput, relative to base
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Pinning base SIS_CACHE
> 1 CCD 100% 99.18% [%diff: -0.82%]
> 2 CCD 100% 97.46% [%diff: -2.54%]
> 4 CCD 100% 97.22% [%diff: -2.78%]
> 8 CCD 100% 99.01% [%diff: -0.99%]
>
> --
>
> Regression observed could either be because of the larger search time to
> find a non cache-hot idle CPU, or perhaps just the larger search time in
> general adding to utilization and curbing the SIS_UTIL limits further.

Yeah that is possible. And you also mentioned that we should consider the
cache-hot idle CPU if we can not find any cache-cold idle CPUs, that
might be a better choice than forcely putting the wakee on the current
CPU which brings task stacking.

> I'll go gather some stats to back my suspicion (particularly for
> hackbench).
>

Thanks!
Chenyu

2023-09-15 03:24:36

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

Hello Chenyu,

Thank you for taking a look at the report.

On 9/14/2023 4:31 PM, Chen Yu wrote:
> Hi Prateek,
>
> thanks for the test,
>
> On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
>> [..snip..]
>>
>> Meanwhile, here is result for the current series without any
>> modifications:
>>
>> tl;dr
>>
>> - There seems to be a noticeable increase in hackbench runtime with a
>> single group but big gains beyond that. The regression could possibly
>> be because of added searching but let me do some digging to confirm
>> that.
>
> Ah OK. May I have the command to run 1 group hackbench?

This is actually perf bench sched messaging. The cmdline from the runner
is:

$ perf bench sched messaging -t -p -l 100000 -g <# of groups>

> [..snip..]

--
Thanks and Regards,
Prateek