by Madadi Vineeth Reddy

[permalink] [raw]

Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
>
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>> burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>> (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>> rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>> for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>> (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits,
>> I ran some benchmarks.
>>
>
> Thanks very much for your interest and your time on the patch.
>
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
>> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
>> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
>> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
>> for SIS_CACHE in case of 4-mthreads.
>
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
>
>> I think we can ignore the last case due to huge run to run variations.
>
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/[email protected]/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>

raw data by schbench(old) with 6-mthreads
======================

Baseline (5 runs)
========
Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 981
99.5000th: 4424
99.9000th: 9200
min=0, max=29497

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 35
95.0000th: 38
*99.0000th: 495
99.5000th: 3924
99.9000th: 9872
min=0, max=29997

Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 36
95.0000th: 39
*99.0000th: 1326
99.5000th: 4744
99.9000th: 10000
min=0, max=23394

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 55
99.5000th: 3292
99.9000th: 9104
min=0, max=25196

Latency percentiles (usec)
50.0000th: 23
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 711
99.5000th: 4600
99.9000th: 9424
min=0, max=19997

SIS_CACHE (5 runs)
=========
Latency percentiles (usec)
50.0000th: 23
75.0000th: 30
90.0000th: 35
95.0000th: 38
*99.0000th: 1894
99.5000th: 5464
99.9000th: 10000
min=0, max=19157

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2396
99.5000th: 6664
99.9000th: 10000
min=0, max=24029

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 2132
99.5000th: 6296
99.9000th: 10000
min=0, max=25313

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 37
*99.0000th: 1090
99.5000th: 6232
99.9000th: 9744
min=0, max=27264

Latency percentiles (usec)
50.0000th: 22
75.0000th: 29
90.0000th: 34
95.0000th: 38
*99.0000th: 1786
99.5000th: 5240
99.9000th: 9968
min=0, max=24754

The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.

schbench(new) with 6-mthreads
=============

Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
50.0th: 8 (43672 samples)
90.0th: 13 (83908 samples)
* 99.0th: 20 (18323 samples)
99.9th: 775 (1785 samples)
min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
50.0th: 13648 (59873 samples)
90.0th: 14000 (82767 samples)
* 99.0th: 14320 (16342 samples)
99.9th: 18720 (1670 samples)
min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (8 samples)
* 50.0th: 6984 (23 samples)
90.0th: 6984 (0 samples)
min=6835, max=6991
average rps: 6984.77

SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
50.0th: 9 (49267 samples)
90.0th: 14 (86522 samples)
* 99.0th: 21 (14091 samples)
99.9th: 1146 (1722 samples)
min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
50.0th: 13616 (62838 samples)
90.0th: 14000 (85301 samples)
* 99.0th: 14352 (16149 samples)
99.9th: 21408 (1660 samples)
min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 6968 (7 samples)
* 50.0th: 6984 (21 samples)
90.0th: 6984 (0 samples)
min=6672, max=6996
average rps: 6981.07

In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.

>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> 5 1.00 [ 0.00]( 0.00) 0.87 [ +13.0]( 1.92)
>> 20 1.00 [ 0.00]( 0.00) 0.92 [ +8.00]( 0.00)
>> 50 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>> 100 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload,
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
>> process-pipe 1-groups 1.00 [ 0.00]( 1.50) 1.02 [ -2.00]( 3.36)
>> process-pipe 2-groups 1.00 [ 0.00]( 4.76) 0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups 1.00 [ 0.00]( 2.56) 1.00 [ 0.00]( 0.86)
>> process-sockets 2-groups 1.00 [ 0.00]( 0.50) 0.99 [ +1.00]( 0.96)
>> threads-pipe 1-groups 1.00 [ 0.00]( 3.87) 0.71 [ +29.0]( 3.56)
>> threads-pipe 2-groups 1.00 [ 0.00]( 1.60) 0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups 1.00 [ 0.00]( 7.65) 0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups 1.00 [ 0.00]( 3.12) 1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10124.5 2 0 3970
>>
>> SIS_CACHE:
>> ===================================================================================
>> Instance 1
>> Throughputs Ave. Resp. Time Min. Resp. Time Max. Resp. Time
>> ================ =============== =============== ===============
>> 10319.5 2 0 5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
>
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/[email protected]/

Oh..ok. Thanks for the pointer!

>
> thanks,
> Chenyu
>

Thanks and Regards
Madadi Vineeth Reddy

2023-10-19 10:58:03

by Chen Yu

[permalink] [raw]

Subject: Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> >
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>> burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>> (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>> rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>> for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>> select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>> (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits,
> >> I ran some benchmarks.
> >>
> >
> > Thanks very much for your interest and your time on the patch.
> >
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case load baseline[pct imp](std%) SIS_CACHE[pct imp]( std%)
> >> normal 1-mthreads 1.00 [ 0.00]( 3.66) 1.00 [ 0.00]( 1.71)
> >> normal 2-mthreads 1.00 [ 0.00]( 4.55) 1.02 [ -2.00]( 3.00)
> >> normal 4-mthreads 1.00 [ 0.00]( 4.77) 0.96 [ +4.00]( 4.27)
> >> normal 6-mthreads 1.00 [ 0.00]( 60.37) 2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better
> >> for SIS_CACHE in case of 4-mthreads.
> >
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> >
> >> I think we can ignore the last case due to huge run to run variations.
> >
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/[email protected]/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >
>
> raw data by schbench(old) with 6-mthreads
> ======================
>
> Baseline (5 runs)
> ========
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 981
> 99.5000th: 4424
> 99.9000th: 9200
> min=0, max=29497
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 495
> 99.5000th: 3924
> 99.9000th: 9872
> min=0, max=29997
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 36
> 95.0000th: 39
> *99.0000th: 1326
> 99.5000th: 4744
> 99.9000th: 10000
> min=0, max=23394
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 55
> 99.5000th: 3292
> 99.9000th: 9104
> min=0, max=25196
>
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 711
> 99.5000th: 4600
> 99.9000th: 9424
> min=0, max=19997
>
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)
> 50.0000th: 23
> 75.0000th: 30
> 90.0000th: 35
> 95.0000th: 38
> *99.0000th: 1894
> 99.5000th: 5464
> 99.9000th: 10000
> min=0, max=19157
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2396
> 99.5000th: 6664
> 99.9000th: 10000
> min=0, max=24029
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 2132
> 99.5000th: 6296
> 99.9000th: 10000
> min=0, max=25313
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 37
> *99.0000th: 1090
> 99.5000th: 6232
> 99.9000th: 9744
> min=0, max=27264
>
> Latency percentiles (usec)
> 50.0000th: 22
> 75.0000th: 29
> 90.0000th: 34
> 95.0000th: 38
> *99.0000th: 1786
> 99.5000th: 5240
> 99.9000th: 9968
> min=0, max=24754
>
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
>
>
> schbench(new) with 6-mthreads
> =============
>
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 50.0th: 8 (43672 samples)
> 90.0th: 13 (83908 samples)
> * 99.0th: 20 (18323 samples)
> 99.9th: 775 (1785 samples)
> min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 50.0th: 13648 (59873 samples)
> 90.0th: 14000 (82767 samples)
> * 99.0th: 14320 (16342 samples)
> 99.9th: 18720 (1670 samples)
> min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (8 samples)
> * 50.0th: 6984 (23 samples)
> 90.0th: 6984 (0 samples)
> min=6835, max=6991
> average rps: 6984.77
>
>
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 50.0th: 9 (49267 samples)
> 90.0th: 14 (86522 samples)
> * 99.0th: 21 (14091 samples)
> 99.9th: 1146 (1722 samples)
> min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 50.0th: 13616 (62838 samples)
> 90.0th: 14000 (85301 samples)
> * 99.0th: 14352 (16149 samples)
> 99.9th: 21408 (1660 samples)
> min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 6968 (7 samples)
> * 50.0th: 6984 (21 samples)
> 90.0th: 6984 (0 samples)
> min=6672, max=6996
> average rps: 6981.07
>
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu