2022-06-09 00:02:27

by Tianchen Ding

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle

On 2022/6/9 00:35, Tianchen Ding wrote:
> Wakelist can help avoid cache bouncing and offload the overhead of waker
> cpu. So far, using wakelist within the same llc only happens on
> WF_ON_CPU, and this limitation could be removed to further improve
> wakeup performance.
>
> The commit 518cd6234178 ("sched: Only queue remote wakeups when
> crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.
>
> Benefits:
> Queuing the task on idle cpu can help improving performance on waker cpu
> and utilization on wakee cpu, and further improve locality because
> the wakee cpu can handle its own rq. This patch helps improving rt on
> our real java workloads where wakeup happens frequently.
>
> Consider the normal condition (CPU0 and CPU1 share same llc)
> Before this patch:
>
> CPU0 CPU1
>
> select_task_rq() idle
> rq_lock(CPU1->rq)
> enqueue_task(CPU1->rq)
> notify CPU1 (by sending IPI or CPU1 polling)
>
> resched()
>
> After this patch:
>
> CPU0 CPU1
>
> select_task_rq() idle
> add to wakelist of CPU1
> notify CPU1 (by sending IPI or CPU1 polling)
>
> rq_lock(CPU1->rq)
> enqueue_task(CPU1->rq)
> resched()
>
> We see CPU0 can finish its work earlier. It only needs to put task to
> wakelist and return.
> While CPU1 is idle, so let itself handle its own runqueue data.
>
> This patch brings no difference about IPI.
> This patch only takes effect when the wakee cpu is:
> 1) idle polling
> 2) idle not polling
>
> For 1), there will be no IPI with or without this patch.
>
> For 2), there will always be an IPI before or after this patch.
> Before this patch: waker cpu will enqueue task and check preempt. Since
> "idle" will be sure to be preempted, waker cpu must send a resched IPI.
> After this patch: waker cpu will put the task to the wakelist of wakee
> cpu, and send an IPI.
>
> Benchmark:
> We've tested schbench, unixbench, and hachbench on both x86 and arm64.
>
> On x86 (Intel Xeon Platinum 8269CY):
> schbench -m 2 -t 8
>
> Latency percentiles (usec) before after
> 50.0000th: 8 6
> 75.0000th: 10 7
> 90.0000th: 11 8
> 95.0000th: 12 8
> *99.0000th: 13 10
> 99.5000th: 15 11
> 99.9000th: 18 14
>
> Unixbench with full threads (104)
> before after
> Dhrystone 2 using register variables 3011862938 3009935994 -0.06%
> Double-Precision Whetstone 617119.3 617298.5 0.03%
> Execl Throughput 27667.3 27627.3 -0.14%
> File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12%
> File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20%
> File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36%
> Pipe Throughput 145535622.8 145323033.2 -0.15%
> Pipe-based Context Switching 3221686.4 3583975.4 11.25%
> Process Creation 101347.1 103345.4 1.97%
> Shell Scripts (1 concurrent) 120193.5 123977.8 3.15%
> Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55%
> System Call Overhead 5300604.8 5312213.6 0.22%
>
> hackbench -g 1 -l 100000
> before after
> Time 3.246 2.251
>
> On arm64 (Ampere Altra):
> schbench -m 2 -t 8
>
> Latency percentiles (usec) before after
> 50.0000th: 14 10
> 75.0000th: 19 14
> 90.0000th: 22 16
> 95.0000th: 23 16
> *99.0000th: 24 17
> 99.5000th: 24 17
> 99.9000th: 28 25
>
> Unixbench with full threads (80)
> before after
> Dhrystone 2 using register variables 3536194249 3536476016 -0.01%
> Double-Precision Whetstone 629383.6 629333.3 -0.01%
> Execl Throughput 65920.5 66288.8 -0.49%
> File Copy 1024 bufsize 2000 maxblocks 1038402.1 1050181.2 1.13%
> File Copy 256 bufsize 500 maxblocks 311054.2 310317.2 -0.24%
> File Copy 4096 bufsize 8000 maxblocks 2276795.6 2297703 0.92%
> Pipe Throughput 130409359.9 130390848.7 -0.01%
> Pipe-based Context Switching 3148440.7 3383705.1 7.47%
> Process Creation 111574.3 119728.6 7.31%
> Shell Scripts (1 concurrent) 122980.7 122657.4 -0.26%
> Shell Scripts (8 concurrent) 17482.8 17476.8 -0.03%
> System Call Overhead 4424103.4 4430062.6 0.13%
>
> Dhrystone 2 using register variables 3536194249 3537019613 0.02%
> Double-Precision Whetstone 629383.6 629431.6 0.01%
> Execl Throughput 65920.5 65846.2 -0.11%
> File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03%
> File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23%
> File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83%
> Pipe Throughput 133542875.3 131619389.8 -1.44%
> Pipe-based Context Switching 3215356.1 3576945.1 11.25%
> Process Creation 108520.5 120184.6 10.75%
> Shell Scripts (1 concurrent) 122636.3 121888 -0.61%
> Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46%
> System Call Overhead 4429998.9 4435006.7 0.11%

Oops... I forgot to remove the previous result.
Let me resend one.