2020-08-28 06:17:36

by Tianxianting

[permalink] [raw]
Subject: [PATCH] aio: make aio wait path to account iowait time

As the normal aio wait path(read_events() ->
wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
this patch to make it to account iowait time, which can truely reflect
the system io situation when using a tool like 'top'.

The test result as below.

Test environment:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz
Stepping: 4
CPU MHz: 801.660

AIO test command:
fio -ioengine=libaio -bs=8k -direct=1 -numjobs 32 -rw=read -size=10G
-filename=/dev/sda3 -name="Max throughput" -iodepth=128 -runtime=60

Before test, set nr_requests to 512(default is 256), aim to to make the backend
device busy to handle io request, and make io_getevents() have more chances to
wait for io completion:
echo 512 > /sys/block/sda/queue/nr_requests

Fio test result with the AIO iowait time accounting patch showed as below,
almost all fio threads are in 'D' state to waiting for io completion, and the
iowait time is accounted as 48.0%:
top - 19:19:23 up 29 min, 2 users, load average: 14.60, 9.45, 9.45
Tasks: 456 total, 4 running, 247 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 1.0 sy, 0.0 ni, 50.6 id, 48.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19668915+total, 19515264+free, 866476 used, 670028 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 19449948+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16135 root 20 0 294092 63724 63060 S 1.7 0.0 0:03.31 fio
16173 root 20 0 272352 3540 1792 D 1.7 0.0 0:03.85 fio
16175 root 20 0 272360 3544 1796 D 1.7 0.0 0:03.85 fio
16185 root 20 0 272400 3556 1808 D 1.7 0.0 0:03.84 fio
16187 root 20 0 272408 3552 1804 D 1.7 0.0 0:03.82 fio
16190 root 20 0 272420 3500 1804 R 1.7 0.0 0:03.88 fio
16169 root 20 0 272336 3444 1740 D 1.3 0.0 0:03.75 fio
16170 root 20 0 272340 3504 1804 R 1.3 0.0 0:03.80 fio
16172 root 20 0 272348 3500 1800 D 1.3 0.0 0:03.86 fio
16174 root 20 0 272356 3544 1796 D 1.3 0.0 0:03.77 fio
16179 root 20 0 272376 3528 1780 D 1.3 0.0 0:03.79 fio
16180 root 20 0 272380 3500 1800 D 1.3 0.0 0:03.85 fio
16181 root 20 0 272384 3552 1804 D 1.3 0.0 0:03.87 fio
16182 root 20 0 272388 3520 1772 D 1.3 0.0 0:03.80 fio
16183 root 20 0 272392 3552 1804 D 1.3 0.0 0:03.77 fio
16186 root 20 0 272404 3500 1804 D 1.3 0.0 0:03.88 fio
16188 root 20 0 272412 3500 1800 D 1.3 0.0 0:03.89 fio
16191 root 20 0 272424 3500 1800 D 1.3 0.0 0:03.92 fio
16192 root 20 0 272428 3500 1800 D 1.3 0.0 0:03.87 fio
16194 root 20 0 272436 3500 1804 D 1.3 0.0 0:03.82 fio
16195 root 20 0 272440 3500 1800 R 1.3 0.0 0:03.82 fio
16196 root 20 0 272444 3552 1804 D 1.3 0.0 0:03.84 fio
16198 root 20 0 272452 3500 1804 D 1.3 0.0 0:03.89 fio
16199 root 20 0 272456 3504 1800 D 1.3 0.0 0:03.84 fio
16200 root 20 0 272460 3552 1804 D 1.3 0.0 0:03.85 fio
16171 root 20 0 272344 3504 1800 D 1.0 0.0 0:03.84 fio
16176 root 20 0 272364 3520 1772 D 1.0 0.0 0:03.76 fio
16177 root 20 0 272368 3556 1808 D 1.0 0.0 0:03.74 fio
16178 root 20 0 272372 3500 1804 D 1.0 0.0 0:03.90 fio
16184 root 20 0 272396 3500 1800 D 1.0 0.0 0:03.83 fio
16189 root 20 0 272416 3500 1804 D 1.0 0.0 0:03.86 fio
16193 root 20 0 272432 3500 1804 D 1.0 0.0 0:03.85 fio
16197 root 20 0 272448 3556 1808 D 1.0 0.0 0:03.75 fio

Fio test result without the AIO iowait time accounting patch showed as below,
almost all fio threads are in 'S' state to waiting for io completion, and the
iowait time is not accounted, iowait is 0.0%.
top - 19:20:44 up 31 min, 2 users, load average: 12.50, 10.15, 9.72
Tasks: 458 total, 2 running, 249 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.9 sy, 0.0 ni, 98.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 19668915+total, 19513945+free, 879652 used, 670040 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 19448636+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16243 root 20 0 294092 63736 63068 S 1.7 0.0 0:03.06 fio
16277 root 20 0 272336 3568 1868 S 1.7 0.0 0:03.59 fio
16287 root 20 0 272376 3564 1864 S 1.7 0.0 0:03.64 fio
16291 root 20 0 272392 3620 1868 S 1.7 0.0 0:03.63 fio
16298 root 20 0 272420 3564 1868 S 1.7 0.0 0:03.61 fio
16302 root 20 0 272436 3560 1868 S 1.7 0.0 0:03.61 fio
16303 root 20 0 272440 3552 1800 S 1.7 0.0 0:03.62 fio
16308 root 20 0 272460 3568 1864 S 1.7 0.0 0:03.60 fio
16278 root 20 0 272340 3568 1868 S 1.3 0.0 0:03.59 fio
16279 root 20 0 272344 3508 1800 S 1.3 0.0 0:03.60 fio
16280 root 20 0 272348 3564 1864 S 1.3 0.0 0:03.60 fio
16281 root 20 0 272352 3624 1872 S 1.3 0.0 0:03.57 fio
16283 root 20 0 272360 3612 1860 S 1.3 0.0 0:03.60 fio
16285 root 20 0 272368 3592 1840 S 1.3 0.0 0:03.62 fio
16286 root 20 0 272372 3580 1828 S 1.3 0.0 0:03.61 fio
16288 root 20 0 272380 3620 1868 S 1.3 0.0 0:03.55 fio
16289 root 20 0 272384 3564 1868 S 1.3 0.0 0:03.59 fio
16292 root 20 0 272396 3536 1836 S 1.3 0.0 0:03.62 fio
16293 root 20 0 272400 3624 1872 S 1.3 0.0 0:03.63 fio
16295 root 20 0 272408 3620 1868 S 1.3 0.0 0:03.61 fio
16297 root 20 0 272416 3568 1868 S 1.3 0.0 0:03.62 fio
16300 root 20 0 272428 3564 1864 R 1.3 0.0 0:03.61 fio
16304 root 20 0 272444 3564 1864 S 1.3 0.0 0:03.59 fio
16305 root 20 0 272448 3456 1760 S 1.3 0.0 0:03.65 fio
16307 root 20 0 272456 3568 1864 S 1.3 0.0 0:03.64 fio
16282 root 20 0 272356 3556 1860 S 1.0 0.0 0:03.55 fio
16284 root 20 0 272364 3612 1860 S 1.0 0.0 0:03.57 fio
16290 root 20 0 272388 3616 1864 S 1.0 0.0 0:03.54 fio
16294 root 20 0 272404 3624 1872 S 1.0 0.0 0:03.60 fio
16296 root 20 0 272412 3564 1864 S 1.0 0.0 0:03.60 fio
16299 root 20 0 272424 3540 1840 S 1.0 0.0 0:03.62 fio
16301 root 20 0 272432 3568 1868 S 1.0 0.0 0:03.63 fio
16306 root 20 0 272452 3624 1872 S 1.0 0.0 0:03.60 fio

Signed-off-by: Xianting Tian <[email protected]>
---
fs/aio.c | 2 +-
include/linux/wait.h | 26 ++++++++++++++++++++++----
2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5736bff48..8d00548e0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1290,7 +1290,7 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
if (until == 0)
aio_read_events(ctx, min_nr, nr, event, &ret);
else
- wait_event_interruptible_hrtimeout(ctx->wait,
+ io_wait_event_hrtimeout(ctx->wait,
aio_read_events(ctx, min_nr, nr, event, &ret),
until);
return ret;
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 898c890fc..fb5902a25 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -312,6 +312,13 @@ do { \
(void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \
io_schedule())

+#define __io_wait_event_hrtimeout(wq_head, condition, timeout) \
+({ \
+ int __ret = 0; \
+ __ret = __wait_event_hrtimeout(wq_head, condition, timeout, \
+ TASK_UNINTERRUPTIBLE, io_schedule()); \
+})
+
/*
* io_wait_event() -- like wait_event() but with io_schedule()
*/
@@ -323,6 +330,15 @@ do { \
__io_wait_event(wq_head, condition); \
} while (0)

+
+#define io_wait_event_hrtimeout(wq_head, condition, timeout) \
+do { \
+ might_sleep(); \
+ if (condition) \
+ break; \
+ __io_wait_event_hrtimeout(wq_head, condition, timeout); \
+} while (0)
+
#define __wait_event_freezable(wq_head, condition) \
___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \
freezable_schedule())
@@ -500,7 +516,7 @@ do { \
__ret; \
})

-#define __wait_event_hrtimeout(wq_head, condition, timeout, state) \
+#define __wait_event_hrtimeout(wq_head, condition, timeout, state, cmd) \
({ \
int __ret = 0; \
struct hrtimer_sleeper __t; \
@@ -517,7 +533,7 @@ do { \
__ret = -ETIME; \
break; \
} \
- schedule()); \
+ cmd); \
\
hrtimer_cancel(&__t.timer); \
destroy_hrtimer_on_stack(&__t.timer); \
@@ -546,7 +562,8 @@ do { \
might_sleep(); \
if (!(condition)) \
__ret = __wait_event_hrtimeout(wq_head, condition, timeout, \
- TASK_UNINTERRUPTIBLE); \
+ TASK_UNINTERRUPTIBLE, \
+ schedule()); \
__ret; \
})

@@ -572,7 +589,8 @@ do { \
might_sleep(); \
if (!(condition)) \
__ret = __wait_event_hrtimeout(wq, condition, timeout, \
- TASK_INTERRUPTIBLE); \
+ TASK_INTERRUPTIBLE, \
+ schedule()); \
__ret; \
})

--
2.17.1


2020-08-28 08:59:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri 28-08-20 14:07:12, Xianting Tian wrote:
> As the normal aio wait path(read_events() ->
> wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> this patch to make it to account iowait time, which can truely reflect
> the system io situation when using a tool like 'top'.
>
> The test result as below.
>
> Test environment:
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 32
> On-line CPU(s) list: 0-31
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 85
> Model name: Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz
> Stepping: 4
> CPU MHz: 801.660
>
> AIO test command:
> fio -ioengine=libaio -bs=8k -direct=1 -numjobs 32 -rw=read -size=10G
> -filename=/dev/sda3 -name="Max throughput" -iodepth=128 -runtime=60
>
> Before test, set nr_requests to 512(default is 256), aim to to make the backend
> device busy to handle io request, and make io_getevents() have more chances to
> wait for io completion:
> echo 512 > /sys/block/sda/queue/nr_requests
>
> Fio test result with the AIO iowait time accounting patch showed as below,
> almost all fio threads are in 'D' state to waiting for io completion, and the
> iowait time is accounted as 48.0%:
> top - 19:19:23 up 29 min, 2 users, load average: 14.60, 9.45, 9.45
> Tasks: 456 total, 4 running, 247 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 0.4 us, 1.0 sy, 0.0 ni, 50.6 id, 48.0 wa, 0.0 hi, 0.0 si, 0.0 st
> KiB Mem : 19668915+total, 19515264+free, 866476 used, 670028 buff/cache
> KiB Swap: 4194300 total, 4194300 free, 0 used. 19449948+avail Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 16135 root 20 0 294092 63724 63060 S 1.7 0.0 0:03.31 fio
> 16173 root 20 0 272352 3540 1792 D 1.7 0.0 0:03.85 fio
> 16175 root 20 0 272360 3544 1796 D 1.7 0.0 0:03.85 fio
> 16185 root 20 0 272400 3556 1808 D 1.7 0.0 0:03.84 fio
> 16187 root 20 0 272408 3552 1804 D 1.7 0.0 0:03.82 fio
> 16190 root 20 0 272420 3500 1804 R 1.7 0.0 0:03.88 fio
> 16169 root 20 0 272336 3444 1740 D 1.3 0.0 0:03.75 fio
> 16170 root 20 0 272340 3504 1804 R 1.3 0.0 0:03.80 fio
> 16172 root 20 0 272348 3500 1800 D 1.3 0.0 0:03.86 fio
> 16174 root 20 0 272356 3544 1796 D 1.3 0.0 0:03.77 fio
> 16179 root 20 0 272376 3528 1780 D 1.3 0.0 0:03.79 fio
> 16180 root 20 0 272380 3500 1800 D 1.3 0.0 0:03.85 fio
> 16181 root 20 0 272384 3552 1804 D 1.3 0.0 0:03.87 fio
> 16182 root 20 0 272388 3520 1772 D 1.3 0.0 0:03.80 fio
> 16183 root 20 0 272392 3552 1804 D 1.3 0.0 0:03.77 fio
> 16186 root 20 0 272404 3500 1804 D 1.3 0.0 0:03.88 fio
> 16188 root 20 0 272412 3500 1800 D 1.3 0.0 0:03.89 fio
> 16191 root 20 0 272424 3500 1800 D 1.3 0.0 0:03.92 fio
> 16192 root 20 0 272428 3500 1800 D 1.3 0.0 0:03.87 fio
> 16194 root 20 0 272436 3500 1804 D 1.3 0.0 0:03.82 fio
> 16195 root 20 0 272440 3500 1800 R 1.3 0.0 0:03.82 fio
> 16196 root 20 0 272444 3552 1804 D 1.3 0.0 0:03.84 fio
> 16198 root 20 0 272452 3500 1804 D 1.3 0.0 0:03.89 fio
> 16199 root 20 0 272456 3504 1800 D 1.3 0.0 0:03.84 fio
> 16200 root 20 0 272460 3552 1804 D 1.3 0.0 0:03.85 fio
> 16171 root 20 0 272344 3504 1800 D 1.0 0.0 0:03.84 fio
> 16176 root 20 0 272364 3520 1772 D 1.0 0.0 0:03.76 fio
> 16177 root 20 0 272368 3556 1808 D 1.0 0.0 0:03.74 fio
> 16178 root 20 0 272372 3500 1804 D 1.0 0.0 0:03.90 fio
> 16184 root 20 0 272396 3500 1800 D 1.0 0.0 0:03.83 fio
> 16189 root 20 0 272416 3500 1804 D 1.0 0.0 0:03.86 fio
> 16193 root 20 0 272432 3500 1804 D 1.0 0.0 0:03.85 fio
> 16197 root 20 0 272448 3556 1808 D 1.0 0.0 0:03.75 fio
>
> Fio test result without the AIO iowait time accounting patch showed as below,
> almost all fio threads are in 'S' state to waiting for io completion, and the
> iowait time is not accounted, iowait is 0.0%.
> top - 19:20:44 up 31 min, 2 users, load average: 12.50, 10.15, 9.72
> Tasks: 458 total, 2 running, 249 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 0.4 us, 0.9 sy, 0.0 ni, 98.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> KiB Mem : 19668915+total, 19513945+free, 879652 used, 670040 buff/cache
> KiB Swap: 4194300 total, 4194300 free, 0 used. 19448636+avail Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 16243 root 20 0 294092 63736 63068 S 1.7 0.0 0:03.06 fio
> 16277 root 20 0 272336 3568 1868 S 1.7 0.0 0:03.59 fio
> 16287 root 20 0 272376 3564 1864 S 1.7 0.0 0:03.64 fio
> 16291 root 20 0 272392 3620 1868 S 1.7 0.0 0:03.63 fio
> 16298 root 20 0 272420 3564 1868 S 1.7 0.0 0:03.61 fio
> 16302 root 20 0 272436 3560 1868 S 1.7 0.0 0:03.61 fio
> 16303 root 20 0 272440 3552 1800 S 1.7 0.0 0:03.62 fio
> 16308 root 20 0 272460 3568 1864 S 1.7 0.0 0:03.60 fio
> 16278 root 20 0 272340 3568 1868 S 1.3 0.0 0:03.59 fio
> 16279 root 20 0 272344 3508 1800 S 1.3 0.0 0:03.60 fio
> 16280 root 20 0 272348 3564 1864 S 1.3 0.0 0:03.60 fio
> 16281 root 20 0 272352 3624 1872 S 1.3 0.0 0:03.57 fio
> 16283 root 20 0 272360 3612 1860 S 1.3 0.0 0:03.60 fio
> 16285 root 20 0 272368 3592 1840 S 1.3 0.0 0:03.62 fio
> 16286 root 20 0 272372 3580 1828 S 1.3 0.0 0:03.61 fio
> 16288 root 20 0 272380 3620 1868 S 1.3 0.0 0:03.55 fio
> 16289 root 20 0 272384 3564 1868 S 1.3 0.0 0:03.59 fio
> 16292 root 20 0 272396 3536 1836 S 1.3 0.0 0:03.62 fio
> 16293 root 20 0 272400 3624 1872 S 1.3 0.0 0:03.63 fio
> 16295 root 20 0 272408 3620 1868 S 1.3 0.0 0:03.61 fio
> 16297 root 20 0 272416 3568 1868 S 1.3 0.0 0:03.62 fio
> 16300 root 20 0 272428 3564 1864 R 1.3 0.0 0:03.61 fio
> 16304 root 20 0 272444 3564 1864 S 1.3 0.0 0:03.59 fio
> 16305 root 20 0 272448 3456 1760 S 1.3 0.0 0:03.65 fio
> 16307 root 20 0 272456 3568 1864 S 1.3 0.0 0:03.64 fio
> 16282 root 20 0 272356 3556 1860 S 1.0 0.0 0:03.55 fio
> 16284 root 20 0 272364 3612 1860 S 1.0 0.0 0:03.57 fio
> 16290 root 20 0 272388 3616 1864 S 1.0 0.0 0:03.54 fio
> 16294 root 20 0 272404 3624 1872 S 1.0 0.0 0:03.60 fio
> 16296 root 20 0 272412 3564 1864 S 1.0 0.0 0:03.60 fio
> 16299 root 20 0 272424 3540 1840 S 1.0 0.0 0:03.62 fio
> 16301 root 20 0 272432 3568 1868 S 1.0 0.0 0:03.63 fio
> 16306 root 20 0 272452 3624 1872 S 1.0 0.0 0:03.60 fio
>
> Signed-off-by: Xianting Tian <[email protected]>

Nice! The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/aio.c | 2 +-
> include/linux/wait.h | 26 ++++++++++++++++++++++----
> 2 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 5736bff48..8d00548e0 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1290,7 +1290,7 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
> if (until == 0)
> aio_read_events(ctx, min_nr, nr, event, &ret);
> else
> - wait_event_interruptible_hrtimeout(ctx->wait,
> + io_wait_event_hrtimeout(ctx->wait,
> aio_read_events(ctx, min_nr, nr, event, &ret),
> until);
> return ret;
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 898c890fc..fb5902a25 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -312,6 +312,13 @@ do { \
> (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \
> io_schedule())
>
> +#define __io_wait_event_hrtimeout(wq_head, condition, timeout) \
> +({ \
> + int __ret = 0; \
> + __ret = __wait_event_hrtimeout(wq_head, condition, timeout, \
> + TASK_UNINTERRUPTIBLE, io_schedule()); \
> +})
> +
> /*
> * io_wait_event() -- like wait_event() but with io_schedule()
> */
> @@ -323,6 +330,15 @@ do { \
> __io_wait_event(wq_head, condition); \
> } while (0)
>
> +
> +#define io_wait_event_hrtimeout(wq_head, condition, timeout) \
> +do { \
> + might_sleep(); \
> + if (condition) \
> + break; \
> + __io_wait_event_hrtimeout(wq_head, condition, timeout); \
> +} while (0)
> +
> #define __wait_event_freezable(wq_head, condition) \
> ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \
> freezable_schedule())
> @@ -500,7 +516,7 @@ do { \
> __ret; \
> })
>
> -#define __wait_event_hrtimeout(wq_head, condition, timeout, state) \
> +#define __wait_event_hrtimeout(wq_head, condition, timeout, state, cmd) \
> ({ \
> int __ret = 0; \
> struct hrtimer_sleeper __t; \
> @@ -517,7 +533,7 @@ do { \
> __ret = -ETIME; \
> break; \
> } \
> - schedule()); \
> + cmd); \
> \
> hrtimer_cancel(&__t.timer); \
> destroy_hrtimer_on_stack(&__t.timer); \
> @@ -546,7 +562,8 @@ do { \
> might_sleep(); \
> if (!(condition)) \
> __ret = __wait_event_hrtimeout(wq_head, condition, timeout, \
> - TASK_UNINTERRUPTIBLE); \
> + TASK_UNINTERRUPTIBLE, \
> + schedule()); \
> __ret; \
> })
>
> @@ -572,7 +589,8 @@ do { \
> might_sleep(); \
> if (!(condition)) \
> __ret = __wait_event_hrtimeout(wq, condition, timeout, \
> - TASK_INTERRUPTIBLE); \
> + TASK_INTERRUPTIBLE, \
> + schedule()); \
> __ret; \
> })
>
> --
> 2.17.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-08-28 09:11:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> As the normal aio wait path(read_events() ->
> wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> this patch to make it to account iowait time, which can truely reflect
> the system io situation when using a tool like 'top'.

Do be aware though that io_schedule() is potentially far more expensive
than regular schedule() and io-wait accounting as a whole is a
trainwreck.

When in_iowait is set schedule() and ttwu() will have to do additional
atomic ops, and (much) worse, PSI will take additional locks.

And all that for a number that, IMO, is mostly useless, see the comment
with nr_iowait().

But, if you don't care about performance, and want to see a shiny random
number generator, by all means, use io_schedule().

2020-08-28 09:42:32

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri 28-08-20 11:07:29, [email protected] wrote:
> On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > As the normal aio wait path(read_events() ->
> > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> > this patch to make it to account iowait time, which can truely reflect
> > the system io situation when using a tool like 'top'.
>
> Do be aware though that io_schedule() is potentially far more expensive
> than regular schedule() and io-wait accounting as a whole is a
> trainwreck.

Hum, I didn't know that io_schedule() is that much more expensive. Thanks
for info.

> When in_iowait is set schedule() and ttwu() will have to do additional
> atomic ops, and (much) worse, PSI will take additional locks.
>
> And all that for a number that, IMO, is mostly useless, see the comment
> with nr_iowait().

Well, I understand the limited usefulness of the system or even per CPU
percentage spent in IO wait. However whether a particular task is sleeping
waiting for IO or not is IMO a useful diagnostic information and there are
several places in the kernel that take that into account (PSI, hangcheck
timer, cpufreq, ...). So I don't see that properly accounting that a task
is waiting for IO is just "expensive random number generator" as you
mention below :). But I'm open to being educated...

> But, if you don't care about performance, and want to see a shiny random
> number generator, by all means, use io_schedule().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-08-28 10:53:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote:
> On Fri 28-08-20 11:07:29, [email protected] wrote:
> > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > > As the normal aio wait path(read_events() ->
> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> > > this patch to make it to account iowait time, which can truely reflect
> > > the system io situation when using a tool like 'top'.
> >
> > Do be aware though that io_schedule() is potentially far more expensive
> > than regular schedule() and io-wait accounting as a whole is a
> > trainwreck.
>
> Hum, I didn't know that io_schedule() is that much more expensive. Thanks
> for info.

It's all relative, but it can add up under contention. And since these
storage thingies are getting faster every year, I'm assuming these
schedule rates are increasing along with it.

> > When in_iowait is set schedule() and ttwu() will have to do additional
> > atomic ops, and (much) worse, PSI will take additional locks.
> >
> > And all that for a number that, IMO, is mostly useless, see the comment
> > with nr_iowait().
>
> Well, I understand the limited usefulness of the system or even per CPU
> percentage spent in IO wait. However whether a particular task is sleeping
> waiting for IO or not

So strict per-task state is not a problem, and we could easily change
get_task_state() to distinguish between IO-wait or not, basically
duplicate S/D state into an IO-wait variant of the same. Although even
this has ABI implications :-(

> is IMO a useful diagnostic information and there are
> several places in the kernel that take that into account (PSI, hangcheck
> timer, cpufreq, ...).

So PSI is the one I hate most. We spend an aweful lot of time to not
have to take the old rq->lock on wakeup, and PSI reintroduced it for
accounting purposes -- I hate accounting overhead. :/

There's a number of high frequency scheduling workloads where it really
adds up, which is the reason we got rid of it in the first place.

OTOH, PSI gives more sensible numbers, although it goes side-ways when
you introduce affinity masks / cpusets.

The menu-cpufreq gov is known crazy and we're all hard working on
replacing it.

And the tick-sched usage is, iirc, the nohz case of iowait.

> So I don't see that properly accounting that a task
> is waiting for IO is just "expensive random number generator" as you
> mention below :). But I'm open to being educated...

It's the userspace iowait, and in particular the per-cpu iowait numbers
that I hate. Only on UP does any of that make sense.

But we can't remove them because ABI :-(

2020-08-28 12:01:45

by Tianxianting

[permalink] [raw]
Subject: RE: [PATCH] aio: make aio wait path to account iowait time

Thanks peterz, jan
So, enable aio iowait time accounting is a bad idea:(
I gained a lot from you, thanks

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Friday, August 28, 2020 6:52 PM
To: Jan Kara <[email protected]>
Cc: tianxianting (RD) <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Tejun Heo <[email protected]>; [email protected]
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote:
> On Fri 28-08-20 11:07:29, [email protected] wrote:
> > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > > As the normal aio wait path(read_events() ->
> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time,
> > > so use this patch to make it to account iowait time, which can
> > > truely reflect the system io situation when using a tool like 'top'.
> >
> > Do be aware though that io_schedule() is potentially far more
> > expensive than regular schedule() and io-wait accounting as a whole
> > is a trainwreck.
>
> Hum, I didn't know that io_schedule() is that much more expensive.
> Thanks for info.

It's all relative, but it can add up under contention. And since these storage thingies are getting faster every year, I'm assuming these schedule rates are increasing along with it.

> > When in_iowait is set schedule() and ttwu() will have to do
> > additional atomic ops, and (much) worse, PSI will take additional locks.
> >
> > And all that for a number that, IMO, is mostly useless, see the
> > comment with nr_iowait().
>
> Well, I understand the limited usefulness of the system or even per
> CPU percentage spent in IO wait. However whether a particular task is
> sleeping waiting for IO or not

So strict per-task state is not a problem, and we could easily change
get_task_state() to distinguish between IO-wait or not, basically duplicate S/D state into an IO-wait variant of the same. Although even this has ABI implications :-(

> is IMO a useful diagnostic information and there are several places in
> the kernel that take that into account (PSI, hangcheck timer, cpufreq,
> ...).

So PSI is the one I hate most. We spend an aweful lot of time to not have to take the old rq->lock on wakeup, and PSI reintroduced it for accounting purposes -- I hate accounting overhead. :/

There's a number of high frequency scheduling workloads where it really adds up, which is the reason we got rid of it in the first place.

OTOH, PSI gives more sensible numbers, although it goes side-ways when you introduce affinity masks / cpusets.

The menu-cpufreq gov is known crazy and we're all hard working on replacing it.

And the tick-sched usage is, iirc, the nohz case of iowait.

> So I don't see that properly accounting that a task is waiting for IO
> is just "expensive random number generator" as you mention below :).
> But I'm open to being educated...

It's the userspace iowait, and in particular the per-cpu iowait numbers that I hate. Only on UP does any of that make sense.

But we can't remove them because ABI :-(