2022-03-31 02:48:15

by Michael Marod

[permalink] [raw]
Subject: NVME performance regression in Linux 5.x due to lack of block level IO queueing

It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.

In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.

# fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120

Note: these are different versions of iostat which is why the output is different

# Linux 4.4.0-1137-aws:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 3248.00 3173.00 12992.00 12692.00 8.00 0.44 0.07 0.10 0.03 0.07 44.00
nvme0n1 0.00 0.00 3229.00 3292.00 12916.00 13168.00 8.00 0.47 0.07 0.10 0.04 0.07 46.80

# Linux 5.4.0-1068-aws:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 2842.57 2789.11 11370.30 11156.44 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
nvme1n1 2849.50 2748.51 11398.02 10994.06 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01

For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.

# fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5

The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.

// On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
# blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
259,0 12 1 0.000000000 114067 A WS 401408 + 8 <- (9,127) 274432
259,0 12 2 0.000003433 114067 G WS 401408 + 8 [fio]
259,0 12 3 0.000004078 114067 P N [fio]
259,0 12 4 0.000004548 114067 U N [fio] 1
259,0 12 5 0.000005654 114067 D WS 401408 + 8 [fio]
259,0 4 1 0.000045379 0 C WS 401408 + 8 [0]
259,0 14 1 0.000210927 0 C WS 401408 + 8 [0]
259,0 14 2 0.000254591 114065 A WS 401408 + 8 <- (9,127) 274432
259,0 14 3 0.000257373 114065 G WS 401408 + 8 [fio]
259,0 14 4 0.000257853 114065 P N [fio]
259,0 14 5 0.000258340 114065 U N [fio] 1
259,0 14 6 0.000259293 114065 D WS 401408 + 8 [fio]
259,0 8 2 0.000290583 0 C WS 401408 + 8 [0]
259,0 13 1 0.000080964 114068 A WS 401408 + 8 <- (9,127) 274432
259,0 13 2 0.000083702 114068 G WS 401408 + 8 [fio]
259,0 13 3 0.000084247 114068 P N [fio]
259,0 13 4 0.000084654 114068 U N [fio] 1
259,0 13 5 0.000085801 114068 D WS 401408 + 8 [fio]
259,0 3 1 1266874889.709450585 114066 A WS 401408 + 8 <- (9,127) 274432
259,0 3 2 1266874889.709455280 114066 G WS 401408 + 8 [fio]
259,0 3 3 1266874889.709456267 114066 P N [fio]
259,0 3 4 1266874889.709456843 114066 U N [fio] 1
259,0 3 5 1266874889.709458888 114066 D WS 401408 + 8 [fio]
259,0 15 1 0.000164738 114069 A WS 401408 + 8 <- (9,127) 274432
259,0 15 2 0.000168016 114069 G WS 401408 + 8 [fio]
259,0 15 3 0.000168571 114069 P N [fio]
259,0 15 4 0.000169037 114069 U N [fio] 1
259,0 15 5 0.000169993 114069 D WS 401408 + 8 [fio]
259,0 10 1 0.000128797 0 C WS 401408 + 8 [0]
^C259,0 8 1 1266874889.709500996 0 C WS 401408 + 8 [0]
CPU3 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 1 Timer unplugs: 0
CPU4 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 0 Timer unplugs: 0
CPU8 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 0 Timer unplugs: 0
CPU10 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 0 Timer unplugs: 0
CPU12 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 1 Timer unplugs: 0
CPU13 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 1 Timer unplugs: 0
CPU14 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 1 Timer unplugs: 0
CPU15 (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 3
IO unplugs: 1 Timer unplugs: 0

Total (259,0):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
IO unplugs: 5 Timer unplugs: 0

Throughput (R/W): 0KiB/s / 0KiB/s
Events (259,0): 30 entries
Skips: 0 forward (0 - 0.0%)

// On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
259,1 1 1 0.000000000 82519 Q WS 399368 + 8 [fio]
259,1 1 2 0.000000980 82519 G WS 399368 + 8 [fio]
259,1 1 3 0.000003468 82519 UT N [fio] 1
259,1 1 4 0.000003867 82519 I WS 399368 + 8 [fio]
259,1 1 5 0.000011785 1393 D WS 399368 + 8 [kworker/1:1H]
259,1 2 1 1266874889.709465021 82517 Q WS 399368 + 8 [fio]
259,1 2 2 1266874889.709467133 82517 G WS 399368 + 8 [fio]
259,1 2 3 1266874889.709471568 82517 UT N [fio] 1
259,1 2 4 1266874889.709471568 82517 I WS 399368 + 8 [fio]
259,1 2 5 1266874889.709476205 980 D WS 399368 + 8 [kworker/2:1H]
259,1 15 1 0.000051309 0 C WS 399368 + 8 [0]
259,1 3 1 0.000079179 82520 Q WS 399368 + 8 [fio]
259,1 3 2 0.000080079 82520 G WS 399368 + 8 [fio]
259,1 3 3 0.000082140 82520 UT N [fio] 1
259,1 3 4 0.000082391 82520 I WS 399368 + 8 [fio]
259,1 3 5 0.000087819 975 D WS 399368 + 8 [kworker/3:1H]
259,1 12 1 0.000125455 0 C WS 399368 + 8 [0]
259,1 5 1 0.000152778 82521 Q WS 399368 + 8 [fio]
259,1 5 2 0.000153811 82521 G WS 399368 + 8 [fio]
259,1 5 3 0.000155811 82521 UT N [fio] 1
259,1 5 4 0.000156314 82521 I WS 399368 + 8 [fio]
259,1 5 5 0.000168577 1552 D WS 399368 + 8 [kworker/5:1H]
259,1 7 1 0.000206251 0 C WS 399368 + 8 [0]
259,1 9 1 0.000234725 82518 Q WS 399368 + 8 [fio]
259,1 9 2 0.000235496 82518 G WS 399368 + 8 [fio]
259,1 9 3 0.000237147 82518 UT N [fio] 1
259,1 9 4 0.000237382 82518 I WS 399368 + 8 [fio]
259,1 9 5 0.000242425 2001 D WS 399368 + 8 [kworker/9:1H]
259,1 15 2 0.000290161 0 C WS 399368 + 8 [0]
259,1 13 1 1266874889.709520875 0 C WS 399368 + 8 [0]
^CCPU1 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 1
CPU2 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 1
CPU3 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 1
CPU5 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 1
CPU7 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 0
CPU9 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 1
CPU12 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 0
CPU13 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 0
CPU15 (259,1):
Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
Read depth: 0 Write depth: 2
IO unplugs: 0 Timer unplugs: 0

Total (259,1):
Reads Queued: 0, 0KiB Writes Queued: 5, 20KiB
Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
Read Merges: 0, 0KiB Write Merges: 0, 0KiB
IO unplugs: 0 Timer unplugs: 5

Throughput (R/W): 0KiB/s / 0KiB/s
Events (259,1): 30 entries
Skips: 0 forward (0 - 0.0%)

Please CC me directly as I am not subscribed to LKML.

Thanks,

-- Michael Marod ([email protected])


2022-03-31 04:09:31

by Michael Marod

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing

Fix: There were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "IO unplugs" and 0 "Timer unplugs" on the Linux 5.4.0 one.

> On Mar 30, 2022, at 10:31 AM, Michael Marod <[email protected]> wrote:
>
> It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.
>
> In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.
>
> # fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120
>
> Note: these are different versions of iostat which is why the output is different
>
> # Linux 4.4.0-1137-aws:
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
> nvme1n1 0.00 0.00 3248.00 3173.00 12992.00 12692.00 8.00 0.44 0.07 0.10 0.03 0.07 44.00
> nvme0n1 0.00 0.00 3229.00 3292.00 12916.00 13168.00 8.00 0.47 0.07 0.10 0.04 0.07 46.80
>
> # Linux 5.4.0-1068-aws:
> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
> nvme0n1 2842.57 2789.11 11370.30 11156.44 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
> nvme1n1 2849.50 2748.51 11398.02 10994.06 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
>
> For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.
>
> # fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5
>
> The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.
>
> // On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,0 12 1 0.000000000 114067 A WS 401408 + 8 <- (9,127) 274432
> 259,0 12 2 0.000003433 114067 G WS 401408 + 8 [fio]
> 259,0 12 3 0.000004078 114067 P N [fio]
> 259,0 12 4 0.000004548 114067 U N [fio] 1
> 259,0 12 5 0.000005654 114067 D WS 401408 + 8 [fio]
> 259,0 4 1 0.000045379 0 C WS 401408 + 8 [0]
> 259,0 14 1 0.000210927 0 C WS 401408 + 8 [0]
> 259,0 14 2 0.000254591 114065 A WS 401408 + 8 <- (9,127) 274432
> 259,0 14 3 0.000257373 114065 G WS 401408 + 8 [fio]
> 259,0 14 4 0.000257853 114065 P N [fio]
> 259,0 14 5 0.000258340 114065 U N [fio] 1
> 259,0 14 6 0.000259293 114065 D WS 401408 + 8 [fio]
> 259,0 8 2 0.000290583 0 C WS 401408 + 8 [0]
> 259,0 13 1 0.000080964 114068 A WS 401408 + 8 <- (9,127) 274432
> 259,0 13 2 0.000083702 114068 G WS 401408 + 8 [fio]
> 259,0 13 3 0.000084247 114068 P N [fio]
> 259,0 13 4 0.000084654 114068 U N [fio] 1
> 259,0 13 5 0.000085801 114068 D WS 401408 + 8 [fio]
> 259,0 3 1 1266874889.709450585 114066 A WS 401408 + 8 <- (9,127) 274432
> 259,0 3 2 1266874889.709455280 114066 G WS 401408 + 8 [fio]
> 259,0 3 3 1266874889.709456267 114066 P N [fio]
> 259,0 3 4 1266874889.709456843 114066 U N [fio] 1
> 259,0 3 5 1266874889.709458888 114066 D WS 401408 + 8 [fio]
> 259,0 15 1 0.000164738 114069 A WS 401408 + 8 <- (9,127) 274432
> 259,0 15 2 0.000168016 114069 G WS 401408 + 8 [fio]
> 259,0 15 3 0.000168571 114069 P N [fio]
> 259,0 15 4 0.000169037 114069 U N [fio] 1
> 259,0 15 5 0.000169993 114069 D WS 401408 + 8 [fio]
> 259,0 10 1 0.000128797 0 C WS 401408 + 8 [0]
> ^C259,0 8 1 1266874889.709500996 0 C WS 401408 + 8 [0]
> CPU3 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU4 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU8 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU10 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 0 Timer unplugs: 0
> CPU12 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU13 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU14 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
> CPU15 (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 3
> IO unplugs: 1 Timer unplugs: 0
>
> Total (259,0):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> IO unplugs: 5 Timer unplugs: 0
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,0): 30 entries
> Skips: 0 forward (0 - 0.0%)
>
> // On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,1 1 1 0.000000000 82519 Q WS 399368 + 8 [fio]
> 259,1 1 2 0.000000980 82519 G WS 399368 + 8 [fio]
> 259,1 1 3 0.000003468 82519 UT N [fio] 1
> 259,1 1 4 0.000003867 82519 I WS 399368 + 8 [fio]
> 259,1 1 5 0.000011785 1393 D WS 399368 + 8 [kworker/1:1H]
> 259,1 2 1 1266874889.709465021 82517 Q WS 399368 + 8 [fio]
> 259,1 2 2 1266874889.709467133 82517 G WS 399368 + 8 [fio]
> 259,1 2 3 1266874889.709471568 82517 UT N [fio] 1
> 259,1 2 4 1266874889.709471568 82517 I WS 399368 + 8 [fio]
> 259,1 2 5 1266874889.709476205 980 D WS 399368 + 8 [kworker/2:1H]
> 259,1 15 1 0.000051309 0 C WS 399368 + 8 [0]
> 259,1 3 1 0.000079179 82520 Q WS 399368 + 8 [fio]
> 259,1 3 2 0.000080079 82520 G WS 399368 + 8 [fio]
> 259,1 3 3 0.000082140 82520 UT N [fio] 1
> 259,1 3 4 0.000082391 82520 I WS 399368 + 8 [fio]
> 259,1 3 5 0.000087819 975 D WS 399368 + 8 [kworker/3:1H]
> 259,1 12 1 0.000125455 0 C WS 399368 + 8 [0]
> 259,1 5 1 0.000152778 82521 Q WS 399368 + 8 [fio]
> 259,1 5 2 0.000153811 82521 G WS 399368 + 8 [fio]
> 259,1 5 3 0.000155811 82521 UT N [fio] 1
> 259,1 5 4 0.000156314 82521 I WS 399368 + 8 [fio]
> 259,1 5 5 0.000168577 1552 D WS 399368 + 8 [kworker/5:1H]
> 259,1 7 1 0.000206251 0 C WS 399368 + 8 [0]
> 259,1 9 1 0.000234725 82518 Q WS 399368 + 8 [fio]
> 259,1 9 2 0.000235496 82518 G WS 399368 + 8 [fio]
> 259,1 9 3 0.000237147 82518 UT N [fio] 1
> 259,1 9 4 0.000237382 82518 I WS 399368 + 8 [fio]
> 259,1 9 5 0.000242425 2001 D WS 399368 + 8 [kworker/9:1H]
> 259,1 15 2 0.000290161 0 C WS 399368 + 8 [0]
> 259,1 13 1 1266874889.709520875 0 C WS 399368 + 8 [0]
> ^CCPU1 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU2 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU3 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU5 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU7 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU9 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 1
> CPU12 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU13 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
> CPU15 (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> Read depth: 0 Write depth: 2
> IO unplugs: 0 Timer unplugs: 0
>
> Total (259,1):
> Reads Queued: 0, 0KiB Writes Queued: 5, 20KiB
> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
> Reads Requeued: 0 Writes Requeued: 0
> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
> IO unplugs: 0 Timer unplugs: 5
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,1): 30 entries
> Skips: 0 forward (0 - 0.0%)
>
> Please CC me directly as I am not subscribed to LKML.
>
> Thanks,
>
> -- Michael Marod ([email protected])


2022-03-31 04:20:52

by Michael Marod

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing

Cross-posting from linux-kernel to linux-block as it relates to linux-block primarily.

Also realized that "Fix" is the wrong thing to say in my previous email -- "Edit" would have made more sense. The issue still remains, I just meant to correct my first email.

Finally wanted to point out that on both systems, the scheduler is set to "none".

Thanks,
Michael

> On Mar 30, 2022, at 10:49 AM, Michael Marod <[email protected]> wrote:
>
> Fix: There were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "IO unplugs" and 0 "Timer unplugs" on the Linux 5.4.0 one.
>
>> On Mar 30, 2022, at 10:31 AM, Michael Marod <[email protected]> wrote:
>>
>> It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.
>>
>> In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.
>>
>> # fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120
>>
>> Note: these are different versions of iostat which is why the output is different
>>
>> # Linux 4.4.0-1137-aws:
>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
>> nvme1n1 0.00 0.00 3248.00 3173.00 12992.00 12692.00 8.00 0.44 0.07 0.10 0.03 0.07 44.00
>> nvme0n1 0.00 0.00 3229.00 3292.00 12916.00 13168.00 8.00 0.47 0.07 0.10 0.04 0.07 46.80
>>
>> # Linux 5.4.0-1068-aws:
>> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
>> nvme0n1 2842.57 2789.11 11370.30 11156.44 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
>> nvme1n1 2849.50 2748.51 11398.02 10994.06 0.00 0.00 0.00 0.00 0.10 0.04 0.00 4.00 4.00 0.18 99.01
>>
>> For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.
>>
>> # fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5
>>
>> The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.
>>
>> // On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
>> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
>> 259,0 12 1 0.000000000 114067 A WS 401408 + 8 <- (9,127) 274432
>> 259,0 12 2 0.000003433 114067 G WS 401408 + 8 [fio]
>> 259,0 12 3 0.000004078 114067 P N [fio]
>> 259,0 12 4 0.000004548 114067 U N [fio] 1
>> 259,0 12 5 0.000005654 114067 D WS 401408 + 8 [fio]
>> 259,0 4 1 0.000045379 0 C WS 401408 + 8 [0]
>> 259,0 14 1 0.000210927 0 C WS 401408 + 8 [0]
>> 259,0 14 2 0.000254591 114065 A WS 401408 + 8 <- (9,127) 274432
>> 259,0 14 3 0.000257373 114065 G WS 401408 + 8 [fio]
>> 259,0 14 4 0.000257853 114065 P N [fio]
>> 259,0 14 5 0.000258340 114065 U N [fio] 1
>> 259,0 14 6 0.000259293 114065 D WS 401408 + 8 [fio]
>> 259,0 8 2 0.000290583 0 C WS 401408 + 8 [0]
>> 259,0 13 1 0.000080964 114068 A WS 401408 + 8 <- (9,127) 274432
>> 259,0 13 2 0.000083702 114068 G WS 401408 + 8 [fio]
>> 259,0 13 3 0.000084247 114068 P N [fio]
>> 259,0 13 4 0.000084654 114068 U N [fio] 1
>> 259,0 13 5 0.000085801 114068 D WS 401408 + 8 [fio]
>> 259,0 3 1 1266874889.709450585 114066 A WS 401408 + 8 <- (9,127) 274432
>> 259,0 3 2 1266874889.709455280 114066 G WS 401408 + 8 [fio]
>> 259,0 3 3 1266874889.709456267 114066 P N [fio]
>> 259,0 3 4 1266874889.709456843 114066 U N [fio] 1
>> 259,0 3 5 1266874889.709458888 114066 D WS 401408 + 8 [fio]
>> 259,0 15 1 0.000164738 114069 A WS 401408 + 8 <- (9,127) 274432
>> 259,0 15 2 0.000168016 114069 G WS 401408 + 8 [fio]
>> 259,0 15 3 0.000168571 114069 P N [fio]
>> 259,0 15 4 0.000169037 114069 U N [fio] 1
>> 259,0 15 5 0.000169993 114069 D WS 401408 + 8 [fio]
>> 259,0 10 1 0.000128797 0 C WS 401408 + 8 [0]
>> ^C259,0 8 1 1266874889.709500996 0 C WS 401408 + 8 [0]
>> CPU3 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 1 Timer unplugs: 0
>> CPU4 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 0 Timer unplugs: 0
>> CPU8 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 0 Timer unplugs: 0
>> CPU10 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 0 Timer unplugs: 0
>> CPU12 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 1 Timer unplugs: 0
>> CPU13 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 1 Timer unplugs: 0
>> CPU14 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 1 Timer unplugs: 0
>> CPU15 (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 3
>> IO unplugs: 1 Timer unplugs: 0
>>
>> Total (259,0):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> IO unplugs: 5 Timer unplugs: 0
>>
>> Throughput (R/W): 0KiB/s / 0KiB/s
>> Events (259,0): 30 entries
>> Skips: 0 forward (0 - 0.0%)
>>
>> // On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
>> 259,1 1 1 0.000000000 82519 Q WS 399368 + 8 [fio]
>> 259,1 1 2 0.000000980 82519 G WS 399368 + 8 [fio]
>> 259,1 1 3 0.000003468 82519 UT N [fio] 1
>> 259,1 1 4 0.000003867 82519 I WS 399368 + 8 [fio]
>> 259,1 1 5 0.000011785 1393 D WS 399368 + 8 [kworker/1:1H]
>> 259,1 2 1 1266874889.709465021 82517 Q WS 399368 + 8 [fio]
>> 259,1 2 2 1266874889.709467133 82517 G WS 399368 + 8 [fio]
>> 259,1 2 3 1266874889.709471568 82517 UT N [fio] 1
>> 259,1 2 4 1266874889.709471568 82517 I WS 399368 + 8 [fio]
>> 259,1 2 5 1266874889.709476205 980 D WS 399368 + 8 [kworker/2:1H]
>> 259,1 15 1 0.000051309 0 C WS 399368 + 8 [0]
>> 259,1 3 1 0.000079179 82520 Q WS 399368 + 8 [fio]
>> 259,1 3 2 0.000080079 82520 G WS 399368 + 8 [fio]
>> 259,1 3 3 0.000082140 82520 UT N [fio] 1
>> 259,1 3 4 0.000082391 82520 I WS 399368 + 8 [fio]
>> 259,1 3 5 0.000087819 975 D WS 399368 + 8 [kworker/3:1H]
>> 259,1 12 1 0.000125455 0 C WS 399368 + 8 [0]
>> 259,1 5 1 0.000152778 82521 Q WS 399368 + 8 [fio]
>> 259,1 5 2 0.000153811 82521 G WS 399368 + 8 [fio]
>> 259,1 5 3 0.000155811 82521 UT N [fio] 1
>> 259,1 5 4 0.000156314 82521 I WS 399368 + 8 [fio]
>> 259,1 5 5 0.000168577 1552 D WS 399368 + 8 [kworker/5:1H]
>> 259,1 7 1 0.000206251 0 C WS 399368 + 8 [0]
>> 259,1 9 1 0.000234725 82518 Q WS 399368 + 8 [fio]
>> 259,1 9 2 0.000235496 82518 G WS 399368 + 8 [fio]
>> 259,1 9 3 0.000237147 82518 UT N [fio] 1
>> 259,1 9 4 0.000237382 82518 I WS 399368 + 8 [fio]
>> 259,1 9 5 0.000242425 2001 D WS 399368 + 8 [kworker/9:1H]
>> 259,1 15 2 0.000290161 0 C WS 399368 + 8 [0]
>> 259,1 13 1 1266874889.709520875 0 C WS 399368 + 8 [0]
>> ^CCPU1 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 1
>> CPU2 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 1
>> CPU3 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 1
>> CPU5 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 1
>> CPU7 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 0
>> CPU9 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 1
>> CPU12 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 0
>> CPU13 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 1, 4KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 0
>> CPU15 (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 0, 0KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 2, 8KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> Read depth: 0 Write depth: 2
>> IO unplugs: 0 Timer unplugs: 0
>>
>> Total (259,1):
>> Reads Queued: 0, 0KiB Writes Queued: 5, 20KiB
>> Read Dispatches: 0, 0KiB Write Dispatches: 5, 20KiB
>> Reads Requeued: 0 Writes Requeued: 0
>> Reads Completed: 0, 0KiB Writes Completed: 5, 20KiB
>> Read Merges: 0, 0KiB Write Merges: 0, 0KiB
>> IO unplugs: 0 Timer unplugs: 5
>>
>> Throughput (R/W): 0KiB/s / 0KiB/s
>> Events (259,1): 30 entries
>> Skips: 0 forward (0 - 0.0%)
>>
>> Please CC me directly as I am not subscribed to LKML.
>>
>> Thanks,
>>
>> -- Michael Marod ([email protected])
>


2022-03-31 05:25:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing

5.4 is very old. Do you still have a problem with the latest kernel
(5.17) ?

2022-04-01 14:36:04

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing


> Thanks for the info. I don't know of anything block or nvme specific that might
> explain an order of magnitude perf difference.
>
> Could you try the same test without the filesytems? You mentioned using mdraid,
> so try '--filename=/dev/mdX'. If that also shows similiar performance
> difference, try using one of your nvme member drives directly, like
> '--filename=/dev/nvme1n1'. That should isolate which subsystem is contributing
> to the difference.
>

With Keith's suggestions you can always take the perf numbers and
compare before and after mainly for nvme_queue_rq() and nvme_irq().

-ck


2022-04-02 15:04:09

by Michael Marod

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing

Good call -- Turns out that that cache issue is resolved in 5.17. I tried a number of kernels and narrowed it down to a problem that started after 4.9 and before 4.15, and ended some time after 5.13. Namely, 4.9 is good, 4.15 is bad, 5.13 is bad, and 5.17 is good. I did not bisect it all the way down to the specific versions where the behaviors changed.

Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme1n1 2758.00 2783.00 11032.00 11132.00 0.00 0.00 0.00 0.00 0.10 0.03 0.36 4.00 4.00 0.18 100.00
nvme0n1 2830.00 2875.00 11320.00 11500.00 0.00 0.00 0.00 0.00 0.10 0.03 0.39 4.00 4.00 0.18 100.00

With regards to the performance between 4.4.0 and 5.17, for a single thread, 4.4.0 still had better performance over 5.17. However, the 5.17 kernel was significantly better at multiple threads. In fact, it is so much better I don't believe the results (10x improvement!). Is this to be expected that a single thread would be slower in 5.17, but recent improvements make it possible to run many of them in parallel more efficiently?

# /usr/local/bin/fio -name=randrw -filename=/opt/foo -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=16 -group_reporting=1 -runtime=120

// Ubuntu 16.04 / Linux 4.4.0:
Run status group 0 (all jobs):
READ: bw=54.5MiB/s (57.1MB/s), 54.5MiB/s-54.5MiB/s (57.1MB/s-57.1MB/s), io=6537MiB (6854MB), run=120002-120002msec
WRITE: bw=54.5MiB/s (57.2MB/s), 54.5MiB/s-54.5MiB/s (57.2MB/s-57.2MB/s), io=6544MiB (6862MB), run=120002-120002msec

// Ubuntu 18.04 / Linux 5.4.0:
Run status group 0 (all jobs):
READ: bw=23.5MiB/s (24.7MB/s), 23.5MiB/s-23.5MiB/s (24.7MB/s-24.7MB/s), io=2821MiB (2959MB), run=120002-120002msec
WRITE: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=2819MiB (2955MB), run=120002-120002msec

// Ubuntu 18.04 / Linux 5.17:
Run status group 0 (all jobs):
READ: bw=244MiB/s (255MB/s), 244MiB/s-244MiB/s (255MB/s-255MB/s), io=28.6GiB (30.7GB), run=120001-120001msec
WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=28.6GiB (30.7GB), run=120001-120001msec

Thanks,
Michael

2022-04-04 04:51:08

by Keith Busch

[permalink] [raw]
Subject: Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing

On Thu, Mar 31, 2022 at 11:22:03PM +0000, Michael Marod wrote:
> # /usr/local/bin/fio -name=randrw -filename=/opt/foo -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=16 -group_reporting=1 -runtime=120
>
> // Ubuntu 16.04 / Linux 4.4.0:
> Run status group 0 (all jobs):
> READ: bw=54.5MiB/s (57.1MB/s), 54.5MiB/s-54.5MiB/s (57.1MB/s-57.1MB/s), io=6537MiB (6854MB), run=120002-120002msec
> WRITE: bw=54.5MiB/s (57.2MB/s), 54.5MiB/s-54.5MiB/s (57.2MB/s-57.2MB/s), io=6544MiB (6862MB), run=120002-120002msec
>
> // Ubuntu 18.04 / Linux 5.4.0:
> Run status group 0 (all jobs):
> READ: bw=23.5MiB/s (24.7MB/s), 23.5MiB/s-23.5MiB/s (24.7MB/s-24.7MB/s), io=2821MiB (2959MB), run=120002-120002msec
> WRITE: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=2819MiB (2955MB), run=120002-120002msec
>
> // Ubuntu 18.04 / Linux 5.17:
> Run status group 0 (all jobs):
> READ: bw=244MiB/s (255MB/s), 244MiB/s-244MiB/s (255MB/s-255MB/s), io=28.6GiB (30.7GB), run=120001-120001msec
> WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=28.6GiB (30.7GB), run=120001-120001msec

Thanks for the info. I don't know of anything block or nvme specific that might
explain an order of magnitude perf difference.

Could you try the same test without the filesytems? You mentioned using mdraid,
so try '--filename=/dev/mdX'. If that also shows similiar performance
difference, try using one of your nvme member drives directly, like
'--filename=/dev/nvme1n1'. That should isolate which subsystem is contributing
to the difference.