Hey Team,
I am sending this e-mail to report a performance regression that’s
caused by commit 244adf6426(ext4: make dioread_nolock the default) , I
am listing the performance regression symptoms below & our analysis
for the reported regression.
Reproduction Environment:
m5.24xlarge AWS EC2 instance that’s equipped with 96 vCPUs & 384 GiB of memory
Disk: io1 16 TiB AWS EBS volume that should be able to achieve 64 IOPS
with throughput of 1000 MiB/s.
Reproduction steps:
We have seen a sharp regression in the achieved Disk IOPS &
throughput running random buffered writes I/O fio test as shown below.
Format and mount:
$sudo mkdir -p /dbdata
$mke2fs -E lazy_itable_init=0,lazy_journal_init=0 -m 1 -t ext4 -b 4096
/dev/nvme1n1
$mount -t ext4 -o noatime,nodiratime,data=ordered /dev/nvme1n1 /dbdata
Mount options:
/dev/nvme1n1 on /dbdata type ext4 (rw,noatime,nodiratime,data=ordered)
Running the below script:
[root@ip-xx-xx-xx-xx ec2-user]# cat fio_test.sh
#!/bin/bash
for i in `seq 1 2`; do
rm -rf /rdsdbdata/*
/usr/bin/fio --name=16kb_rand_write_only_2048_jobs
--directory=/dbdata --rw=randwrite --ioengine=sync --fdatasync=1
--buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=30
--thread --filesize=28800000 --fsync=1 --group_reporting
--create_only=1 > /dev/null
sudo echo 1 > /proc/sys/vm/drop_caches
echo "start test ${i}"
/usr/bin/fio --name=16kb_rand_write_only_2048_jobs
--directory=/dbdata --rw=randwrite --ioengine=sync --fdatasync=1
--buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=60
--time_based --thread --filesize=28800000 --fsync=1 --group_reporting
| grep iops
done
Symptoms:
The regression has been seen after upgrading from kernel 4.14 to 5.10
as shown below. it’s still reproducible on the latest upstream kernel
5.19 as the previously mentioned commit has been in the Linux kernel
since 5.6.
Kernel 4.14.287-215.504.amzn2.x86_64
[root@ip-xx-xx-xx-xx ec2-user]# ./fio_test.sh
start test 1
write: io=56335MB, bw=320416KB/s, iops=20026, runt=180038msec
start test 2
write: io=55111MB, bw=313421KB/s, iops=19588, runt=180056msec
Kernel 5.10.130-118.517.amzn2.x86_64
[root@ip-xx-xx-xx-xx ec2-user]# ./fio_test.sh
start test 1
write: io=911744KB, bw=5055.5KB/s, iops=315, runt=180350msec
start test 2
write: io=1075.6MB, bw=6089.4KB/s, iops=380, runt=180872msec
Analysis:
Doing a kernel bisect pointed out to the previously mentioned commit
as culprit behind the reported regression. We have tried to remount
the ext4 filesystem disabling dioread_nolock and that was enough to
bring the performance back to what it was before the commit.
We have done some performance debugging using perf event flame graphs
The dioread_nolock captured flame graphs are showing that 93.47% of
the time is been spent on fsync calls with the below path:
Fsync->entry_SYSCALL_64_after_hwframe->do_syscall_64->__x64_sys_fsync->ext4_sync_file->file_write_and_wait_range->__filemap_fdatawrite_range->do_writepages->
ext4_writepages->__ext4_journal_start_sb-> jbd2__journal_start->
start_this_handle->add_transaction_credits-> prepare_to_wait_event->
_raw_spin_lock_irqsave-> native_queued_spin_lock_slowpath.
With dioread_lock 51% of the time is been spent on fsync->
entry_SYSCALL_64_after_hwframe-> do_syscall_64-> __x64_sys_fsync->
ext4_sync_file->jbd2_log_wait_commit->prepare_to_wait_event->_raw_spin_lock_irqsave->native_queued_spin_lock_slowpath.
Doing a quick perf diff comparison between dioread_nolock &
dioread_lock and ironically it looks like we have kind of higher spin
lock contention when using dioread_nolock
# Event 'cycles'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ...................
.................................................
#
96.46% -14.86% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
1.05% +2.19% fio [.] 0x0000000000019f00
0.00% +1.52% [kernel.kallsyms] [k] start_this_handle
0.04% +1.31% [kernel.kallsyms] [k] osq_lock
0.03% +0.73% [kernel.kallsyms] [k]
mwait_idle_with_hints.constprop.0
0.01% +0.60% [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.01% +0.58% [kernel.kallsyms] [k] _raw_read_lock
0.03% +0.48% [kernel.kallsyms] [k] add_transaction_credits
0.01% +0.34% [kernel.kallsyms] [k] queued_read_lock_slowpath
The flame graphs were showing that with dioread_lock enabled, most of
the CPU time is been spent in the add_transaction_credits function
which is responsible for making sure there is enough journal space
before adding the handle to the running transaction.
While running the reproduction with dioread_lock enabled and using
/proc/sysrq we were able to see that the fio process is waiting for
journal space most of the time and that’s why the lock contention on
journal->j_state_lock was higher with dioread_lock enabled.
Aug 19 14:52:09 ip-172-31-8-128 kernel: task:fio state:D
stack: 0 pid:34280 ppid: 26341 flags:0x00000080
Aug 19 14:52:09 ip-172-31-8-128 kernel: Call Trace:
Aug 19 14:52:09 ip-172-31-8-128 kernel: __schedule+0x1f9/0x660
Aug 19 14:52:09 ip-172-31-8-128 kernel: schedule+0x46/0xb0
Aug 19 14:52:09 ip-172-31-8-128 kernel: schedule_preempt_disabled+0xa/0x10
Aug 19 14:52:09 ip-172-31-8-128 kernel: __mutex_lock.constprop.0+0x12d/0x460
Aug 19 14:52:09 ip-172-31-8-128 kernel: mutex_lock_io+0x39/0x50
Aug 19 14:52:09 ip-172-31-8-128 kernel:
__jbd2_log_wait_for_space+0xc5/0x1a0 [jbd2]
Aug 19 14:52:09 ip-172-31-8-128 kernel:
add_transaction_credits+0x288/0x2a0 [jbd2]
Aug 19 14:52:09 ip-172-31-8-128 kernel: start_this_handle+0x12d/0x4d0 [jbd2]
Aug 19 14:52:09 ip-172-31-8-128 kernel: ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
Aug 19 14:52:09 ip-172-31-8-128 kernel: ? kmem_cache_alloc+0x132/0x260
Aug 19 14:52:09 ip-172-31-8-128 kernel: jbd2__journal_start+0xf7/0x1e0 [jbd2]
Aug 19 14:52:09 ip-172-31-8-128 kernel:
__ext4_journal_start_sb+0xf3/0x110 [ext4]
Aug 19 14:52:09 ip-172-31-8-128 kernel: ext4_dirty_inode+0x3d/0x80 [ext4]
Aug 19 14:52:09 ip-172-31-8-128 kernel: __mark_inode_dirty+0x192/0x2f0
Aug 19 14:52:09 ip-172-31-8-128 kernel: generic_update_time+0x68/0xc0
Aug 19 14:52:09 ip-172-31-8-128 kernel: file_update_time+0x123/0x140
Aug 19 14:52:09 ip-172-31-8-128 kernel: ? generic_write_checks+0x61/0xc0
Aug 19 14:52:09 ip-172-31-8-128 kernel:
ext4_buffered_write_iter+0x5a/0x160 [ext4]
Aug 19 14:52:09 ip-172-31-8-128 kernel: new_sync_write+0x11c/0x1b0
Aug 19 14:52:09 ip-172-31-8-128 kernel: vfs_write+0x1bd/0x260
Aug 19 14:52:09 ip-172-31-8-128 kernel: ksys_write+0x5f/0xe0
Aug 19 14:52:09 ip-172-31-8-128 kernel: do_syscall_64+0x33/0x40
Aug 19 14:52:09 ip-172-31-8-128 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 19 14:52:09 ip-172-31-8-128 kernel: RIP: 0033:0x7f13bde28559
Aug 19 14:52:09 ip-172-31-8-128 kernel: RSP: 002b:00007f04e0dde8e0
EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Aug 19 14:52:09 ip-172-31-8-128 kernel: RAX: ffffffffffffffda RBX:
00007f0fa8005280 RCX: 00007f13bde28559
Aug 19 14:52:09 ip-172-31-8-128 kernel: RDX: 0000000000004000 RSI:
00007f0fa8005280 RDI: 00000000000006bf
Aug 19 14:52:09 ip-172-31-8-128 kernel: RBP: 0000000000004000 R08:
0000000000000000 R09: 0000000000000001
Aug 19 14:52:09 ip-172-31-8-128 kernel: R10: 0000000000000000 R11:
0000000000000293 R12: 00007f13a9e562c8
Aug 19 14:52:09 ip-172-31-8-128 kernel: R13: 0000000000004000 R14:
00007f13a9e63c40 R15: 00007f13a9e63ae0
We did the same with dioread_lock enabled and I couldn't see any
reference for __jbd2_log_wait_for_space function which is the one
responsible for waiting for enough journal space to be available.
Aug 19 15:22:35 ip-172-31-8-128 kernel: task:fio state:D
stack: 0 pid:77710 ppid: 71030 flags:0x00004080
Aug 19 15:22:35 ip-172-31-8-128 kernel: Call Trace:
Aug 19 15:22:35 ip-172-31-8-128 kernel: __schedule+0x1f9/0x660
Aug 19 15:22:35 ip-172-31-8-128 kernel: schedule+0x46/0xb0
Aug 19 15:22:35 ip-172-31-8-128 kernel: wait_transaction_locked+0x8a/0xd0 [jbd2]
Aug 19 15:22:35 ip-172-31-8-128 kernel: ? add_wait_queue_exclusive+0x70/0x70
Aug 19 15:22:35 ip-172-31-8-128 kernel:
add_transaction_credits+0xd6/0x2a0 [jbd2]
Aug 19 15:22:35 ip-172-31-8-128 kernel: start_this_handle+0x12d/0x4d0 [jbd2]
Aug 19 15:22:35 ip-172-31-8-128 kernel: ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
Aug 19 15:22:35 ip-172-31-8-128 kernel: ? kmem_cache_alloc+0x132/0x260
Aug 19 15:22:35 ip-172-31-8-128 kernel: jbd2__journal_start+0xf7/0x1e0 [jbd2]
Aug 19 15:22:35 ip-172-31-8-128 kernel:
__ext4_journal_start_sb+0xf3/0x110 [ext4]
Aug 19 15:22:35 ip-172-31-8-128 kernel: ext4_dirty_inode+0x3d/0x80 [ext4]
Aug 19 15:22:35 ip-172-31-8-128 kernel: __mark_inode_dirty+0x192/0x2f0
Aug 19 15:22:35 ip-172-31-8-128 kernel: generic_update_time+0x68/0xc0
Aug 19 15:22:35 ip-172-31-8-128 kernel: file_update_time+0x123/0x140
Aug 19 15:22:35 ip-172-31-8-128 kernel: ? generic_write_checks+0x61/0xc0
Aug 19 15:22:35 ip-172-31-8-128 kernel:
ext4_buffered_write_iter+0x5a/0x160 [ext4]
Aug 19 15:22:35 ip-172-31-8-128 kernel: new_sync_write+0x11c/0x1b0
Aug 19 15:22:35 ip-172-31-8-128 kernel: vfs_write+0x1bd/0x260
Aug 19 15:22:35 ip-172-31-8-128 kernel: ksys_write+0x5f/0xe0
Aug 19 15:22:35 ip-172-31-8-128 kernel: do_syscall_64+0x33/0x40
Aug 19 15:22:35 ip-172-31-8-128 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 19 15:22:35 ip-172-31-8-128 kernel: RIP: 0033:0x7fd6d8083559
Aug 19 15:22:35 ip-172-31-8-128 kernel: RSP: 002b:00007fd59d7f78e0
EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Aug 19 15:22:35 ip-172-31-8-128 kernel: RAX: ffffffffffffffda RBX:
00007fd580000dd0 RCX: 00007fd6d8083559
Aug 19 15:22:35 ip-172-31-8-128 kernel: RDX: 0000000000004000 RSI:
00007fd580000dd0 RDI: 0000000000000088
Aug 19 15:22:35 ip-172-31-8-128 kernel: RBP: 0000000000004000 R08:
0000000000000000 R09: 0000000000000001
Aug 19 15:22:35 ip-172-31-8-128 kernel: R10: 0000000000000000 R11:
0000000000000293 R12: 00007fd6bfe33b08
Aug 19 15:22:35 ip-172-31-8-128 kernel: R13: 0000000000004000 R14:
00007fd6bfe41480 R15: 00007fd6bfe41320
Also running funclatency BCC script is showing that fsync operations
are taking higher time while having dioread_nolock enabled
dioread_nolock enabled
operation = open
msecs : count distribution
0 -> 1 : 4086 |****************************************|
operation = fsync
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 92 |******************** |
16 -> 31 : 153 |********************************** |
32 -> 63 : 177 |****************************************|
64 -> 127 : 18 |**** |
128 -> 255 : 77 |***************** |
256 -> 511 : 15 |*** |
512 -> 1023 : 19 |**** |
1024 -> 2047 : 51 |*********** |
2048 -> 4095 : 137 |****************************** |
4096 -> 8191 : 82 |****************** |
8192 -> 16383 : 118 |************************** |
16384 -> 32767 : 104 |*********************** |
32768 -> 65535 : 140 |******************************* |
operation = write
msecs : count distribution
0 -> 1 : 10265 |****************************************|
2 -> 3 : 491 |* |
4 -> 7 : 433 |* |
8 -> 15 : 670 |** |
16 -> 31 : 1043 |**** |
32 -> 63 : 438 |* |
64 -> 127 : 177 | |
128 -> 255 : 129 | |
256 -> 511 : 14 | |
512 -> 1023 : 80 | |
1024 -> 2047 : 24 | |
dioread_lock enabled
operation = fsync
msecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 115 | |
4 -> 7 : 795 | |
8 -> 15 : 2402 | |
16 -> 31 : 3226 |* |
32 -> 63 : 6271 |** |
64 -> 127 : 109773 |****************************************|
128 -> 255 : 222 | |
operation = write
msecs : count distribution
0 -> 1 : 698243 |****************************************|
2 -> 3 : 967 | |
4 -> 7 : 3233 | |
8 -> 15 : 16189 | |
16 -> 31 : 19443 |* |
32 -> 63 : 47206 |** |
64 -> 127 : 6850 | |
128 -> 255 : 149 | |
operation = open
msecs : count distribution
-> 1 : 8140 |****************************************|
Further tracing is showing that performance degradation we are seeing
is visible in funclatency of add_transaction_credits where we are
waiting until we can add credits for handle to the running
transaction. With dioread_nolock it’s taking on average 154 msec while
0.27 msec with dioread_lock enabled
dioread_nolock enabled
3 warnings generated.
Tracing 1 functions for "add_transaction_credits"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 12677 |********************|
256 -> 511 : 5374 |******** |
512 -> 1023 : 6898 |********** |
1024 -> 2047 : 4114 |****** |
2048 -> 4095 : 3736 |***** |
4096 -> 8191 : 1954 |*** |
8192 -> 16383 : 329 | |
16384 -> 32767 : 111 | |
32768 -> 65535 : 157 | |
65536 -> 131071 : 191 | |
131072 -> 262143 : 234 | |
262144 -> 524287 : 169 | |
524288 -> 1048575 : 190 | |
1048576 -> 2097151 : 270 | |
2097152 -> 4194303 : 459 | |
4194304 -> 8388607 : 2203 |*** |
8388608 -> 16777215 : 6032 |********* |
16777216 -> 33554431 : 4827 |******* |
33554432 -> 67108863 : 3846 |****** |
67108864 -> 134217727 : 3720 |***** |
134217728 -> 268435455 : 6606 |********** |
268435456 -> 536870911 : 6502 |********** |
536870912 -> 1073741823 : 5072 |******** |
1073741824 -> 2147483647 : 2180 |*** |
2147483648 -> 4294967295 : 229 | |
4294967296 -> 8589934591 : 7 | |
avg = 154580459 nsecs, total: 12084945771325 nsecs, count: 78179
dioread_lock
3 warnings generated.
Tracing 1 functions for "add_transaction_credits"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 54708 | |
256 -> 511 : 952229 |************ |
512 -> 1023 : 2954850 |****************************************|
1024 -> 2047 : 2833787 |************************************** |
2048 -> 4095 : 2843252 |************************************** |
4096 -> 8191 : 1952274 |************************** |
8192 -> 16383 : 385997 |***** |
16384 -> 32767 : 56495 | |
32768 -> 65535 : 39379 | |
65536 -> 131071 : 27116 | |
131072 -> 262143 : 14277 | |
262144 -> 524287 : 5726 | |
524288 -> 1048575 : 3224 | |
1048576 -> 2097151 : 2192 | |
2097152 -> 4194303 : 6344 | |
4194304 -> 8388607 : 11606 | |
8388608 -> 16777215 : 21528 | |
16777216 -> 33554431 : 42377 | |
33554432 -> 67108863 : 40907 | |
67108864 -> 134217727 : 801 | |
134217728 -> 268435455 : 64 | |
avg = 276573 nsecs, total: 3419652820836 nsecs, count: 12364366
We thought we may be seeing other manifestation for
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
but unfortunately unsetting EXT4_GET_BLOCKS_PRE_IO in
https://elixir.bootlin.com/linux/v5.10.135/source/fs/ext4/inode.c#L2410didn't
make any difference in reclaiming the performance similar to what it
was with dioread_lock enabled.
Below are the workarounds which helped us reclaiming the performance
back to what it was before commit 244adf6426(ext4: make dioread_nolock
the default)
Mounting using dioread_lock
Enabling nodelalloc and that’s because since commit c8980e1980(ext4:
disable dioread_nolock whenever delayed allocation is disabled)
dioread_nolock will be disabled whenever you disable delayed
allocation.
Increasing the journal size from ext4 128 MiB to 1GiB will also fix the problem.
We know that dioread_nolock is pretty much beneficial for parallel
direct reads as with it enabled we shouldn’t use the DIO read locking
however with buffered writes ext4 will allocate uninitialized extent
before buffer write and convert the extent to initialized after IO
completes and this seems to be the cause behind the reported
regression and we have to note that this should be only beneficial
with extent-based files, it’s also mentioned in the ext4 man page
https://www.kernel.org/doc/Documentation/filesystems/ext4.txt that
dioread_lock is the default which no longer the case since commit
244adf6426(ext4: make dioread_nolock the default) so will need to
update the documentations unless we are going to revert the previously
mentioned commit for the listed reasons.
I think we should revert the previously mentioned commit as this is
clear performance regression otherwise we need to make it explicit in
the commit message that this option has higher journaling overhead and
will require higher journal space if this is working as designed.
I will submit a revert patch for commit 244adf6426(ext4: make
dioread_nolock the default) if that’s what we agreed on after the
discussion
Thank you.
Hazem
On 19.09.22 17:18, hazem ahmed mohamed wrote:
>
> I am sending this e-mail to report a performance regression that’s
> caused by commit 244adf6426(ext4: make dioread_nolock the default) , I
> am listing the performance regression symptoms below & our analysis
> for the reported regression.
FWIW, that patch went into v5.6-rc1~113^2~12
And BTW: it seems 0-day back then noticed that 244adf6426 caused a
performance regression as well, but it seems that was ignored:
https://lore.kernel.org/all/20201024120829.GK31092@shao2-debian/
Anyway, now to the main reason why I write this mail:
[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]
Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:
#regzbot ^introduced 244adf6426
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/
Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; the webpage mention at the end of
the last para explains why this is important for tracked regressions.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
Looks like I replied only on Thorsten insteading of replying to all so
moving the discussion to the wider thread for better visibility.
---------- Forwarded message ---------
From: Thorsten Leemhuis <[email protected]>
Date: Tue, 20 Sept 2022 at 14:31
Subject: Re: Ext4: Buffered random writes performance regression with
dioread_nolock enabled
To: hazem ahmed mohamed <[email protected]>
On 20.09.22 15:21, hazem ahmed mohamed wrote:
> Thanks Thorsten, I am surprised that we merged this commit while it
> has been showing regression since Day-0, unless there is an objection
> I will submit a revert patch until we know what's going on here.
Please keep replies online, then others can learn from the conversation
and weight in. In this particular case I'd explained that a quick revert
after all this time is likely a bad thing, as there is always a risk
that is creates regressions of its own. :-/
Ciao, Thorsten
> On 19.09.22 17:18, hazem ahmed mohamed wrote:
> >
> > I am sending this e-mail to report a performance regression that’s
> > caused by commit 244adf6426(ext4: make dioread_nolock the default) , I
> > am listing the performance regression symptoms below & our analysis
> > for the reported regression.
>
> FWIW, that patch went into v5.6-rc1~113^2~12
>
> And BTW: it seems 0-day back then noticed that 244adf6426 caused a
> performance regression as well, but it seems that was ignored:
> https://lore.kernel.org/all/20201024120829.GK31092@shao2-debian/
>
> Anyway, now to the main reason why I write this mail:
>
> [TLDR: I'm adding this regression report to the list of tracked
> regressions; all text from me you find below is based on a few templates
> paragraphs you might have encountered already already in similar form.]
>
> Thanks for the report. To be sure below issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced 244adf6426
> #regzbot ignore-activity
>
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply -- ideally with also
> telling regzbot about it, as explained here:
> https://linux-regtracking.leemhuis.info/tracked-regression/
>
> Reminder for developers: When fixing the issue, add 'Link:' tags
> pointing to the report (the mail this one replies to), as explained for
> in the Linux kernel's documentation; the webpage mention at the end of
> the last para explains why this is important for tracked regressions.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
On Mon, Sep 19, 2022 at 03:06:46PM +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hey Team,
>
>
> * I am sending this e-mail to report a performance regression that’s caused by commit 244adf6426(ext4: make dioread_nolock the default) , I am listing the performance regression symptoms below & our analysis for the reported regression.
Performance regressions are always tricky; dioread_nolock improves on
some workloads, and can cause regressions for others. In this
particular case, the choice to make it the default was to also fix a
direct I/O vs. writeback race which can result in stale data being
revealed (which is a security issue).
That being said...
1) as you've noted, this commit has been around since 5.6.
2) as you noted,
Increasing the journal size from ext4 128 MiB to 1GiB will also
fix the problem .
Since 2016, the commit bbd2f78cf63a ("libext2fs: allow the default
journal size to go as large as a gigabyte") has been in e2fsprogs
v1.43.2 and newer (the current version of e2fsprogs v1.46.5; v1.43.2
was released in September 2016, six years ago). Quoting the commit
description:
Recent research has shown that for a metadata-heavy workload, a 128 MB
is journal be a bottleneck on HDD's, and that the optimal journal size
is proportional to number of unique metadata blocks that can be
modified (and written into the journal) in a 30 second window. One
gigabyte should be sufficient for most workloads, which will be used
for file systems larger than 128 gigabytes.
So this should not be a problem in practice, and if there are users
who are using antedeluvian versions of e2fsprogs, or who have old file
systems which were created many years ago, it's quite easy to adjust
the journal size. For example, to adjust the journal to be 2GiB (2048
MiB), just run the commands:
tune2fs -O ^has_journal /dev/sdXX
tune2fs -O has_journal -J size=2048 /tmp/sdXX
Hence, I disagree that we should revert commit 244adf6426. It may be
that for your workload and your file system configuration, using the
mount option nodioread_nolock (or dioread_lock), may make sense. But
there were also workloads for which using dioread_nolock improved
benchmark numbers, so the question of which is the better default is
not at all obvious.
That being said, I do have plans for a new writeback scheme which will
replace dioread_nolock *and* dioread_lock, and which will hopefully be
faster than either approach.
- Ted
P.S. I'm puzzled by your comment, "we have to note that this should
be only beneficial with extent-based files" --- while this is true,
why does this matter? Unless you're dealing with an ancient file
system that was originally created as ext2 or ext3 and then converted
to ext4, *all* ext4 files should be extent-based...
Hazem started separate e-mail threads on this issue (separated by
about an hour), and I replied to the earlier one here:
https://lore.kernel.org/all/[email protected]/
TL;DR:
1) The patch landed in 5.6, and improved performance for some
workloads, and also fixed a potential security problem (exposure of
stale data caused by a race).
2) If you are using a storage device >= 128GB, and a version of
e2fsprogs v1.43.2 (released six years ago), the journal size will be
1GB, which Hazem reported resolved the problem.
3) I disagree that we should revert this commit, as it only changes a
default. If you prefer the older behavior, you can change it with a
mount option.
- Ted
On 22/09/19 04:18PM, hazem ahmed mohamed wrote:
> Hey Team,
>
>
>
> I am sending this e-mail to report a performance regression that’s
> caused by commit 244adf6426(ext4: make dioread_nolock the default) , I
> am listing the performance regression symptoms below & our analysis
> for the reported regression.
>
Although you did mention you already tested on the latest kernel too and there
too you saw a similar performance regression.
But no harm in also double checking if you have this patch [1].
This does fixes a similar problem AFAIU.
[1]: https://lore.kernel.org/linux-ext4/[email protected]/
-ritesh
On 21.09.22 04:41, Theodore Ts'o wrote:
> Hazem started separate e-mail threads on this issue (separated by
> about an hour), and I replied to the earlier one here:
>
> https://lore.kernel.org/all/[email protected]/
#regbot monitor: https://lore.kernel.org/all/[email protected]/
> TL;DR:
>
> 1) The patch landed in 5.6, and improved performance for some
> workloads, and also fixed a potential security problem (exposure of
> stale data caused by a race).
>
> 2) If you are using a storage device >= 128GB, and a version of
> e2fsprogs v1.43.2 (released six years ago), the journal size will be
> 1GB, which Hazem reported resolved the problem.
>
> 3) I disagree that we should revert this commit, as it only changes a
> default. If you prefer the older behavior, you can change it with a
> mount option.
Great, thx for clarifying, in that case let me remove this from the
tracking:
#regzbot invalid: caused by a change of defaults that (among others) was
done for security reasons, see Ted's answer in
https://lore.kernel.org/all/[email protected]/ for details
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
> Since 2016, the commit bbd2f78cf63a ("libext2fs: allow the default
> journal size to go as large as a gigabyte") has been in e2fsprogs
> v1.43.2 and newer (the current version of e2fsprogs v1.46.5; v1.43.2
> was released in September 2016, six years ago).
Thanks Theo for sharing that I will looks into updating the e2fsprogs package I am using to include that commit.
> P.S. I'm puzzled by your comment, "we have to note that this should
> be only beneficial with extent-based files" --- while this is true,
> why does this matter? Unless you're dealing with an ancient file
> system that was originally created as ext2 or ext3 and then converted
> to ext4, *all* ext4 files should be extent-based...
Yes I am using ext4 FS so that shouldn't be relevant in my case here, I just mentioned that because that's also mentioned in ext4 mount options man page as justification on why dioread_lock is the default so we likely need to update the documentations as well. https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> Note that dioread_nolock code path is only used for extent-based files.
> Because of the restrictions this options comprises it is off by default (e.g. dioread_lock).
My last question would be is the higher journaling overhead is justified or at least expected with dioread_nolock enabled? Is that caused by the fact that we need to allocate uninitialized extent before buffer write ?
Thank you.
Hazem
On 21/09/2022, 03:08, "Theodore Ts'o" <[email protected]> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Sep 19, 2022 at 03:06:46PM +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hey Team,
>
>
> * I am sending this e-mail to report a performance regression that’s caused by commit 244adf6426(ext4: make dioread_nolock the default) , I am listing the performance regression symptoms below & our analysis for the reported regression.
Performance regressions are always tricky; dioread_nolock improves on
some workloads, and can cause regressions for others. In this
particular case, the choice to make it the default was to also fix a
direct I/O vs. writeback race which can result in stale data being
revealed (which is a security issue).
That being said...
1) as you've noted, this commit has been around since 5.6.
2) as you noted,
Increasing the journal size from ext4 128 MiB to 1GiB will also
fix the problem .
Since 2016, the commit bbd2f78cf63a ("libext2fs: allow the default
journal size to go as large as a gigabyte") has been in e2fsprogs
v1.43.2 and newer (the current version of e2fsprogs v1.46.5; v1.43.2
was released in September 2016, six years ago). Quoting the commit
description:
Recent research has shown that for a metadata-heavy workload, a 128 MB
is journal be a bottleneck on HDD's, and that the optimal journal size
is proportional to number of unique metadata blocks that can be
modified (and written into the journal) in a 30 second window. One
gigabyte should be sufficient for most workloads, which will be used
for file systems larger than 128 gigabytes.
So this should not be a problem in practice, and if there are users
who are using antedeluvian versions of e2fsprogs, or who have old file
systems which were created many years ago, it's quite easy to adjust
the journal size. For example, to adjust the journal to be 2GiB (2048
MiB), just run the commands:
tune2fs -O ^has_journal /dev/sdXX
tune2fs -O has_journal -J size=2048 /tmp/sdXX
Hence, I disagree that we should revert commit 244adf6426. It may be
that for your workload and your file system configuration, using the
mount option nodioread_nolock (or dioread_lock), may make sense. But
there were also workloads for which using dioread_nolock improved
benchmark numbers, so the question of which is the better default is
not at all obvious.
That being said, I do have plans for a new writeback scheme which will
replace dioread_nolock *and* dioread_lock, and which will hopefully be
faster than either approach.
- Ted
P.S. I'm puzzled by your comment, "we have to note that this should
be only beneficial with extent-based files" --- while this is true,
why does this matter? Unless you're dealing with an ancient file
system that was originally created as ext2 or ext3 and then converted
to ext4, *all* ext4 files should be extent-based...
Thanks for sharing that unfortunately this patch has been merged to
kernel 5.7 and I am still seeing that regression with later versions
that include that patch.
Hazem.
On Wed, 21 Sept 2022 at 05:44, Ritesh Harjani (IBM)
<[email protected]> wrote:
>
> On 22/09/19 04:18PM, hazem ahmed mohamed wrote:
> > Hey Team,
> >
> >
> >
> > I am sending this e-mail to report a performance regression that’s
> > caused by commit 244adf6426(ext4: make dioread_nolock the default) , I
> > am listing the performance regression symptoms below & our analysis
> > for the reported regression.
> >
>
> Although you did mention you already tested on the latest kernel too and there
> too you saw a similar performance regression.
> But no harm in also double checking if you have this patch [1].
> This does fixes a similar problem AFAIU.
>
>
> [1]: https://lore.kernel.org/linux-ext4/[email protected]/
>
> -ritesh