2022-03-01 04:21:13

by Ritesh Harjani

[permalink] [raw]
Subject: Parallel fsck performance degradation case discussion

Hello,

I am working to help merge ext4's parallel fsck in upstream e2fsprogs.
Ted has provided some details here[1] on some of the work needed, to get it
accepted/merged into upstream.

However, in this email, I mostly wanted to discuss some performance(perf) observations
and to check if we have done our multi-thread fsck testing on such test cases or not.

So, I was doing some testing with different FS layouts and with different disk types
to see its performance benefits. Here are some of the observations. I wanted to know
if it is in line with your observations too.
Also to mainly discuss Case-4, to see if it is already a known limitation.

Case-1: Huge no. of 0 byte sized inodes (22M inodes)
We do see performance benefits with pfsck in this use case (I saw around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types too).

Case-2: Huge no. of 4KB-32KB sized inodes/directories (22M inodes)
We do see performance benefits with pfsck in this use case as well (again around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types).

Case-3: Large directories (with many 0 byte files within these directories)
In this case, mostly pass-2 takes significant time, but again we do see performance
improvements with pass-1 for all different disk/device setups.

Case-4: Files with heavy fragmentation i.e. lots of extents.
(creating this FS layout roughly by running script1.sh followed by script2.sh mentioned at the end of this email)
In this case we start seeing performance degradation if the I/O device is fast enough.
1. On a single HDD, we see significant perf reduction > ~30% (with pfsck compare to non pfsck).
2. With single nvme, similar perf reduction or more.
3. ramfs based single loop device setup - ~100% perf reduction.
4. ramfs based 4 loop devices with dm_delay on top and with SW raid0 config (md0) (i.e. with 4 dm-delay devices of 50G each in raid0).
a. With delay of 0ms we see a performance degradation of around ~100%. (10s v/s 20s)
Below is the perf profile where the performance degradation is seen (with pfsck -m 4)
26.37% e2fsck e2fsck [.] rb_insert_extent
13.54% e2fsck e2fsck [.] ext2fs_rb_next
9.72% e2fsck libc-2.31.so [.] _int_free
7.83% e2fsck libc-2.31.so [.] malloc
7.45% e2fsck e2fsck [.] rb_test_clear_bmap_extent
6.46% e2fsck e2fsck [.] rb_test_bmap
4.60% e2fsck libpthread-2.31.so [.] __pthread_rwlock_rdlock
4.39% e2fsck libpthread-2.31.so [.] __pthread_rwlock_unlock

b. But with above disk setup (4 dm-delay with raid0), ~36% to 3x performance improvement is observed when the
delay is within the range of [1ms - 500ms] (for every read/write).

Now, I understand we might say that parallel fsck benefits can mostly be seen in case of parallel I/O.
Because otherwise, pfsck might add some extra overhead due to thread spawning, allocating per thread
structures and merge logic. But should that account to significant perf degradation in such fragmented files use case?

From my observations so far, I see in case-4.a), most of the time is being spent in merging of block_found_map bitmap.
On measuring some stats and when testing with -m 1 (i.e. thread-0), I see e2fsck_pass1_merge_context() alone
taking 18sec out of 32sec (which is total time for pass-1).

<stats log>
============
[Thread 0] Scanned group range [0, 1599), inodes 169076
e2fsck_pass1_merge_context [0]: bg range [0, 1599] elapsed time: 18.580 count=25573571
elapsed time: 32.863

"count" in above stat measures total no. of extent entries found in thread_ctx->block_found_map
(by adding rb_count_bmap() function). Since there is only one thread here, that also means it is the total no.
of extent entries. Above data is shown with "-m 1", to just show the exact count entries.
Otherwise too with "-m 4", the performance is degraded.

I have also tested this on raid0 using 2 HDDs, and on that too perf degradation was observed.
(Although I don't have the exact data handy for this, but I can get those again, if needed).
But AFAIK, it was definitely a significant reduction in perf numbers.

So I was wondering if this is a known limitation around pfsck and if it has popped up in any of your tests too.
Also please do let me know if I have missed anything obvious here?

In some of my earlier testing, I had tested with lusture e2fsprogs (master-pfsck branch) and had similar observations
as mentioned above. But recently all my tests were based out of the following tree[2] (with patch[3] included).
I have these setups available with me, so if anything is needed to be tested from my end, I can do that.

References
============
[1]: https://lore.kernel.org/all/YMN10sXgoTR%[email protected]/
[2]: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/log/?h=pfsck
[3] https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=699di448eee4b991acafaae4e4f8222be332d6837


Thanks for your help!!
-ritesh

--

<script1.sh>
============
fragmented_filesize=$((10 * 1024 * 1024 * 1024))
dir_cnt=0
while [ $dir_cnt -lt 8192 ]; do
mkdir $MNT/n$dir_cnt || break
inode_cnt=0
while [ $inode_cnt -lt 8192 ]; do
if [ $inode_cnt -eq 0 ]; then
xfs_io -fc "falloc 0 $fragmented_filesize" $MNT/n$dir_cnt/n$inode_cnt
else
touch $MNT/n$dir_cnt/n$inode_cnt || break
fi
inode_cnt=$((inode_cnt+1))
done
dir_cnt=$((dir_cnt+1))
done
exit

<script2.sh>
==============
dir_cnt=0
while [ $dir_cnt -lt 8192 ]; do
inode_cnt=0
$XFSTESTS_PATH/src/punch-alternating $MNT/n$dir_cnt/n$inode_cnt
dir_cnt=$((dir_cnt+1))
done


2022-03-21 22:48:36

by Ritesh Harjani

[permalink] [raw]
Subject: Re: Parallel fsck performance degradation case discussion


Sorry about the delay in getting back to this. I was caught up in completing
some of the other open activities, but I will now be spending more time on
getting this work in shape for merging.

On 22/03/03 12:50PM, Andreas Dilger wrote:
> On Feb 28, 2022, at 7:57 PM, Ritesh Harjani <[email protected]> wrote:
> >
> > Hello,
> >
> > I am working to help merge ext4's parallel fsck in upstream e2fsprogs.
> > Ted has provided some details here[1] on some of the work needed, to
> > get it accepted/merged into upstream.
>
> Hi Ritesh,
> thanks for working on this and running the testing.
>
> > However, in this email, I mostly wanted to discuss some performance(perf) observations and to check if we have done our multi-thread fsck testing
> > on such test cases or not.
> >
> > So, I was doing some testing with different FS layouts and with different
> > disk types to see its performance benefits. Here are some of the observations.
> > I wanted to know if it is in line with your observations too.
>
> Most of our testing was done with large multi-disk declustered-parity RAID
> (i.e. in the range of 40-160 HDDs in a single volume). I've attached an
> image showing our results.

Thanks for sharing the data.

Yes, the e2fsck timings with multi-disks (HDDs) raid0 setup and with large
inodes count is inline with my observations too.


>
> > Also to mainly discuss Case-4, to see if it is already a known limitation.
> >
> > Case-1: Huge no. of 0 byte sized inodes (22M inodes)
> > We do see performance benefits with pfsck in this use case (I saw around 3x improvement with ramfs).
> > This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
> > on HDD and on NVMes (perf improvements can vary based on disk types too).
> >
> > Case-2: Huge no. of 4KB-32KB sized inodes/directories (22M inodes)
> > We do see performance benefits with pfsck in this use case as well (again around 3x improvement with ramfs).
> > This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
> > on HDD and on NVMes (perf improvements can vary based on disk types).
> >
> > Case-3: Large directories (with many 0 byte files within these directories)
> > In this case, mostly pass-2 takes significant time, but again we do see
> > performance improvements with pass-1 for all different disk/device setups.
>
> Yes, the pass2/3 scanning can definitely be parallelized for workloads like
> this, but we haven't had time to do this work yet. There was a preliminary
> patch to do some of the block fetching in parallel, but not the actual scan:
> https://review.whamcloud.com/44428

Sure, thanks for sharing the info.

>
> > Case-4: Files with heavy fragmentation i.e. lots of extents.
> > (creating this FS layout roughly by running script1.sh followed by script2.sh mentioned at the end of this email)
> > In this case we start seeing performance degradation if the I/O device is fast enough.
> > 1. On a single HDD, we see significant perf reduction > ~30% (with pfsck compare to non pfsck).
> > 2. With single nvme, similar perf reduction or more.
> > 3. ramfs based single loop device setup - ~100% perf reduction.
> > 4. ramfs based 4 loop devices with dm_delay on top and with SW raid0 config (md0) (i.e. with 4 dm-delay devices of 50G each in raid0).
> > a. With delay of 0ms we see a performance degradation of around ~100%. (10s v/s 20s)
> > Below is the perf profile where the performance degradation is seen (with pfsck -m 4)
> > 26.37% e2fsck e2fsck [.] rb_insert_extent
> > 13.54% e2fsck e2fsck [.] ext2fs_rb_next
> > 9.72% e2fsck libc-2.31.so [.] _int_free
> > 7.83% e2fsck libc-2.31.so [.] malloc
> > 7.45% e2fsck e2fsck [.] rb_test_clear_bmap_extent
> > 6.46% e2fsck e2fsck [.] rb_test_bmap
> > 4.60% e2fsck libpthread-2.31.so [.] __pthread_rwlock_rdlock
> > 4.39% e2fsck libpthread-2.31.so [.] __pthread_rwlock_unlock
>
> This is a pretty unrealistic scenario IMHO, with a huge number of fragmented
> chunks on a single HDD. So it isn't clear if this is worthwhile to optimize
> the parallel mode, and instead run in a single thread?

Above perf profile is not on a single HDD.

So, in my testing, what I have observed so far is, when we have relatively faster devices
(including HDDs in a raid0 setup) and a FS layout with heavy fragmented files,
pfsck bottleneck is seen in rb tree insert/merge logic. In fact pfsck takes more
time then normal fsck.

I am guessing since this is not common use case for most of the scenarios,
so we might not have tested this before? Do you remember if that was the case?


Now given that in most of the common use cases with multi-disk raid setups, pfsck
performs quite well, so I think it makes more sense to get the current patches
rebased (after getting libext2fs abstraction changes done).
In parallel, I will also check the heavy fragmented file use case, to see
if there are any optimizations possible in rbtree insert/merge code paths which
can remove this observed bottleneck.

Thanks for your help!!
-ritesh


>
> Cheers, Andreas
>
> >
> > b. But with above disk setup (4 dm-delay with raid0), ~36% to 3x performance improvement is observed when the
> > delay is within the range of [1ms - 500ms] (for every read/write).
> >
> > Now, I understand we might say that parallel fsck benefits can mostly be seen in case of parallel I/O.
> > Because otherwise, pfsck might add some extra overhead due to thread spawning, allocating per thread
> > structures and merge logic. But should that account to significant perf degradation in such fragmented files use case?
> >
> > From my observations so far, I see in case-4.a), most of the time is being spent in merging of block_found_map bitmap.
> > On measuring some stats and when testing with -m 1 (i.e. thread-0), I see e2fsck_pass1_merge_context() alone
> > taking 18sec out of 32sec (which is total time for pass-1).
> >
> > <stats log>
> > ============
> > [Thread 0] Scanned group range [0, 1599), inodes 169076
> > e2fsck_pass1_merge_context [0]: bg range [0, 1599] elapsed time: 18.580 count=25573571
> > elapsed time: 32.863
> >
> > "count" in above stat measures total no. of extent entries found in thread_ctx->block_found_map
> > (by adding rb_count_bmap() function). Since there is only one thread here, that also means it is the total no.
> > of extent entries. Above data is shown with "-m 1", to just show the exact count entries.
> > Otherwise too with "-m 4", the performance is degraded.
> >
> > I have also tested this on raid0 using 2 HDDs, and on that too perf degradation was observed.
> > (Although I don't have the exact data handy for this, but I can get those again, if needed).
> > But AFAIK, it was definitely a significant reduction in perf numbers.
> >
> > So I was wondering if this is a known limitation around pfsck and if it has popped up in any of your tests too.
> > Also please do let me know if I have missed anything obvious here?
> >
> > In some of my earlier testing, I had tested with lusture e2fsprogs (master-pfsck branch) and had similar observations
> > as mentioned above. But recently all my tests were based out of the following tree[2] (with patch[3] included).
> > I have these setups available with me, so if anything is needed to be tested from my end, I can do that.
> >
> > References
> > ============
> > [1]: https://lore.kernel.org/all/YMN10sXgoTR%[email protected]/
> > [2]: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/log/?h=pfsck
> > [3] https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=699di448eee4b991acafaae4e4f8222be332d6837
> >
> >
> > Thanks for your help!!
> > -ritesh
> >
> > --
> >
> > <script1.sh>
> > ============
> > fragmented_filesize=$((10 * 1024 * 1024 * 1024))
> > dir_cnt=0
> > while [ $dir_cnt -lt 8192 ]; do
> > mkdir $MNT/n$dir_cnt || break
> > inode_cnt=0
> > while [ $inode_cnt -lt 8192 ]; do
> > if [ $inode_cnt -eq 0 ]; then
> > xfs_io -fc "falloc 0 $fragmented_filesize" $MNT/n$dir_cnt/n$inode_cnt
> > else
> > touch $MNT/n$dir_cnt/n$inode_cnt || break
> > fi
> > inode_cnt=$((inode_cnt+1))
> > done
> > dir_cnt=$((dir_cnt+1))
> > done
> > exit
> >
> > <script2.sh>
> > ==============
> > dir_cnt=0
> > while [ $dir_cnt -lt 8192 ]; do
> > inode_cnt=0
> > $XFSTESTS_PATH/src/punch-alternating $MNT/n$dir_cnt/n$inode_cnt
> > dir_cnt=$((dir_cnt+1))
> > done
>
>
> Cheers, Andreas
>
>
>
>