2022-04-21 11:35:11

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On Thu, Apr 21, 2022 at 3:55 AM Logan Gunthorpe <[email protected]> wrote:
>
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).


Hi Logan

Could you share the commands to get the test result (lock contention
and performance)?

Regards
Xiao


2022-04-22 17:47:28

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention



On 2022-04-21 02:45, Xiao Ni wrote:
> Could you share the commands to get the test result (lock contention
> and performance)?

Sure. The performance we were focused on was large block writes. So we
setup raid5 instances with varying number of disks and ran the following
fio script directly on the drive.

[simple]
filename=/dev/md0
ioengine=libaio
rw=write
direct=1
size=8G
blocksize=2m
iodepth=16
runtime=30s
time_based=1
offset_increment=8G
numjobs=12

(We also played around with tuning this but didn't find substantial
changes once the bottleneck was hit)

We tuned md with parameters like:

echo 4 > /sys/block/md0/md/group_thread_cnt
echo 8192 > /sys/block/md0/md/stripe_cache_size

For lock contention stats, we just used lockstat[1]; roughly like:

echo 1 > /proc/sys/kernel/lock_stat
fio test.fio
echo 0 > /proc/sys/kernel/lock_stat
cat /proc/lock_stat

And compared the before and after.

Logan

[1] https://www.kernel.org/doc/html/latest/locking/lockstat.html

2022-04-25 06:43:17

by Guoqing Jiang

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention



On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
>
> On 2022-04-21 02:45, Xiao Ni wrote:
>> Could you share the commands to get the test result (lock contention
>> and performance)?
> Sure. The performance we were focused on was large block writes. So we
> setup raid5 instances with varying number of disks and ran the following
> fio script directly on the drive.
>
> [simple]
> filename=/dev/md0
> ioengine=libaio
> rw=write
> direct=1
> size=8G
> blocksize=2m
> iodepth=16
> runtime=30s
> time_based=1
> offset_increment=8G
> numjobs=12
> 
> (We also played around with tuning this but didn't find substantial
> changes once the bottleneck was hit)

Nice, I suppose other IO patterns keep the same performance as before.

> We tuned md with parameters like:
>
> echo 4 > /sys/block/md0/md/group_thread_cnt
> echo 8192 > /sys/block/md0/md/stripe_cache_size
>
> For lock contention stats, we just used lockstat[1]; roughly like:
>
> echo 1 > /proc/sys/kernel/lock_stat
> fio test.fio
> echo 0 > /proc/sys/kernel/lock_stat
> cat /proc/lock_stat
>
> And compared the before and after.

Thanks for your effort, besides the performance test, please try to run
mdadm test suites to avoid regression.

Thanks,
Guoqing

2022-04-25 18:45:49

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention



On 2022-04-24 02:00, Guoqing Jiang wrote:
>
>
> On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
>>
>> On 2022-04-21 02:45, Xiao Ni wrote:
>>> Could you share the commands to get the test result (lock contention
>>> and performance)?
>> Sure. The performance we were focused on was large block writes. So we
>> setup raid5 instances with varying number of disks and ran the following
>> fio script directly on the drive.
>>
>> [simple]
>> filename=/dev/md0
>> ioengine=libaio
>> rw=write
>> direct=1
>> size=8G
>> blocksize=2m
>> iodepth=16
>> runtime=30s
>> time_based=1
>> offset_increment=8G
>> numjobs=12
>> 
>> (We also played around with tuning this but didn't find substantial
>> changes once the bottleneck was hit)
>
> Nice, I suppose other IO patterns keep the same performance as before.
>
>> We tuned md with parameters like:
>>
>> echo 4 > /sys/block/md0/md/group_thread_cnt
>> echo 8192 > /sys/block/md0/md/stripe_cache_size
>>
>> For lock contention stats, we just used lockstat[1]; roughly like:
>>
>> echo 1 > /proc/sys/kernel/lock_stat
>> fio test.fio
>> echo 0 > /proc/sys/kernel/lock_stat
>> cat /proc/lock_stat
>>
>> And compared the before and after.
>
> Thanks for your effort, besides the performance test, please try to run
> mdadm test suites to avoid regression.

Yeah, is there any documentation for that? I tried to look into it but
couldn't figure out how it's run.

I do know that lkp-tests has run it on this series as I did get an error
from it. But while I'm pretty sure that error has been resolved, I was
never able to figure out how to run them locally.

Thanks,

Logan

2022-04-26 02:54:15

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On Mon, Apr 25, 2022 at 11:39 PM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2022-04-24 02:00, Guoqing Jiang wrote:
> >
> >
> > On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
> >>
> >> On 2022-04-21 02:45, Xiao Ni wrote:
> >>> Could you share the commands to get the test result (lock contention
> >>> and performance)?
> >> Sure. The performance we were focused on was large block writes. So we
> >> setup raid5 instances with varying number of disks and ran the following
> >> fio script directly on the drive.
> >>
> >> [simple]
> >> filename=/dev/md0
> >> ioengine=libaio
> >> rw=write
> >> direct=1
> >> size=8G
> >> blocksize=2m
> >> iodepth=16
> >> runtime=30s
> >> time_based=1
> >> offset_increment=8G
> >> numjobs=12
> >> 
> >> (We also played around with tuning this but didn't find substantial
> >> changes once the bottleneck was hit)
> >
> > Nice, I suppose other IO patterns keep the same performance as before.
> >
> >> We tuned md with parameters like:
> >>
> >> echo 4 > /sys/block/md0/md/group_thread_cnt
> >> echo 8192 > /sys/block/md0/md/stripe_cache_size
> >>
> >> For lock contention stats, we just used lockstat[1]; roughly like:
> >>
> >> echo 1 > /proc/sys/kernel/lock_stat
> >> fio test.fio
> >> echo 0 > /proc/sys/kernel/lock_stat
> >> cat /proc/lock_stat
> >>
> >> And compared the before and after.
> >
> > Thanks for your effort, besides the performance test, please try to run
> > mdadm test suites to avoid regression.
>
> Yeah, is there any documentation for that? I tried to look into it but
> couldn't figure out how it's run.
>
> I do know that lkp-tests has run it on this series as I did get an error
> from it. But while I'm pretty sure that error has been resolved, I was
> never able to figure out how to run them locally.
>

Hi Logan

You can clone the mdadm repo at
git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
Then you can find there is a script test under the directory. It's not
under the tests directory.
The test cases are under tests directory.

Regards
Xiao

2022-04-29 21:57:47

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention



On 2022-04-25 10:12, Xiao Ni wrote:
>> I do know that lkp-tests has run it on this series as I did get an error
>> from it. But while I'm pretty sure that error has been resolved, I was
>> never able to figure out how to run them locally.
>>
>
> Hi Logan
>
> You can clone the mdadm repo at
> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
> Then you can find there is a script test under the directory. It's not
> under the tests directory.
> The test cases are under tests directory.

So I've been fighting with this and it seems there are just a ton of
failures in these tests without my changes. Running on the latest master
(52c67fcdd6dad) with stock v5.17.5 I see major brokenness. About 17 out
of 44 tests that run failed. I had to run with --disable-integrity
because those tests seem to hang on an infinite loop waiting for the md
array to go into the U state (even though it appears idle).

Even though I ran the tests with '--keep-going', the testing stopped
after the 07revert-grow reported errors in dmesg -- even though the only
errors printed to dmesg were that of mdadm segfaulting.

Running on md/md-next seems to get a bit further (to
10ddf-create-fail-rebuild) and stops with the same segfaulting issue (or
perhaps the 07 test only randomly fails first -- I haven't run it that
many times). Though most of the tests between these points fail anyway.

My upcoming v3 patches cause no failures that are different from the
md/md-next branch. But it seems these tests have rotted to the point
that they aren't all that useful; or maybe there are a ton of
regressions in the kernel already and nobody was paying much attention.
I have also tried to test certain cases that appear broken in recent
kernels anyway (like reducing the number of disks in a raid5 array hangs
on the first stripe to reshape).

In any case I have a very rough ad-hoc test suite I've been expanding
that is targeted at testing my specific changes. Testing these changes
has definitely been challenging. In any case, I've published my tests here:

https://github.com/Eideticom/raid5-tests

Logan