2012-08-22 06:01:03

by NeilBrown

[permalink] [raw]
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <[email protected]>
wrote:

>
> -#define NR_STRIPES 256
> +#define NR_STRIPES 1024

Changing one magic number into another magic number might help your case, but
it not really a general solution.

Possibly making sure that max_nr_stripes is at least some multiple of the
chunk size might make sense, but I wouldn't want to see a very large multiple.

I thing the problems with RAID5 are deeper than that. Hopefully I'll figure
out exactly what the best fix is soon - I'm trying to look into it.

I don't think the size of the cache is a big part of the solution. I think
correct scheduling of IO is the real answer.

Thanks,
NeilBrown



Attachments:
signature.asc (828.00 B)

2012-08-22 06:30:55

by Yuanhan Liu

[permalink] [raw]
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On Wed, Aug 22, 2012 at 04:00:25PM +1000, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <[email protected]>
> wrote:
>
> >
> > -#define NR_STRIPES 256
> > +#define NR_STRIPES 1024
>
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

Agreed.

>
> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
>
> I thing the problems with RAID5 are deeper than that. Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.
>
> I don't think the size of the cache is a big part of the solution. I think
> correct scheduling of IO is the real answer.

Yes, it should not be. But with less max_nr_stripes, the chance to get a
full strip write is less, and maybe that's the reason why the chance to
block at get_active_strip() is more; and also, the reading is more.

The perfect case would be there are no reading; setting max_nr_stripes
to 32768(the max we get set now), you will find the reading is quite
less(almost zero, please see the iostat I attached in former email).

Anyway, I do agree this should not be the big part of the solution. If
we can handle those stripes faster, I guess 256 would be enough.

Thanks,
Yuanhan Liu

2012-08-22 07:14:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On 2012-08-22, at 12:00 AM, NeilBrown wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <[email protected]>
> wrote:
>>
>> -#define NR_STRIPES 256
>> +#define NR_STRIPES 1024
>
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.

We've actually been carrying a patch for a few years in Lustre to
increase the NR_STRIPES to 2048, and made it a configurable module
parameter. This made a noticeable improvement to the performance
for fast systems.

> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
>
> I thing the problems with RAID5 are deeper than that. Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.

The other MD RAID-5/6 patches that we have change the page submission
order to avoid the need to merge pages in the elevator so much, and a
patch to allow zero-copy IO submission if the caller marks the page for
direct IO (indicating it will not be modified until after IO completes).
This avoids a lot of overhead on fast systems.

This isn't really my area of expertise, but patches against RHEL6
could be seen at http://review.whamcloud.com/1142 if you want to
take a look. I don't know if that code is at all relevant to what
is in 3.x today.

> I don't think the size of the cache is a big part of the solution. I think
> correct scheduling of IO is the real answer.

My experience is that on fast systems the IO scheduler just gets in the
way. Submitting larger contiguous IOs to each disk in the first place
is far better than trying to merge small IOs again at the back end.

Cheers, Andreas






Attachments:
PGP.sig (235.00 B)
This is a digitally signed message part

2012-08-22 20:47:15

by Dan Williams

[permalink] [raw]
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On Tue, Aug 21, 2012 at 11:00 PM, NeilBrown <[email protected]> wrote:
> On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <[email protected]>
> wrote:
>
>>
>> -#define NR_STRIPES 256
>> +#define NR_STRIPES 1024
>
> Changing one magic number into another magic number might help your case, but
> it not really a general solution.
>
> Possibly making sure that max_nr_stripes is at least some multiple of the
> chunk size might make sense, but I wouldn't want to see a very large multiple.
>
> I thing the problems with RAID5 are deeper than that. Hopefully I'll figure
> out exactly what the best fix is soon - I'm trying to look into it.
>
> I don't think the size of the cache is a big part of the solution. I think
> correct scheduling of IO is the real answer.

Not sure if this is what we are seeing here, but we still have the
unresolved fast parity effect whereby slower parity calculation gives
a larger time to coalesce writes. I saw this effect when playing with
xor offload.

--
Dan

2012-08-22 22:00:16

by NeilBrown

[permalink] [raw]
Subject: Re: ext4 write performance regression in 3.6-rc1 on RAID0/5

On Wed, 22 Aug 2012 13:47:07 -0700 Dan Williams <[email protected]> wrote:

> On Tue, Aug 21, 2012 at 11:00 PM, NeilBrown <[email protected]> wrote:
> > On Wed, 22 Aug 2012 11:57:02 +0800 Yuanhan Liu <[email protected]>
> > wrote:
> >
> >>
> >> -#define NR_STRIPES 256
> >> +#define NR_STRIPES 1024
> >
> > Changing one magic number into another magic number might help your case, but
> > it not really a general solution.
> >
> > Possibly making sure that max_nr_stripes is at least some multiple of the
> > chunk size might make sense, but I wouldn't want to see a very large multiple.
> >
> > I thing the problems with RAID5 are deeper than that. Hopefully I'll figure
> > out exactly what the best fix is soon - I'm trying to look into it.
> >
> > I don't think the size of the cache is a big part of the solution. I think
> > correct scheduling of IO is the real answer.
>
> Not sure if this is what we are seeing here, but we still have the
> unresolved fast parity effect whereby slower parity calculation gives
> a larger time to coalesce writes. I saw this effect when playing with
> xor offload.

I did find a case where inserting a printk made it go faster again.
Replacing that with msleep(2) worked as well. :-)

I'm looking for a most robust solution though.
Thanks for the reminder.

NeilBrown


Attachments:
signature.asc (828.00 B)