2010-12-28 09:07:51

by Rogier Wolff

[permalink] [raw]
Subject: Uneven load on my raid disks.


Hi,

I have a big filesystem (4x2Tb RAID5 = 6Tb.).

I have a shitload (*) of files on there. 4 days ago, I decided I
wanted to clean up some disk space, and ordered an rm -rf of some 110
directories with a subset of that shitload of files. If I do one rm
-rf, it will do everything linearly: read one directory from one disk,
read the inodes from another disk, read the required bitmpas from the
next disk, remove the file. Only the write backs of the just-cleared
inodes and the writeback of the bitmaps would be exectued in parallel.

So that's why I started the remove of these 110 directories in
parallel: to keep my 4 disks busy. About 1.5 times the number of disks
that I have would have been optimal, but I thought I wouldn't incur
too much of a penalty by starting all of them together. In any case it
would be done the next morning..... 4 days and counting, almost a Tb
freed...

However, it seems that it is NOT keeping all of my four disks busy:
according to iostat my "sdb" disk is executing about 10 times more
reads than the other drives, and it is therefore the performance
bottleneck of the whole operation.

This would mean that all of the inodes or all of the bitmaps are on my
"sdb" disk. This sounds reasonable if I forgot to tell mke2fs that
this was a raid. I don't think I forgot... How can I check that this
is the case?

http://www.goplexian.com/2010/09/6-tips-for-improving-hard-drive.html

says:
dumpe2fs -h /dev/md0 | grep RAID

should show wether my ext4 partition is on a RAID. Well mine is, but
doesn't provide any output. Did I forget those raid-options during
formatting? Can this be tuned after the fact?

dumpe2fs without -h takes a long time. I just stopped all rm processes
and now only dumpe2fs is running. It seems to be causing about 10times
more load on sdb than on the other drives. strace shows that it is in
a loop where it seeks/reads. It is alternating reads of 4k and 280
bytes.


Roger.

(*) Guess high, then multiply by ten.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ


2010-12-28 09:18:26

by Rogier Wolff

[permalink] [raw]
Subject: Re: Uneven load on my raid disks.

On Tue, Dec 28, 2010 at 10:07:49AM +0100, Rogier Wolff wrote:
> dumpe2fs without -h takes a long time. I just stopped all rm processes
> and now only dumpe2fs is running. It seems to be causing about 10times
> more load on sdb than on the other drives. strace shows that it is in
> a loop where it seeks/reads. It is alternating reads of 4k and 280
> bytes.

OK. I did this wrong: there were also other things running. Whatever
dumpe2fs is doing distributes nicely over the four disks.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2010-12-29 16:50:03

by Kay Diederichs

[permalink] [raw]
Subject: Re: Uneven load on my raid disks.

Rogier Wolff wrote:
...
>
> http://www.goplexian.com/2010/09/6-tips-for-improving-hard-drive.html
>
> says:
> dumpe2fs -h /dev/md0 | grep RAID
>
> should show wether my ext4 partition is on a RAID. Well mine is, but
> doesn't provide any output. Did I forget those raid-options during
> formatting? Can this be tuned after the fact?
>

% tune2fs -l /dev/md0

...
RAID stride: 128
RAID stripe width: 768
...

runs much faster than dumpe2fs.
The command can also adjust the values.

HTH,
Kay

2010-12-29 22:17:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Uneven load on my raid disks.

On Wed, Dec 29, 2010 at 05:40:12PM +0100, Kay Diederichs wrote:
> >says: dumpe2fs -h /dev/md0 | grep RAID
>
> % tune2fs -l /dev/md0
>
> ...
> RAID stride: 128
> RAID stripe width: 768
> ...
>
> runs much faster than dumpe2fs.
> The command can also adjust the values.

Actually, "tune2fs -l" and "dumpe2fs -h" both run in about the same
amount of time. dumpe2fs without the -h option runs slower than
tune2fs -l, true. But that's because it reads and prints out
information regarding the block and inode allocation bitmaps.

- Ted

2010-12-30 10:32:03

by Rogier Wolff

[permalink] [raw]
Subject: Re: Uneven load on my raid disks.

On Wed, Dec 29, 2010 at 05:17:15PM -0500, Ted Ts'o wrote:
> On Wed, Dec 29, 2010 at 05:40:12PM +0100, Kay Diederichs wrote:
> > >says: dumpe2fs -h /dev/md0 | grep RAID
> >
> > % tune2fs -l /dev/md0
> >
> > ...
> > RAID stride: 128
> > RAID stripe width: 768
> > ...
> >
> > runs much faster than dumpe2fs.
> > The command can also adjust the values.
>
> Actually, "tune2fs -l" and "dumpe2fs -h" both run in about the same
> amount of time. dumpe2fs without the -h option runs slower than
> tune2fs -l, true. But that's because it reads and prints out
> information regarding the block and inode allocation bitmaps.

And the annoying thing is that it apparently uses a library function
that only returns after reading all that data.

So while it could print the superblock info and the first few block
groups, I'm left waiting.

My remove-of-200-million-files has completed. It took a week.
200000000/7/24/3600 = 330.7 .

So it deleted around 330 files per second. With one IO operation per
delete, the four disks operating at close to 75 IOs per second have
performed reasonable. And at an average of 1 IO per remove, also
the filesystem has performed reasonable. It seems I forgot the
-E stride= option on mkfs.

The manual of tune2fs hints that this can be tuned after the fact with
tune2fs. I seriously doubt it. Correct?

TUNE2FS(8)
...
-E extended-options
Set extended options for the filesystem. Extended options are
comma separated, and may take an argument using the equals ('=')
sign. The following extended options are supported:

stride=stride-size
...
stripe_width=stripe-width



Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-01-03 17:19:07

by Eric Sandeen

[permalink] [raw]
Subject: Re: Uneven load on my raid disks.

On 12/30/2010 04:32 AM, Rogier Wolff wrote:
> On Wed, Dec 29, 2010 at 05:17:15PM -0500, Ted Ts'o wrote:
>> On Wed, Dec 29, 2010 at 05:40:12PM +0100, Kay Diederichs wrote:
>>>> says: dumpe2fs -h /dev/md0 | grep RAID
>>>
>>> % tune2fs -l /dev/md0
>>>
>>> ...
>>> RAID stride: 128
>>> RAID stripe width: 768
>>> ...
>>>
>>> runs much faster than dumpe2fs.
>>> The command can also adjust the values.
>>
>> Actually, "tune2fs -l" and "dumpe2fs -h" both run in about the same
>> amount of time. dumpe2fs without the -h option runs slower than
>> tune2fs -l, true. But that's because it reads and prints out
>> information regarding the block and inode allocation bitmaps.
>
> And the annoying thing is that it apparently uses a library function
> that only returns after reading all that data.
>
> So while it could print the superblock info and the first few block
> groups, I'm left waiting.
>
> My remove-of-200-million-files has completed. It took a week.
> 200000000/7/24/3600 = 330.7 .
>
> So it deleted around 330 files per second. With one IO operation per
> delete, the four disks operating at close to 75 IOs per second have
> performed reasonable. And at an average of 1 IO per remove, also
> the filesystem has performed reasonable. It seems I forgot the
> -E stride= option on mkfs.
>
> The manual of tune2fs hints that this can be tuned after the fact with
> tune2fs. I seriously doubt it. Correct?
>
> TUNE2FS(8)
> ...
> -E extended-options
> Set extended options for the filesystem. Extended options are
> comma separated, and may take an argument using the equals ('=')
> sign. The following extended options are supported:
>
> stride=stride-size
> ...
> stripe_width=stripe-width
>

It will change the superblock values, but you're right, it does not
appear to actually move around any metadata or inode tables.

Interestingly there are some facilities for doing this if the inode
size gets changed:

/*
* We need to scan for inode and block bitmaps that may need to be
* moved. This can take place if the filesystem was formatted for
* RAID arrays using the mke2fs's extended option "stride".
*/
static int group_desc_scan_and_fix(ext2_filsys fs, ext2fs_block_bitmap bmap)


-Eric

>
> Roger.
>