2008-06-07 14:22:54

by Justin Piszcz

[permalink] [raw]
Subject: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

First, the original benchmarks with 6-SATA drives with fixed formatting, using
right justification and the same decimal point precision throughout:
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html

Now for for veliciraptors! Ever wonder what kind of speed is possible with
3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
executed three times and the average is taken of all three runs per each
RAID5 disk set.

In short? The 965 no longer does justice with faster drives, a new chipset
and motherboard are needed. After reading or writing to 4-5 veliciraptors
it saturates the bus/965 chipset.

Here is a picture of the 12 veliciraptors I tested with:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg

Here are the bonnie++ results:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html

For those who want the results in text:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt

System used, same/similar as before:
Motherboard: Intel DG965WH
Memory: 8GiB
Kernel: 2.6.25.4
Distribution: Debian Testing x86_64
Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW RAID]
Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
Chunk size: 1024KiB
RAID5 Layout: Default (left-symmetric)
Mdadm Superblock used: 0.90

Optimizations used (last one is for the CFQ scheduler), it improves
performance by a modest 5-10MiB/s:
http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html

# Tell user what's going on.
echo "Optimizing RAID Arrays..."

# Define DISKS.
cd /sys/block
DISKS=$(/bin/ls -1d sd[a-z])

# Set read-ahead.
# > That's actually 65k x 512byte blocks so 32MiB
echo "Setting read-ahead to 32 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
echo "Disabling NCQ on $i"
echo 1 > /sys/block/"$i"/device/queue_depth
done

# Fix slice_idle.
# See http://www.nextre.it/oracledocs/ioscheduler_03.html
echo "Fixing slice_idle to 0..."
for i in $DISKS
do
echo "Changing slice_idle to 0 on $i"
echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
done

----

Order of tests:

1. Create RAID (mdadm)

Example:


if [ $num_disks -eq 3 ]; then
mdadm --create /dev/md3 --verbose --level=5 -n $num_disks -c 1024 -e 0.90 \
/dev/sd[c-e]1 --assume-clean --run
fi

2. Run optimize script (above)

See above.

3. mkfs.xfs -f /dev/md3

mkfs.xfs auto-optimized for the underlying devices in an mdadm SW RAID.

4. Run bonnie++ as shown below 3 times, averaged:

/usr/bin/time /usr/sbin/bonnie++ -u 1000 -d /x/test -s 16384 -m p34 -n 16:100000:16:64 > $HOME/test"$run"_$num_disks-disks.txt 2>&1


----

A little more info, after 4-5 dd's, I have already maxed out the performance
of what the chipset can offer, see below:

knoppix@Knoppix:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 2755556 6176 203584 0 0 153 1 25 371 3 1 84 11
0 0 0 2755556 6176 203588 0 0 0 0 66 257 0 0 100 0
0 1 0 2605400 152204 203584 0 0 0 146028 257 396 0 5 77 18
0 1 0 2478176 277520 203604 0 0 0 125316 345 794 1 4 75 20
1 0 0 2349472 403984 203592 0 0 0 119136 297 256 0 5 75 20
2 1 0 2117292 631172 203512 0 0 0 232336 498 1019 0 8 66 26
0 2 0 2014400 731968 203556 0 0 0 241472 542 2078 1 11 63 25
3 0 0 2013412 733756 203492 0 0 0 302104 672 2760 0 14 59 27
0 3 0 2013576 735624 203520 0 0 0 362524 808 3356 0 15 56 29
0 4 0 2039312 736728 174860 0 0 120 425484 956 4899 1 20 52 26
0 4 0 2050236 738508 163712 0 0 0 482868 1008 5030 1 24 46 29
5 3 0 2050192 737916 163756 0 0 0 531532 1175 6033 0 26 43 31
3 4 0 2050220 738028 163744 0 0 0 606560 1312 6664 1 32 38 30
1 5 0 2049432 739184 163628 0 0 0 592756 1291 7195 1 30 35 34
8 3 0 2049488 738868 163580 0 0 0 675228 1721 10540 1 38 30 31
Here, ~5 raptor 300s, no more linear improvement after this:
4 4 0 2050048 737816 163744 0 0 0 677820 1771 10514 1 36 32 31
6 4 0 2048764 738612 163684 0 0 0 697640 1842 13231 1 40 27 33


2008-06-07 15:56:31

by David Lethe

[permalink] [raw]
Subject: RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

This is all interesting, but this has no relevance to the real world,
where computers run application software. You have a great foundation
here, but it won't help anybody who is running a database, mail, or
file/backup server because the I/Os are too large, and homogeneous. You
will get profoundly different sweet spots for RAID configurations once
you model your bench to match something that people actually run. I am
not criticizing you for this, it is just that now I have a taste for
what you have accomplished, and I want more more more :)

David


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Justin Piszcz
Sent: Saturday, June 07, 2008 9:23 AM
To: [email protected]; [email protected];
[email protected]
Cc: Alan Piszcz
Subject: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors

First, the original benchmarks with 6-SATA drives with fixed formatting,
using
right justification and the same decimal point precision throughout:
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
d-right-justified/disks.html

Now for for veliciraptors! Ever wonder what kind of speed is possible
with
3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run
is
executed three times and the average is taken of all three runs per each

RAID5 disk set.

In short? The 965 no longer does justice with faster drives, a new
chipset
and motherboard are needed. After reading or writing to 4-5
veliciraptors
it saturates the bus/965 chipset.

Here is a picture of the 12 veliciraptors I tested with:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/raptors.jpg

Here are the bonnie++ results:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.html

For those who want the results in text:
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.txt

System used, same/similar as before:
Motherboard: Intel DG965WH
Memory: 8GiB
Kernel: 2.6.25.4
Distribution: Debian Testing x86_64
Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW
RAID]
Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
Chunk size: 1024KiB
RAID5 Layout: Default (left-symmetric)
Mdadm Superblock used: 0.90

Optimizations used (last one is for the CFQ scheduler), it improves
performance by a modest 5-10MiB/s:
http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html

# Tell user what's going on.
echo "Optimizing RAID Arrays..."

# Define DISKS.
cd /sys/block
DISKS=$(/bin/ls -1d sd[a-z])

# Set read-ahead.
# > That's actually 65k x 512byte blocks so 32MiB
echo "Setting read-ahead to 32 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
echo "Disabling NCQ on $i"
echo 1 > /sys/block/"$i"/device/queue_depth
done

# Fix slice_idle.
# See http://www.nextre.it/oracledocs/ioscheduler_03.html
echo "Fixing slice_idle to 0..."
for i in $DISKS
do
echo "Changing slice_idle to 0 on $i"
echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
done

----

Order of tests:

1. Create RAID (mdadm)

Example:


if [ $num_disks -eq 3 ]; then
mdadm --create /dev/md3 --verbose --level=5 -n $num_disks -c 1024
-e 0.90 \
/dev/sd[c-e]1 --assume-clean --run
fi

2. Run optimize script (above)

See above.

3. mkfs.xfs -f /dev/md3

mkfs.xfs auto-optimized for the underlying devices in an mdadm SW RAID.

4. Run bonnie++ as shown below 3 times, averaged:

/usr/bin/time /usr/sbin/bonnie++ -u 1000 -d /x/test -s 16384 -m p34 -n
16:100000:16:64 > $HOME/test"$run"_$num_disks-disks.txt 2>&1


----

A little more info, after 4-5 dd's, I have already maxed out the
performance
of what the chipset can offer, see below:

knoppix@Knoppix:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
1 0 0 2755556 6176 203584 0 0 153 1 25 371 3
1 84 11
0 0 0 2755556 6176 203588 0 0 0 0 66 257 0
0 100 0
0 1 0 2605400 152204 203584 0 0 0 146028 257 396 0
5 77 18
0 1 0 2478176 277520 203604 0 0 0 125316 345 794 1
4 75 20
1 0 0 2349472 403984 203592 0 0 0 119136 297 256 0
5 75 20
2 1 0 2117292 631172 203512 0 0 0 232336 498 1019 0
8 66 26
0 2 0 2014400 731968 203556 0 0 0 241472 542 2078 1
11 63 25
3 0 0 2013412 733756 203492 0 0 0 302104 672 2760 0
14 59 27
0 3 0 2013576 735624 203520 0 0 0 362524 808 3356 0
15 56 29
0 4 0 2039312 736728 174860 0 0 120 425484 956 4899 1
20 52 26
0 4 0 2050236 738508 163712 0 0 0 482868 1008 5030 1
24 46 29
5 3 0 2050192 737916 163756 0 0 0 531532 1175 6033 0
26 43 31
3 4 0 2050220 738028 163744 0 0 0 606560 1312 6664 1
32 38 30
1 5 0 2049432 739184 163628 0 0 0 592756 1291 7195 1
30 35 34
8 3 0 2049488 738868 163580 0 0 0 675228 1721 10540 1
38 30 31
Here, ~5 raptor 300s, no more linear improvement after this:
4 4 0 2050048 737816 163744 0 0 0 677820 1771 10514 1
36 32 31
6 4 0 2048764 738612 163684 0 0 0 697640 1842 13231 1
40 27 33

2008-06-08 01:46:31

by Dan Williams

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <[email protected]> wrote:
> First, the original benchmarks with 6-SATA drives with fixed formatting,
> using
> right justification and the same decimal point precision throughout:
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>
> Now for for veliciraptors! Ever wonder what kind of speed is possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
> executed three times and the average is taken of all three runs per each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new chipset
> and motherboard are needed. After reading or writing to 4-5 veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
>
> Here are the bonnie++ results:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
>
> For those who want the results in text:
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144 0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record, 16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance / stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
> echo "Disabling NCQ on $i"
> echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
> echo "Changing slice_idle to 0 on $i"
> echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan

2008-06-09 07:51:36

by thomas62186218

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

Thank you for sharing these results. One issue that I consistently see
with these results is miserable random IO performance. Looking at these
numbers, even a low-end RAID controller with 128MB of cache will outrun
md-based RAIDs in random IO benchmarks. In today's world of virtual
machines, etc, random IO is far more common than sequential IO. What
can be done with md (or something else) to alleviate this problem?

-Thomas


-----Original Message-----
From: Dan Williams <[email protected]>
To: Justin Piszcz <[email protected]>
Cc: [email protected]; [email protected];
[email protected]; Alan Piszcz <[email protected]>
Sent: Sat, 7 Jun 2008 6:46 pm
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors










On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <[email protected]>
wrote:
> First, the original benchmarks with 6-SATA drives with fixed
formatting,
> using
> right justification and the same decimal point precision throughout:
>
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>
> Now for for veliciraptors! Ever wonder what kind of speed is
possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each
run is
> executed three times and the average is taken of all three runs per
each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new
chipset
> and motherboard are needed. After reading or writing to 4-5
veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
>
> Here are the bonnie++ results:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
>
> For those who want the results in text:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for
SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144
0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record, 16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance /
stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
> echo "Disabling NCQ on $i"
> echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
> echo "Changing slice_idle to 0 on $i"
> echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan

2008-06-09 08:43:54

by Keld Jørn Simonsen

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Mon, Jun 09, 2008 at 03:51:07AM -0400, [email protected] wrote:
> Thank you for sharing these results. One issue that I consistently see
> with these results is miserable random IO performance. Looking at these
> numbers, even a low-end RAID controller with 128MB of cache will outrun
> md-based RAIDs in random IO benchmarks. In today's world of virtual
> machines, etc, random IO is far more common than sequential IO. What
> can be done with md (or something else) to alleviate this problem?

Have you got any numbers to back up this?

What benchmark are you using for random IO?

Anyway the numbers that Justin reported was with an outdate motherboard,

My take is that Linux MD raid can outperform most HW RAID by a factor of two
on random IO.

Best regards
keld

> -Thomas
>
>
> -----Original Message-----
> From: Dan Williams <[email protected]>
> To: Justin Piszcz <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; Alan Piszcz <[email protected]>
> Sent: Sat, 7 Jun 2008 6:46 pm
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
> Veliciraptors
>
>
>
>
>
>
>
>
>
>
> On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <[email protected]>
> wrote:
> >First, the original benchmarks with 6-SATA drives with fixed
> formatting,
> >using
> >right justification and the same decimal point precision throughout:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
> >
> >Now for for veliciraptors! Ever wonder what kind of speed is
> possible with
> >3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each
> run is
> >executed three times and the average is taken of all three runs per
> each
> >RAID5 disk set.
> >
> >In short? The 965 no longer does justice with faster drives, a new
> chipset
> >and motherboard are needed. After reading or writing to 4-5
> veliciraptors
> >it saturates the bus/965 chipset.
> >
> >Here is a picture of the 12 veliciraptors I tested with:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/raptors.jpg
> >
> >Here are the bonnie++ results:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.html
> >
> >For those who want the results in text:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-veliciraptors/veliciraptor-raid.txt
> >
> >System used, same/similar as before:
> >Motherboard: Intel DG965WH
> >Memory: 8GiB
> >Kernel: 2.6.25.4
> >Distribution: Debian Testing x86_64
> >Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for
> SW
> >RAID]
> >Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144
> 0 1
> >Chunk size: 1024KiB
> >RAID5 Layout: Default (left-symmetric)
> >Mdadm Superblock used: 0.90
> >
> >Optimizations used (last one is for the CFQ scheduler), it improves
> >performance by a modest 5-10MiB/s:
> >http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
> >
> ># Tell user what's going on.
> >echo "Optimizing RAID Arrays..."
> >
> ># Define DISKS.
> >cd /sys/block
> >DISKS=$(/bin/ls -1d sd[a-z])
> >
> ># Set read-ahead.
> ># > That's actually 65k x 512byte blocks so 32MiB
> >echo "Setting read-ahead to 32 MiB for /dev/md3"
> >blockdev --setra 65536 /dev/md3
> >
> ># Set stripe-cache_size for RAID5.
> >echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
>
> Sorry to sound like a broken record, 16MiB is not correct.
>
> size=$((num_disks * 4 * 16384 / 1024))
> echo "Setting stripe_cache_size to $size MiB for /dev/md3"
>
> ...and commit 8b3e6cdc should improve the performance /
> stripe_cache_size ratio.
>
> >echo 16384 > /sys/block/md3/md/stripe_cache_size
> >
> ># Disable NCQ on all disks.
> >echo "Disabling NCQ on all disks..."
> >for i in $DISKS
> >do
> > echo "Disabling NCQ on $i"
> > echo 1 > /sys/block/"$i"/device/queue_depth
> >done
> >
> ># Fix slice_idle.
> ># See http://www.nextre.it/oracledocs/ioscheduler_03.html
> >echo "Fixing slice_idle to 0..."
> >for i in $DISKS
> >do
> > echo "Changing slice_idle to 0 on $i"
> > echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> >done
> >
>
> Thanks for putting this data together.
>
> Regards,
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-06-09 13:42:57

by David Lethe

[permalink] [raw]
Subject: RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

For faster random I/O:
* Decrease chunk size
* Migrate files that have higher random I/O to a RAID1 set, using disks
with the lowest access time/latency
* If possible, use the /dev/shm file system
* Determine I/O size of apps that produce most of the random I/O, and
make sure that md+filesystem matches. If most random I/O is 32KB, then
don't waste bandwidth by making md read 256KB at a time, or making it
read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
10-drive RAID5 set with heavy random I/O is also profoundly wrong
because you are just removing the opportunity to have all of those heads
processing random I/O.
* If you only have one partition on a md set, then partition it into a
few file systems. This may provide greater opportunity for caching I/Os.
* Experiment with different file systems, and optimize accordingly.
* Turn of journaling, or at least move journals to RAID1 devices.
* Add RAM and try to increase buffer cache in attempt to improve cache
hit percentage (this works up to a point)
* Buy a small SSD and migrate files that get pounded with random I/O to
that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
that satisfies random I/O in nanoseconds instead of millisecs). They are
expensive, but the appropriate device. This is how companies such as
Google & Ebay manage to get things done.
The biggest thing to remember about random I/O, is that they are
expensive, so just step back and think about ways to minimize the I/O
requests to disk in the first place, and/or to spread the I/O across
multiple raidsets that can work independently to satisfy your load. All
suggestions above will not work for everybody. You must understand the
nature of the bottleneck.

David

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
[email protected]
Sent: Monday, June 09, 2008 2:51 AM
To: [email protected]; [email protected]
Cc: [email protected]; [email protected];
[email protected]; [email protected]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors

Thank you for sharing these results. One issue that I consistently see
with these results is miserable random IO performance. Looking at these
numbers, even a low-end RAID controller with 128MB of cache will outrun
md-based RAIDs in random IO benchmarks. In today's world of virtual
machines, etc, random IO is far more common than sequential IO. What
can be done with md (or something else) to alleviate this problem?

-Thomas


-----Original Message-----
From: Dan Williams <[email protected]>
To: Justin Piszcz <[email protected]>
Cc: [email protected]; [email protected];
[email protected]; Alan Piszcz <[email protected]>
Sent: Sat, 7 Jun 2008 6:46 pm
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
Veliciraptors










On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <[email protected]>
wrote:
> First, the original benchmarks with 6-SATA drives with fixed
formatting,
> using
> right justification and the same decimal point precision throughout:
>
http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
d-right-justified/disks.html
>
> Now for for veliciraptors! Ever wonder what kind of speed is
possible with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each
run is
> executed three times and the average is taken of all three runs per
each
> RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new
chipset
> and motherboard are needed. After reading or writing to 4-5
veliciraptors
> it saturates the bus/965 chipset.
>
> Here is a picture of the 12 veliciraptors I tested with:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/raptors.jpg
>
> Here are the bonnie++ results:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.html
>
> For those who want the results in text:
>
http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
aptors/veliciraptor-raid.txt
>
> System used, same/similar as before:
> Motherboard: Intel DG965WH
> Memory: 8GiB
> Kernel: 2.6.25.4
> Distribution: Debian Testing x86_64
> Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for
SW
> RAID]
> Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144
0 1
> Chunk size: 1024KiB
> RAID5 Layout: Default (left-symmetric)
> Mdadm Superblock used: 0.90
>
> Optimizations used (last one is for the CFQ scheduler), it improves
> performance by a modest 5-10MiB/s:
> http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
>
> # Tell user what's going on.
> echo "Optimizing RAID Arrays..."
>
> # Define DISKS.
> cd /sys/block
> DISKS=$(/bin/ls -1d sd[a-z])
>
> # Set read-ahead.
> # > That's actually 65k x 512byte blocks so 32MiB
> echo "Setting read-ahead to 32 MiB for /dev/md3"
> blockdev --setra 65536 /dev/md3
>
> # Set stripe-cache_size for RAID5.
> echo "Setting stripe_cache_size to 16 MiB for /dev/md3"

Sorry to sound like a broken record, 16MiB is not correct.

size=$((num_disks * 4 * 16384 / 1024))
echo "Setting stripe_cache_size to $size MiB for /dev/md3"

...and commit 8b3e6cdc should improve the performance /
stripe_cache_size ratio.

> echo 16384 > /sys/block/md3/md/stripe_cache_size
>
> # Disable NCQ on all disks.
> echo "Disabling NCQ on all disks..."
> for i in $DISKS
> do
> echo "Disabling NCQ on $i"
> echo 1 > /sys/block/"$i"/device/queue_depth
> done
>
> # Fix slice_idle.
> # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> echo "Fixing slice_idle to 0..."
> for i in $DISKS
> do
> echo "Changing slice_idle to 0 on $i"
> echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> done
>

Thanks for putting this data together.

Regards,
Dan

2008-06-09 14:27:39

by Keld Jørn Simonsen

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Mon, Jun 09, 2008 at 08:41:18AM -0500, David Lethe wrote:
> For faster random I/O:
> * Decrease chunk size
> * Migrate files that have higher random I/O to a RAID1 set, using disks
> with the lowest access time/latency
> * If possible, use the /dev/shm file system
> * Determine I/O size of apps that produce most of the random I/O, and
> make sure that md+filesystem matches. If most random I/O is 32KB, then
> don't waste bandwidth by making md read 256KB at a time, or making it
> read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
> 5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
> 10-drive RAID5 set with heavy random I/O is also profoundly wrong
> because you are just removing the opportunity to have all of those heads
> processing random I/O.
> * If you only have one partition on a md set, then partition it into a
> few file systems. This may provide greater opportunity for caching I/Os.
> * Experiment with different file systems, and optimize accordingly.
> * Turn of journaling, or at least move journals to RAID1 devices.
> * Add RAM and try to increase buffer cache in attempt to improve cache
> hit percentage (this works up to a point)
> * Buy a small SSD and migrate files that get pounded with random I/O to
> that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
> that satisfies random I/O in nanoseconds instead of millisecs). They are
> expensive, but the appropriate device. This is how companies such as
> Google & Ebay manage to get things done.
> The biggest thing to remember about random I/O, is that they are
> expensive, so just step back and think about ways to minimize the I/O
> requests to disk in the first place, and/or to spread the I/O across
> multiple raidsets that can work independently to satisfy your load. All
> suggestions above will not work for everybody. You must understand the
> nature of the bottleneck.


For faster random IO I would suggest to use raid10,f2 for the random
reading, it performs like raid0, something like more than double the
speed of a normal single-drive file system. For random writes raid10,f2
performs like most other mirrorred raids, given that data needs to be
written twice.

Try and see if you can gat any HW raids to match that performance.

best regards
keld

> David
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> [email protected]
> Sent: Monday, June 09, 2008 2:51 AM
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
> Veliciraptors
>
> Thank you for sharing these results. One issue that I consistently see
> with these results is miserable random IO performance. Looking at these
> numbers, even a low-end RAID controller with 128MB of cache will outrun
> md-based RAIDs in random IO benchmarks. In today's world of virtual
> machines, etc, random IO is far more common than sequential IO. What
> can be done with md (or something else) to alleviate this problem?
>
> -Thomas
>
>
> -----Original Message-----
> From: Dan Williams <[email protected]>
> To: Justin Piszcz <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; Alan Piszcz <[email protected]>
> Sent: Sat, 7 Jun 2008 6:46 pm
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte
> Veliciraptors
>
>
>
>
>
>
>
>
>
>
> On Sat, Jun 7, 2008 at 7:22 AM, Justin Piszcz <[email protected]>
> wrote:
> > First, the original benchmarks with 6-SATA drives with fixed
> formatting,
> > using
> > right justification and the same decimal point precision throughout:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-an
> d-right-justified/disks.html
> >
> > Now for for veliciraptors! Ever wonder what kind of speed is
> possible with
> > 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each
> run is
> > executed three times and the average is taken of all three runs per
> each
> > RAID5 disk set.
> >
> > In short? The 965 no longer does justice with faster drives, a new
> chipset
> > and motherboard are needed. After reading or writing to 4-5
> veliciraptors
> > it saturates the bus/965 chipset.
> >
> > Here is a picture of the 12 veliciraptors I tested with:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/raptors.jpg
> >
> > Here are the bonnie++ results:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/veliciraptor-raid.html
> >
> > For those who want the results in text:
> >
> http://home.comcast.net/~jpiszcz/20080607/raid5-benchmarks-3to10-velicir
> aptors/veliciraptor-raid.txt
> >
> > System used, same/similar as before:
> > Motherboard: Intel DG965WH
> > Memory: 8GiB
> > Kernel: 2.6.25.4
> > Distribution: Debian Testing x86_64
> > Filesystem: XFS with default mkfs.xfs parameters [auto-optimized for
> SW
> > RAID]
> > Mount options: defaults,noatime,nodiratime,logbufs=8,logbsize=262144
> 0 1
> > Chunk size: 1024KiB
> > RAID5 Layout: Default (left-symmetric)
> > Mdadm Superblock used: 0.90
> >
> > Optimizations used (last one is for the CFQ scheduler), it improves
> > performance by a modest 5-10MiB/s:
> > http://home.comcast.net/~jpiszcz/raid/20080601/raid5.html
> >
> > # Tell user what's going on.
> > echo "Optimizing RAID Arrays..."
> >
> > # Define DISKS.
> > cd /sys/block
> > DISKS=$(/bin/ls -1d sd[a-z])
> >
> > # Set read-ahead.
> > # > That's actually 65k x 512byte blocks so 32MiB
> > echo "Setting read-ahead to 32 MiB for /dev/md3"
> > blockdev --setra 65536 /dev/md3
> >
> > # Set stripe-cache_size for RAID5.
> > echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
>
> Sorry to sound like a broken record, 16MiB is not correct.
>
> size=$((num_disks * 4 * 16384 / 1024))
> echo "Setting stripe_cache_size to $size MiB for /dev/md3"
>
> ...and commit 8b3e6cdc should improve the performance /
> stripe_cache_size ratio.
>
> > echo 16384 > /sys/block/md3/md/stripe_cache_size
> >
> > # Disable NCQ on all disks.
> > echo "Disabling NCQ on all disks..."
> > for i in $DISKS
> > do
> > echo "Disabling NCQ on $i"
> > echo 1 > /sys/block/"$i"/device/queue_depth
> > done
> >
> > # Fix slice_idle.
> > # See http://www.nextre.it/oracledocs/ioscheduler_03.html
> > echo "Fixing slice_idle to 0..."
> > for i in $DISKS
> > do
> > echo "Changing slice_idle to 0 on $i"
> > echo 0 > /sys/block/"$i"/queue/iosched/slice_idle
> > done
> >
>
> Thanks for putting this data together.
>
> Regards,
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-06-09 15:01:19

by David Lethe

[permalink] [raw]
Subject: RE: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors



-----Original Message-----
From: Keld J?rn Simonsen [mailto:[email protected]]
Sent: Monday, June 09, 2008 9:27 AM
To: David Lethe
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Mon, Jun 09, 2008 at 08:41:18AM -0500, David Lethe wrote:
> For faster random I/O:
> * Decrease chunk size
> * Migrate files that have higher random I/O to a RAID1 set, using disks
> with the lowest access time/latency
> * If possible, use the /dev/shm file system
> * Determine I/O size of apps that produce most of the random I/O, and
> make sure that md+filesystem matches. If most random I/O is 32KB, then
> don't waste bandwidth by making md read 256KB at a time, or making it
> read 2x16KB I/Os. Also don't build md sets like 4-drive RAID5, (Do a
> 5-drive RAID5 set), because non-parity data isn't a multiple of 2. A
> 10-drive RAID5 set with heavy random I/O is also profoundly wrong
> because you are just removing the opportunity to have all of those heads
> processing random I/O.
> * If you only have one partition on a md set, then partition it into a
> few file systems. This may provide greater opportunity for caching I/Os.
> * Experiment with different file systems, and optimize accordingly.
> * Turn of journaling, or at least move journals to RAID1 devices.
> * Add RAM and try to increase buffer cache in attempt to improve cache
> hit percentage (this works up to a point)
> * Buy a small SSD and migrate files that get pounded with random I/O to
> that device. (Make sure you don't get a flash SSD, but a DRAM based SSD
> that satisfies random I/O in nanoseconds instead of millisecs). They are
> expensive, but the appropriate device. This is how companies such as
> Google & Ebay manage to get things done.
> The biggest thing to remember about random I/O, is that they are
> expensive, so just step back and think about ways to minimize the I/O
> requests to disk in the first place, and/or to spread the I/O across
> multiple raidsets that can work independently to satisfy your load. All
> suggestions above will not work for everybody. You must understand the
> nature of the bottleneck.


For faster random IO I would suggest to use raid10,f2 for the random
reading, it performs like raid0, something like more than double the
speed of a normal single-drive file system. For random writes raid10,f2
performs like most other mirrorred raids, given that data needs to be
written twice.

Try and see if you can gat any HW raids to match that performance.

best regards
keld

--------------------------------------------------------------------------------
Keld:
That is counter-intuitive. The issue is random IOPs, not throughput. I do not
understand how a RAID10 would provide more IOs per sec than RAID1. Or, since
you are using RAID10, then how could RAID10 serve more random I/Os then a pair
of RAID1 filesystems? RAID0 dictates that each disk will supply half
of the data you want per application I/O request. At least with RAID1, then each
disk can get all the data you want with a single request, and dual-porting/load balancing
will allow both disks to work independently of each other on reads so the disk with
the least amount of load at any time can work on the request. That is why RAID1 can be
faster than JBOD.

Granted writes are handled differently, but with any RAID0 implementation you still have to write
Half of the data to each disk requiring 2 I/Os + journaling & housekeeping.


David

2008-06-09 23:15:39

by Keld Jørn Simonsen

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

On Mon, Jun 09, 2008 at 09:56:14AM -0500, David Lethe wrote:
>
>
> From: Keld J?rn Simonsen [mailto:[email protected]]
> Sent: Monday, June 09, 2008 9:27 AM
> To: David Lethe
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors
>
> For faster random IO I would suggest to use raid10,f2 for the random
> reading, it performs like raid0, something like more than double the
> speed of a normal single-drive file system. For random writes raid10,f2
> performs like most other mirrorred raids, given that data needs to be
> written twice.
>
> Try and see if you can gat any HW raids to match that performance.
>
> best regards
> keld
>
> --------------------------------------------------------------------------------
> Keld:
> That is counter-intuitive. The issue is random IOPs, not throughput.

That probably depends on your use. I run Linux mirrors, and for that
purpose thruputi of random IO, especially reading, is key.

For data bases it is probably something else, probably IOP. here I also
think that Linux MD raid has good performance. Once again I think my pet
RAID type, raid10,f2 has something to offer, especially with lower
random seek rates, as the track span is shorter, and on the outer,
faster tracks.

And other uses may have other bottlenecks. In general I think that
thruput is an important figure, as it shows how fast a system can
process a given amount of data. Areas where this may count include web servers,
file servers, print servers, ordinary workstations.

I actually think those 2 measures for random IO: IO thruput, and IO transactions per
second, for read and write, are the two most important measures.

For the IO transacions per second I agree that your suggestions are good
advice.

I would like to have good benchmarking tools for this, and also I would
like figures on how Linux MD compares to different HW RAID.

> I do not
> understand how a RAID10 would provide more IOs per sec than RAID1. Or, since
> you are using RAID10, then how could RAID10 serve more random I/Os then a pair
> of RAID1 filesystems?

In theory you are right. The MD implementation of RAID1 does not seem to
handle random seeks so well, AFAIK. Then the seeks are confined with
raid10,f2 to less than half of the disk arm movement, taht does speed
things up a little.

> RAID0 dictates that each disk will supply half
> of the data you want per application I/O request. At least with RAID1, then each
> disk can get all the data you want with a single request, and dual-porting/load balancing
> will allow both disks to work independently of each other on reads so the disk with
> the least amount of load at any time can work on the request. That is why RAID1 can be
> faster than JBOD.
>
> Granted writes are handled differently, but with any RAID0 implementation you still have to write
> Half of the data to each disk requiring 2 I/Os + journaling & housekeeping.

yes, indeed.

best regards
keld

2008-06-11 20:30:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

Justin Piszcz wrote:
> First, the original benchmarks with 6-SATA drives with fixed
> formatting, using
> right justification and the same decimal point precision throughout:
> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>
>
> Now for for veliciraptors! Ever wonder what kind of speed is possible
> with
> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
> executed three times and the average is taken of all three runs per
> each RAID5 disk set.
>
> In short? The 965 no longer does justice with faster drives, a new
> chipset
> and motherboard are needed. After reading or writing to 4-5 veliciraptors
> it saturates the bus/965 chipset.

This is very interesting, but a 16GB chunk size bears no relationship to
anything I would run in the real world, and I suspect most people are in
the same category.

--
Bill Davidsen <[email protected]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark

2008-06-11 20:48:30

by Justin Piszcz

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors



On Wed, 11 Jun 2008, Bill Davidsen wrote:

> Justin Piszcz wrote:
>> First, the original benchmarks with 6-SATA drives with fixed formatting,
>> using
>> right justification and the same decimal point precision throughout:
>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>>
>> Now for for veliciraptors! Ever wonder what kind of speed is possible with
>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
>> executed three times and the average is taken of all three runs per each
>> RAID5 disk set.
>>
>> In short? The 965 no longer does justice with faster drives, a new chipset
>> and motherboard are needed. After reading or writing to 4-5 veliciraptors
>> it saturates the bus/965 chipset.
>
> This is very interesting, but a 16GB chunk size bears no relationship to
> anything I would run in the real world, and I suspect most people are in the
> same category.

I based my bonnie++ test on:
http://everything2.org/?node_id=1479435

So I could compare to his results.

I use a 1024k (1MiB) with 16384 stripe, this offered the best overall
read/write/rewrite performance AFAIK.

Justin.

2008-06-11 20:53:48

by Justin Piszcz

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors



On Wed, 11 Jun 2008, Justin Piszcz wrote:

>
>
> On Wed, 11 Jun 2008, Bill Davidsen wrote:
>
>> Justin Piszcz wrote:
>>> First, the original benchmarks with 6-SATA drives with fixed formatting,
>>> using
>>> right justification and the same decimal point precision throughout:
>>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>>> Now for for veliciraptors! Ever wonder what kind of speed is possible with
>>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each run is
>>> executed three times and the average is taken of all three runs per each
>>> RAID5 disk set.
>>>
>>> In short? The 965 no longer does justice with faster drives, a new chipset
>>> and motherboard are needed. After reading or writing to 4-5 veliciraptors
>>> it saturates the bus/965 chipset.
>>
>> This is very interesting, but a 16GB chunk size bears no relationship to
>> anything I would run in the real world, and I suspect most people are in
>> the same category.
>
> I based my bonnie++ test on:
> http://everything2.org/?node_id=1479435
>
> So I could compare to his results.
>
> I use a 1024k (1MiB) with 16384 stripe, this offered the best overall
> read/write/rewrite performance AFAIK.

1024k chunk size (raid5 chunk size)
echo 16384 > stripe_cache_size

2008-06-12 19:09:41

by Bill Davidsen

[permalink] [raw]
Subject: Re: Linux MD RAID 5 Benchmarks Across (3 to 10) 300 Gigabyte Veliciraptors

Justin Piszcz wrote:
>
>
> On Wed, 11 Jun 2008, Justin Piszcz wrote:
>
>>
>>
>> On Wed, 11 Jun 2008, Bill Davidsen wrote:
>>
>>> Justin Piszcz wrote:
>>>> First, the original benchmarks with 6-SATA drives with fixed
>>>> formatting, using
>>>> right justification and the same decimal point precision throughout:
>>>> http://home.comcast.net/~jpiszcz/20080607/raid-benchmarks-decimal-fix-and-right-justified/disks.html
>>>> Now for for veliciraptors! Ever wonder what kind of speed is
>>>> possible with
>>>> 3 disk, 4,5,6,7,8,9,10-disk RAID5s? I ran a loop to find out, each
>>>> run is
>>>> executed three times and the average is taken of all three runs per
>>>> each RAID5 disk set.
>>>>
>>>> In short? The 965 no longer does justice with faster drives, a new
>>>> chipset
>>>> and motherboard are needed. After reading or writing to 4-5
>>>> veliciraptors
>>>> it saturates the bus/965 chipset.
>>>
>>> This is very interesting, but a 16GB chunk size bears no
>>> relationship to anything I would run in the real world, and I
>>> suspect most people are in the same category.
>>
>> I based my bonnie++ test on:
>> http://everything2.org/?node_id=1479435
>>
>> So I could compare to his results.
>>
>> I use a 1024k (1MiB) with 16384 stripe, this offered the best overall
>> read/write/rewrite performance AFAIK.
>
> 1024k chunk size (raid5 chunk size)
> echo 16384 > stripe_cache_size

Please don't explain any more, I'm confused enough already. I can't make
those numbers match 16G no matter how I add them, either the contents of
the column labeled "size:chunk size" isn't the size of the chunk, or you
have a multiplier floating around that I don't see. And you eliminated
the degraded performance, since your stripe_cache_size is less than
(raid5 chunk size)*(#disks), I would expect the reads in degraded mode
to be dog slow because the don't fit in cache, even if 1024k is what I
call chunk size and certainly not if chunk size is 16G.

--
Bill Davidsen <[email protected]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark