2012-08-15 00:50:16

by Andy Lutomirski

[permalink] [raw]
Subject: O_DIRECT to md raid 6 is slow

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 26.88 35.27 0.00 37.85

Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 265.20 1.16 54.79 5 273
sdc 266.20 1.47 54.73 7 273
sdd 264.20 1.38 54.54 6 272
sdf 286.00 1.84 54.74 9 273
sde 266.60 1.04 54.75 5 273
sdg 265.00 1.02 54.74 5 273
md0 55808.00 0.00 218.00 0 1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 11.70 12.94 0.00 75.36

Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sdb 831.00 8.58 30.42 42 152
sdc 832.80 8.05 29.99 40 149
sdd 832.00 9.10 29.78 45 148
sdf 838.40 9.11 29.72 45 148
sde 828.80 7.91 29.79 39 148
sdg 850.80 8.00 30.18 40 150
md0 1012.60 0.00 101.27 0 506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC


2012-08-15 01:07:11

by kedacomkernel

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 2012-08-15 08:49 Andy Lutomirski <[email protected]> Wrote:
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M
>then iostat -m 5 says:
>
>avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 26.88 35.27 0.00 37.85
>
>Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
>sdb 265.20 1.16 54.79 5 273
>sdc 266.20 1.47 54.73 7 273
>sdd 264.20 1.38 54.54 6 272
>sdf 286.00 1.84 54.74 9 273
>sde 266.60 1.04 54.75 5 273
>sdg 265.00 1.02 54.74 5 273
>md0 55808.00 0.00 218.00 0 1090
>
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>then iostat -m 5 says:
>avg-cpu: %user %nice %system %iowait %steal %idle
> 0.00 0.00 11.70 12.94 0.00 75.36
>
>Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
>sdb 831.00 8.58 30.42 42 152
>sdc 832.80 8.05 29.99 40 149
>sdd 832.00 9.10 29.78 45 148
>sdf 838.40 9.11 29.72 45 148
>sde 828.80 7.91 29.79 39 148
>sdg 850.80 8.00 30.18 40 150
>md0 1012.60 0.00 101.27 0 506
>
>It looks like md isn't recognizing that I'm writing whole stripes when
>I'm in O_DIRECT mode.
>
kernel version?

>--Andy
>
>--
>Andy Lutomirski
>AMA Capital Management, LLC
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-08-15 01:12:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

Ubuntu's 3.2.0-27-generic. I can test on a newer kernel tomorrow.

--Andy

On Tue, Aug 14, 2012 at 6:07 PM, kedacomkernel <[email protected]> wrote:
> On 2012-08-15 08:49 Andy Lutomirski <[email protected]> Wrote:
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M
>>then iostat -m 5 says:
>>
>>avg-cpu: %user %nice %system %iowait %steal %idle
>> 0.00 0.00 26.88 35.27 0.00 37.85
>>
>>Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
>>sdb 265.20 1.16 54.79 5 273
>>sdc 266.20 1.47 54.73 7 273
>>sdd 264.20 1.38 54.54 6 272
>>sdf 286.00 1.84 54.74 9 273
>>sde 266.60 1.04 54.75 5 273
>>sdg 265.00 1.02 54.74 5 273
>>md0 55808.00 0.00 218.00 0 1090
>>
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>>then iostat -m 5 says:
>>avg-cpu: %user %nice %system %iowait %steal %idle
>> 0.00 0.00 11.70 12.94 0.00 75.36
>>
>>Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
>>sdb 831.00 8.58 30.42 42 152
>>sdc 832.80 8.05 29.99 40 149
>>sdd 832.00 9.10 29.78 45 148
>>sdf 838.40 9.11 29.72 45 148
>>sde 828.80 7.91 29.79 39 148
>>sdg 850.80 8.00 30.18 40 150
>>md0 1012.60 0.00 101.27 0 506
>>
>>It looks like md isn't recognizing that I'm writing whole stripes when
>>I'm in O_DIRECT mode.
>>
> kernel version?
>
>>--Andy
>>
>>--
>>Andy Lutomirski
>>AMA Capital Management, LLC
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Andy Lutomirski
AMA Capital Management, LLC

2012-08-15 01:23:21

by kedacomkernel

[permalink] [raw]
Subject: Re: Re: O_DIRECT to md raid 6 is slow

On 2012-08-15 09:12 Andy Lutomirski <[email protected]> Wrote:
>Ubuntu's 3.2.0-27-generic. I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.

CC: Li Shaohua <[email protected]>
Acked-by: Jeff Moyer <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/direct-io.c | 5 +++++
mm/filemap.c | 4 ----
2 files changed, 5 insertions(+), 4 deletions(-)

--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
retval = filemap_write_and_wait_range(mapping, pos,
pos + iov_length(iov, nr_segs) - 1);
if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug(&plug);
retval = mapping->a_ops->direct_IO(READ, iocb,
iov, pos, nr_segs);
- blk_finish_plug(&plug);
}
if (retval > 0) {
*ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
unsigned long user_addr;
size_t bytes;
struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;

if (rw & WRITE)
rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
PAGE_SIZE - user_addr / PAGE_SIZE);
}

+ blk_start_plug(&plug);
+
for (seg = 0; seg < nr_segs; seg++) {
user_addr = (unsigned long)iov[seg].iov_base;
sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
if (sdio.bio)
dio_bio_submit(dio, &sdio);

+ blk_finish_plug(&plug);
+
/*
* It is possible that, we return short IO due to end of file.
* In that case, we need to release all the pages we got hold on.


--
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-08-15 12:08:37

by John Robinson

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 15/08/2012 01:49, Andy Lutomirski wrote:
> If I do:
> # dd if=/dev/zero of=/dev/md0p1 bs=8M
[...]
> It looks like md isn't recognizing that I'm writing whole stripes when
> I'm in O_DIRECT mode.

I see your md device is partitioned. Is the partition itself stripe-aligned?

Cheers,

John.

2012-08-15 17:58:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
<[email protected]> wrote:
> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>
>> If I do:
>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>
> [...]
>
>> It looks like md isn't recognizing that I'm writing whole stripes when
>> I'm in O_DIRECT mode.
>
>
> I see your md device is partitioned. Is the partition itself stripe-aligned?

Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UUUUUU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.

Sadly, /sys/block/md0/md0p1/alignment_offset reports 0 (instead of 1MB).

Fixing this has no effect, though.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2012-08-15 22:00:34

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
> <[email protected]> wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>
> Crud.
>
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UUUUUU]
>
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over. You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust. So you consume 6MB of bandwidth to write less than
a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata. Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand. Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB. You'll be much happier with real
workloads.

--
Stan

2012-08-15 22:11:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <[email protected]> wrote:
> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>> <[email protected]> wrote:
>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>
>>>> If I do:
>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>
>>> [...]
>>>
>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>> I'm in O_DIRECT mode.
>>>
>>>
>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>
>> Crud.
>>
>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [6/6] [UUUUUU]
>>
>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
>> (i.e. 1MB) boundary.
>
> It's time to blow away the array and start over. You're already
> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
> but for a handful of niche all streaming workloads with little/no
> rewrite, such as video surveillance or DVR workloads.
>
> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
> Deleting a single file changes only a few bytes of directory metadata.
> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
> modify the directory block in question, calculate parity, then write out
> 3MB of data to rust. So you consume 6MB of bandwidth to write less than
> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
> a few bytes of metadata. Yes, insane.

Grr. I thought the bad old days of filesystem and related defaults
sucking were over. cryptsetup aligns sanely these days, xfs is
sensible, etc. wtf? <rant>Why is there no sensible filesystem for
huge disks? zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.</rant>

Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here...

--Andy

2012-08-15 23:07:52

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

In article <[email protected]> you write:
>It's time to blow away the array and start over. You're already
>misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>but for a handful of niche all streaming workloads with little/no
>rewrite, such as video surveillance or DVR workloads.
>
>Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
>Deleting a single file changes only a few bytes of directory metadata.
>With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>modify the directory block in question, calculate parity, then write out
>3MB of data to rust. So you consume 6MB of bandwidth to write less than
>a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
>a few bytes of metadata. Yes, insane.

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

>Parity RAID sucks in general because of RMW, but it is orders of
>magnitude worse when one chooses to use an insane chunk size to boot,
>and especially so with a large drive count.

If you have a lot of parallel readers (readers >> disks) then
you want chunk sizes of about 2*mean_read_size, so that for each
read you just have 1 seek on 1 disk.

If you have just a few readers (readers <<<< disks) that read
really large blocks then you want a small chunk size to keep
all disks busy.

If you have no readers and just writers and you write large
blocks, then you might want a small chunk size too, so that
you can write data+parity over the stripe in one go, bypassing rmw.

Also, 256K or 512K isn't all that big nowadays, there's not much
latency difference between reading 32K or 512K..

>Recreate your array, partition aligned, and manually specify a sane
>chunk size of something like 32KB. You'll be much happier with real
>workloads.

Aligning is a good idea, and on modern distributions partitions,
LVM lv's etc are generally created with 1MB alignment. But using
a small chunksize like 32K? That depends on the workload, but
in most cases I'd advise against it.

Mike.

2012-08-15 23:50:44

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <[email protected]> wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>> <[email protected]> wrote:
>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>
>>>>> If I do:
>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>
>>>> [...]
>>>>
>>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>>> I'm in O_DIRECT mode.
>>>>
>>>>
>>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UUUUUU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over. You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust. So you consume 6MB of bandwidth to write less than
>> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata. Yes, insane.
>
> Grr. I thought the bad old days of filesystem and related defaults
> sucking were over.

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads. I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default. Obviously something went horribly wrong here. 512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

> cryptsetup aligns sanely these days, xfs is
> sensible, etc.

XFS won't align with the 512KB chunk default of metadata 1.2. The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended. Thus mkfs.xfs throws an error due to the 512KB
stripe. See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

> wtf? <rant>Why is there no sensible filesystem for
> huge disks? zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.</rant>

Deduplication isn't a responsibility of a filesystem. TTBOMK there are
two, and only two, COW filesystems in existence: ZFS and BTRFS. And
these are the only two to offer a native dedupe capability. They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup. So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

> Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here...

Always one somewhere.

--
Stan

2012-08-16 01:09:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On Wed, Aug 15, 2012 at 4:50 PM, Stan Hoeppner <[email protected]> wrote:
> On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <[email protected]> wrote:
>>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>>> <[email protected]> wrote:
>>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>>
>>>>>> If I do:
>>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>>
>>>>> [...]
>>
>> Grr. I thought the bad old days of filesystem and related defaults
>> sucking were over.
>
> The previous md chunk default of 64KB wasn't horribly bad, though still
> maybe a bit high for alot of common workloads. I didn't have eyes/ears
> on the discussion and/or testing process that led to the 'new' 512KB
> default. Obviously something went horribly wrong here. 512KB isn't a
> show stopper as a default for 0/1/10, but is 8-16 times too large for
> parity RAID.
>
>> cryptsetup aligns sanely these days, xfs is
>> sensible, etc.
>
> XFS won't align with the 512KB chunk default of metadata 1.2. The
> largest XFS journal stripe unit (su--chunk) is 256KB, and even that
> isn't recommended. Thus mkfs.xfs throws an error due to the 512KB
> stripe. See the md and xfs archives for more details, specifically Dave
> Chinner's colorful comments on the md 512KB default.

Heh -- that's why the math didn't make any sense :)

>
>> wtf? <rant>Why is there no sensible filesystem for
>> huge disks? zfs can't cp --reflink and has all kinds of source
>> availability and licensing issues, xfs can't dedupe at all, and btrfs
>> isn't nearly stable enough.</rant>
>
> Deduplication isn't a responsibility of a filesystem. TTBOMK there are
> two, and only two, COW filesystems in existence: ZFS and BTRFS. And
> these are the only two to offer a native dedupe capability. They did it
> because they could, with COW, not necessarily because they *should*.
> There are dozens of other single node, cluster, and distributed
> filesystems in use today and none of them support COW, and thus none
> support dedup. So to *expect* a 'sensible' filesystem to include dedupe
> is wishful thinking at best.

I should clarify my rant for the record. I don't care about in-fs
dedupe. I want COW so userspace can dedupe and generally replace
hardlinks with sensible cowlinks. I'm also working on some fun tools
that *require* reflinks for anything resembling decent performance.

--Andy

2012-08-16 06:47:53

by Roman Mamedov

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On Wed, 15 Aug 2012 18:50:44 -0500
Stan Hoeppner <[email protected]> wrote:

> TTBOMK there are two, and only two, COW filesystems in existence: ZFS and BTRFS.

There is also NILFS2: http://www.nilfs.org/en/
And in general, any https://en.wikipedia.org/wiki/Log-structured_file_system
is COW by design, but afaik of those only NILFS is also in the mainline Linux
kernel AND is not aimed just for some niche like flash-based devices, but for
general-purpose usage.

--
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."


Attachments:
signature.asc (198.00 B)

2012-08-16 11:05:30

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
> In article <[email protected]> you write:
>> It's time to blow away the array and start over. You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust. So you consume 6MB of bandwidth to write less than
>> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata. Yes, insane.
>
> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
> to read that 4K block, and the corresponding 4K block on the
> parity drive, recalculate parity, and write back 4K of data and 4K
> of parity. (read|read) modify (write|write). You do not have to
> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See: http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior. What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

>> Parity RAID sucks in general because of RMW, but it is orders of
>> magnitude worse when one chooses to use an insane chunk size to boot,
>> and especially so with a large drive count.
[snip]
> Also, 256K or 512K isn't all that big nowadays, there's not much
> latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1. All filesystems have metadata
2. All (worth using) filesystems have a metadata journal
3. All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases. Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe. With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6). Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes. The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata. With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

>> Recreate your array, partition aligned, and manually specify a sane
>> chunk size of something like 32KB. You'll be much happier with real
>> workloads.
>
> Aligning is a good idea,

Understatement of the century. Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

> and on modern distributions partitions,
> LVM lv's etc are generally created with 1MB alignment. But using
> a small chunksize like 32K? That depends on the workload, but
> in most cases I'd advise against it.

People should ignore your advice in this regard. A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above. It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

--
Stan

2012-08-16 22:03:06

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 16-08-12 1:05 PM, Stan Hoeppner wrote:
> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>> to read that 4K block, and the corresponding 4K block on the
>> parity drive, recalculate parity, and write back 4K of data and 4K
>> of parity. (read|read) modify (write|write). You do not have to
>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>
> See: http://www.spinics.net/lists/xfs/msg12627.html
>
> Dave usually knows what he's talking about, and I didn't see Neil nor
> anyone else correcting him on his description of md RMW behavior.

Well he's wrong, or you're interpreting it incorrectly.

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
/dev/sdd1
* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look
at kB read and kB written:

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb1 0.60 0.00 1.60 0 8
sdc1 0.60 0.80 0.80 4 4
sdd1 0.60 0.00 1.60 0 8

As you can see, a single 4K read, and a few writes. You see a few blocks
more written that you'd expect because the superblock is updated too.

Mike.

2012-08-17 07:31:38

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
> On 16-08-12 1:05 PM, Stan Hoeppner wrote:
>> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>>> to read that 4K block, and the corresponding 4K block on the
>>> parity drive, recalculate parity, and write back 4K of data and 4K
>>> of parity. (read|read) modify (write|write). You do not have to
>>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>>
>> See: http://www.spinics.net/lists/xfs/msg12627.html
>>
>> Dave usually knows what he's talking about, and I didn't see Neil nor
>> anyone else correcting him on his description of md RMW behavior.
>
> Well he's wrong, or you're interpreting it incorrectly.
>
> I did a simple test:
>
> * created a 1G partition on 3 seperate disks
> * created a md raid5 array with 512K chunksize:
> mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
> * wrote a single 4K block:
> dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
>
> Output from iostat over the period in which the 4K write was done. Look
> at kB read and kB written:
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sdb1 0.60 0.00 1.60 0 8
> sdc1 0.60 0.80 0.80 4 4
> sdd1 0.60 0.00 1.60 0 8
>
> As you can see, a single 4K read, and a few writes. You see a few blocks
> more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state. So it doesn't appear this test is going to
trigger RMW. Don't you need now need to do another write in the same
stripe to to trigger RMW? Maybe I'm just reading this wrong.

--
Stan

2012-08-17 11:16:45

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 08/17/2012 09:31 AM, Stan Hoeppner wrote:
> On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
>> I did a simple test:
>>
>> * created a 1G partition on 3 seperate disks
>> * created a md raid5 array with 512K chunksize:
>> mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
>> /dev/sdd1
>> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
>> * wrote a single 4K block:
>> dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
>>
>> Output from iostat over the period in which the 4K write was done. Look
>> at kB read and kB written:
>>
>> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
>> sdb1 0.60 0.00 1.60 0 8
>> sdc1 0.60 0.80 0.80 4 4
>> sdd1 0.60 0.00 1.60 0 8
>>
>> As you can see, a single 4K read, and a few writes. You see a few blocks
>> more written that you'd expect because the superblock is updated too.
>
> I'm no dd expert, but this looks like you're simply writing a 4KB block
> to a new stripe, using an offset, but not to an existing stripe, as the
> array is in a virgin state. So it doesn't appear this test is going to
> trigger RMW. Don't you need now need to do another write in the same
> stripe to to trigger RMW? Maybe I'm just reading this wrong.

That shouldn't matter, but that is easily checked ofcourse, by writing
some random random data first, then doing the dd 4K write also with
random data somewhere in the same area:

# dd if=/dev/urandom bs=1M count=3 of=/dev/md0
3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s

Now the first 6 chunks are filled with random data, let write 4K
somewhere in there:

# dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s

Output from iostat over the period in which the 4K write was done:

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb1 0.60 0.00 1.60 0 8
sdc1 0.60 0.80 0.80 4 4
sdd1 0.60 0.00 1.60 0 8

Mike.

2012-08-19 23:34:32

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/19/2012 9:01 AM, David Brown wrote:
> I'm sort of jumping in to this thread, so my apologies if I repeat
> things other people have said already.

I'm glad you jumped in David. You made a critical statement of fact
below which clears some things up. If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion. Which is:

> AFAIK, there is scope for a few performance optimisations in raid6. One
> is that for small writes which only need to change one block, raid5 uses
> a "short-cut" RMW cycle (read the old data block, read the old parity
> block, calculate the new parity block, write the new data and parity
> blocks). A similar short-cut could be implemented in raid6, though it
> is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6. Then Miquel switched the discussion to
md/RAID5 and stated I was all wet. I wasn't, and neither was Dave
Chinner. I was simply unaware of this md/RAID5 single block write RMW
shortcut. I'm copying lkml proper on this simply to set the record
straight. Not that anyone was paying attention, but it needs to be in
the same thread in the archives. The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

--
Stan

2012-08-20 00:02:08

by NeilBrown

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <[email protected]>
wrote:

> On 8/19/2012 9:01 AM, David Brown wrote:
> > I'm sort of jumping in to this thread, so my apologies if I repeat
> > things other people have said already.
>
> I'm glad you jumped in David. You made a critical statement of fact
> below which clears some things up. If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion. Which is:
>
> > AFAIK, there is scope for a few performance optimisations in raid6. One
> > is that for small writes which only need to change one block, raid5 uses
> > a "short-cut" RMW cycle (read the old data block, read the old parity
> > block, calculate the new parity block, write the new data and parity
> > blocks). A similar short-cut could be implemented in raid6, though it
> > is not clear how much a difference it would really make.
>
> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6. Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet. I wasn't, and neither was Dave
> Chinner. I was simply unaware of this md/RAID5 single block write RMW
> shortcut. I'm copying lkml proper on this simply to set the record
> straight. Not that anyone was paying attention, but it needs to be in
> the same thread in the archives. The takeaway:
>

Since we are trying to set the record straight....

> md/RAID6 must read all devices in a RMW cycle.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

>
> md/RAID5 takes a shortcut for single block writes, and must only read
> one drive for the RMW cycle.

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe. In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block. It then computes the new parity and writes it out along with the new
data.

>
> [1}The only thing that's not clear at this point is if md/RAID6 also
> always writes back all chunks during RMW, or only the chunk that has
> changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed? Sad. :-)

NeilBrown


Attachments:
signature.asc (828.00 B)

2012-08-20 08:21:23

by David Brown

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 20/08/2012 02:01, NeilBrown wrote:
> On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <[email protected]>
> wrote:
>
>
> Since we are trying to set the record straight....
>
>> md/RAID6 must read all devices in a RMW cycle.
>
> md/RAID6 must read all data devices (i.e. not parity devices) which it is not
> going to write to, in an RWM cycle (which the code actually calls RCW -
> reconstruct-write).
>
>>
>> md/RAID5 takes a shortcut for single block writes, and must only read
>> one drive for the RMW cycle.
>
> md/RAID5 uses an alternate mechanism when the number of data blocks that need
> to be written is less than half the number of data blocks in a stripe. In
> this alternate mechansim (which the code calls RMW - read-modify-write),
> md/RAID5 reads all the blocks that it is about to write to, plus the parity
> block. It then computes the new parity and writes it out along with the new
> data.
>

I've learned something here too - I thought this mechanism was only used
for a single block write. Thanks for the correction, Neil.

If you (or anyone else) are ever interested in implementing the same
thing in raid6, the maths is not actually too bad (now that I've thought
about it). (I understand the theory here, but I'm afraid I don't have
the experience with kernel programming to do the implementation.)

To change a few data blocks, you need to read in the old data blocks
(Da, Db, etc.) and the old parities (P, Q).

Calculate the xor differences Xa = Da + D'a, Xb = Db + D'b, etc.

The new P parity is P' = P + Xa + Xb +...

The new Q parity is Q' = P + (g^a).Xa + (g^b).Xb + ...
The power series there is just the normal raid6 Q-parity calculation
with most entries set to 0, and the Xa, Xb, etc. in the appropriate spots.

If the raid6 Q-parity function already has short-cuts for handling zero
entries (I haven't looked, but the mechanism might be in place to
slightly speed up dual-failure recovery), then all the blocks are in place.

2012-08-21 14:51:57

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
> I'm glad you jumped in David. You made a critical statement of fact
> below which clears some things up. If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion. Which is:

I'm sorry about that, that's because of the software that I use to
follow most mailinglist. I didn't notice that the discussion was cc'ed
to both lkml and l-r. I should fix that.

> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6. Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet. I wasn't, and neither was Dave
> Chinner. I was simply unaware of this md/RAID5 single block write RMW
> shortcut

Well, all I tried to say is that a small write of, say, 4K, to a
raid5/raid6 array does not need to re-write the whole stripe (i.e.
chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

Mike.

2012-08-22 04:00:30

by Stan Hoeppner

[permalink] [raw]
Subject: Re: O_DIRECT to md raid 6 is slow

On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
> On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
>> I'm glad you jumped in David. You made a critical statement of fact
>> below which clears some things up. If you had stated it early on,
>> before Miquel stole the thread and moved it to LKML proper, it would
>> have short circuited a lot of this discussion. Which is:
>
> I'm sorry about that, that's because of the software that I use to
> follow most mailinglist. I didn't notice that the discussion was cc'ed
> to both lkml and l-r. I should fix that.

Oh, my bad. I thought it was intentional.

Don't feel too bad about it. When I tried to copy lkml back in on the
one message I screwed up as well. I though Tbird had filled in the full
address but it didn't.

>> Thus my original statement was correct, or at least half correct[1], as
>> it pertained to md/RAID6. Then Miquel switched the discussion to
>> md/RAID5 and stated I was all wet. I wasn't, and neither was Dave
>> Chinner. I was simply unaware of this md/RAID5 single block write RMW
>> shortcut
>
> Well, all I tried to say is that a small write of, say, 4K, to a
> raid5/raid6 array does not need to re-write the whole stripe (i.e.
> chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did. Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

--
Stan