2011-02-23 04:36:42

by L A Walsh

[permalink] [raw]
Subject: write 'O_DIRECT' file w/odd amount of data: desirable result?




I understand, somewhat, what is happening.
I have two different utils, 'dd' and mbuffer both of
which have a 'direct' option to write to disk.
mbuffer was from my distro with a direct added, which
is

I'm not sure if it's truncating the write to the
lower bound of the sector size or the file-allocation-unit size
but from a dump, piped into {cat, dd mbuffer}, the
output sizes are:

file size delta
------------- ---------- ----
dumptest.cat 5776419696
dumptest.dd 5776343040 76656
dumptest.mbuff 5368709120 407710576

params:

dd of=dumptest.dd bs=512M oflag=direct
mbuffer -b 5 -s 512m --direct -f -o dumptest.mbuff

original file size MOD 512M = 407710576 (answer from mbuff).

The disk it is being written to is a RAID with a span
size of 640k (64k io*10 data disks) and formatted to
indicated that with 'xfs' (stripe-unit=64k stripe=width=10).

This gives a 'coincidental' (??) interpretation for
the output from 'dd', where the original file size MOD
640K = 76656 (the amount 'dd' is short).

Was that a coincidence or a fluke?
Why didn't 'mbuffer' have the same shortfall -- it's was
only related to it's 512m buffer size.

In any event, shouldn't the kernel yield the correct answer
in either case? It would be consistent with the processor it
was natively developed on, the x86, where a misaligned memory
access doesn't cause a fault at the user level, but is handled
correctly, with a slight penalty to speed for the unaligned
data parts.

Shouldn't the linux kernel behave similarly?

Note, that the mbuffer program indicated an error
(which didn't help the 'dump' program that had already exited
with what it thought was a 'success'), though a bit
cryptic:
buffer: error: outputThread: error writing to dumptest.mbuff at offset
0x140000000: Invalid argument

summary: 5509 MByte in 8.4 sec - average of 658 MB/s
mbuffer: warning: error during output to dumptest.mbuff: Invalid argument

dd indicated no warning or error.

----
I'm not aware of what either did, but no doubt neither
expected an error in the final write and didn't handle the results properly.

However, wouldn't it be a good thing for linux to do 'the right thing'
and successfully the last partial write (whichever is the case!), even
if it has to be internally buffered and slightly slowed? Seems
correctness of the function should be given preference over the
adherence to some limitation where possible.
Software should be as forgiving and tolerant and 'err' to the side of
least harm -- which I'd argue is getting the data to the disk, NOT
generating some 'abnormal end' (ABEND) condition that the software can't
handle.

I'd think of it like a page-fault of a record not in memory. The
remainder of the I/O record is a 'zero-filled' buffer that fills in the
remainder of the sector while the size of the field is set to the size
written. ??

Vanilla kernel 2.6.35-7 x86_64 (SMP PREMPT)








2011-02-23 10:38:33

by Pádraig Brady

[permalink] [raw]
Subject: Re: write 'O_DIRECT' file w/odd amount of data: desirable result?

On 23/02/11 04:30, Linda Walsh wrote:
>
>
>
> I understand, somewhat, what is happening.
> I have two different utils, 'dd' and mbuffer both of
> which have a 'direct' option to write to disk.
> mbuffer was from my distro with a direct added, which
> is
>
> I'm not sure if it's truncating the write to the
> lower bound of the sector size or the file-allocation-unit size
> but from a dump, piped into {cat, dd mbuffer}, the
> output sizes are:
>
> file size delta
> ------------- ---------- ----
> dumptest.cat 5776419696
> dumptest.dd 5776343040 76656
> dumptest.mbuff 5368709120 407710576
>
> params:
>
> dd of=dumptest.dd bs=512M oflag=direct
> mbuffer -b 5 -s 512m --direct -f -o dumptest.mbuff
>
> original file size MOD 512M = 407710576 (answer from mbuff).
>
> The disk it is being written to is a RAID with a span
> size of 640k (64k io*10 data disks) and formatted to
> indicated that with 'xfs' (stripe-unit=64k stripe=width=10).
>
> This gives a 'coincidental' (??) interpretation for
> the output from 'dd', where the original file size MOD
> 640K = 76656 (the amount 'dd' is short).
>
> Was that a coincidence or a fluke?
> Why didn't 'mbuffer' have the same shortfall -- it's was
> only related to it's 512m buffer size.
>
> In any event, shouldn't the kernel yield the correct answer
> in either case? It would be consistent with the processor it
> was natively developed on, the x86, where a misaligned memory
> access doesn't cause a fault at the user level, but is handled
> correctly, with a slight penalty to speed for the unaligned
> data parts.
>
> Shouldn't the linux kernel behave similarly?
> Note, that the mbuffer program indicated an error
> (which didn't help the 'dump' program that had already exited
> with what it thought was a 'success'), though a bit
> cryptic:
> buffer: error: outputThread: error writing to dumptest.mbuff at offset
> 0x140000000: Invalid argument
>
> summary: 5509 MByte in 8.4 sec - average of 658 MB/s
> mbuffer: warning: error during output to dumptest.mbuff: Invalid argument
>
> dd indicated no warning or error.
>
> ----
> I'm not aware of what either did, but no doubt neither
> expected an error in the final write and didn't handle the results
> properly.
>
> However, wouldn't it be a good thing for linux to do 'the right thing'
> and successfully the last partial write (whichever is the case!), even
> if it has to be internally buffered and slightly slowed? Seems
> correctness of the function should be given preference over the
> adherence to some limitation where possible.
> Software should be as forgiving and tolerant and 'err' to the side of
> least harm -- which I'd argue is getting the data to the disk, NOT
> generating some 'abnormal end' (ABEND) condition that the software can't
> handle.
> I'd think of it like a page-fault of a record not in memory. The
> remainder of the I/O record is a 'zero-filled' buffer that fills in the
> remainder of the sector while the size of the field is set to the size
> written. ??
>
> Vanilla kernel 2.6.35-7 x86_64 (SMP PREMPT)

Note dd will turn off O_DIRECT for the last write
if it's less than the block size.
http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322c

Note also you mentioned that you piped from dump to dd.
For dd reading from a pipe I strongly suggest you specify iflag=fullblock

If there is still an issue, it seems from the above that the kernel is throwing
away data and not indicating this through the last non O_DIRECT write().

cheers,
Pádraig.

2011-02-24 01:18:39

by Pádraig Brady

[permalink] [raw]
Subject: Re: write 'O_DIRECT' file w/odd amount of data: desirable result?

On 23/02/11 18:04, Linda A. Walsh wrote:
> I tried using the 'iflag=fullblock' as you recommend and it made the
> problem
> 'consistent' with the output of 'mbuffer', i.e. it transfered less data
> and the truncation was consistent with a 512M divisor, indicating it was
> 'cat' default record output size that was causing the difference.

Right. That's expected as with 'fullblock', both mbuffer and dd
will read/write 512M at a time. Both will fail in the same
way when they try to write the odd sized chunk at the end.
This was only changed for dd in version coreutils 7.5
(where it reverts to a standard write for the last chunk)

> I've tried significantly shorter files and NOT had this problem
> (record size=64k, and 2 files oneat 64+57k). Both copied
> fine.
> Something to do with large file buffers.

Small blocks cause an issue on ext[34] at least.
I modified dd here to behave like yours and got:
$ truncate -s513 small
$ dd oflag=direct if=small of=small.out
./dd: writing `small.out': Invalid argument

> Of *SIGNIFICANT* note. In trying to create an empty file of the size
> used, from scratch, using 'xfs_mkfile', I got an error:
>
>> xfs_mkfile 5776419696 testfile
> pwrite64: Invalid argument

Looks like that uses the same O_DIRECT write
method with the same issues?
You could try fallocate(1) which is newly available
in util-linux and might be supported by your xfs.

cheers,
Pádraig.

p.s. dd would if written today default to using fullblock.
For backwards and POSIX compat though we must keep
the current default behavior

p.p.s. There are situations were fullblock is required,
and I'll patch dd soon to auto apply that option when appropriate.
[io]flag=direct is one of those cases I think.

p.p.p.s coreutils 8.11 should have the oflag=nocache option
which will write to disk without using up your page cache,
and also avoiding O_DIRECT constraints.

2011-02-24 09:26:31

by Dave Chinner

[permalink] [raw]
Subject: Re: write 'O_DIRECT' file w/odd amount of data: desirable result?

On Wed, Feb 23, 2011 at 10:04:30AM -0800, Linda A. Walsh wrote:
>
>
> FWIW -- xfs-oss, included as 'last line' was of minor interest; known bug on
> this kernel?:
> Linux Ishtar 2.6.35.7-T610-Vanilla-1 #2 SMP PREEMPT Mon Oct 11
> 17:19:41 PDT 2010 x86_64 x86_64 x86_64 GNU/Linux
....
> Of *SIGNIFICANT* note. In trying to create an empty file of the size
> used, from scratch, using 'xfs_mkfile', I got an error:
>
> > xfs_mkfile 5776419696 testfile
> pwrite64: Invalid argument

xfs_mkfile does not create an "empty" file. It creates a file that
is full of zeros.

iAnd you're getting that erro because:

5776419696 / 512 = 11,282,069.7188

the last write is not a multiple of the sector size and xfs_mkfile
uses direct IO. It has always failed when you try to do this. If you
want to create allocated, zeroed files of abitrary size, then use:

xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $filename

to preallocate it. it'll be much, much faster than xfs_mkfile.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-03-02 02:28:07

by L A Walsh

[permalink] [raw]
Subject: Re: RFE kernel option to do the desirable thing, w/regards to 'O_DIRECT' and mis-aligned data


Thanks for the shorthand Dave, but I wasn't really trying to use
xfs_mkfs to make a file that was failing -- but was more trying to use
it as an example of supporting the idea that both should succeed, and
if a write is a partial write to an O_DIRECT file, that it be allowed
to succeed and the kernel, knowing the device's minimum write size
from the driver, could buffer the last sector.

To deal with back-compat issues, it could be based off of a proc
var like /proc/kernel/fs/direct_IO_handling using bitfields (or
multiple vars if you don't like bitfields, I s
with the bits defined as:

Bit 0 Controlling allowed partial writes that start at an aligned position
Bit 1 Controlling allowed non-aligned writes
Bit 2 Controlling allowed partial reads that start at aligned position
Bit 3 Controlling allowed non-aligned reads
Bit 4 Controlling whether to use general FS cache for affected sectors

It's a bit of 'overkill' for what I wanted (just case controlled by
Bit 0), but for sake of completeness, I thought all of these combinations
should be specified.

Default of 0 = current behavior of mis-aligned data accesses failing,
while specifying various combinations would allow for variations with
the kernel handling mis-aligned accesses automatically, much like the
x86 processor handles mis-aligned integer additions or stacks automatically
(perhaps at a performance penalty, but with a tendency toward 'working'
rather than failing, if possible).

It seems better to put that logic in the kernel rather than saddle multiple
applications using DIRECT I/O with handling the non-aligned cases.
This seems especially useful given the long term trend toward
increasing use of static-memory devices which will likely support
arbitrary direct I/O sizes.

Linda Walsh