After some comments from Oliver Diedrich (editor of heise.de), which told me
he couldn't make O_DIRECT work on 2.4.17, I tried with different versions and
file systems:
This is the result:
2.4.14 - Ext[23] - redhat7.2 glibs: OK (at least the bytes are written)
2.4.17 - ReiserFS - Debian Sid : FAILS (0 bytes file, write returns -1)
2.4.17 - Ext2 - Debian Woody : OK (bytes written)
2.4.17 - Ext3 - Debian Woody : FAILS (0 bytes file, write returns -1)
Oliver Diedrich also told he could make work O_DIRECT with ext3 and 2.4.17.
Is this normal? Does it really work on 2.4.14? Or it doesn't but the kernel
doesn't avoid caching?
Funny behaviour...
Regards,
--
ricardo
"I just stopped using Windows and now you tell me to use Mirrors?"
- said Aunt Tillie, just before downloading 2.5.3 kernel.
Ricardo Galli wrote:
>
> After some comments from Oliver Diedrich (editor of heise.de), which told me
> he couldn't make O_DIRECT work on 2.4.17, I tried with different versions and
> file systems:
>
> This is the result:
>
> 2.4.14 - Ext[23] - redhat7.2 glibs: OK (at least the bytes are written)
> 2.4.17 - ReiserFS - Debian Sid : FAILS (0 bytes file, write returns -1)
> 2.4.17 - Ext2 - Debian Woody : OK (bytes written)
> 2.4.17 - Ext3 - Debian Woody : FAILS (0 bytes file, write returns -1)
>
> Oliver Diedrich also told he could make work O_DIRECT with ext3 and 2.4.17.
>
> Is this normal? Does it really work on 2.4.14? Or it doesn't but the kernel
> doesn't avoid caching?
>
ext2 is the only filesystem which has O_DIRECT support.
-
On 01/02/02 21:44, Andrew Morton wrote:
> Ricardo Galli wrote:
> > After some comments from Oliver Diedrich (editor of heise.de), which told
> > me he couldn't make O_DIRECT work on 2.4.17, I tried with different
> > versions and file systems:
> >
> > This is the result:
> >
> > 2.4.14 - Ext[23] - redhat7.2 glibs: OK (at least the bytes are written)
> > 2.4.17 - ReiserFS - Debian Sid : FAILS (0 bytes file, write returns
> > -1) 2.4.17 - Ext2 - Debian Woody : OK (bytes written)
> > 2.4.17 - Ext3 - Debian Woody : FAILS (0 bytes file, write returns
> > -1)
> >
> > Oliver Diedrich also told he could make work O_DIRECT with ext3 and
> > 2.4.17.
> >
> > Is this normal? Does it really work on 2.4.14? Or it doesn't but the
> > kernel doesn't avoid caching?
>
> ext2 is the only filesystem which has O_DIRECT support.
Does that mean that the succesful test with ext3 and 2.4.14 is bogus?
--
ricardo
"I just stopped using Windows and now you tell me to use Mirrors?"
- said Aunt Tillie, just before downloading 2.5.3 kernel.
Ricardo Galli wrote:
>
> > ext2 is the only filesystem which has O_DIRECT support.
>
> Does that mean that the succesful test with ext3 and 2.4.14 is bogus?
>
Yep. O_DIRECT was added around 2.4.10, was tugged out for a while
and then went back in again.
-
On Fri, 2002-02-01 at 14:44, Andrew Morton wrote:
> Ricardo Galli wrote:
> >
> > After some comments from Oliver Diedrich (editor of heise.de), which told me
> > he couldn't make O_DIRECT work on 2.4.17, I tried with different versions and
> > file systems:
> >
> > This is the result:
> >
> > 2.4.14 - Ext[23] - redhat7.2 glibs: OK (at least the bytes are written)
> > 2.4.17 - ReiserFS - Debian Sid : FAILS (0 bytes file, write returns -1)
> > 2.4.17 - Ext2 - Debian Woody : OK (bytes written)
> > 2.4.17 - Ext3 - Debian Woody : FAILS (0 bytes file, write returns -1)
> >
> > Oliver Diedrich also told he could make work O_DIRECT with ext3 and 2.4.17.
> >
> > Is this normal? Does it really work on 2.4.14? Or it doesn't but the kernel
> > doesn't avoid caching?
> >
>
> ext2 is the only filesystem which has O_DIRECT support.
And XFS ;-)
Steve
On Fri, Feb 01, 2002 at 03:05:38PM -0600, Steve Lord wrote:
> ext2 is the only filesystem which has O_DIRECT support.
And XFS ;-)
I sent reiserfs O_DIRECT support patches to someone a while ago. I
can look to ressurect these (assuming I can find them!)
Chris Mason is always going to be a better source for these anyhow, he
certainly understands any complex nuances there may be. Chris, do you
have any cycles to comment on this please?
--cw
Chris Wedgwood wrote:
>On Fri, Feb 01, 2002 at 03:05:38PM -0600, Steve Lord wrote:
>
> > ext2 is the only filesystem which has O_DIRECT support.
>
> And XFS ;-)
>
>I sent reiserfs O_DIRECT support patches to someone a while ago. I
>can look to ressurect these (assuming I can find them!)
>
>Chris Mason is always going to be a better source for these anyhow, he
>certainly understands any complex nuances there may be. Chris, do you
>have any cycles to comment on this please?
>
>
>
>
> --cw
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
You might try sending them to me if you want them to be reviewed and
hopefully go in.
Cc [email protected] if you do, because I will be away until the 24th.
Hans
On Saturday, February 02, 2002 01:35:56 AM -0800 Chris Wedgwood <[email protected]> wrote:
> On Fri, Feb 01, 2002 at 03:05:38PM -0600, Steve Lord wrote:
>
> > ext2 is the only filesystem which has O_DIRECT support.
>
> And XFS ;-)
>
> I sent reiserfs O_DIRECT support patches to someone a while ago. I
> can look to ressurect these (assuming I can find them!)
>
> Chris Mason is always going to be a better source for these anyhow, he
> certainly understands any complex nuances there may be. Chris, do you
> have any cycles to comment on this please?
I've dug your patch out of my archives, it should be safer now that
we've got the expanding truncate patch into the kernel (2.2.18pre).
I'm porting it forward now.
thanks,
chris
In article <[email protected]> you wrote:
>> Oliver Diedrich also told he could make work O_DIRECT with ext3 and 2.4.17.
>>
>> Is this normal? Does it really work on 2.4.14? Or it doesn't but the kernel
>> doesn't avoid caching?
>>
>
> ext2 is the only filesystem which has O_DIRECT support.
You forgot JFS and XFS. Also there is a patche for NFS, but this one
requires a prototype change for ->directIO.
Christoph
--
Of course it doesn't work. We've performed a software upgrade.
Ok, the tricky part of direct io on reiserfs is the tails. But,
since direct io isn't allowed on non-page aligned file sizes, we'll
never have direct io onto a normal file tail.
< 2.4.18 reiserfs versions allowed expanding truncates to set i_size
without creating the corresponding metadata, so we still have to deal
with that. It means we could have a packed tail on any file size,
including those bigger than the 16k limit after which we don't create
tails any more.
Chris and I had initially decided to unpack the tails on file open
if O_DIRECT is used, but it seems cleaner to add a
reiserfs_get_block_direct_io, and have it return -EINVAL if a read
went to a tail. writes that happen to a tail will trigger tail
conversion.
Anyway, this patch is very lightly tested, I'll try all the corner
cases on sunday.
-chris
# against 2.4.18-pe7
#
--- temp.1/fs/reiserfs/inode.c Mon, 28 Jan 2002 09:51:50 -0500
+++ temp.1(w)/fs/reiserfs/inode.c Sat, 02 Feb 2002 12:26:50 -0500
@@ -445,6 +445,20 @@
return reiserfs_get_block(inode, block, bh_result, GET_BLOCK_NO_HOLE) ;
}
+static int reiserfs_get_block_direct_io (struct inode * inode, long block,
+ struct buffer_head * bh_result, int create) {
+ int ret ;
+
+ ret = reiserfs_get_block(inode, block, bh_result, create) ;
+
+ /* don't allow direct io onto tail pages */
+ if (ret == 0 && buffer_mapped(bh_result) && bh_result->b_blocknr == 0) {
+ ret = -EINVAL ;
+ }
+ return ret ;
+}
+
+
/*
** helper function for when reiserfs_get_block is called for a hole
** but the file tail is still in a direct item
@@ -2050,11 +2064,20 @@
return ret ;
}
+static int reiserfs_direct_io(int rw, struct inode *inode,
+ struct kiobuf *iobuf, unsigned long blocknr,
+ int blocksize)
+{
+ return generic_direct_IO(rw, inode, iobuf, blocknr, blocksize,
+ reiserfs_get_block_direct_io) ;
+}
+
struct address_space_operations reiserfs_address_space_operations = {
writepage: reiserfs_writepage,
readpage: reiserfs_readpage,
sync_page: block_sync_page,
prepare_write: reiserfs_prepare_write,
commit_write: reiserfs_commit_write,
- bmap: reiserfs_aop_bmap
+ bmap: reiserfs_aop_bmap,
+ direct_IO: reiserfs_direct_io,
} ;
On Sat, Feb 02, 2002 at 01:20:08PM -0500, Chris Mason wrote:
>
> Ok, the tricky part of direct io on reiserfs is the tails. But,
> since direct io isn't allowed on non-page aligned file sizes, we'll
> never have direct io onto a normal file tail.
>
> < 2.4.18 reiserfs versions allowed expanding truncates to set i_size
> without creating the corresponding metadata, so we still have to deal
> with that. It means we could have a packed tail on any file size,
> including those bigger than the 16k limit after which we don't create
> tails any more.
>
> Chris and I had initially decided to unpack the tails on file open
> if O_DIRECT is used, but it seems cleaner to add a
> reiserfs_get_block_direct_io, and have it return -EINVAL if a read
> went to a tail. writes that happen to a tail will trigger tail
> conversion.
This is a safe approch (no risk of corruption etc..). However to provide
the same semantics of the other filesystems it would be even better if
we could unpack the tail within reiserfs_get_block_direct_io rather than
returning -EINVAL, but ok, most apps should work fine anyways (and as
worse people can workaround the magic by remounting reiserfs with notail
before writing the data that will need to be handled later via
O_DIRECT).
thanks for the patch,
Andrea
On Saturday, February 02, 2002 08:54:38 PM +0100 Andrea Arcangeli <[email protected]> wrote:
>> Chris and I had initially decided to unpack the tails on file open
>> if O_DIRECT is used, but it seems cleaner to add a
>> reiserfs_get_block_direct_io, and have it return -EINVAL if a read
>> went to a tail. writes that happen to a tail will trigger tail
>> conversion.
>
> This is a safe approch (no risk of corruption etc..). However to provide
> the same semantics of the other filesystems it would be even better if
> we could unpack the tail within reiserfs_get_block_direct_io rather than
> returning -EINVAL, but ok, most apps should work fine anyways (and as
> worse people can workaround the magic by remounting reiserfs with notail
> before writing the data that will need to be handled later via
> O_DIRECT).
In the normal case, O_DIRECT can't be done on a file with a tail.
The way I read generic_file_direct_IO, O_DIRECT is only done in
units that start block aligned, and continue for a block aligned
length. So, this can never include a packed file tail.
We should only need to worry if i_size on the file is wrong, and allows a
read/write to a block aligned chunk on a file with a tail, which should
only be legal in the expanding truncate case from older kernels. The
-EINVAL return should only happen in this (very unlikely) case.
-chris
Chris Mason wrote:
>
>On Saturday, February 02, 2002 08:54:38 PM +0100 Andrea Arcangeli <[email protected]> wrote:
>
>>>Chris and I had initially decided to unpack the tails on file open
>>>if O_DIRECT is used, but it seems cleaner to add a
>>>reiserfs_get_block_direct_io, and have it return -EINVAL if a read
>>>went to a tail. writes that happen to a tail will trigger tail
>>>conversion.
>>>
>>This is a safe approch (no risk of corruption etc..). However to provide
>>the same semantics of the other filesystems it would be even better if
>>we could unpack the tail within reiserfs_get_block_direct_io rather than
>>returning -EINVAL, but ok, most apps should work fine anyways (and as
>>worse people can workaround the magic by remounting reiserfs with notail
>>before writing the data that will need to be handled later via
>>O_DIRECT).
>>
>
>In the normal case, O_DIRECT can't be done on a file with a tail.
>
>The way I read generic_file_direct_IO, O_DIRECT is only done in
>units that start block aligned, and continue for a block aligned
>length. So, this can never include a packed file tail.
>
>We should only need to worry if i_size on the file is wrong, and allows a
>read/write to a block aligned chunk on a file with a tail, which should
>only be legal in the expanding truncate case from older kernels. The
>-EINVAL return should only happen in this (very unlikely) case.
>
>-chris
>
Can't you fall back to buffered I/O for the tail? OK it complicates the
code, probably a lot, but it keeps things sane from the user's point of
view.
Steve
On Sat, Feb 02, 2002 at 02:16:41PM -0600, Stephen Lord wrote:
> Can't you fall back to buffered I/O for the tail? OK it complicates the
> code, probably a lot, but it keeps things sane from the user's point of
> view.
For O_DIRECT, IMHO you should fail not fallback. You're simply lying
to the underlying program otherwise.
In the ibu fs I am hacking on, the idea for O_DIRECT is to fail a read
if the file is small enough to fit in the inode. If the O_DIRECT
action is a write, then I will invalidate the data in the inode,
then follow the standard path (which eventually calls get_block()).
For file tails (a different case from small-file-in-inode), I
imagine it would be prudent to support O_DIRECT for all actions
except reading the file tail. If you want to be complicated, you
could provide userspace with a way to say "this is a dense file"
and/or simply not create a tail at all...
Jeff
Jeff Garzik <[email protected]> writes:
> On Sat, Feb 02, 2002 at 02:16:41PM -0600, Stephen Lord wrote:
> > Can't you fall back to buffered I/O for the tail? OK it complicates the
> > code, probably a lot, but it keeps things sane from the user's point of
> > view.
>
> For O_DIRECT, IMHO you should fail not fallback. You're simply lying
> to the underlying program otherwise.
It's just impossible to write a tail which is smaller than a disk block
without another buffer.
-Andi
Jeff Garzik wrote:
>On Sat, Feb 02, 2002 at 02:16:41PM -0600, Stephen Lord wrote:
>
>>Can't you fall back to buffered I/O for the tail? OK it complicates the
>>code, probably a lot, but it keeps things sane from the user's point of
>>view.
>>
>
>For O_DIRECT, IMHO you should fail not fallback. You're simply lying
>to the underlying program otherwise.
>
By fallback I mean't just for the tail, not the whole file.
I have been there before. I had to implement the mixed mode buffered/direct
I/O on Unicos because a change in underlying disk subsystems stopped
customer applications from working - the allowed boundaries for
O_DIRECT stopped working when the sales people sold them some new
disks. This also meant you could get most of the speed benefits of
O_DIRECT without having to align your I/O, it also meant really
large I/Os could be made to automatically bypass cache to avoid
cache thrashing.
What we had were two flags, one which indicated use direct I/O, and another
which indicated return an error to user space rather than go through
buffers.
So lie to me and make it work, or don't lie to me options I suppose.
>
>
>In the ibu fs I am hacking on, the idea for O_DIRECT is to fail a read
>if the file is small enough to fit in the inode. If the O_DIRECT
>action is a write, then I will invalidate the data in the inode,
>then follow the standard path (which eventually calls get_block()).
>
>For file tails (a different case from small-file-in-inode), I
>imagine it would be prudent to support O_DIRECT for all actions
>except reading the file tail. If you want to be complicated, you
>could provide userspace with a way to say "this is a dense file"
>and/or simply not create a tail at all...
>
I suspect the reason XFS never did small files in the inode was because of
the problems with implementing mmap and O_DIRECT.
>
> Jeff
>
>
Steve
On Sun, Feb 03, 2002 at 07:40:57AM -0600, Stephen Lord wrote:
What we had were two flags, one which indicated use direct I/O,
and another which indicated return an error to user space rather
than go through buffers. So lie to me and make it work, or don't
lie to me options I suppose.
This seems way to complex in the case of reiserfs... you're only going
to see tails for small files (typically under 16k) and for the tail
part when less than a block.
Since O_DIRECT much be blocked sized and block aligned, I'm not sure
if this is a problem at present...
I suspect the reason XFS never did small files in the inode was
because of the problems with implementing mmap and O_DIRECT.
How does IRIX deal with O_DIRECT read/writes of a mapped area?
Invalidate them or just accept things as being incoherent?
--cw
Chris Wedgwood wrote:
>On Sun, Feb 03, 2002 at 07:40:57AM -0600, Stephen Lord wrote:
>
> What we had were two flags, one which indicated use direct I/O,
> and another which indicated return an error to user space rather
> than go through buffers. So lie to me and make it work, or don't
> lie to me options I suppose.
>
>This seems way to complex in the case of reiserfs... you're only going
>to see tails for small files (typically under 16k) and for the tail
>part when less than a block.
>
>Since O_DIRECT much be blocked sized and block aligned, I'm not sure
>if this is a problem at present...
>
I agree is is not a big issue in this case - my interpretation of tails
was the end
of any file could be packed, but if it is only small files.....
>
>
> I suspect the reason XFS never did small files in the inode was
> because of the problems with implementing mmap and O_DIRECT.
>
>How does IRIX deal with O_DIRECT read/writes of a mapped area?
>Invalidate them or just accept things as being incoherent?
>
They are invalidated at the start of the I/O, but page faults are not
blocked
out for the duration of the I/O, so the coherency is weak. However, if an
application is doing a combination of mmapped and direct I/O to a file
at the same time, then it should generally have some form of user space
synchronization anyway. For an application doing its own synchronization
of different I/Os they are coherent.
>
>
>
> --cw
>
Steve
On Sun, Feb 03, 2002 at 09:05:04AM -0600, Stephen Lord wrote:
I agree is is not a big issue in this case - my interpretation of
tails was the end of any file could be packed, but if it is only
small files.....
But you can't mmap (say) a 1k file right now... so right now this
isn't a problem, but at some point a larger mmap granularity would be
nice --- especially on architectures with small (or untagged) TLBs.
I'm guessing so as not to break backwards compatibility we will have
to support variable page-sizes (creating a plethora of nasties I
imagine).
They are invalidated at the start of the I/O
Cool. That much I'd like to see under Linux
but page faults are not blocked out for the duration of the I/O so
the coherency is weak.
I was thinking this would also be goof, basically invalidate those
pages and remove them from the VMAs, marking them as unusable pending
IO completion --- the logic her being if you were to fault on an
invalidated page during IO you deserve to block indefinitely until the
IO completes.
However, if an application is doing a combination of mmapped and
direct I/O to a file at the same time, then it should generally
have some form of user space synchronization anyway.
I hadn't considered that. I imagined an application doing either but
not both, and the kernel enforcing this. However, in the case when
you want to mmap a large file, you may want to manipulate some pages
using mmap whilst writing others with O_DIRECT. Although, in such
cases arguably you could using multiple mapping's.
--cw
Chris Wedgwood wrote:
>
> On Sun, Feb 03, 2002 at 09:05:04AM -0600, Stephen Lord wrote:
>
> I agree is is not a big issue in this case - my interpretation of
> tails was the end of any file could be packed, but if it is only
> small files.....
>
> But you can't mmap (say) a 1k file right now... so right now this
huh? You can mmap a file of any size > 0. Is this a reiserfs
limitation or something?
Jeff
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com
Andi Kleen wrote:
>
> Jeff Garzik <[email protected]> writes:
>
> > On Sat, Feb 02, 2002 at 02:16:41PM -0600, Stephen Lord wrote:
> > > Can't you fall back to buffered I/O for the tail? OK it complicates the
> > > code, probably a lot, but it keeps things sane from the user's point of
> > > view.
> >
> > For O_DIRECT, IMHO you should fail not fallback. You're simply lying
> > to the underlying program otherwise.
>
> It's just impossible to write a tail which is smaller than a disk block
> without another buffer.
I argue, for reiserfs:
For O_DIRECT writes, the preferred behavior is to write disk blocks
obtained through the normal methods (get_block, etc.), and fully support
inodes for which file tails do not exist.
For O_DIRECT reads, if the data is determined to be in a file tail,
->direct_IO should either (a) fail or (b) dump the file tail to a normal
disk block before performing ->direct_IO.
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com
On Sun, 2002-02-03 at 16:44, Chris Wedgwood wrote:
> On Sun, Feb 03, 2002 at 09:05:04AM -0600, Stephen Lord wrote:
>
> but page faults are not blocked out for the duration of the I/O so
> the coherency is weak.
>
> I was thinking this would also be goof, basically invalidate those
> pages and remove them from the VMAs, marking them as unusable pending
> IO completion --- the logic her being if you were to fault on an
> invalidated page during IO you deserve to block indefinitely until the
> IO completes.
>
> However, if an application is doing a combination of mmapped and
> direct I/O to a file at the same time, then it should generally
> have some form of user space synchronization anyway.
>
> I hadn't considered that. I imagined an application doing either but
> not both, and the kernel enforcing this. However, in the case when
> you want to mmap a large file, you may want to manipulate some pages
> using mmap whilst writing others with O_DIRECT. Although, in such
> cases arguably you could using multiple mapping's.
>
>
If an application is single threaded then it cannot be doing both at
the same time - so all we need to do is flush and invalidate mappings
at the start of I/O. This is really only needed for the range covered by
the direct read/write.
If an application is multithreaded and is doing mmap and direct I/O
from different threads without doing its own synchronization, then it
is broken, there is no ordering guarantee provided by the kernel as
to what happens first.
>
> --cw
Steve
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]
On Monday, February 04, 2002 10:04:45 AM -0500 Jeff Garzik <[email protected]> wrote:
> Chris Wedgwood wrote:
>>
>> On Sun, Feb 03, 2002 at 09:05:04AM -0600, Stephen Lord wrote:
>>
>> I agree is is not a big issue in this case - my interpretation of
>> tails was the end of any file could be packed, but if it is only
>> small files.....
>>
>> But you can't mmap (say) a 1k file right now... so right now this
>
> huh? You can mmap a file of any size > 0. Is this a reiserfs
> limitation or something?
>
No, reiserfs can mmap files of size 1k. Data past the end of file is
zerod on write.
-chris
On Monday, February 04, 2002 10:13:37 AM -0500 Jeff Garzik <[email protected]> wrote:
> Andi Kleen wrote:
>>
>> Jeff Garzik <[email protected]> writes:
>>
>> > On Sat, Feb 02, 2002 at 02:16:41PM -0600, Stephen Lord wrote:
>> > > Can't you fall back to buffered I/O for the tail? OK it complicates the
>> > > code, probably a lot, but it keeps things sane from the user's point of
>> > > view.
>> >
>> > For O_DIRECT, IMHO you should fail not fallback. You're simply lying
>> > to the underlying program otherwise.
>>
>> It's just impossible to write a tail which is smaller than a disk block
>> without another buffer.
>
> I argue, for reiserfs:
>
> For O_DIRECT writes, the preferred behavior is to write disk blocks
> obtained through the normal methods (get_block, etc.), and fully support
> inodes for which file tails do not exist.
Done ;-)
>
> For O_DIRECT reads, if the data is determined to be in a file tail,
> ->direct_IO should either (a) fail or (b) dump the file tail to a normal
> disk block before performing ->direct_IO.
The current patch does A. Another option is to change the reiserfs open
code to detect the tail and do an -EINVAL for o_direct. This gives the
application a better way to fall back to normal open methods than returning
an error during the read.
Just to restate, the current O_DIRECT code can never hit a reiserfs tail in
the normal case. By definition, reiserfs tails are not block aligned, and
O_DIRECT writes are. The only time it is a concern is with a screwy
interaction between expanding truncates and tails on kernels < 2.4.17.
Since most O_DIRECT users are databases, and tails are never created on
files > 16k in size, I don't expect anyone to ever see the reiserfs
triggered -EINVAL from the current patch (famous last words).
-chris
> If an application is multithreaded and is doing mmap and direct I/O
> from different threads without doing its own synchronization, then it
> is broken, there is no ordering guarantee provided by the kernel as
> to what happens first.
Providing we don't allow asynchronous I/O with O_DIRECT once asynchronous
I/O is merged.
Alan
On Mon, 2002-02-04 at 09:46, Alan Cox wrote:
> > If an application is multithreaded and is doing mmap and direct I/O
> > from different threads without doing its own synchronization, then it
> > is broken, there is no ordering guarantee provided by the kernel as
> > to what happens first.
>
> Providing we don't allow asynchronous I/O with O_DIRECT once asynchronous
> I/O is merged.
But async I/O itself needs synchronisation (being English in this email ;-)
to be meaningful. If I issue a bunch of async I/O calls which overlap with
each other then the outcome is really undefined in terms of what ends up
on the disk. Scheduling of the actual I/O operations is really no different
from them being synchronous calls from different user space threads.
The only questions you can really ask is 'is read atomic with respect to
write?' and 'are writes atomic with respect to each other?'. So when you
perform a read it sees data from before or after writes, but never sees
data from half way through a write. And for multiple write calls the output
appears as if one write happened after the other, not intermingled
with each other.
Irix actually takes the viewpoint that it only needs to make a best effort
at synchronizing between direct I/O and other modes of I/O. Multiple
direct writers are allowed into a file at once, and direct writers and
buffered readers are also allowed to operate in parallel. At this point
coherency is really up to the applications. I am not presenting this as
a recommended model for linux, just reporting what it does.
>
> Alan
Steve
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]
On February 4, 2002 05:02 pm, Steve Lord wrote:
> But async I/O itself needs synchronisation (being English in this email ;-)
> to be meaningful. If I issue a bunch of async I/O calls which overlap with
> each other then the outcome is really undefined in terms of what ends up
> on the disk. Scheduling of the actual I/O operations is really no different
> from them being synchronous calls from different user space threads.
>
> The only questions you can really ask is 'is read atomic with respect to
> write?' and 'are writes atomic with respect to each other?'. So when you
> perform a read it sees data from before or after writes, but never sees
> data from half way through a write. And for multiple write calls the output
> appears as if one write happened after the other, not intermingled
> with each other.
Why is it not ok to have the writes come out intermingled, if that's what the
user has asked for? (Implicitly, by not synchronizing the writes.)
> Irix actually takes the viewpoint that it only needs to make a best effort
> at synchronizing between direct I/O and other modes of I/O. Multiple
> direct writers are allowed into a file at once, and direct writers and
> buffered readers are also allowed to operate in parallel. At this point
> coherency is really up to the applications. I am not presenting this as
> a recommended model for linux, just reporting what it does.
I'm having a little trouble with this. Suppose an application does direct
IO on a file but, unbeknownst to it, some other program has done buffered
IO on the file, so that there are still dirty blocks in the page cache,
waiting to land by surprise on top of unbuffered data. A third program
may come along to do buffered IO on the file, and find stale blocks in
cache. Am I missing something here?
--
Daniel
On Mon, Feb 04, 2002 at 03:46:20PM +0000, Alan Cox wrote:
> > If an application is multithreaded and is doing mmap and direct I/O
> > from different threads without doing its own synchronization, then it
> > is broken, there is no ordering guarantee provided by the kernel as
> > to what happens first.
>
> Providing we don't allow asynchronous I/O with O_DIRECT once asynchronous
> I/O is merged.
Oh, but async + O_DIRECT is a good thing. The fundamental
ordering comes down at the block layer. Things are synchronous there.
An application using async I/O knows that ordering is not guaranteed.
Applications using O_DIRECT know they are skipping the buffer cache.
"Caveat emptor" and "Don't do that then" apply to stupid applications.
The big issues I see are O_DIRECT alignment size (see my patch
to allow hardsectsize alignment on O_DIRECT ops) and whether or not to
synchronize with the caches upon O_DIRECT write. Keeping the
page/buffer caches in sync with O_DIRECT writes is a bit of work,
especially with writes smaller than sb_blocksize. You can either do
that work, or you can say that applications and people using O_DIRECT
should know the caches might be inconsistent. Large O_DIRECT users,
such as databases, already know this. They are happily ignorant of
cache inconsistencies. All they care about is hardsectsize O_DIRECT
operations.
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
http://www.jlbec.org/
[email protected]
Joel Becker wrote:
> should know the caches might be inconsistent. Large O_DIRECT users,
> such as databases, already know this. They are happily ignorant of
> cache inconsistencies. All they care about is hardsectsize O_DIRECT
> operations.
I have similar inclination, that is inspired from the implementation of
"NTFS TNG": hard sector size should always equal sb->blocksize. This
allows for fine-grained operations at the O_DIRECT level, logical block
sizes > PAGE_CACHE_SIZE, easy implementation of fragments (>= hard sect
size), O_DIRECT for fragments, and other stuff.
This works right now in 2.4 and 2.5 with no modification to the VFS
core.
Jeff
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com
On Mon, Feb 04, 2002 at 01:49:10PM -0500, Jeff Garzik wrote:
> I have similar inclination, that is inspired from the implementation of
> "NTFS TNG": hard sector size should always equal sb->blocksize. This
> allows for fine-grained operations at the O_DIRECT level, logical block
> sizes > PAGE_CACHE_SIZE, easy implementation of fragments (>= hard sect
> size), O_DIRECT for fragments, and other stuff.
I'm not sure I get you here. When I say hardsectsize, I mean
get_hardsectsize(dev), not super->s_blocksize. On ext2, s_blocksize is
1k, 2k, or 4k. Databases want to use O_DIRECT aligned at 512b. This
can be done (again, see my patch), and I would think it necesary.
If you meant that s_blocksize should match get_hardsectsize, I
agree. If you meant the other way around, then consumers that want to
do O_DIRECT operations at 512b alingments won't be able to.
Joel
--
"All alone at the end of the evening
When the bright lights have faded to blue.
I was thinking about a woman who had loved me
And I never knew"
http://www.jlbec.org/
[email protected]
On Mon, 2002-02-04 at 12:22, Daniel Phillips wrote:
> On February 4, 2002 05:02 pm, Steve Lord wrote:
> > But async I/O itself needs synchronisation (being English in this email ;-)
> > to be meaningful. If I issue a bunch of async I/O calls which overlap with
> > each other then the outcome is really undefined in terms of what ends up
> > on the disk. Scheduling of the actual I/O operations is really no different
> > from them being synchronous calls from different user space threads.
> >
> > The only questions you can really ask is 'is read atomic with respect to
> > write?' and 'are writes atomic with respect to each other?'. So when you
> > perform a read it sees data from before or after writes, but never sees
> > data from half way through a write. And for multiple write calls the output
> > appears as if one write happened after the other, not intermingled
> > with each other.
>
> Why is it not ok to have the writes come out intermingled, if that's what the
> user has asked for? (Implicitly, by not synchronizing the writes.)
I cannot quote a source, but I have heard people say Posix - or some
other standard, all I can find on google is people saying read is
atomic wrt to write, but there is no definition of writes wrt other
writes.
>
> > Irix actually takes the viewpoint that it only needs to make a best effort
> > at synchronizing between direct I/O and other modes of I/O. Multiple
> > direct writers are allowed into a file at once, and direct writers and
> > buffered readers are also allowed to operate in parallel. At this point
> > coherency is really up to the applications. I am not presenting this as
> > a recommended model for linux, just reporting what it does.
>
> I'm having a little trouble with this. Suppose an application does direct
> IO on a file but, unbeknownst to it, some other program has done buffered
> IO on the file, so that there are still dirty blocks in the page cache,
> waiting to land by surprise on top of unbuffered data. A third program
> may come along to do buffered IO on the file, and find stale blocks in
> cache. Am I missing something here?
No you are not, I did not say it was totally coherent, at the start of
the direct I/O the caches are made coherent, they can drift apart during
the operation if buffered or mmapped I/O is ongoing during the operation,
and yes those blocks are stale in the cache.
In normal life people do not seem to mix direct I/O and other forms of
I/O in parallel.
If you want full coherency you have to lock out page faults and buffered
I/O during direct I/O. You also need to deadlock avoidance code for the
case where someone does this:
fd = open("file", O_DIRECT|O_RDWR);
mem = mmap(&addr, 40960, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 20480);
read(fd, mem, 32768);
Steve
>
> --
> Daniel
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]
Joel Becker wrote:
> On Mon, Feb 04, 2002 at 01:49:10PM -0500, Jeff Garzik wrote:
> > hard sector size should always equal sb->blocksize.
> If you meant that s_blocksize should match get_hardsectsize, I
yes. get_hardsectsize returns hard sector size, so this is what I
meant.
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com