2002-10-04 20:14:33

by Steve Lord

[permalink] [raw]
Subject: 2.5 O)DIRECT problem

Hi Andrew,

I ran into a problem with 2.5's O_DIRECT read path,

generic_file_direct_IO usually ends up in generic_direct_IO
this does bounds checking on the I/O and then flushes any
cached data.

Once we return to generic_file_direct_IO we unconditionally
call invalidate_inode_pages2 if there are any cached pages.

If we make a non-aligned O_DIRECT read call, the end result is we
call invalidate_inode_pages2, but we do not do the filemap_fdatawrite,
filemap_fdatawait calls. End result is we throw the buffered data away.

Either the flush needs to happen before the bounds checks, or the
invalidate should only be done on a successful write. It looks
pretty hard to detect the latter case with the current structure,
we can get EINVAL from the bounds check and possibly from an
aligned, but invalid memory address being passed in.

This is worse for xfs than for other filesystems since if we do the
invalidate without the flush first, we end up with a delayed allocate
extent in memory and no data to flush into it. Various spots in the
filesystem dislike this.

I can fix it inside xfs by doing my own flush and invalidate first, but
the generic code should really be fixed.

Thoughts?

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]


2002-10-04 20:45:54

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5 O)DIRECT problem

Steve Lord wrote:
>
> On Fri, 2002-10-04 at 15:29, Andrew Morton wrote:
> > Steve Lord wrote:
> > > Either the flush needs to happen before the bounds checks, or the
> > > invalidate should only be done on a successful write. It looks
> > > pretty hard to detect the latter case with the current structure,
> > > we can get EINVAL from the bounds check and possibly from an
> > > aligned, but invalid memory address being passed in.
> >
> > Yes I agree; let's just do the sync before any checks.
> >
> > I think it should be moved into generic_file_direct_IO(),
> > because that's the place where the invalidation happens, yes?
>
> OK, sounds good to me, I will let my tests churn away on that
> version and see what happens. I think something else is doing
> the same thing to me elsewhere, but it might well be an xfs
> specific case.

I queued the below patch.

BTW, I'm sitting on a patch from Badari which allows the direct-io
code to perform 512-byte-aligned and multiple-of-512-byte-sized
IO against a 4k blocksize filesystem. We make (reasonable)
assumptions about the return value from get_block(): scale it
from softblocksize up to sectors and then add an offset.

And we do weird things with the ZERO_PAGE to cater for the case
where the filesystem block is buffer_new() - use bits of the
ZERO_PAGE to fill out the gaps in the BIOs to zero out bits of
disk blocks.

We're also currently requiring that the filesystem pass its backing
block_device * into generic_direct_IO so we can run bdev_hardsectsize()
at the right time, which I'm not 100% happy with.

But relaxing the 4k alignment requirement has great value, so
we'll persist with that. I'll include that in the next -mm;
you may want to take a look at it.



patch:

If the alignment checks in generic_direct_IO() fail, we end up not
forcing writeback of dirty pagecache pages, but we still run
invalidate_inode_pages2(). The net result is that dirty pagecache gets
incorrectly removed. I guess this will expose unwritten disk blocks.

So move the sync up into generic_file_direct_IO(), where we perform the
invalidation. So we know that pagecache and disk are in sync before we
do anything else.



fs/direct-io.c | 21 ++++++++++-----------
1 files changed, 10 insertions(+), 11 deletions(-)

--- 2.5.40/fs/direct-io.c~direct-io-invalidation-fix Fri Oct 4 13:41:37 2002
+++ 2.5.40-akpm/fs/direct-io.c Fri Oct 4 13:41:37 2002
@@ -620,13 +620,11 @@ generic_direct_IO(int rw, struct inode *
int seg;
size_t size;
unsigned long addr;
- struct address_space *mapping = inode->i_mapping;
unsigned blocksize_mask = (1 << inode->i_blkbits) - 1;
ssize_t retval = -EINVAL;

- if (offset & blocksize_mask) {
+ if (offset & blocksize_mask)
goto out;
- }

/* Check the memory alignment. Blocks cannot straddle pages */
for (seg = 0; seg < nr_segs; seg++) {
@@ -636,14 +634,6 @@ generic_direct_IO(int rw, struct inode *
goto out;
}

- if (mapping->nrpages) {
- retval = filemap_fdatawrite(mapping);
- if (retval == 0)
- retval = filemap_fdatawait(mapping);
- if (retval)
- goto out;
- }
-
retval = direct_io_worker(rw, inode, iov, offset, nr_segs, get_blocks);
out:
return retval;
@@ -656,8 +646,17 @@ generic_file_direct_IO(int rw, struct in
struct address_space *mapping = inode->i_mapping;
ssize_t retval;

+ if (mapping->nrpages) {
+ retval = filemap_fdatawrite(mapping);
+ if (retval == 0)
+ retval = filemap_fdatawait(mapping);
+ if (retval)
+ goto out;
+ }
+
retval = mapping->a_ops->direct_IO(rw, inode, iov, offset, nr_segs);
if (inode->i_mapping->nrpages)
invalidate_inode_pages2(inode->i_mapping);
+out:
return retval;
}

.