2007-10-26 21:28:47

by Karl Schendel

[permalink] [raw]
Subject: [PATCH] Fix bad data from non-direct-io read after direct-io write

This patch fixes a race between direct IO writes and non-direct IO
reads on the same file. The symptom is a stale file page seen by
any non-direct-IO reader, which persists until the page is invalidated
somehow (e.g. page rewritten again, or memory pressure, or reboot).

An improper return test caused direct-IO's after-write page invalidations
to be skipped. If we're writing page N, and the reader is reading
page N-x for small x, and the read code decides to readahead, it's
not too hard to cause a race that leaves an old, stale copy of the
page in the page cache. Retval is usually +nonzero after the
mapping->a_ops->direct_IO call!

Signed-off-by: Karl Schendel <[email protected]>

---

By the way, I agree that the userland situation is stupid, and I'm
addressing that in the application (happens to be the Ingres DBMS).
However, the kernel shouldn't compound the stupidity.

I'll try to watch for replies, but it would be very useful to
cc me at [email protected] if any discussion is needed;
I'm not subscribed to lkml.


--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 16:12:08.000000000 -0400
@@ -2194,7 +2194,7 @@ generic_file_direct_IO(int rw, struct ki
}

retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
+ if (retval < 0)
goto out;

/*


2007-10-26 21:35:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write


Hmm. If I read this right, this bug seems to have been introduced by
commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean
pages before dio write") back in March.

Before that, we'd call invalidate_inode_pages2_range() unconditionally
after the call mapping->a_ops->direct_IO() if it was a write and there
were cached pages in the mapping (well, "unconditionally" in the sense
that it didn't depend on the return value of the ->direct_IO() call).

However, with both the old and the new code _and_ with your patch, the
return code - in case the invalidate failed - was corrupted. So we may
actually end up doing some IO, but then returning the "wrong" error code
from the invalidate. Hmm?

Somebody who cares about direct-IO and who - unlike me - doesn't think
it's a total and idiotic crock should think hard about this. I'm including
Karl's email, but also an alternate patch for consideration.

And maybe some day we can all agree that direct_IO is crap and should not
be done.

Linus

--
diff --git a/mm/filemap.c b/mm/filemap.c
index 5209e47..032371a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2510,22 +2510,17 @@ generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
}

retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
+ if (retval < 0)
goto out;

/*
* Finally, try again to invalidate clean pages which might have been
* faulted in by get_user_pages() if the source of the write was an
* mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * thing to do, so we don't support it 100%.
*/
- if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
- }
+ if (rw == WRITE && mapping->nrpages)
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
out:
return retval;
}

On Fri, 26 Oct 2007, Karl Schendel wrote:
>
> This patch fixes a race between direct IO writes and non-direct IO
> reads on the same file. The symptom is a stale file page seen by
> any non-direct-IO reader, which persists until the page is invalidated
> somehow (e.g. page rewritten again, or memory pressure, or reboot).
>
> An improper return test caused direct-IO's after-write page invalidations
> to be skipped. If we're writing page N, and the reader is reading
> page N-x for small x, and the read code decides to readahead, it's
> not too hard to cause a race that leaves an old, stale copy of the
> page in the page cache. Retval is usually +nonzero after the
> mapping->a_ops->direct_IO call!
>
> Signed-off-by: Karl Schendel <[email protected]>
>
> ---
>
> By the way, I agree that the userland situation is stupid, and I'm
> addressing that in the application (happens to be the Ingres DBMS).
> However, the kernel shouldn't compound the stupidity.
>
> I'll try to watch for replies, but it would be very useful to
> cc me at [email protected] if any discussion is needed;
> I'm not subscribed to lkml.
>
>
> --- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
> +++ linux-2.6.23.1/mm/filemap.c 2007-10-26 16:12:08.000000000 -0400
> @@ -2194,7 +2194,7 @@ generic_file_direct_IO(int rw, struct ki
> }
>
> retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
> - if (retval)
> + if (retval < 0)
> goto out;
>
> /*
>

2007-10-26 22:10:58

by Karl Schendel

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Linus Torvalds wrote:

> .... So we may
> actually end up doing some IO, but then returning the "wrong" error code
> from the invalidate. Hmm?
>

A point. In an all-seeing, all-caring universe, it would be the read
hitting the cached page that couldn't be invalidated that would get
the error, not the write. I can't get too worked up over that, though.

In any case, as you say, if the write worked it should report as such.
Perhaps this equivalent but slightly cleaned-up patch instead?

Karl

--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 18:00:00.000000000 -0400
@@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki
}

retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
- goto out;

/*
* Finally, try again to invalidate clean pages which might have been
- * faulted in by get_user_pages() if the source of the write was an
- * mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * cached by non-direct readahead, or faulted in by get_user_pages()
+ * if the source of the write was an mmap'ed region of the file
+ * we're writing. Either one is a pretty crazy thing to do,
+ * so we don't support it 100%. If this invalidation
+ * fails, tough, the write still worked...
*/
- if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
+ if (retval >= 0 && rw == WRITE && mapping->nrpages) {
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
}
out:
return retval;

2007-10-26 22:30:27

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Linus Torvalds wrote:
> Hmm. If I read this right, this bug seems to have been introduced by
> commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean
> pages before dio write") back in March.

Agreed. And it's a really dumb bug. ->direct_io will almost always
return -EIOCBQUEUED for aio dio so it won't be invalidating for aio dio
writes. (Notice that the testing in that commit mentions two racing
processes, I bet U$1M that I only tested sync dio :/)

I think that test should be changed to

if (retval < 0 && retval != -EIOCBQUEUED)
goto out;


> However, with both the old and the new code _and_ with your patch, the
> return code - in case the invalidate failed - was corrupted. So we may
> actually end up doing some IO, but then returning the "wrong" error code
> from the invalidate. Hmm?

If the invalidation fails then the app is left with stale data in the
page cache and current data on disk. The return code corruption you're
referring to was intended to communicate this scary situation to the app
with EIO.

It sucks. Does it suck more than returning success for the dio write
when later buffered reads will return stale data? I dunno. What does
the peanut gallery think?

> And maybe some day we can all agree that direct_IO is crap and should not
> be done.

Chris (Mason) and I certainly love the idea of getting rid of
fs/direct-io.c. Getting O_DIRECT working with the page-granular
buffered locking rules while doing large IOs (and, as far as I know,
potentially sector-granular) without noticeable performance regressions
is a mess.

- z

2007-10-26 22:41:52

by Karl Schendel

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Zach Brown wrote:
> Linus Torvalds wrote:
>> Hmm. If I read this right, this bug seems to have been introduced by
>> commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean
>> pages before dio write") back in March.
>
> Agreed. And it's a really dumb bug. ->direct_io will almost always
> return -EIOCBQUEUED for aio dio so it won't be invalidating for aio dio
> writes. (Notice that the testing in that commit mentions two racing
> processes, I bet U$1M that I only tested sync dio :/)

Well, actually, in this case both processes are doing sync IO.
It's just that the writer is direct and the reader isn't, with
the reader usually behind the writer but close enough that
readahead crosses the writer reasonably often. With the
if (retval) test, we only invalidate if nothing got written
at all!

I hadn't even thought of aio directio. Yeah, the invalidate
should happen for -EIOCBQUEUED as well, I guess.

The disparity in direct-io-ness on the part of the reader
vs writer is a userland dum-dum, no question. (That's
getting fixed, too.)

Karl

2007-10-26 22:43:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write



On Fri, 26 Oct 2007, Zach Brown wrote:
>
> I think that test should be changed to

How about not testing at all? Which was what the old code did.

Just do the invalidate unconditionally for any writes, and screw the end
result of the invalidate, since we cannot afford to overwrite the previous
return value anyway in any realistic scenario?

Linus

2007-10-26 22:54:21

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Linus Torvalds wrote:
>
> On Fri, 26 Oct 2007, Zach Brown wrote:
>> I think that test should be changed to
>
> How about not testing at all? Which was what the old code did.
>
> Just do the invalidate unconditionally for any writes, and screw the end
> result of the invalidate, since we cannot afford to overwrite the previous
> return value anyway in any realistic scenario?

I'm reasonably comfortable with that, sure. This second invalidation
only catches reads which userspace raced with the write, and that's
already racy by definition.

I can throw together a patch if you haven't already committed one by the
time you read this ;).

- z

2007-10-26 23:16:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write



On Fri, 26 Oct 2007, Zach Brown wrote:
>
> I can throw together a patch if you haven't already committed one by the
> time you read this ;).

I'm not touching that code except to send out possible patches for others
to test and comment on. I have no test-cases, nor any real interest in it.
So yeah, please send me a tested and thought-through patch.

It sounded like Karl actually had a test-case to trigger this, but maybe
I'm confused (and it sounds a bit unlikely, since it should be hard to
trigger.. Although maybe you can trigger it by doing a direct_IO write
with the *source* being an mmap() of the file you're writing to, and
depending on the write itself re-populating the destination in the page
cache?)

Karl?

Linus

2007-10-26 23:28:29

by Karl Schendel

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Linus Torvalds wrote:
>
> On Fri, 26 Oct 2007, Zach Brown wrote:
>> I can throw together a patch if you haven't already committed one by the
>> time you read this ;).
>
> I'm not touching that code except to send out possible patches for others
> to test and comment on. I have no test-cases, nor any real interest in it.
> So yeah, please send me a tested and thought-through patch.
>
> It sounded like Karl actually had a test-case to trigger this...

Yes, I do - I'd been tripping over it once every couple weeks,
and I finally figured out how to hold my mouth right so that it
fails (almost) every time.

The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.

Karl

---

--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 19:21:20.000000000 -0400
@@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki
}

retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
- goto out;

/*
* Finally, try again to invalidate clean pages which might have been
- * faulted in by get_user_pages() if the source of the write was an
- * mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * cached by non-direct readahead, or faulted in by get_user_pages()
+ * if the source of the write was an mmap'ed region of the file
+ * we're writing. Either one is a pretty crazy thing to do,
+ * so we don't support it 100%. If this invalidation
+ * fails, tough, the write still worked...
*/
if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
}
out:
return retval;

2007-10-26 23:38:46

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write

Linus Torvalds wrote:
>
> On Fri, 26 Oct 2007, Zach Brown wrote:
>> I can throw together a patch if you haven't already committed one by the
>> time you read this ;).
>
> I'm not touching that code except to send out possible patches for others
> to test and comment on. I have no test-cases, nor any real interest in it.
> So yeah, please send me a tested and thought-through patch.

Sure thing. It looks like Karl just sent out the patch I had in mind,
so I'll run it through the tests on Monday. I assume everyone can
handle waiting that long.

> It sounded like Karl actually had a test-case to trigger this, but maybe
> I'm confused (and it sounds a bit unlikely, since it should be hard to
> trigger..

It's disappointingly easy to trigger with ext3 and ordered extending
writes. The invalidation can race with jbd pinning the bhs while it
commits the transaction. It happens that ext3 returns failure from
->releasepage if jbd is committing the transaction.

http://git.kernel.org/?p=linux/kernel/git/zab/aio-dio-regress.git;a=blob;f=c/aio-dio-invalidate-failure.c

> Although maybe you can trigger it by doing a direct_IO write
> with the *source* being an mmap() of the file you're writing to, and
> depending on the write itself re-populating the destination in the page
> cache?)

dio will definitely bring the page in with get_user_pages() but I'd like
to think that that pinned page could be safely invalidated out of the
page cache while dio holds a page reference. I've never tried it, though.

- z

2007-10-30 18:45:53

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write


> Yes, I do - I'd been tripping over it once every couple weeks,
> and I finally figured out how to hold my mouth right so that it
> fails (almost) every time.

OK, I tested and verified Karl's fix and wrote some commentary around it.
(Would a aio-dio git repo on kernel.org for these kind of fixes be well
received?)

----
dio: fix cache invalidation after sync writes

Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate clean
pages before dio write") introduced a bug which stopped dio from ever
invalidating the page cache after writes. It still invalidated it before
writes so most users were fine.

Karl Schendel reported hitting this bug ( http://lkml.org/lkml/2007/10/26/481 )
when he had a buffered reader immediately reading file data after an O_DIRECT
wirter had written the data. The kernel issued read-ahead beyond the position
of the reader which overlapped with the O_DIRECT writer. The failure to
invalidate after writes caused the reader to see stale data from the
read-ahead.

The following patch is originally from Karl. The following commentary is his:

The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.

I added a test to the aio-dio-regress repository which mimics Karl's IO
pattern. It verifed the bad behaviour and that the patch fixed it. I agree
with Karl, this still doesn't help the case where a buffered reader follows an
AIO O_DIRECT writer. That will require a bit more work.

This gives up on the idea of returning EIO to indicate to userspace that stale
data remains if the invalidation failed.

Signed-off-by: Zach Brown <[email protected]>

--- linux-2.6.23.1-base/mm/filemap.c 2007-10-12 12:43:44.000000000 -0400
+++ linux-2.6.23.1/mm/filemap.c 2007-10-26 19:21:20.000000000 -0400
@@ -2194,21 +2194,17 @@ generic_file_direct_IO(int rw, struct ki
}

retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
- if (retval)
- goto out;

/*
* Finally, try again to invalidate clean pages which might have been
- * faulted in by get_user_pages() if the source of the write was an
- * mmap()ed region of the file we're writing. That's a pretty crazy
- * thing to do, so we don't support it 100%. If this invalidation
- * fails and we have -EIOCBQUEUED we ignore the failure.
+ * cached by non-direct readahead, or faulted in by get_user_pages()
+ * if the source of the write was an mmap'ed region of the file
+ * we're writing. Either one is a pretty crazy thing to do,
+ * so we don't support it 100%. If this invalidation
+ * fails, tough, the write still worked...
*/
if (rw == WRITE && mapping->nrpages) {
- int err = invalidate_inode_pages2_range(mapping,
- offset >> PAGE_CACHE_SHIFT, end);
- if (err && retval >= 0)
- retval = err;
+ invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end);
}
out:
return retval;

2007-10-30 19:14:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Fix bad data from non-direct-io read after direct-io write



On Tue, 30 Oct 2007, Zach Brown wrote:
>
> OK, I tested and verified Karl's fix and wrote some commentary around it.

Thanks, will apply.

> (Would a aio-dio git repo on kernel.org for these kind of fixes be well
> received?)

If it's one of these "in a blue moon" things I don't think it matters.
Whatever is easier.

In general, patches have lots of advantages: (a) you can apply it
regardless of version, and (b) pushing the email containing them back and
forth with commentary etc is very powerful.

So I would generally see git as a "maintainer handling issue", not a
"these kinds of fixes" issue. Git doesn't obviate the need for having
people look at patches (and that implies sending them out). But git _is_ a
good way to push the finished product out, if it's a recurring thing.

Linus