2008-06-18 18:10:56

by Jeff Moyer

[permalink] [raw]
Subject: [patch] aio: invalidate async directio writes

Hi, Andrew,

This is a follow-up to:

commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
Author: Zach Brown <[email protected]>
Date: Tue Oct 30 11:45:46 2007 -0700

dio: fix cache invalidation after sync writes

Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
clean pages before dio write") introduced a bug which stopped dio from
ever invalidating the page cache after writes. It still invalidated it
before writes so most users were fine.

Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
this bug when he had a buffered reader immediately reading file data
after an O_DIRECT [writer] had written the data. The kernel issued
read-ahead beyond the position of the reader which overlapped with the
O_DIRECT writer. The failure to invalidate after writes caused the
reader to see stale data from the read-ahead.

The following patch is originally from Karl. The following commentary
is his:

The below 3rd try takes on your suggestion of just invalidating
no matter what the retval from the direct_IO call. I ran it
thru the test-case several times and it has worked every time.
The post-invalidate is probably still too early for async-directio,
but I don't have a testcase for that; just sync. And, this
won't be any worse in the async case.

I added a test to the aio-dio-regress repository which mimics Karl's IO
pattern. It verifed the bad behaviour and that the patch fixed it. I
agree with Karl, this still doesn't help the case where a buffered
reader follows an AIO O_DIRECT writer. That will require a bit more
work.

This gives up on the idea of returning EIO to indicate to userspace that
stale data remains if the invalidation failed.

Note the second-to-last paragraph, where it mentions that this does not fix
the AIO case. I updated the regression test to also perform asynchronous
I/O and verified that the problem does exist.

To fix the problem, we need to invalidate the pages that were under write
I/O after the I/O completes. Because the I/O completion handler can be called
in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
context), this patch opts to defer the completion to a workqueue. That
workqueue is responsible for invalidating the page cache pages and completing
the I/O.

I verified that the test case passes with the following patch applied.
Since this affects the code path for all async writers, I had our
performance group run a patched kernel through an OLTP workload. I'm
told that the noise level for this test is in the 5% range. The results
showed a 2.9% degradation in performance at 80 users, and 80 users was
the peak performance for the test. For other user counts, the patched
kernel varied from a percent better to a couple of percent worse. So,
overall I think the patch is worth taking, given that it fixes a
potential data corrupter. (Sorry I can't report the actual numbers,
since the tests were run on unreleased hardware.)

Comments, as always, are encouraged.

Cheers,

Jeff

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9e81add..90fa9e2 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -131,6 +131,8 @@ struct dio {
int is_async; /* is IO async ? */
int io_error; /* IO error in completion path */
ssize_t result; /* IO result */
+ struct list_head done_list; /* AIO DIO writes to be completed
+ * in process context */
};

/*
@@ -260,6 +262,40 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
return ret;
}

+static void aio_complete_fn(void *data);
+static DECLARE_WORK(aio_complete_work, aio_complete_fn, NULL);
+static DEFINE_SPINLOCK(iocb_completion_list_lock);
+static LIST_HEAD(iocb_completion_list);
+
+static void aio_complete_fn(void *data)
+{
+ unsigned long flags;
+ LIST_HEAD(private);
+ struct dio *dio, *tmp;
+
+ spin_lock_irqsave(&iocb_completion_list_lock, flags);
+ list_splice_init(&iocb_completion_list, &private);
+ spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+
+ list_for_each_entry_safe(dio, tmp, &private, done_list) {
+ struct kiocb *iocb = dio->iocb;
+ loff_t offset = iocb->ki_pos;
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ int ret;
+ pgoff_t end = (offset + dio->size - 1) >> PAGE_CACHE_SHIFT;
+
+ /* and now invalidate the page cache */
+ ret = dio_complete(dio, offset, 0);
+ if (mapping->nrpages)
+ invalidate_inode_pages2_range(mapping,
+ offset >> PAGE_CACHE_SHIFT, end);
+ aio_complete(dio->iocb, ret, 0);
+ list_del(&dio->done_list);
+ kfree(dio);
+ }
+}
+
static int dio_bio_complete(struct dio *dio, struct bio *bio);
/*
* Asynchronous IO callback.
@@ -280,9 +316,22 @@ static void dio_bio_end_aio(struct bio *bio, int error)
spin_unlock_irqrestore(&dio->bio_lock, flags);

if (remaining == 0) {
- int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
- aio_complete(dio->iocb, ret, 0);
- kfree(dio);
+ /* For async O_DIRECT writes, we need to invalidate the
+ * page cache after the write completes. Kick off a
+ * workqueue to do this and issue the completion in process
+ * context.
+ */
+ if (dio->rw == READ) {
+ int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
+ aio_complete(dio->iocb, ret, 0);
+ kfree(dio);
+ } else {
+ unsigned long flags;
+ spin_lock_irqsave(&iocb_completion_list_lock, flags);
+ list_add(&dio->done_list, &iocb_completion_list);
+ spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
+ schedule_work(&aio_complete_work);
+ }
}
}


2008-06-18 18:22:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

On Wed, Jun 18, 2008 at 02:09:51PM -0400, Jeff Moyer wrote:
> + /* For async O_DIRECT writes, we need to invalidate the
> + * page cache after the write completes. Kick off a
> + * workqueue to do this and issue the completion in process
> + * context.
> + */
> + if (dio->rw == READ) {
> + int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
> + aio_complete(dio->iocb, ret, 0);
> + kfree(dio);
> + } else {
> + unsigned long flags;
> + spin_lock_irqsave(&iocb_completion_list_lock, flags);
> + list_add(&dio->done_list, &iocb_completion_list);
> + spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
> + schedule_work(&aio_complete_work);
> + }

Can we please move all these aio_complete calls to user context? Having
AIO contexts completing from irq context is a major pain for complex
filesystems like XFS.

2008-06-18 19:46:40

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

Christoph Hellwig <[email protected]> writes:

> On Wed, Jun 18, 2008 at 02:09:51PM -0400, Jeff Moyer wrote:
>> + /* For async O_DIRECT writes, we need to invalidate the
>> + * page cache after the write completes. Kick off a
>> + * workqueue to do this and issue the completion in process
>> + * context.
>> + */
>> + if (dio->rw == READ) {
>> + int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
>> + aio_complete(dio->iocb, ret, 0);
>> + kfree(dio);
>> + } else {
>> + unsigned long flags;
>> + spin_lock_irqsave(&iocb_completion_list_lock, flags);
>> + list_add(&dio->done_list, &iocb_completion_list);
>> + spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
>> + schedule_work(&aio_complete_work);
>> + }
>
> Can we please move all these aio_complete calls to user context? Having
> AIO contexts completing from irq context is a major pain for complex
> filesystems like XFS.

Can you help me understand why this is a pain? I'm having trouble
making the connection.

Thanks!

Jeff

2008-06-18 19:48:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

On Wed, Jun 18, 2008 at 03:45:28PM -0400, Jeff Moyer wrote:
> > Can we please move all these aio_complete calls to user context? Having
> > AIO contexts completing from irq context is a major pain for complex
> > filesystems like XFS.
>
> Can you help me understand why this is a pain? I'm having trouble
> making the connection.

With your patch we complete aio dio write request in user context, which
is great for filesystems that need to do more complex activity in the
completion handler, e.g. XFS for the unwritten extent conversion. But
only doing this for the write case is only very partially useful, we
should be doing this for the read case, too.

See fs/xfs/linux-2.6/xfs_aops.c:xfs_end_io_direct() for what I mean.

2008-06-19 07:51:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

On Wed, 2008-06-18 at 14:09 -0400, Jeff Moyer wrote:
> Hi, Andrew,
>
> This is a follow-up to:
>
> commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
> Author: Zach Brown <[email protected]>
> Date: Tue Oct 30 11:45:46 2007 -0700
>
> dio: fix cache invalidation after sync writes
>
> Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
> clean pages before dio write") introduced a bug which stopped dio from
> ever invalidating the page cache after writes. It still invalidated it
> before writes so most users were fine.
>
> Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
> this bug when he had a buffered reader immediately reading file data
> after an O_DIRECT [writer] had written the data. The kernel issued
> read-ahead beyond the position of the reader which overlapped with the
> O_DIRECT writer. The failure to invalidate after writes caused the
> reader to see stale data from the read-ahead.
>
> The following patch is originally from Karl. The following commentary
> is his:
>
> The below 3rd try takes on your suggestion of just invalidating
> no matter what the retval from the direct_IO call. I ran it
> thru the test-case several times and it has worked every time.
> The post-invalidate is probably still too early for async-directio,
> but I don't have a testcase for that; just sync. And, this
> won't be any worse in the async case.
>
> I added a test to the aio-dio-regress repository which mimics Karl's IO
> pattern. It verifed the bad behaviour and that the patch fixed it. I
> agree with Karl, this still doesn't help the case where a buffered
> reader follows an AIO O_DIRECT writer. That will require a bit more
> work.
>
> This gives up on the idea of returning EIO to indicate to userspace that
> stale data remains if the invalidation failed.
>
> Note the second-to-last paragraph, where it mentions that this does not fix
> the AIO case. I updated the regression test to also perform asynchronous
> I/O and verified that the problem does exist.
>
> To fix the problem, we need to invalidate the pages that were under write
> I/O after the I/O completes. Because the I/O completion handler can be called
> in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
> context), this patch opts to defer the completion to a workqueue. That
> workqueue is responsible for invalidating the page cache pages and completing
> the I/O.
>
> I verified that the test case passes with the following patch applied.

I'm utterly ignorant of all thing [AD]IO, but doesn't deferring the
invalidate open up/widen a race window?

2008-06-19 13:52:14

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

Peter Zijlstra <[email protected]> writes:

> On Wed, 2008-06-18 at 14:09 -0400, Jeff Moyer wrote:
>> Hi, Andrew,
>>
>> This is a follow-up to:
>>
>> commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
>> Author: Zach Brown <[email protected]>
>> Date: Tue Oct 30 11:45:46 2007 -0700
>>
>> dio: fix cache invalidation after sync writes
>>
>> Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
>> clean pages before dio write") introduced a bug which stopped dio from
>> ever invalidating the page cache after writes. It still invalidated it
>> before writes so most users were fine.
>>
>> Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
>> this bug when he had a buffered reader immediately reading file data
>> after an O_DIRECT [writer] had written the data. The kernel issued
>> read-ahead beyond the position of the reader which overlapped with the
>> O_DIRECT writer. The failure to invalidate after writes caused the
>> reader to see stale data from the read-ahead.
>>
>> The following patch is originally from Karl. The following commentary
>> is his:
>>
>> The below 3rd try takes on your suggestion of just invalidating
>> no matter what the retval from the direct_IO call. I ran it
>> thru the test-case several times and it has worked every time.
>> The post-invalidate is probably still too early for async-directio,
>> but I don't have a testcase for that; just sync. And, this
>> won't be any worse in the async case.
>>
>> I added a test to the aio-dio-regress repository which mimics Karl's IO
>> pattern. It verifed the bad behaviour and that the patch fixed it. I
>> agree with Karl, this still doesn't help the case where a buffered
>> reader follows an AIO O_DIRECT writer. That will require a bit more
>> work.
>>
>> This gives up on the idea of returning EIO to indicate to userspace that
>> stale data remains if the invalidation failed.
>>
>> Note the second-to-last paragraph, where it mentions that this does not fix
>> the AIO case. I updated the regression test to also perform asynchronous
>> I/O and verified that the problem does exist.
>>
>> To fix the problem, we need to invalidate the pages that were under write
>> I/O after the I/O completes. Because the I/O completion handler can be called
>> in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
>> context), this patch opts to defer the completion to a workqueue. That
>> workqueue is responsible for invalidating the page cache pages and completing
>> the I/O.
>>
>> I verified that the test case passes with the following patch applied.
>
> I'm utterly ignorant of all thing [AD]IO, but doesn't deferring the
> invalidate open up/widen a race window?

We weren't doing the invalidate at all before this patch. This patch
introduces the invalidation, but we can't do it in interrupt context.

Cheers,

Jeff

2008-06-19 13:59:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

On Thu, 2008-06-19 at 09:50 -0400, Jeff Moyer wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, 2008-06-18 at 14:09 -0400, Jeff Moyer wrote:
> >> Hi, Andrew,
> >>
> >> This is a follow-up to:
> >>
> >> commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
> >> Author: Zach Brown <[email protected]>
> >> Date: Tue Oct 30 11:45:46 2007 -0700
> >>
> >> dio: fix cache invalidation after sync writes
> >>
> >> Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
> >> clean pages before dio write") introduced a bug which stopped dio from
> >> ever invalidating the page cache after writes. It still invalidated it
> >> before writes so most users were fine.
> >>
> >> Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
> >> this bug when he had a buffered reader immediately reading file data
> >> after an O_DIRECT [writer] had written the data. The kernel issued
> >> read-ahead beyond the position of the reader which overlapped with the
> >> O_DIRECT writer. The failure to invalidate after writes caused the
> >> reader to see stale data from the read-ahead.
> >>
> >> The following patch is originally from Karl. The following commentary
> >> is his:
> >>
> >> The below 3rd try takes on your suggestion of just invalidating
> >> no matter what the retval from the direct_IO call. I ran it
> >> thru the test-case several times and it has worked every time.
> >> The post-invalidate is probably still too early for async-directio,
> >> but I don't have a testcase for that; just sync. And, this
> >> won't be any worse in the async case.
> >>
> >> I added a test to the aio-dio-regress repository which mimics Karl's IO
> >> pattern. It verifed the bad behaviour and that the patch fixed it. I
> >> agree with Karl, this still doesn't help the case where a buffered
> >> reader follows an AIO O_DIRECT writer. That will require a bit more
> >> work.
> >>
> >> This gives up on the idea of returning EIO to indicate to userspace that
> >> stale data remains if the invalidation failed.
> >>
> >> Note the second-to-last paragraph, where it mentions that this does not fix
> >> the AIO case. I updated the regression test to also perform asynchronous
> >> I/O and verified that the problem does exist.
> >>
> >> To fix the problem, we need to invalidate the pages that were under write
> >> I/O after the I/O completes. Because the I/O completion handler can be called
> >> in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
> >> context), this patch opts to defer the completion to a workqueue. That
> >> workqueue is responsible for invalidating the page cache pages and completing
> >> the I/O.
> >>
> >> I verified that the test case passes with the following patch applied.
> >
> > I'm utterly ignorant of all thing [AD]IO, but doesn't deferring the
> > invalidate open up/widen a race window?
>
> We weren't doing the invalidate at all before this patch. This patch
> introduces the invalidation, but we can't do it in interrupt context.

Sure, I understand that, so this patch goes from always wrong, to
sometimes wrong. I'm just wondering if this non-determinism will hurt
us.

2008-06-19 14:07:02

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes

Peter Zijlstra <[email protected]> writes:

> On Thu, 2008-06-19 at 09:50 -0400, Jeff Moyer wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Wed, 2008-06-18 at 14:09 -0400, Jeff Moyer wrote:
>> >> Hi, Andrew,
>> >>
>> >> This is a follow-up to:
>> >>
>> >> commit bdb76ef5a4bc8676a81034a443f1eda450b4babb
>> >> Author: Zach Brown <[email protected]>
>> >> Date: Tue Oct 30 11:45:46 2007 -0700
>> >>
>> >> dio: fix cache invalidation after sync writes
>> >>
>> >> Commit commit 65b8291c4000e5f38fc94fb2ca0cb7e8683c8a1b ("dio: invalidate
>> >> clean pages before dio write") introduced a bug which stopped dio from
>> >> ever invalidating the page cache after writes. It still invalidated it
>> >> before writes so most users were fine.
>> >>
>> >> Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting
>> >> this bug when he had a buffered reader immediately reading file data
>> >> after an O_DIRECT [writer] had written the data. The kernel issued
>> >> read-ahead beyond the position of the reader which overlapped with the
>> >> O_DIRECT writer. The failure to invalidate after writes caused the
>> >> reader to see stale data from the read-ahead.
>> >>
>> >> The following patch is originally from Karl. The following commentary
>> >> is his:
>> >>
>> >> The below 3rd try takes on your suggestion of just invalidating
>> >> no matter what the retval from the direct_IO call. I ran it
>> >> thru the test-case several times and it has worked every time.
>> >> The post-invalidate is probably still too early for async-directio,
>> >> but I don't have a testcase for that; just sync. And, this
>> >> won't be any worse in the async case.
>> >>
>> >> I added a test to the aio-dio-regress repository which mimics Karl's IO
>> >> pattern. It verifed the bad behaviour and that the patch fixed it. I
>> >> agree with Karl, this still doesn't help the case where a buffered
>> >> reader follows an AIO O_DIRECT writer. That will require a bit more
>> >> work.
>> >>
>> >> This gives up on the idea of returning EIO to indicate to userspace that
>> >> stale data remains if the invalidation failed.
>> >>
>> >> Note the second-to-last paragraph, where it mentions that this does not fix
>> >> the AIO case. I updated the regression test to also perform asynchronous
>> >> I/O and verified that the problem does exist.
>> >>
>> >> To fix the problem, we need to invalidate the pages that were under write
>> >> I/O after the I/O completes. Because the I/O completion handler can be called
>> >> in interrupt context (and invalidate_inode_pages2 cannot be called in interrupt
>> >> context), this patch opts to defer the completion to a workqueue. That
>> >> workqueue is responsible for invalidating the page cache pages and completing
>> >> the I/O.
>> >>
>> >> I verified that the test case passes with the following patch applied.
>> >
>> > I'm utterly ignorant of all thing [AD]IO, but doesn't deferring the
>> > invalidate open up/widen a race window?
>>
>> We weren't doing the invalidate at all before this patch. This patch
>> introduces the invalidation, but we can't do it in interrupt context.
>
> Sure, I understand that, so this patch goes from always wrong, to
> sometimes wrong. I'm just wondering if this non-determinism will hurt
> us.

Actually, it goes from sometimes wrong, to sometimes, but less likely,
wrong. We already have the non-determinism. And, as I mentioned, a
test case written to expose this very problem doesn't hit it after this
patch.

I'll see if I can't come up with a more deterministic solution. Any
ideas on the matter would be welcome. ;)

Thanks for taking a look, Peter.

Cheers,

Jeff

2008-06-19 17:23:19

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes


> +static DECLARE_WORK(aio_complete_work, aio_complete_fn, NULL);
> +static DEFINE_SPINLOCK(iocb_completion_list_lock);
> +static LIST_HEAD(iocb_completion_list);

It seems like a bad idea to funnel all AIO DIO completion in the system
through one cacheline. Should we have per-cpu lists and work structs?

> + unsigned long flags;
> + spin_lock_irqsave(&iocb_completion_list_lock, flags);
> + list_add(&dio->done_list, &iocb_completion_list);
> + spin_unlock_irqrestore(&iocb_completion_list_lock, flags);
> + schedule_work(&aio_complete_work);

And we should probably use list_add_tail() here so that we don't reverse
the order of IO completion and end_io() callbacks.

And hopefully going per-cpu could simplify the locking so that we don't
have even more per-io locking.

- z

2008-06-19 17:50:56

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: invalidate async directio writes


> I'm utterly ignorant of all thing [AD]IO, but doesn't deferring the
> invalidate open up/widen a race window?

The case we care about making consistent are buffered reads which the
user politely only issues after O_DIRECT writes have completed.

If they issue buffered reads that race with O_DIRECT writes, well, they
get to see weird versions of the data. Just like if they issue buffered
reads that race with buffered writes.

But we must make sure that reads issued after the O_DIRECT writes are
not satisfied by cached data which was populated during a nasty racing
read. So we invalidate cached pages in the range of the write after the
O_DIRECT write is on disk but before we tell the user that the write has
completed.

This is made even worse by the observation that racing buffered reads
might be issued behind the user's back as part of read-ahead. People
have hit this very issue in systems where they have a ring buffer in a
file, an O_DIRECT writer, and a buffered reader that reads the blocks
just after they're written.

- z