There is a race between ext4 buffer write and direct_IO read with
dioread_nolock mount option enabled. The problem is that we clear
PageWriteback flag during end_io time but will do
uninitialized-to-initialized extent conversion later with dioread_nolock.
If an O_direct read request comes in during this period, ext4 will return
zero instead of the recently written data.
This patch checks whether there are any pending uninitialized-to-initialized
extent conversion requests before doing O_direct read to close the race.
Note that this is just a bandaid fix. The fundamental issue is that we
clear PageWriteback flag before we really complete an IO, which is
problem-prone. To fix the fundamental issue, we may need to implement an
extent tree cache that we can use to look up pending to-be-converted extents.
Signed-off-by: Jiaying Zhang <[email protected]>
---
fs/ext4/indirect.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 6c27111..ca90d73 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
}
retry:
- if (rw == READ && ext4_should_dioread_nolock(inode))
+ if (rw == READ && ext4_should_dioread_nolock(inode)) {
+ if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+ mutex_lock(&inode->i_mutex);
+ ext4_flush_completed_IO(inode);
+ mutex_unlock(&inode->i_mutex);
+ }
ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
ext4_get_block, NULL, NULL, 0);
- else {
+ } else {
ret = blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
--
1.7.3.1
On Thu, Aug 18, 2011 at 06:28:45PM -0700, Jiaying Zhang wrote:
> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> }
>
> retry:
> - if (rw == READ && ext4_should_dioread_nolock(inode))
> + if (rw == READ && ext4_should_dioread_nolock(inode)) {
> + if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> + mutex_lock(&inode->i_mutex);
> + ext4_flush_completed_IO(inode);
> + mutex_unlock(&inode->i_mutex);
> + }
Doesn't this largely invalidate the reasons for using dioread_nolock
in the first place, which was to avoid taking the i_mutex for
performance reasons? If we are willing to solve the problem this way,
I wonder if we be better off just simply telling users to disable
dioread_nolock if they care about cache consistency between DIO reads
and buffered writes?
(Yes, I do understand that in the hopefully common case where a user
is not trying to do buffered writes and DIO reads at the same time,
we'll just take and release the mutex very quickly, but still, it's
got to have a performance impact.)
I seem to recall a conversation I had with Stephen Tweedie over a
decade ago, where he noted that many other Unix systems made
absolutely no cache consistency guarantees with respect to DIO and the
page cache, but he wanted to set a higher standard for Linux. Which
is fair enough, but I wonder if for the case of dioread_nolock, since
its raison d'etre is to avoid the i_mutex lock, to simply just say
that one of the side effects of dioread_nolock is that (for now) the
cache consistency guarantees are repealed if this mount option is
chosen.
What do folks think?
- Ted
Hi Ted,
On Fri, Aug 19, 2011 at 1:57 PM, Ted Ts'o <[email protected]> wrote:
> On Thu, Aug 18, 2011 at 06:28:45PM -0700, Jiaying Zhang wrote:
>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>> ? ? ? }
>>
>> ?retry:
>> - ? ? if (rw == READ && ext4_should_dioread_nolock(inode))
>> + ? ? if (rw == READ && ext4_should_dioread_nolock(inode)) {
>> + ? ? ? ? ? ? if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>> + ? ? ? ? ? ? ? ? ? ? mutex_lock(&inode->i_mutex);
>> + ? ? ? ? ? ? ? ? ? ? ext4_flush_completed_IO(inode);
>> + ? ? ? ? ? ? ? ? ? ? mutex_unlock(&inode->i_mutex);
>> + ? ? ? ? ? ? }
>
> Doesn't this largely invalidate the reasons for using dioread_nolock
> in the first place, which was to avoid taking the i_mutex for
> performance reasons? ?If we are willing to solve the problem this way,
> I wonder if we be better off just simply telling users to disable
> dioread_nolock if they care about cache consistency between DIO reads
> and buffered writes?
>
> (Yes, I do understand that in the hopefully common case where a user
> is not trying to do buffered writes and DIO reads at the same time,
> we'll just take and release the mutex very quickly, but still, it's
> got to have a performance impact.)
My hope is that in the hopefully common case, applications just do a lot of
DIO reads, so the check on unlikely(!list_empty(&ei->i_completed_io_list)
would fail and we don't need to take the i_mutex lock. It is certainly not an
optimal solution because we only care about data consistency for the blocks
to be read instead of all of uncompleted writes. But again, the more general
solution would require some kind of extent tree that requires some
development effort.
Jiaying
>
> I seem to recall a conversation I had with Stephen Tweedie over a
> decade ago, where he noted that many other Unix systems made
> absolutely no cache consistency guarantees with respect to DIO and the
> page cache, but he wanted to set a higher standard for Linux. ?Which
> is fair enough, but I wonder if for the case of dioread_nolock, since
> its raison d'etre is to avoid the i_mutex lock, to simply just say
> that one of the side effects of dioread_nolock is that (for now) the
> cache consistency guarantees are repealed if this mount option is
> chosen.
>
> What do folks think?
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>