LinuxLists.cc - [PATCH] ext4: serialize unaligned asynchronous DIO

2011-01-13 22:23:16

Subject: [PATCH] ext4: serialize unaligned asynchronous DIO

2011-01-14 04:15:18

Subject: Re: [PATCH] ext4: serialize unaligned asynchronous DIO

On Thu, Jan 13, 2011 at 04:23:14PM -0600, Eric Sandeen wrote:
> Mingming suggested that perhaps we can track outstanding
> conversions, and wait on that instead so that non-sparse
> files won't be affected, but I've had trouble making that
> work so far, and would like to get the corruption hole
> plugged ASAP. Perhaps adding a prink_once() warning of
> the perf degradation on this path would be useful?

Yeah, I think a printk_once(), or maybe better yet, a warning
ext4_msg() ratelimited to once a day, is the way to go. I'd print the
inode number and process name that did the offending async DIO, so it
can help out the system administrator.

I've looked over the rest of the patch, and it seems good. Just one
question:

> +static int
> +ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
> + unsigned long nr_segs, loff_t pos)
> +{
> + struct super_block *sb = inode->i_sb;
> + int blockmask = sb->s_blocksize - 1;
> + size_t count = iov_length(iov, nr_segs);
> + loff_t final_size = pos + count;
> +
> + if (pos >= inode->i_size)
> + return 0;

Why is it ok if the write is extended the file? Are you depending on
some other lock (i_data_sem, perhaps?) to serialize the write in that
case? If so, could you please add a comment to that effect?

- Ted

2011-01-14 04:41:15

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH] ext4: serialize unaligned asynchronous DIO

On 1/13/11 10:15 PM, Ted Ts'o wrote:
> On Thu, Jan 13, 2011 at 04:23:14PM -0600, Eric Sandeen wrote:
>> Mingming suggested that perhaps we can track outstanding
>> conversions, and wait on that instead so that non-sparse
>> files won't be affected, but I've had trouble making that
>> work so far, and would like to get the corruption hole
>> plugged ASAP. Perhaps adding a prink_once() warning of
>> the perf degradation on this path would be useful?
>
> Yeah, I think a printk_once(), or maybe better yet, a warning
> ext4_msg() ratelimited to once a day, is the way to go. I'd print the
> inode number and process name that did the offending async DIO, so it
> can help out the system administrator.

I'll add something like that.

> I've looked over the rest of the patch, and it seems good. Just one
> question:
>
>> +static int
>> +ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
>> + unsigned long nr_segs, loff_t pos)
>> +{
>> + struct super_block *sb = inode->i_sb;
>> + int blockmask = sb->s_blocksize - 1;
>> + size_t count = iov_length(iov, nr_segs);
>> + loff_t final_size = pos + count;
>> +
>> + if (pos >= inode->i_size)
>> + return 0;
>
> Why is it ok if the write is extended the file? Are you depending on
> some other lock (i_data_sem, perhaps?) to serialize the write in that
> case? If so, could you please add a comment to that effect?

We only have this problem if we are going down the unwritten extent
route, which only happens for writes inside i_size:

if (rw == WRITE && final_size <= inode->i_size) {
/*
* We could direct write to holes and fallocate.
...

I can add a comment, good point. In fact I'll liberally add a few
comments, sorry, I usually do that but didn't tidy this patch up
prior to sending. :)

-Eric

> - Ted

2011-01-14 17:28:49

by Eric Sandeen

[permalink] [raw]

Subject: [PATCH V2] ext4: serialize unaligned asynchronous DIO

2011-01-18 16:23:32

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH V2] ext4: serialize unaligned asynchronous DIO

On 01/14/2011 11:28 AM, Eric Sandeen wrote:

> Mingming suggested that perhaps we can track outstanding
> conversions, and wait on that instead so that non-sparse
> files won't be affected, but I've had trouble making that
> work so far, and would like to get the corruption hole
> plugged ASAP. Perhaps adding a prink_once() warning of
> the perf degradation on this path would be useful?

Ted, if you haven't merged this already, you might hold off.

I've got a version going which only synchronizes IO on the inode
if there is a pending unwritten extent conversion, which speeds
things up a bit. It's still not the most optimal solution
for this suboptimal workload, but it's as simple as the one
proposed here, and faster.

I'll send in a bit.

-Eric

> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> V2: Add comments and daily printk
>
> Index: linux-2.6/fs/ext4/ext4.h
> ===================================================================
> --- linux-2.6.orig/fs/ext4/ext4.h
> +++ linux-2.6/fs/ext4/ext4.h
> @@ -848,6 +848,7 @@ struct ext4_inode_info {
> atomic_t i_ioend_count; /* Number of outstanding io_end structs */
> /* current io_end structure for async DIO write*/
> ext4_io_end_t *cur_aio_dio;
> + struct mutex i_aio_mutex; /* big hammer for unaligned AIO */
>
> spinlock_t i_block_reservation_lock;
>
> Index: linux-2.6/fs/ext4/file.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/file.c
> +++ linux-2.6/fs/ext4/file.c
> @@ -55,11 +55,42 @@ static int ext4_release_file(struct inod
> return 0;
> }
>
> +/*
> + * This tests whether the IO in question is block-aligned or
> + * not. ext4 utilizes unwritten extents when hole-filling
> + * during direct IO, and they are converted to written only
> + * after the IO is complete. Until they are mapped, these
> + * blocks appear as holes, so dio_zero_block() will assume
> + * that it needs to zero out portions of the start and/or
> + * end block. If 2 AIO threads are at work on the same block,
> + * they must be synchronized or one thread will zero the others'
> + * data, causing corruption.
> + */
> +static int
> +ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
> + unsigned long nr_segs, loff_t pos)
> +{
> + struct super_block *sb = inode->i_sb;
> + int blockmask = sb->s_blocksize - 1;
> + size_t count = iov_length(iov, nr_segs);
> + loff_t final_size = pos + count;
> +
> + if (pos >= inode->i_size)
> + return 0;
> +
> + if ((pos & blockmask) || (final_size & blockmask))
> + return 1;
> +
> + return 0;
> +}
> +
> static ssize_t
> ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
> unsigned long nr_segs, loff_t pos)
> {
> struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
> + int unaligned_aio = 0;
> + int ret;
>
> /*
> * If we have encountered a bitmap-format file, the size limit
> @@ -78,9 +109,30 @@ ext4_file_write(struct kiocb *iocb, cons
> nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
> sbi->s_bitmap_maxbytes - pos);
> }
> + } else if (unlikely((iocb->ki_filp->f_flags & O_DIRECT) &&
> + !is_sync_kiocb(iocb)))
> + unaligned_aio = ext4_unaligned_aio(inode, iov, nr_segs, pos);
> +
> + /* Unaligned direct AIO must be serialized; see comment above */
> + if (unaligned_aio) {
> + static unsigned long unaligned_warn_time;
> +
> + /* Warn about this once per day */
> + if (printk_timed_ratelimit(&unaligned_warn_time, 60*60*24*HZ))
> + ext4_msg(inode->i_sb, KERN_WARNING,
> + "Unaligned AIO/DIO on inode %ld by %s; "
> + "performance will be poor.",
> + inode->i_ino, current->comm);
> + mutex_lock(&EXT4_I(inode)->i_aio_mutex);
> + ext4_ioend_wait(inode);
> }
>
> - return generic_file_aio_write(iocb, iov, nr_segs, pos);
> + ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
> +
> + if (unaligned_aio)
> + mutex_unlock(&EXT4_I(inode)->i_aio_mutex);
> +
> + return ret;
> }
>
> static const struct vm_operations_struct ext4_file_vm_ops = {
> Index: linux-2.6/fs/ext4/super.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/super.c
> +++ linux-2.6/fs/ext4/super.c
> @@ -875,6 +875,7 @@ static void init_once(void *foo)
> init_rwsem(&ei->xattr_sem);
> #endif
> init_rwsem(&ei->i_data_sem);
> + mutex_init(&ei->i_aio_mutex);
> inode_init_once(&ei->vfs_inode);
> }
>
>

2011-01-21 16:00:04

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH V2] ext4: serialize unaligned asynchronous DIO

On 1/14/11 11:28 AM, Eric Sandeen wrote:
>
>
> ext4 has a data corruption case when doing non-block-aligned
> asynchronous direct IO into a sparse file, as demonstrated
> by xfstest 240.
>
> The root cause is that while ext4 preallocates space in the
> hole, mappings of that space still look "new" and
> dio_zero_block() will zero out the unwritten portions. When
> more than one AIO thread is going, they both find this "new"
> block and race to zero out their portion; this is uncoordinated
> and causes data corruption.
>
> Dave Chinner fixed this for xfs by simply serializing all
> unaligned asynchronous direct IO. I've done the same here.
> This is a very big hammer, and I'm not very pleased with
> stuffing this into ext4_file_write(). But since ext4 is
> DIO_LOCKING, we need to serialize it at this high level.
>
> I tried to move this into ext4_ext_direct_IO, but by then
> we have the i_mutex already, and we will wait on the
> work queue to do conversions - which must also take the
> i_mutex. So that won't work.
>
> This was originally exposed by qemu-kvm installing to
> a raw disk image with a normal sector-63 alignment. I've
> tested a backport of this patch with qemu, and it does
> avoid the corruption. It is also quite a lot slower
> (14 min for package installs, vs. 8 min for well-aligned)
> but I'll take slow correctness over fast corruption any day.
>
> Mingming suggested that perhaps we can track outstanding
> conversions, and wait on that instead so that non-sparse
> files won't be affected, but I've had trouble making that
> work so far, and would like to get the corruption hole
> plugged ASAP. Perhaps adding a prink_once() warning of
> the perf degradation on this path would be useful?
>

I've sent a patch to do the above, as V3, twice now, but the list
is eating it. Ted, you were cc'd so hopefully you got it?

-Eric

2011-01-21 18:26:57

by Eric Sandeen

[permalink] [raw]

Subject: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

Signed-off-by: Eric Sandeen <[email protected]>
---

V2: Add comments and daily printk
V3: Only do serialization when there are outstanding conversions

Index: linux-2.6-bisect/fs/ext4/ext4.h
===================================================================
--- linux-2.6-bisect.orig/fs/ext4/ext4.h
+++ linux-2.6-bisect/fs/ext4/ext4.h
@@ -859,6 +859,8 @@ struct ext4_inode_info {
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
+ atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */
+ struct mutex i_aio_mutex; /* big hammer for unaligned AIO */

/*
* Transactions that contain inode's metadata needed to complete
@@ -2100,6 +2102,11 @@ static inline void set_bitmap_uptodate(s

#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)

+/* For ioend & aio unwritten conversion wait queues */
+#define WQ_HASH_SZ 37
+#define to_ioend_wq(v) (&ioend_wq[((unsigned long)v) % WQ_HASH_SZ])
+extern wait_queue_head_t ioend_wq[WQ_HASH_SZ];
+
#endif /* __KERNEL__ */

#endif /* _EXT4_H */
Index: linux-2.6-bisect/fs/ext4/extents.c
===================================================================
--- linux-2.6-bisect.orig/fs/ext4/extents.c
+++ linux-2.6-bisect/fs/ext4/extents.c
@@ -3154,9 +3154,10 @@ ext4_ext_handle_uninitialized_extents(ha
* that this IO needs to convertion to written when IO is
* completed
*/
- if (io)
+ if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
io->flag = EXT4_IO_END_UNWRITTEN;
- else
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ } else
ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
if (ext4_should_dioread_nolock(inode))
map->m_flags |= EXT4_MAP_UNINIT;
@@ -3446,9 +3447,10 @@ int ext4_ext_map_blocks(handle_t *handle
* that we need to perform convertion when IO is done.
*/
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
- if (io)
+ if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
io->flag = EXT4_IO_END_UNWRITTEN;
- else
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ } else
ext4_set_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN);
}
Index: linux-2.6-bisect/fs/ext4/file.c
===================================================================
--- linux-2.6-bisect.orig/fs/ext4/file.c
+++ linux-2.6-bisect/fs/ext4/file.c
@@ -55,11 +55,47 @@ static int ext4_release_file(struct inod
return 0;
}

+static void ext4_aiodio_wait(struct inode *inode)
+{
+ wait_queue_head_t *wq = to_ioend_wq(inode);
+
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0));
+}
+
+/*
+ * This tests whether the IO in question is block-aligned or not.
+ * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
+ * are converted to written only after the IO is complete. Until they are
+ * mapped, these blocks appear as holes, so dio_zero_block() will assume that
+ * it needs to zero out portions of the start and/or end block. If 2 AIO
+ * threads are at work on the same unwritten block, they must be synchronized
+ * or one thread will zero the other's data, causing corruption.
+ */
+static int
+ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct super_block *sb = inode->i_sb;
+ int blockmask = sb->s_blocksize - 1;
+ size_t count = iov_length(iov, nr_segs);
+ loff_t final_size = pos + count;
+
+ if (pos >= inode->i_size)
+ return 0;
+
+ if ((pos & blockmask) || (final_size & blockmask))
+ return 1;
+
+ return 0;
+}
+
static ssize_t
ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
+ int unaligned_aio = 0;
+ int ret;

/*
* If we have encountered a bitmap-format file, the size limit
@@ -78,9 +114,31 @@ ext4_file_write(struct kiocb *iocb, cons
nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
sbi->s_bitmap_maxbytes - pos);
}
+ } else if (unlikely((iocb->ki_filp->f_flags & O_DIRECT) &&
+ !is_sync_kiocb(iocb))) {
+ unaligned_aio = ext4_unaligned_aio(inode, iov, nr_segs, pos);
}

- return generic_file_aio_write(iocb, iov, nr_segs, pos);
+ /* Unaligned direct AIO must be serialized; see comment above */
+ if (unaligned_aio) {
+ static unsigned long unaligned_warn_time;
+
+ /* Warn about this once per day */
+ if (printk_timed_ratelimit(&unaligned_warn_time, 60*60*24*HZ))
+ ext4_msg(inode->i_sb, KERN_WARNING,
+ "Unaligned AIO/DIO on inode %ld by %s; "
+ "performance will be poor.",
+ inode->i_ino, current->comm);
+ mutex_lock(&EXT4_I(inode)->i_aio_mutex);
+ ext4_aiodio_wait(inode);
+ }
+
+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+
+ if (unaligned_aio)
+ mutex_unlock(&EXT4_I(inode)->i_aio_mutex);
+
+ return ret;
}

static const struct vm_operations_struct ext4_file_vm_ops = {
Index: linux-2.6-bisect/fs/ext4/page-io.c
===================================================================
--- linux-2.6-bisect.orig/fs/ext4/page-io.c
+++ linux-2.6-bisect/fs/ext4/page-io.c
@@ -32,9 +32,7 @@

static struct kmem_cache *io_page_cachep, *io_end_cachep;

-#define WQ_HASH_SZ 37
-#define to_ioend_wq(v) (&ioend_wq[((unsigned long)v) % WQ_HASH_SZ])
-static wait_queue_head_t ioend_wq[WQ_HASH_SZ];
+wait_queue_head_t ioend_wq[WQ_HASH_SZ];

int __init ext4_init_pageio(void)
{
@@ -102,6 +100,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io
struct inode *inode = io->inode;
loff_t offset = io->offset;
ssize_t size = io->size;
+ wait_queue_head_t *wq;
int ret = 0;

ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
@@ -126,7 +125,16 @@ int ext4_end_io_nolock(ext4_io_end_t *io
if (io->iocb)
aio_complete(io->iocb, io->result, 0);
/* clear the DIO AIO unwritten flag */
- io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ if (io->flag & EXT4_IO_END_UNWRITTEN) {
+ io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ /* Wake up anyone waiting on unwritten extent conversion */
+ wq = to_ioend_wq(io->inode);
+ if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten) &&
+ waitqueue_active(wq)) {
+ wake_up_all(wq);
+ }
+ }
+
return ret;
}

Index: linux-2.6-bisect/fs/ext4/super.c
===================================================================
--- linux-2.6-bisect.orig/fs/ext4/super.c
+++ linux-2.6-bisect/fs/ext4/super.c
@@ -829,6 +829,7 @@ static struct inode *ext4_alloc_inode(st
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
+ atomic_set(&ei->i_aiodio_unwritten, 0);

return &ei->vfs_inode;
}
@@ -865,6 +866,7 @@ static void init_once(void *foo)
init_rwsem(&ei->xattr_sem);
#endif
init_rwsem(&ei->i_data_sem);
+ mutex_init(&ei->i_aio_mutex);
inode_init_once(&ei->vfs_inode);
}

2011-01-21 23:27:35

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

ACK, I'll be catching up on bug fix patches to send to Linus over the
weekend.

- Ted

2011-02-07 02:33:49

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

On Fri, Jan 21, 2011 at 12:26:54PM -0600, Eric Sandeen wrote:
> ext4 has a data corruption case when doing non-block-aligned
> asynchronous direct IO into a sparse file, as demonstrated
> by xfstest 240.

Hey Eric,

One question about this patch. You are currently using a hashed array
of size 37 for the waitqueue; what about using a similarly sized
hashed array for the aio_mutex? We only take it for unaligned
mutexes, and I'm trying to work on reducing the size of ext4 inode,
since it gets rather large in the inode cache....

- Ted

2011-02-07 15:59:40

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

On Sun, Feb 06, 2011 at 09:33:36PM -0500, Ted Ts'o wrote:
>
> Hey Eric,
>
> One question about this patch. You are currently using a hashed array
> of size 37 for the waitqueue; what about using a similarly sized
> hashed array for the aio_mutex? We only take it for unaligned
> mutexes, and I'm trying to work on reducing the size of ext4 inode,
> since it gets rather large in the inode cache....
>
> - Ted

Like this.... I also changed to_ioend_wq() and ioend_wq[] to be
ext4_ioend_wq() and ext4__ioend_wq[], since they are now global
variables.

commit 7520bb0f2980ef79d17dcbec2783760b37490ffc
Author: Eric Sandeen <[email protected]>
Date: Mon Feb 7 10:57:28 2011 -0500

ext4: serialize unaligned asynchronous DIO

ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0c8d97b..f8857d4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -848,6 +848,7 @@ struct ext4_inode_info {
atomic_t i_ioend_count; /* Number of outstanding io_end structs */
/* current io_end structure for async DIO write*/
ext4_io_end_t *cur_aio_dio;
+ atomic_t i_aiodio_unwritten; /* Nr. of inflight conversions pending */

spinlock_t i_block_reservation_lock;

@@ -2119,6 +2120,13 @@ static inline void set_bitmap_uptodate(struct buffer_head *bh)

#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)

+/* For ioend & aio unwritten conversion wait queues */
+#define EXT4_WQ_HASH_SZ 37
+#define ext4_ioend_wq(v) (&ext4__ioend_wq[((unsigned)v) % EXT4_WQ_HASH_SZ])
+#define ext4_aio_mutex(v) (&ext4__aio_mutex[((unsigned)v) % EXT4_WQ_HASH_SZ])
+extern wait_queue_head_t ext4__ioend_wq[EXT4_WQ_HASH_SZ];
+extern struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ];
+
#endif /* __KERNEL__ */

#endif /* _EXT4_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 63a7581..ccce8a7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3174,9 +3174,10 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
* that this IO needs to convertion to written when IO is
* completed
*/
- if (io)
+ if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
io->flag = EXT4_IO_END_UNWRITTEN;
- else
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ } else
ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
if (ext4_should_dioread_nolock(inode))
map->m_flags |= EXT4_MAP_UNINIT;
@@ -3463,9 +3464,10 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* that we need to perform convertion when IO is done.
*/
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
- if (io)
+ if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
io->flag = EXT4_IO_END_UNWRITTEN;
- else
+ atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+ } else
ext4_set_inode_state(inode,
EXT4_STATE_DIO_UNWRITTEN);
}
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2e8322c..1278b6b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -55,11 +55,47 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}

+static void ext4_aiodio_wait(struct inode *inode)
+{
+ wait_queue_head_t *wq = ext4_ioend_wq(inode);
+
+ wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0));
+}
+
+/*
+ * This tests whether the IO in question is block-aligned or not.
+ * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
+ * are converted to written only after the IO is complete. Until they are
+ * mapped, these blocks appear as holes, so dio_zero_block() will assume that
+ * it needs to zero out portions of the start and/or end block. If 2 AIO
+ * threads are at work on the same unwritten block, they must be synchronized
+ * or one thread will zero the other's data, causing corruption.
+ */
+static int
+ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct super_block *sb = inode->i_sb;
+ int blockmask = sb->s_blocksize - 1;
+ size_t count = iov_length(iov, nr_segs);
+ loff_t final_size = pos + count;
+
+ if (pos >= inode->i_size)
+ return 0;
+
+ if ((pos & blockmask) || (final_size & blockmask))
+ return 1;
+
+ return 0;
+}
+
static ssize_t
ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
+ int unaligned_aio = 0;
+ int ret;

/*
* If we have encountered a bitmap-format file, the size limit
@@ -78,9 +114,31 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
sbi->s_bitmap_maxbytes - pos);
}
+ } else if (unlikely((iocb->ki_filp->f_flags & O_DIRECT) &&
+ !is_sync_kiocb(iocb))) {
+ unaligned_aio = ext4_unaligned_aio(inode, iov, nr_segs, pos);
}

- return generic_file_aio_write(iocb, iov, nr_segs, pos);
+ /* Unaligned direct AIO must be serialized; see comment above */
+ if (unaligned_aio) {
+ static unsigned long unaligned_warn_time;
+
+ /* Warn about this once per day */
+ if (printk_timed_ratelimit(&unaligned_warn_time, 60*60*24*HZ))
+ ext4_msg(inode->i_sb, KERN_WARNING,
+ "Unaligned AIO/DIO on inode %ld by %s; "
+ "performance will be poor.",
+ inode->i_ino, current->comm);
+ mutex_lock(ext4_aio_mutex(inode));
+ ext4_aiodio_wait(inode);
+ }
+
+ ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+
+ if (unaligned_aio)
+ mutex_unlock(ext4_aio_mutex(inode));
+
+ return ret;
}

static const struct vm_operations_struct ext4_file_vm_ops = {
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 4e9b0a2..955cc30 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -32,14 +32,8 @@

static struct kmem_cache *io_page_cachep, *io_end_cachep;

-#define WQ_HASH_SZ 37
-#define to_ioend_wq(v) (&ioend_wq[((unsigned long)v) % WQ_HASH_SZ])
-static wait_queue_head_t ioend_wq[WQ_HASH_SZ];
-
int __init ext4_init_pageio(void)
{
- int i;
-
io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
if (io_page_cachep == NULL)
return -ENOMEM;
@@ -48,9 +42,6 @@ int __init ext4_init_pageio(void)
kmem_cache_destroy(io_page_cachep);
return -ENOMEM;
}
- for (i = 0; i < WQ_HASH_SZ; i++)
- init_waitqueue_head(&ioend_wq[i]);
-
return 0;
}

@@ -62,7 +53,7 @@ void ext4_exit_pageio(void)

void ext4_ioend_wait(struct inode *inode)
{
- wait_queue_head_t *wq = to_ioend_wq(inode);
+ wait_queue_head_t *wq = ext4_ioend_wq(inode);

wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_ioend_count) == 0));
}
@@ -87,7 +78,7 @@ void ext4_free_io_end(ext4_io_end_t *io)
for (i = 0; i < io->num_io_pages; i++)
put_io_page(io->pages[i]);
io->num_io_pages = 0;
- wq = to_ioend_wq(io->inode);
+ wq = ext4_ioend_wq(io->inode);
if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count) &&
waitqueue_active(wq))
wake_up_all(wq);
@@ -102,6 +93,7 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
struct inode *inode = io->inode;
loff_t offset = io->offset;
ssize_t size = io->size;
+ wait_queue_head_t *wq;
int ret = 0;

ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
@@ -126,7 +118,16 @@ int ext4_end_io_nolock(ext4_io_end_t *io)
if (io->iocb)
aio_complete(io->iocb, io->result, 0);
/* clear the DIO AIO unwritten flag */
- io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ if (io->flag & EXT4_IO_END_UNWRITTEN) {
+ io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ /* Wake up anyone waiting on unwritten extent conversion */
+ wq = ext4_ioend_wq(io->inode);
+ if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten) &&
+ waitqueue_active(wq)) {
+ wake_up_all(wq);
+ }
+ }
+
return ret;
}

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 86b0548..07e790b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -833,6 +833,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_ioend_count, 0);
+ atomic_set(&ei->i_aiodio_unwritten, 0);

return &ei->vfs_inode;
}
@@ -4800,11 +4801,21 @@ static void ext4_exit_feat_adverts(void)
kfree(ext4_feat);
}

+/* Shared across all ext4 file systems */
+wait_queue_head_t ext4__ioend_wq[EXT4_WQ_HASH_SZ];
+struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ];
+
static int __init ext4_init_fs(void)
{
- int err;
+ int i, err;

ext4_check_flag_values();
+
+ for (i=0; i < EXT4_WQ_HASH_SZ; i++) {
+ mutex_init(&ext4__aio_mutex[i]);
+ init_waitqueue_head(&ext4__ioend_wq[i]);
+ }
+
err = ext4_init_pageio();
if (err)
return err;

2011-02-07 17:58:23

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

On 02/07/2011 09:59 AM, Ted Ts'o wrote:
> On Sun, Feb 06, 2011 at 09:33:36PM -0500, Ted Ts'o wrote:
>>
>> Hey Eric,
>>
>> One question about this patch. You are currently using a hashed array
>> of size 37 for the waitqueue; what about using a similarly sized
>> hashed array for the aio_mutex? We only take it for unaligned
>> mutexes, and I'm trying to work on reducing the size of ext4 inode,
>> since it gets rather large in the inode cache....
>>
>> - Ted
>
> Like this.... I also changed to_ioend_wq() and ioend_wq[] to be
> ext4_ioend_wq() and ext4__ioend_wq[], since they are now global
> variables.
>

Looks fine to me, thanks, feel free to put an akpm-style comment
in the commit about what you fixed up.

-Eric

2011-02-07 22:18:33

by Mingming Cao

[permalink] [raw]

Subject: Re: [PATCH V3 RESEND 2] ext4: serialize unaligned asynchronous DIO

On Mon, 2011-02-07 at 11:58 -0600, Eric Sandeen wrote:
> On 02/07/2011 09:59 AM, Ted Ts'o wrote:
> > On Sun, Feb 06, 2011 at 09:33:36PM -0500, Ted Ts'o wrote:
> >>
> >> Hey Eric,
> >>
> >> One question about this patch. You are currently using a hashed array
> >> of size 37 for the waitqueue; what about using a similarly sized
> >> hashed array for the aio_mutex? We only take it for unaligned
> >> mutexes, and I'm trying to work on reducing the size of ext4 inode,
> >> since it gets rather large in the inode cache....
> >>
> >> - Ted
> >
> > Like this.... I also changed to_ioend_wq() and ioend_wq[] to be
> > ext4_ioend_wq() and ext4__ioend_wq[], since they are now global
> > variables.
> >
>
> Looks fine to me, thanks, feel free to put an akpm-style comment
> in the commit about what you fixed up.
>
> -Eric

Ted, Eric,

If it matters, you could add reviewed-by: Mingming Cao <[email protected]>
from me. The modified version looks fine with me.

Mingming
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-02-23 13:23:15

by Philipp Hahn

[permalink] [raw]

Subject: Re: backport "ext4: serialize unaligned asynchronous DIO" to 2.6.32

Hello Ted, hello Eric,

On Monday February 7th 2011 16:59:36 Ted Ts'o wrote:
> commit 7520bb0f2980ef79d17dcbec2783760b37490ffc
> Author: Eric Sandeen <[email protected]>
> Date: Mon Feb 7 10:57:28 2011 -0500
>
> ext4: serialize unaligned asynchronous DIO
>
> ext4 has a data corruption case when doing non-block-aligned
> asynchronous direct IO into a sparse file, as demonstrated
> by xfstest 240.

I hope you remember that bug, because I encountered this data corruption bug
on Debians 2.6.32(.51) kernel as well.

On the other hand RedHat seems to have back-ported that fix to RHEL5 (2.6.18)
and probably RHEL6 (2.6.32) as well, but I don't have a subscription, so I
can't verify that:
<http://rpmfind.net/linux/RPM/centos/updates/5.7/x86_64/RPMS/kernel-devel-2.6.18-274.12.1.el5.x86_64.html>
<https://bugzilla.redhat.com/show_bug.cgi?id=689830>

The Xen-people also encountered it and asked for someone to backport it:
<http://osdir.com/ml/xen-development/2011-07/msg00474.html>

I tried to backport it from 2.6.38~rc5 to 2.6.32.51 and thus far it seems to
fix the bug. But several other things were re-named and re-organized between
those versions, so it was not slreight forward.

Since I'm no ext4 expert, I'd like to ask you to have a look at this backport.
Is it sound or are there some tests I can throw at it to get it tested more
thoroughly?
Does is classify for <mailto:[email protected]>?

Thanks in advance
Philipp Hahn
--
Philipp Hahn Open Source Software Engineer [email protected]
Univention GmbH Linux for Your Business fon: +49 421 22 232- 0
Mary-Somerville-Str.1 D-28359 Bremen fax: +49 421 22 232-99
http://www.univention.de/

Attachments:

(No filename) (0.00 B)
signature.asc (197.00 B)
This is a digitally signed message part. (No filename) (126.00 B)
Download all attachments

2012-02-23 15:15:28

by Eric Sandeen

[permalink] [raw]

Subject: Re: backport "ext4: serialize unaligned asynchronous DIO" to 2.6.32

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/23/12 7:23 AM, Philipp Hahn wrote:
> Hello Ted, hello Eric,
>
> On Monday February 7th 2011 16:59:36 Ted Ts'o wrote:
>> commit 7520bb0f2980ef79d17dcbec2783760b37490ffc

upstream is actually e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d but you probably knew that.

>> Author: Eric Sandeen <[email protected]>
>> Date: Mon Feb 7 10:57:28 2011 -0500
>>
>> ext4: serialize unaligned asynchronous DIO
>>
>> ext4 has a data corruption case when doing non-block-aligned
>> asynchronous direct IO into a sparse file, as demonstrated
>> by xfstest 240.
>
> I hope you remember that bug, because I encountered this data corruption bug
> on Debians 2.6.32(.51) kernel as well.

I remember it well ;)

I also backported it to RHEL6, but that "2.6.32" kernel also had a few of ext4 updates.

Still, grabbing a centos6 kernel src.rpm and looking might help you.

FWIW I also backported f46c483357c2d87606bbefb511321e3efd4baae0 and
f2d28a2ebcb525a6ec7e2152106ddb385ef52b73 as helpers.

> On the other hand RedHat seems to have back-ported that fix to RHEL5 (2.6.18)
> and probably RHEL6 (2.6.32) as well, but I don't have a subscription, so I
> can't verify that:
> <http://rpmfind.net/linux/RPM/centos/updates/5.7/x86_64/RPMS/kernel-devel-2.6.18-274.12.1.el5.x86_64.html>
> <https://bugzilla.redhat.com/show_bug.cgi?id=689830>

615309 is the RHEL6 bug. Sadly it's marked private.

> The Xen-people also encountered it and asked for someone to backport it:
> <http://osdir.com/ml/xen-development/2011-07/msg00474.html>
>
> I tried to backport it from 2.6.38~rc5 to 2.6.32.51 and thus far it seems to
> fix the bug. But several other things were re-named and re-organized between
> those versions, so it was not slreight forward.
>
> Since I'm no ext4 expert, I'd like to ask you to have a look at this backport.
> Is it sound or are there some tests I can throw at it to get it tested more
> thoroughly?

xfstests test #240 tests it specifically:

# FS QA Test No. 240
#
# Test that non-block-aligned aio+dio into holes does not leave
# zero'd out portions of the file
#
# QEMU IO to a file-backed device with misaligned partitions
# can send this sort of IO
#
# This test need only be run in the case where the logical block size
# of the device can be smaller than the file system block size.

and running all the xfstests over your result would probably be good.

FWIW, there was a related xfs fix as well:

https://bugzilla.redhat.com/show_bug.cgi?id=669272

that one's not marked private <eyeroll>

> Does is classify for <mailto:[email protected]>?

probably so.

To be honest I am really pressed for time right now, and this was somewhat tricky code. I won't be able to review it right now, anyway, but can try to get to it at some point if my schedule lightens up... :(

- -Eric

> Thanks in advance
> Philipp Hahn

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPRlgFAAoJECCuFpLhPd7gU58QAJwAzv2e9WVT2zs741Jhv0Fl
NCAJ7t7+EYECA5Qtbzm/LCojmvf2mYKVA2MmHiLIG0jxSAX6DQ+bHsjx0N3DbFCA
nabZFBwjiNfbII3ut4lHTWXfi6hZ0yqs6/qZTCnm4janwQN9ffR7+kuwTfuJGTaI
a0pelgiXQTIVQcx/togd9qezrEwVGd/7Z/sw67o1/hpc76fsELXYnZVkQ4jzXKwc
gvAgjFKSdkY0sMCq/owwiA6lgZydMeGzkXYbDlvYx7lfPp6n8ZPpupa2UKAeSF/P
4T8cweTK/XNvlr7KcXx9zHoD3ZLRTYaVvvIlaOa5n0S+v71wGV/AOV2wLNeK3s2x
jn91Zsf1sKTpvUQsh5P1UZKgOEXVgQ+gu4+15Ggk5LPOrSd8j4wjrGbrHHu4MfRv
udLZP1lE+RINflzdkL6nx6UeeI+X4LOU5McjSgs4gaKInTK9U090vpsCDijEK+mJ
ku9nfX+pBiXQgcHFA8fTe0KwnBxvA+AuY0n4w4zzj96bh9tqlYK2MH3qfVJTrMrs
pl6gZQTUbt8/UwaP8ooa0gBBG9aGVaBsRtBng9hqIb5vU+TImR8PBlL3QQZPvcZa
g+wYq3UmIl5z+A17Ify37FgbU6lsuFmleU0fRglV/ofAywMGatKNUJZn4yz+7QTv
ADMfCQ8Gnpy8Ue+LxQkO
=cYPJ
-----END PGP SIGNATURE-----