2009-07-29 00:28:10

by Curt Wohlgemuth

[permalink] [raw]
Subject: [PATCH RFC] Insure direct IO writes do not use the page cache

This insures that direct IO writes to fallocate'd file space do not use the
page cache.


Signed-off-by: Curt Wohlgemuth <[email protected]>

---

I implemented Aneesh's ideas for this solution, but have some questions.

I've verified that the page cache isn't used, regardless of whether
FALLOC_FL_KEEP_SIZE is used in the fallocate() call or not.

The changes:
- New inode state flag to indicate that a DIO write is ongoing

- ext4_ext_convert_to_initialized() will mark any new extent it creates
as uninitialized, if this new EXT4_STATE_DIO_WRITE flag is set.

- ext4_direct_IO() will set this flag for any write.

It now calls blockdev_direct_IO_own_locking() to do the I/O.

After return from blockdev_direct_IO_own_locking() it clears the flag
and calls a new routine to mark all extents containing the returned
blocks as initialized.

I'm a bit uncertain about the use of DIO_OWN_LOCKING; I looked at the XFS
code to see if it acquired any other locks while using this flag, but didn't
see any. Suggestions/corrections welcome.


diff -Naur orig/fs/ext4/ext4.h new/fs/ext4/ext4.h
--- orig/fs/ext4/ext4.h 2009-07-28 17:02:25.000000000 -0700
+++ new/fs/ext4/ext4.h 2009-07-28 17:02:19.000000000 -0700
@@ -272,6 +272,7 @@
#define EXT4_STATE_XATTR 0x00000004 /* has in-inode xattrs */
#define EXT4_STATE_NO_EXPAND 0x00000008 /* No space for expansion */
#define EXT4_STATE_DA_ALLOC_CLOSE 0x00000010 /* Alloc DA blks on close */
+#define EXT4_STATE_DIO_WRITE 0x00000020 /* This is a DIO write */

/* Used to pass group descriptor data when online resize is done */
struct ext4_new_group_input {
diff -Naur orig/fs/ext4/ext4_extents.h new/fs/ext4/ext4_extents.h
--- orig/fs/ext4/ext4_extents.h 2009-07-28 17:02:40.000000000 -0700
+++ new/fs/ext4/ext4_extents.h 2009-07-28 17:02:34.000000000 -0700
@@ -242,5 +242,7 @@
ext4_lblk_t *, ext4_fsblk_t *);
extern void ext4_ext_drop_refs(struct ext4_ext_path *);
extern int ext4_ext_check_inode(struct inode *inode);
+extern int ext4_ext_mark_extents_init(struct inode *inode, loff_t offset,
+ int nr_bytes);
#endif /* _EXT4_EXTENTS */

diff -Naur orig/fs/ext4/extents.c new/fs/ext4/extents.c
--- orig/fs/ext4/extents.c 2009-07-28 17:02:54.000000000 -0700
+++ new/fs/ext4/extents.c 2009-07-28 17:02:49.000000000 -0700
@@ -2563,6 +2563,13 @@
ex3->ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(ex3, newblock);
ex3->ee_len = cpu_to_le16(allocated);
+
+ /* For direct I/O, we mark this as uninitialized, and
+ * set it as initialized when we finish the direct
+ * IO. */
+ if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
+ ext4_ext_mark_uninitialized(ex3);
+
err = ext4_ext_insert_extent(handle, inode, path, ex3);
if (err == -ENOSPC) {
err = ext4_ext_zeroout(inode, &orig_ex);
@@ -2698,6 +2705,13 @@
ex2->ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(ex2, newblock);
ex2->ee_len = cpu_to_le16(allocated);
+
+ /* For direct I/O, we mark this as uninitialized, and
+ * set it as initialized when we finish the direct
+ * IO. */
+ if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
+ ext4_ext_mark_uninitialized(ex2);
+
if (ex2 != ex)
goto insert;
/*
@@ -2763,6 +2777,109 @@
return err;
}

+
+static int handle_uninit_extent(struct ext4_ext_path *path,
+ struct inode *inode,
+ struct ext4_extent *ex)
+{
+ handle_t *handle;
+ int credits;
+ int started;
+ int depth = ext_depth(inode);
+ int ret = 0;
+
+ if (ext4_ext_is_uninitialized(ex)) {
+ handle = ext4_journal_current_handle();
+ started = 0;
+
+ if (!handle) {
+ credits =
+ ext4_chunk_trans_blocks(inode, 1);
+ handle = ext4_journal_start(inode,
+ credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ started = 1;
+ }
+ /* Mark as initialized */
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex));
+
+ ret = ext4_ext_dirty(handle, inode, path + depth);
+
+ if (started)
+ ext4_journal_stop(handle);
+ }
+out:
+ return ret;
+}
+
+/* Called after direct I/O writes, to make sure that all extents are marked
+ * as initialized. */
+int ext4_ext_mark_extents_init(struct inode *inode,
+ loff_t offset,
+ int nr_bytes)
+{
+ struct ext4_ext_path *path = NULL;
+ struct ext4_extent *ex;
+ int depth;
+ int nr_blocks = nr_bytes >> inode->i_blkbits;
+ ext4_lblk_t first_block = offset >> inode->i_blkbits;
+ ext4_lblk_t block;
+ int ret;
+
+ /* Handle the extent for the first block */
+ path = ext4_ext_find_extent(inode, first_block, NULL);
+ if (IS_ERR(path)) {
+ ret = PTR_ERR(path);
+ path = NULL;
+ goto out;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+
+ ret = handle_uninit_extent(path, inode, ex);
+ if (ret != 0) {
+ goto out;
+ }
+
+ /* Check the rest of the blocks; if they're in different
+ * extents, handle those as well. */
+ for (block = first_block + 1;
+ block < first_block + nr_blocks;
+ block++) {
+
+ /* Only look for new extents now */
+ if (block >= ex->ee_block &&
+ block < (ex->ee_block + ex->ee_len))
+ continue;
+
+ ext4_ext_drop_refs(path);
+ path = ext4_ext_find_extent(inode, first_block, path);
+ if (IS_ERR(path)) {
+ ret = PTR_ERR(path);
+ path = NULL;
+ goto out;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+
+ ret = handle_uninit_extent(path, inode, ex);
+ if (ret != 0) {
+ goto out;
+ }
+ }
+out:
+ if (path) {
+ ext4_ext_drop_refs(path);
+ kfree(path);
+ }
+
+ return ret;
+}
+
+
/*
* Block allocation/map/preallocation routine for extents based files
*
diff -Naur orig/fs/ext4/inode.c new/fs/ext4/inode.c
--- orig/fs/ext4/inode.c 2009-07-28 17:02:04.000000000 -0700
+++ new/fs/ext4/inode.c 2009-07-28 17:23:59.000000000 -0700
@@ -3318,11 +3318,33 @@
ei->i_disksize = inode->i_size;
ext4_journal_stop(handle);
}
- }
+ /* This prevents initializing extents too early */
+ if (ei->i_flags & EXT4_EXTENTS_FL)
+ ei->i_state |= EXT4_STATE_DIO_WRITE;
+ }
+
+ ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov,
+ offset, nr_segs,
+ ext4_get_block, NULL);
+
+ /* We now need to look at all extents for the blocks that were
+ * written out, and mark them as initialized. Note that they've
+ * already been converted, simply not yet marked as init. */
+ if (rw == WRITE && ei->i_flags & EXT4_EXTENTS_FL) {
+ BUG_ON((ei->i_state & EXT4_STATE_DIO_WRITE) == 0);
+ ei->i_state |= ~EXT4_STATE_DIO_WRITE;
+
+ if (ret > 0) {
+ int ret2;

- ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL);
+ ret2 = ext4_ext_mark_extents_init(inode, offset, ret);
+ if (ret2 != 0) {
+ ret = ret2;
+ goto out;
+ }
+ }
+ }

if (orphan) {
int err;


2009-07-29 16:10:48

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Although replying to self is somewhat bad etiquette...

I've found at least one issue with this patch: Although the semantics
seem correct, since the late-converted-to-init extents are not merged
with neighbors, you can easily end up with thousands of extents :-( .
Each write to fallocate'd space results in its own initialized extent.

I'm not sure how expensive it would be to merge the extents when they
are converted to initialized after the DIO write goes through.

Curt


On Tue, Jul 28, 2009 at 5:28 PM, Curt Wohlgemuth<[email protected]> wrote:
> This insures that direct IO writes to fallocate'd file space do not use the
> page cache.
>
>
> ? ? ?Signed-off-by: Curt Wohlgemuth <[email protected]>
>
> ---
>
> I implemented Aneesh's ideas for this solution, but have some questions.
>
> I've verified that the page cache isn't ?used, regardless of whether
> FALLOC_FL_KEEP_SIZE is used in the fallocate() call or not.
>
> The changes:
> ? - New inode state flag to indicate that a DIO write is ongoing
>
> ? - ext4_ext_convert_to_initialized() will mark any new extent it creates
> ? ? as uninitialized, if this new EXT4_STATE_DIO_WRITE flag is set.
>
> ? - ext4_direct_IO() will set this flag for any write.
>
> ? ? It now calls blockdev_direct_IO_own_locking() to do the I/O.
>
> ? ? After return from blockdev_direct_IO_own_locking() it clears the flag
> ? ? and calls a new routine to mark all extents containing the returned
> ? ? blocks as initialized.
>
> I'm a bit uncertain about the use of DIO_OWN_LOCKING; I looked at the XFS
> code to see if it acquired any other locks while using this flag, but didn't
> see any. ?Suggestions/corrections welcome.
>
>
> diff -Naur orig/fs/ext4/ext4.h new/fs/ext4/ext4.h
> --- orig/fs/ext4/ext4.h 2009-07-28 17:02:25.000000000 -0700
> +++ new/fs/ext4/ext4.h ?2009-07-28 17:02:19.000000000 -0700
> @@ -272,6 +272,7 @@
> ?#define EXT4_STATE_XATTR ? ? ? ? ? ? ? 0x00000004 /* has in-inode xattrs */
> ?#define EXT4_STATE_NO_EXPAND ? ? ? ? ? 0x00000008 /* No space for expansion */
> ?#define EXT4_STATE_DA_ALLOC_CLOSE ? ? ?0x00000010 /* Alloc DA blks on close */
> +#define EXT4_STATE_DIO_WRITE ? ? ? ? ? 0x00000020 /* This is a DIO write */
>
> ?/* Used to pass group descriptor data when online resize is done */
> ?struct ext4_new_group_input {
> diff -Naur orig/fs/ext4/ext4_extents.h new/fs/ext4/ext4_extents.h
> --- orig/fs/ext4/ext4_extents.h 2009-07-28 17:02:40.000000000 -0700
> +++ new/fs/ext4/ext4_extents.h ?2009-07-28 17:02:34.000000000 -0700
> @@ -242,5 +242,7 @@
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_lblk_t *, ext4_fsblk_t *);
> ?extern void ext4_ext_drop_refs(struct ext4_ext_path *);
> ?extern int ext4_ext_check_inode(struct inode *inode);
> +extern int ext4_ext_mark_extents_init(struct inode *inode, loff_t offset,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int nr_bytes);
> ?#endif /* _EXT4_EXTENTS */
>
> diff -Naur orig/fs/ext4/extents.c new/fs/ext4/extents.c
> --- orig/fs/ext4/extents.c ? ? ?2009-07-28 17:02:54.000000000 -0700
> +++ new/fs/ext4/extents.c ? ? ? 2009-07-28 17:02:49.000000000 -0700
> @@ -2563,6 +2563,13 @@
> ? ? ? ? ? ? ? ? ? ? ? ?ex3->ee_block = cpu_to_le32(iblock);
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_ext_store_pblock(ex3, newblock);
> ? ? ? ? ? ? ? ? ? ? ? ?ex3->ee_len = cpu_to_le16(allocated);
> +
> + ? ? ? ? ? ? ? ? ? ? ? /* For direct I/O, we mark this as uninitialized, and
> + ? ? ? ? ? ? ? ? ? ? ? ?* set it as initialized when we finish the direct
> + ? ? ? ? ? ? ? ? ? ? ? ?* IO. */
> + ? ? ? ? ? ? ? ? ? ? ? if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext4_ext_mark_uninitialized(ex3);
> +
> ? ? ? ? ? ? ? ? ? ? ? ?err = ext4_ext_insert_extent(handle, inode, path, ex3);
> ? ? ? ? ? ? ? ? ? ? ? ?if (err == -ENOSPC) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?err = ?ext4_ext_zeroout(inode, &orig_ex);
> @@ -2698,6 +2705,13 @@
> ? ? ? ?ex2->ee_block = cpu_to_le32(iblock);
> ? ? ? ?ext4_ext_store_pblock(ex2, newblock);
> ? ? ? ?ex2->ee_len = cpu_to_le16(allocated);
> +
> + ? ? ? /* For direct I/O, we mark this as uninitialized, and
> + ? ? ? ?* set it as initialized when we finish the direct
> + ? ? ? ?* IO. */
> + ? ? ? if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
> + ? ? ? ? ? ? ? ext4_ext_mark_uninitialized(ex2);
> +
> ? ? ? ?if (ex2 != ex)
> ? ? ? ? ? ? ? ?goto insert;
> ? ? ? ?/*
> @@ -2763,6 +2777,109 @@
> ? ? ? ?return err;
> ?}
>
> +
> +static int handle_uninit_extent(struct ext4_ext_path *path,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct inode *inode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct ext4_extent *ex)
> +{
> + ? ? ? handle_t *handle;
> + ? ? ? int credits;
> + ? ? ? int started;
> + ? ? ? int depth = ext_depth(inode);
> + ? ? ? int ret = 0;
> +
> + ? ? ? if (ext4_ext_is_uninitialized(ex)) {
> + ? ? ? ? ? ? ? handle = ext4_journal_current_handle();
> + ? ? ? ? ? ? ? started = 0;
> +
> + ? ? ? ? ? ? ? if (!handle) {
> + ? ? ? ? ? ? ? ? ? ? ? credits =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_chunk_trans_blocks(inode, 1);
> + ? ? ? ? ? ? ? ? ? ? ? handle = ext4_journal_start(inode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? credits);
> + ? ? ? ? ? ? ? ? ? ? ? if (IS_ERR(handle)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = PTR_ERR(handle);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? started = 1;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? /* Mark as initialized */
> + ? ? ? ? ? ? ? ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex));
> +
> + ? ? ? ? ? ? ? ret = ext4_ext_dirty(handle, inode, path + depth);
> +
> + ? ? ? ? ? ? ? if (started)
> + ? ? ? ? ? ? ? ? ? ? ? ext4_journal_stop(handle);
> + ? ? ? }
> +out:
> + ? ? ? return ret;
> +}
> +
> +/* Called after direct I/O writes, to make sure that all extents are marked
> + * as initialized. */
> +int ext4_ext_mark_extents_init(struct inode *inode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? loff_t offset,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int nr_bytes)
> +{
> + ? ? ? struct ext4_ext_path *path = NULL;
> + ? ? ? struct ext4_extent *ex;
> + ? ? ? int depth;
> + ? ? ? int nr_blocks = nr_bytes >> inode->i_blkbits;
> + ? ? ? ext4_lblk_t first_block = offset >> inode->i_blkbits;
> + ? ? ? ext4_lblk_t block;
> + ? ? ? int ret;
> +
> + ? ? ? /* Handle the extent for the first block */
> + ? ? ? path ?= ext4_ext_find_extent(inode, first_block, NULL);
> + ? ? ? if (IS_ERR(path)) {
> + ? ? ? ? ? ? ? ret = PTR_ERR(path);
> + ? ? ? ? ? ? ? path = NULL;
> + ? ? ? ? ? ? ? goto out;
> + ? ? ? }
> + ? ? ? depth = ext_depth(inode);
> + ? ? ? ex ? ?= path[depth].p_ext;
> +
> + ? ? ? ret = handle_uninit_extent(path, inode, ex);
> + ? ? ? if (ret != 0) {
> + ? ? ? ? ? ? ? goto out;
> + ? ? ? }
> +
> + ? ? ? /* Check the rest of the blocks; if they're in different
> + ? ? ? ?* extents, handle those as well. */
> + ? ? ? for (block = first_block + 1;
> + ? ? ? ? ? ?block < first_block + nr_blocks;
> + ? ? ? ? ? ?block++) {
> +
> + ? ? ? ? ? ? ? /* Only look for new extents now */
> + ? ? ? ? ? ? ? if (block >= ex->ee_block &&
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? block < (ex->ee_block + ex->ee_len))
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> +
> + ? ? ? ? ? ? ? ext4_ext_drop_refs(path);
> + ? ? ? ? ? ? ? path ?= ext4_ext_find_extent(inode, first_block, path);
> + ? ? ? ? ? ? ? if (IS_ERR(path)) {
> + ? ? ? ? ? ? ? ? ? ? ? ret = PTR_ERR(path);
> + ? ? ? ? ? ? ? ? ? ? ? path = NULL;
> + ? ? ? ? ? ? ? ? ? ? ? goto out;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? depth = ext_depth(inode);
> + ? ? ? ? ? ? ? ex ? ?= path[depth].p_ext;
> +
> + ? ? ? ? ? ? ? ret = handle_uninit_extent(path, inode, ex);
> + ? ? ? ? ? ? ? if (ret != 0) {
> + ? ? ? ? ? ? ? ? ? ? ? goto out;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> +out:
> + ? ? ? if (path) {
> + ? ? ? ? ? ? ? ext4_ext_drop_refs(path);
> + ? ? ? ? ? ? ? kfree(path);
> + ? ? ? }
> +
> + ? ? ? return ret;
> +}
> +
> +
> ?/*
> ?* Block allocation/map/preallocation routine for extents based files
> ?*
> diff -Naur orig/fs/ext4/inode.c new/fs/ext4/inode.c
> --- orig/fs/ext4/inode.c ? ? ? ?2009-07-28 17:02:04.000000000 -0700
> +++ new/fs/ext4/inode.c 2009-07-28 17:23:59.000000000 -0700
> @@ -3318,11 +3318,33 @@
> ? ? ? ? ? ? ? ? ? ? ? ?ei->i_disksize = inode->i_size;
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_journal_stop(handle);
> ? ? ? ? ? ? ? ?}
> - ? ? ? }
> + ? ? ? ? ? ? ? /* This prevents initializing extents too early */
> + ? ? ? ? ? ? ? if (ei->i_flags & EXT4_EXTENTS_FL)
> + ? ? ? ? ? ? ? ? ? ? ? ei->i_state |= EXT4_STATE_DIO_WRITE;
> + ? ? ? }
> +
> + ? ? ? ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? inode->i_sb->s_bdev, iov,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? offset, nr_segs,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext4_get_block, NULL);
> +
> + ? ? ? /* We now need to look at all extents for the blocks that were
> + ? ? ? ?* written out, and mark them as initialized. ?Note that they've
> + ? ? ? ?* already been converted, simply not yet marked as init. */
> + ? ? ? if (rw == WRITE && ei->i_flags & EXT4_EXTENTS_FL) {
> + ? ? ? ? ? ? ? BUG_ON((ei->i_state & EXT4_STATE_DIO_WRITE) == 0);
> + ? ? ? ? ? ? ? ei->i_state |= ~EXT4_STATE_DIO_WRITE;
> +
> + ? ? ? ? ? ? ? if (ret > 0) {
> + ? ? ? ? ? ? ? ? ? ? ? int ret2;
>
> - ? ? ? ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?offset, nr_segs,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_get_block, NULL);
> + ? ? ? ? ? ? ? ? ? ? ? ret2 = ext4_ext_mark_extents_init(inode, offset, ret);
> + ? ? ? ? ? ? ? ? ? ? ? if (ret2 != 0) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = ret2;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? }
> + ? ? ? }
>
> ? ? ? ?if (orphan) {
> ? ? ? ? ? ? ? ?int err;
>

2009-07-29 17:18:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Curt Wohlgemuth wrote:
> Although replying to self is somewhat bad etiquette...

nah :)

> I've found at least one issue with this patch: Although the semantics
> seem correct, since the late-converted-to-init extents are not merged
> with neighbors, you can easily end up with thousands of extents :-( .
> Each write to fallocate'd space results in its own initialized extent.
>
> I'm not sure how expensive it would be to merge the extents when they
> are converted to initialized after the DIO write goes through.
>
> Curt
>

hm I think I've seen other cases where things don't get merged as well
as I'd expect.

I haven't replied to the first mail yet because I have a lot of
remembering to do about xfs first, but I'm fairly certain that at least
your use of blockdev_direct_IO_own_locking() is not correct. See for
example all the comments around __blockdev_direct_IO about i_mutex, and
all the xfs_ilock/xfs_iolock calls in xfs_read/xfs_write.

There is a lot of locking for the fs to handle if you want to go that route.

Also, IIRC xfs does the conversion to written (vs. unwritten) extents in
an IO completion handler, just FWIW.

-Eric

2009-07-29 17:41:44

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Eric Sandeen wrote:

...

> I haven't replied to the first mail yet because I have a lot of
> remembering to do about xfs first, but I'm fairly certain that at least
> your use of blockdev_direct_IO_own_locking() is not correct. See for
> example all the comments around __blockdev_direct_IO about i_mutex, and
> all the xfs_ilock/xfs_iolock calls in xfs_read/xfs_write.

Oh and i_alloc_sem too, forgot about that one :)

-Eric

2009-07-29 17:47:06

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Tue, 2009-07-28 at 17:28 -0700, Curt Wohlgemuth wrote:
> This insures that direct IO writes to fallocate'd file space do not use the
> page cache.
>
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>
>
> ---
>
> I implemented Aneesh's ideas for this solution, but have some questions.
>
> I've verified that the page cache isn't used, regardless of whether
> FALLOC_FL_KEEP_SIZE is used in the fallocate() call or not.
>
> The changes:
> - New inode state flag to indicate that a DIO write is ongoing
>
> - ext4_ext_convert_to_initialized() will mark any new extent it creates
> as uninitialized, if this new EXT4_STATE_DIO_WRITE flag is set.
>
> - ext4_direct_IO() will set this flag for any write.
>
> It now calls blockdev_direct_IO_own_locking() to do the I/O.
>
> After return from blockdev_direct_IO_own_locking() it clears the flag
> and calls a new routine to mark all extents containing the returned
> blocks as initialized.
>
> I'm a bit uncertain about the use of DIO_OWN_LOCKING; I looked at the XFS
> code to see if it acquired any other locks while using this flag, but didn't
> see any. Suggestions/corrections welcome.
>

I was coding up something different approach, but you beat me first:)

I looked at your patch briefly, I am not sure about using
DIO_OWN_LOCKING directly. For writes, that requires filesystem to handle
the race between the buffered IO and direct IO, assuming i_mutex is not
hold for write. But in ext4 case, since we use the generic routine, we
already hold the inode's i_mutex for write. Not sure if that would cause
any locking problem though...

XFS seems hold the i_mutex if it founds there are dirty pages in cache,
but in the case of no dirty pages, it will drop the lock completely.
during the DIO IO the lock is not hold at all.

Mingming
>
> diff -Naur orig/fs/ext4/ext4.h new/fs/ext4/ext4.h
> --- orig/fs/ext4/ext4.h 2009-07-28 17:02:25.000000000 -0700
> +++ new/fs/ext4/ext4.h 2009-07-28 17:02:19.000000000 -0700
> @@ -272,6 +272,7 @@
> #define EXT4_STATE_XATTR 0x00000004 /* has in-inode xattrs */
> #define EXT4_STATE_NO_EXPAND 0x00000008 /* No space for expansion */
> #define EXT4_STATE_DA_ALLOC_CLOSE 0x00000010 /* Alloc DA blks on close */
> +#define EXT4_STATE_DIO_WRITE 0x00000020 /* This is a DIO write */
>
> /* Used to pass group descriptor data when online resize is done */
> struct ext4_new_group_input {
> diff -Naur orig/fs/ext4/ext4_extents.h new/fs/ext4/ext4_extents.h
> --- orig/fs/ext4/ext4_extents.h 2009-07-28 17:02:40.000000000 -0700
> +++ new/fs/ext4/ext4_extents.h 2009-07-28 17:02:34.000000000 -0700
> @@ -242,5 +242,7 @@
> ext4_lblk_t *, ext4_fsblk_t *);
> extern void ext4_ext_drop_refs(struct ext4_ext_path *);
> extern int ext4_ext_check_inode(struct inode *inode);
> +extern int ext4_ext_mark_extents_init(struct inode *inode, loff_t offset,
> + int nr_bytes);
> #endif /* _EXT4_EXTENTS */
>
> diff -Naur orig/fs/ext4/extents.c new/fs/ext4/extents.c
> --- orig/fs/ext4/extents.c 2009-07-28 17:02:54.000000000 -0700
> +++ new/fs/ext4/extents.c 2009-07-28 17:02:49.000000000 -0700
> @@ -2563,6 +2563,13 @@
> ex3->ee_block = cpu_to_le32(iblock);
> ext4_ext_store_pblock(ex3, newblock);
> ex3->ee_len = cpu_to_le16(allocated);
> +
> + /* For direct I/O, we mark this as uninitialized, and
> + * set it as initialized when we finish the direct
> + * IO. */
> + if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
> + ext4_ext_mark_uninitialized(ex3);
> +
> err = ext4_ext_insert_extent(handle, inode, path, ex3);
> if (err == -ENOSPC) {
> err = ext4_ext_zeroout(inode, &orig_ex);
> @@ -2698,6 +2705,13 @@
> ex2->ee_block = cpu_to_le32(iblock);
> ext4_ext_store_pblock(ex2, newblock);
> ex2->ee_len = cpu_to_le16(allocated);
> +
> + /* For direct I/O, we mark this as uninitialized, and
> + * set it as initialized when we finish the direct
> + * IO. */
> + if (EXT4_I(inode)->i_state & EXT4_STATE_DIO_WRITE)
> + ext4_ext_mark_uninitialized(ex2);
> +
> if (ex2 != ex)
> goto insert;
> /*
> @@ -2763,6 +2777,109 @@
> return err;
> }
>
> +
> +static int handle_uninit_extent(struct ext4_ext_path *path,
> + struct inode *inode,
> + struct ext4_extent *ex)
> +{
> + handle_t *handle;
> + int credits;
> + int started;
> + int depth = ext_depth(inode);
> + int ret = 0;
> +
> + if (ext4_ext_is_uninitialized(ex)) {
> + handle = ext4_journal_current_handle();
> + started = 0;
> +
> + if (!handle) {
> + credits =
> + ext4_chunk_trans_blocks(inode, 1);
> + handle = ext4_journal_start(inode,
> + credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out;
> + }
> + started = 1;
> + }
> + /* Mark as initialized */
> + ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex));
> +
> + ret = ext4_ext_dirty(handle, inode, path + depth);
> +
> + if (started)
> + ext4_journal_stop(handle);
> + }
> +out:
> + return ret;
> +}
> +
> +/* Called after direct I/O writes, to make sure that all extents are marked
> + * as initialized. */
> +int ext4_ext_mark_extents_init(struct inode *inode,
> + loff_t offset,
> + int nr_bytes)
> +{
> + struct ext4_ext_path *path = NULL;
> + struct ext4_extent *ex;
> + int depth;
> + int nr_blocks = nr_bytes >> inode->i_blkbits;
> + ext4_lblk_t first_block = offset >> inode->i_blkbits;
> + ext4_lblk_t block;
> + int ret;
> +
> + /* Handle the extent for the first block */
> + path = ext4_ext_find_extent(inode, first_block, NULL);
> + if (IS_ERR(path)) {
> + ret = PTR_ERR(path);
> + path = NULL;
> + goto out;
> + }
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> +
> + ret = handle_uninit_extent(path, inode, ex);
> + if (ret != 0) {
> + goto out;
> + }
> +
> + /* Check the rest of the blocks; if they're in different
> + * extents, handle those as well. */
> + for (block = first_block + 1;
> + block < first_block + nr_blocks;
> + block++) {
> +
> + /* Only look for new extents now */
> + if (block >= ex->ee_block &&
> + block < (ex->ee_block + ex->ee_len))
> + continue;
> +
> + ext4_ext_drop_refs(path);
> + path = ext4_ext_find_extent(inode, first_block, path);
> + if (IS_ERR(path)) {
> + ret = PTR_ERR(path);
> + path = NULL;
> + goto out;
> + }
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> +
> + ret = handle_uninit_extent(path, inode, ex);
> + if (ret != 0) {
> + goto out;
> + }
> + }
> +out:
> + if (path) {
> + ext4_ext_drop_refs(path);
> + kfree(path);
> + }
> +
> + return ret;
> +}
> +
> +
> /*
> * Block allocation/map/preallocation routine for extents based files
> *
> diff -Naur orig/fs/ext4/inode.c new/fs/ext4/inode.c
> --- orig/fs/ext4/inode.c 2009-07-28 17:02:04.000000000 -0700
> +++ new/fs/ext4/inode.c 2009-07-28 17:23:59.000000000 -0700
> @@ -3318,11 +3318,33 @@
> ei->i_disksize = inode->i_size;
> ext4_journal_stop(handle);
> }
> - }
> + /* This prevents initializing extents too early */
> + if (ei->i_flags & EXT4_EXTENTS_FL)
> + ei->i_state |= EXT4_STATE_DIO_WRITE;
> + }
> +
> + ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
> + inode->i_sb->s_bdev, iov,
> + offset, nr_segs,
> + ext4_get_block, NULL);
> +
> + /* We now need to look at all extents for the blocks that were
> + * written out, and mark them as initialized. Note that they've
> + * already been converted, simply not yet marked as init. */
> + if (rw == WRITE && ei->i_flags & EXT4_EXTENTS_FL) {
> + BUG_ON((ei->i_state & EXT4_STATE_DIO_WRITE) == 0);
> + ei->i_state |= ~EXT4_STATE_DIO_WRITE;
> +
> + if (ret > 0) {
> + int ret2;
>
> - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> - offset, nr_segs,
> - ext4_get_block, NULL);
> + ret2 = ext4_ext_mark_extents_init(inode, offset, ret);
> + if (ret2 != 0) {
> + ret = ret2;
> + goto out;
> + }
> + }
> + }
>
> if (orphan) {
> int err;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2009-07-29 18:10:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Tue, Jul 28, 2009 at 05:28:05PM -0700, Curt Wohlgemuth wrote:
> This insures that direct IO writes to fallocate'd file space do not use the
> page cache.

This isn't a full review, but I haven't found the time to write a
complete summary of all of the issues around direct I/O yet; but since
people are working on the solutions, I thought I'd raise a few issues
that I noticed while doing a quick read-through of the patch. It's
not a full review, sorry --- ENOTIME.

One thing that we really need to do is in handle_uninit_extent(), but
a comment that it's only going to be used by at the end of direct I/O,
and then put in a call to ext4_handle_sync() to mark the handle as
synchronous. That way, the DIO write won't return until the journal
transaction which commits the metadata operation is complete.

This is going to make the DIO write performance with a journal have
horrendous performance, but at least it will be correct.
(Specifically, DIO writes have the semantics that if the alignment
constraints are respected, the write is supposed to be synchronous ---
which means that if the write completes, and the system crashes
immediately afterwards, the data has to be accessible after the system
reboots. If the extent tree change isn't committed, then the written
data won't be visible to userspace, so it might as well be lost.) I
don't think we need to do anything special in the no journal case,
since by definition without a journal metadata updates aren't
guaranteed to be up to date.

I think I've only mentioned this on the weekly ext4 concall, so let me
fill in the optimizations I have in mind that should hopefully make
DIO less painful when writing into preallocated space.

1) If extent tree block in question isn't already part of a journalled
transaction, and there is space in the extent block so we don't have
to split the extent tree node (which would require allocating a new
block), we can update the extent block in place, bypassing the
journal. This will allow us to avoid waiting for the journal commit.

2) We can modify the ext4_ext_convert_to_initialized() to be more
aggressive about initializing data blocks if we know we are doing DIO,
since zero'ing an aligned 16 to 32 blocks and then waiting for the
journal commit once is cheaper than converting the extent one block at
a time and waiting for the journal commit after each block write.

Does that make sense?

- Ted

2009-07-29 19:48:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Eric Sandeen wrote:
...

> Also, IIRC xfs does the conversion to written (vs. unwritten) extents in
> an IO completion handler, just FWIW.

After talking w/ Jeff Moyer, realized that w/o using the io completion,
there's a race with AIO.

Userspace will get notified that the write is done before the extents
get flipped to initialized ....

-Eric

2009-07-29 22:17:16

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Wed, 2009-07-29 at 14:48 -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
> ...
>
> > Also, IIRC xfs does the conversion to written (vs. unwritten) extents in
> > an IO completion handler, just FWIW.
>
> After talking w/ Jeff Moyer, realized that w/o using the io completion,
> there's a race with AIO.
>
> Userspace will get notified that the write is done before the extents
> get flipped to initialized ....

Yes, for the AIO case, the DIO submit the IO and returns without waiting
for IO to complete. If we do conversion in ext4_direct_IO, then we could
convert the extents before the real IO hit to disk. Could cause stale
data exposed if crash before the data written to disk.
>
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2009-07-30 11:06:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Tue, Jul 28, 2009 at 05:28:05PM -0700, Curt Wohlgemuth wrote:
> This insures that direct IO writes to fallocate'd file space do not use the
> page cache.
>
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>
>
> ---
>
> I implemented Aneesh's ideas for this solution, but have some questions.
>
> I've verified that the page cache isn't used, regardless of whether
> FALLOC_FL_KEEP_SIZE is used in the fallocate() call or not.
>
> The changes:
> - New inode state flag to indicate that a DIO write is ongoing
>
> - ext4_ext_convert_to_initialized() will mark any new extent it creates
> as uninitialized, if this new EXT4_STATE_DIO_WRITE flag is set.

That will have issues with a parallel get_block due to writepages. ext4_da_writepages
will end up calling get_block on uninit extent without holding inode->i_mutex. So
you can have a direct_IO -> get_block and a writepages -> get_block going together
now if we mark extent as uninit based on a inode flag we will have to fix the
writepages call path also. Instead of inode flag you may want to track the extent
(you can look at the patch from Chris Mason implementing data=guarded for ext3. Chris
Mason's patch track buffer_head what you want is to track extent. And convert the
extent to init using end_io call back.


>
> - ext4_direct_IO() will set this flag for any write.
>
> It now calls blockdev_direct_IO_own_locking() to do the I/O.
>
> After return from blockdev_direct_IO_own_locking() it clears the flag
> and calls a new routine to mark all extents containing the returned
> blocks as initialized.
>
> I'm a bit uncertain about the use of DIO_OWN_LOCKING; I looked at the XFS
> code to see if it acquired any other locks while using this flag, but didn't
> see any. Suggestions/corrections welcome.

-aneesh

2009-07-30 18:30:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

> On Tue, Jul 28, 2009 at 05:28:05PM -0700, Curt Wohlgemuth wrote:
> > This insures that direct IO writes to fallocate'd file space do not use the
> > page cache.
>
> This isn't a full review, but I haven't found the time to write a
> complete summary of all of the issues around direct I/O yet; but since
> people are working on the solutions, I thought I'd raise a few issues
> that I noticed while doing a quick read-through of the patch. It's
> not a full review, sorry --- ENOTIME.
>
> One thing that we really need to do is in handle_uninit_extent(), but
> a comment that it's only going to be used by at the end of direct I/O,
> and then put in a call to ext4_handle_sync() to mark the handle as
> synchronous. That way, the DIO write won't return until the journal
> transaction which commits the metadata operation is complete.
>
> This is going to make the DIO write performance with a journal have
> horrendous performance, but at least it will be correct.
> (Specifically, DIO writes have the semantics that if the alignment
> constraints are respected, the write is supposed to be synchronous ---
> which means that if the write completes, and the system crashes
> immediately afterwards, the data has to be accessible after the system
> reboots. If the extent tree change isn't committed, then the written
> data won't be visible to userspace, so it might as well be lost.) I
> don't think we need to do anything special in the no journal case,
> since by definition without a journal metadata updates aren't
> guaranteed to be up to date.
>
> I think I've only mentioned this on the weekly ext4 concall, so let me
> fill in the optimizations I have in mind that should hopefully make
> DIO less painful when writing into preallocated space.
>
> 1) If extent tree block in question isn't already part of a journalled
> transaction, and there is space in the extent block so we don't have
> to split the extent tree node (which would require allocating a new
> block), we can update the extent block in place, bypassing the
> journal. This will allow us to avoid waiting for the journal commit.
I have to say I'm a bit worried about modify-in-place tricks - it's
not trivial to make sure buffer is not part of any transaction in the
journal, since the buffer head could have been evicted from memory, but
the transaction still is not fully checkpointed. Hence in memory, you
don't have any evidence of the fact that if the machine crashes, your
modify-in-place gets overwritten by journal-replay.

> 2) We can modify the ext4_ext_convert_to_initialized() to be more
> aggressive about initializing data blocks if we know we are doing DIO,
> since zero'ing an aligned 16 to 32 blocks and then waiting for the
> journal commit once is cheaper than converting the extent one block at
> a time and waiting for the journal commit after each block write.
Definitely. I'm not following the discussion too much in detail but
what seems to me is the following could work:
The direct IO path would first send all the data to disk to the
desired location (get_block wouldn't do any conversion, just map blocks).
When this is done, we convert all the touched extents to initialized ones
from ext4_direct_IO, update i_size if needed, and wait for transaction
commit.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-07-30 18:39:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Jan Kara wrote:
>> On Tue, Jul 28, 2009 at 05:28:05PM -0700, Curt Wohlgemuth wrote:

...

>> 2) We can modify the ext4_ext_convert_to_initialized() to be more
>> aggressive about initializing data blocks if we know we are doing DIO,
>> since zero'ing an aligned 16 to 32 blocks and then waiting for the
>> journal commit once is cheaper than converting the extent one block at
>> a time and waiting for the journal commit after each block write.
> Definitely. I'm not following the discussion too much in detail but
> what seems to me is the following could work:
> The direct IO path would first send all the data to disk to the
> desired location (get_block wouldn't do any conversion, just map blocks).
> When this is done, we convert all the touched extents to initialized ones
> from ext4_direct_IO, update i_size if needed, and wait for transaction
> commit.
>
> Honza

This is all about right, but it's tricky, because right now, get_block
is called in the direct IO path from get_more_blocks(), and it's called
with create == 0 unless OWN_LOCKING is specified. If we do get_block w/
create == 0 and find prealloc'd blocks, then we're given back unmapped
buffer heads. This looks like a hole, and so DIO falls back to buffered.

Right now the only way to get create == 1 sent to get_blocks via
directio is to do OWN_LOCKING, which implies... we have to do our own
locking, and it'll take some time to get it right I think.

-Eric

2009-07-30 18:44:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Thu 30-07-09 13:39:12, Eric Sandeen wrote:
> Jan Kara wrote:
> >> On Tue, Jul 28, 2009 at 05:28:05PM -0700, Curt Wohlgemuth wrote:
>
> ...
>
> >> 2) We can modify the ext4_ext_convert_to_initialized() to be more
> >> aggressive about initializing data blocks if we know we are doing DIO,
> >> since zero'ing an aligned 16 to 32 blocks and then waiting for the
> >> journal commit once is cheaper than converting the extent one block at
> >> a time and waiting for the journal commit after each block write.
> > Definitely. I'm not following the discussion too much in detail but
> > what seems to me is the following could work:
> > The direct IO path would first send all the data to disk to the
> > desired location (get_block wouldn't do any conversion, just map blocks).
> > When this is done, we convert all the touched extents to initialized ones
> > from ext4_direct_IO, update i_size if needed, and wait for transaction
> > commit.
> >
> > Honza
>
> This is all about right, but it's tricky, because right now, get_block
> is called in the direct IO path from get_more_blocks(), and it's called
> with create == 0 unless OWN_LOCKING is specified. If we do get_block w/
> create == 0 and find prealloc'd blocks, then we're given back unmapped
> buffer heads. This looks like a hole, and so DIO falls back to buffered.
>
> Right now the only way to get create == 1 sent to get_blocks via
> directio is to do OWN_LOCKING, which implies... we have to do our own
> locking, and it'll take some time to get it right I think.
But the get_block function called by get_more_blocks() is specified in
ext4_direct_IO. So we can provide it with a special direct_IO version of
get_block function which happily maps also uninitialized extents... It's a
slight hack, but maintainable IMHO.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-07-30 19:16:46

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Jan Kara wrote:
> On Thu 30-07-09 13:39:12, Eric Sandeen wrote:
...

Honza
>> This is all about right, but it's tricky, because right now, get_block
>> is called in the direct IO path from get_more_blocks(), and it's called
>> with create == 0 unless OWN_LOCKING is specified. If we do get_block w/
>> create == 0 and find prealloc'd blocks, then we're given back unmapped
>> buffer heads. This looks like a hole, and so DIO falls back to buffered.
>>
>> Right now the only way to get create == 1 sent to get_blocks via
>> directio is to do OWN_LOCKING, which implies... we have to do our own
>> locking, and it'll take some time to get it right I think.
> But the get_block function called by get_more_blocks() is specified in
> ext4_direct_IO. So we can provide it with a special direct_IO version of
> get_block function which happily maps also uninitialized extents... It's a
> slight hack, but maintainable IMHO.

Hm, perhaps. xfs does have a xfs_get_blocks_direct() that it uses for DIO.

Although returning mapped unwritten blocks ... well, we'll have to be
sure it doesn't cause any problems elsewhere. :)

-Eric

> Honza


2009-07-30 20:34:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Thu, Jul 30, 2009 at 08:30:53PM +0200, Jan Kara wrote:
> I have to say I'm a bit worried about modify-in-place tricks - it's
> not trivial to make sure buffer is not part of any transaction in the
> journal, since the buffer head could have been evicted from memory, but
> the transaction still is not fully checkpointed. Hence in memory, you
> don't have any evidence of the fact that if the machine crashes, your
> modify-in-place gets overwritten by journal-replay.

Yeah, good point; tracking which blocks might get overwritten on a
journal replay is tough. What we *could* do that would make this easier
is to insert a revoke record for all extent tree blocks after the
blocks have been written to disk (since at that point there's no need
for that block to be replayed).

Whether or not this optimization is worth it largely depends on time
between how many blocks are getting allocated using fallocate(), and
what the average number of blocks are that get written at a time by
the application (normally enterprise databases) when write into the
unitialized area. If the average size is say, 32k, and the amount of
space they allocate is say, 32 megs, then without doing any special
DIO optimization, on average we will end up having to do 1024
synchronous waits on a journal commit. If the database doesn't use
any fallocates at all, then it will have to do a 32 meg write to
initialize the area, followed by 32 megs of data writes, written
randomly 32k at a time.

So being aggressive with pre-zeroing extra datablocks when we convert
uninit extents to initialized extents mean that we still have to do
some percentage of zero'izing data writes combined with the extra
journal traffic, so it's likely we haven't reduced the total disk
bandwidth by much, and the latency improvements of not having to do
the 32meg zero writes gets offset with the data=ordered latency hits
when we do the journal commit.

So it would seem to me that if we really want to get the full benefit
of preallocation in the DIO case, we really do need to think about
seeing if it's possible bypass the journal.

It may be useful here to write a benchmark that simulates the behavior
of an eneterprise database using fallocate, so we can see what the
performance hit is of making sure we don't lose data on a crash, and
then how much of that performance hit we can claw back with various
optimizations.

- Ted

2009-07-31 16:10:26

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

Thanks to all for ideas and corrections for my original patch. I'd like to
summarize the issues that I've seen raised:

1. Using blockdev_direct_IO_own_locking() as it stands, without additional
locking in the ext4 code, is incorrect.

2. The conversion from uninit to initialized extents should be done in an IO
completion handler.

3. When the uninit-to-init extents are converted, the handle must be marked
as synchronous.

But this will make DIO writes (to fallocated space) with a journal have
bad performance.

4. Ted mentioned some optimizations possible for extent conversion (when the
extent block isn't part of a transaction, and no new block is required).
Jan says that verifying that the extent block is not part of a
transaction can be difficult.

Also we could increase the extent size that we're willing to zero out the
data blocks for.

5. Aneesh mentioned that we could use extent tracking a la Chris Mason's
patch for data=guarded (I confess, I haven't looked at this yet).

6. Jan's other thought is to use a new ext4_get_blocks_direct() routine as
the get_block callback to blockdev_direct_IO() -- so no use of
_own_locking(). This would simply return blocks from uninit extents;
extent conversion (including possible splitting) would then be done in
ext4_direct_IO().

7. Ted's last comment is about the tradeoffs between getting the journal
transaction correct vs aggressive zeroout of data blocks -- seeing if
it's possible to bypass the journal in the case of preallocated DIO
writes.

Looking through these, it seems to me that there are two major problems:

a. How to correctly do extent conversion in the face of locking issues and
races with other requests (e.g. AIO)

b. How to efficiently do this extent conversion in the face of correct
journal semantics.

Have I missed anything?

Jan's idea of a new get_block callback for DIO seems like the simplest
solution to (a) above. No locking changes would seem to be needed, I think.
Does this seem reasonable?

Problem (b) is one that I would defer to others with more experience with
journals than I have.

Thanks,
Curt

2009-07-31 17:58:59

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Thu, 2009-07-30 at 16:33 -0400, Theodore Tso wrote:
> On Thu, Jul 30, 2009 at 08:30:53PM +0200, Jan Kara wrote:
> > I have to say I'm a bit worried about modify-in-place tricks - it's
> > not trivial to make sure buffer is not part of any transaction in the
> > journal, since the buffer head could have been evicted from memory, but
> > the transaction still is not fully checkpointed. Hence in memory, you
> > don't have any evidence of the fact that if the machine crashes, your
> > modify-in-place gets overwritten by journal-replay.
>
> Yeah, good point; tracking which blocks might get overwritten on a
> journal replay is tough. What we *could* do that would make this easier
> is to insert a revoke record for all extent tree blocks after the
> blocks have been written to disk (since at that point there's no need
> for that block to be replayed).
>
> Whether or not this optimization is worth it largely depends on time
> between how many blocks are getting allocated using fallocate(), and
> what the average number of blocks are that get written at a time by
> the application (normally enterprise databases) when write into the
> unitialized area. If the average size is say, 32k, and the amount of
> space they allocate is say, 32 megs, then without doing any special
> DIO optimization, on average we will end up having to do 1024
> synchronous waits on a journal commit. If the database doesn't use
> any fallocates at all, then it will have to do a 32 meg write to
> initialize the area, followed by 32 megs of data writes, written
> randomly 32k at a time.
>
> So being aggressive with pre-zeroing extra datablocks when we convert
> uninit extents to initialized extents mean that we still have to do
> some percentage of zero'izing data writes combined with the extra
> journal traffic, so it's likely we haven't reduced the total disk
> bandwidth by much, and the latency improvements of not having to do
> the 32meg zero writes gets offset with the data=ordered latency hits
> when we do the journal commit.
>
> So it would seem to me that if we really want to get the full benefit
> of preallocation in the DIO case, we really do need to think about
> seeing if it's possible bypass the journal.
>
> It may be useful here to write a benchmark that simulates the behavior
> of an eneterprise database using fallocate, so we can see what the
> performance hit is of making sure we don't lose data on a crash, and
> then how much of that performance hit we can claw back with various
> optimizations.
>

Eric and I looked at xfs code together the other day, xfs code did not
ensure DIO sync metadata (conversion) before return back to userspace.
It does ensure the workqueue kickoff the conversion and journal commit,
but it seems not waiting for it to complete. This seems confirmed by xfs
expert on IRC, who expressed DIO means only bypass page cache, but not
necessarily means sync on data and metadata unless file is opened with
SYNC mode.


Mingming



2009-07-31 18:03:11

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Fri, Jul 31, 2009 at 10:58 AM, Mingming<[email protected]> wrote:
> Eric and I looked at xfs code together the other day, xfs code did not
> ensure DIO sync metadata (conversion) before return back to userspace.
> It does ensure the workqueue kickoff the conversion and journal commit,
> but it seems not waiting for it to complete. This seems confirmed by xfs
> expert on IRC, who expressed DIO means only bypass page cache, but not
> necessarily means sync on data and metadata unless file is opened with
> SYNC mode.

This will our expectations for DIO on ext4 also.

mrubin

2009-07-31 18:03:48

by Michael Rubin

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Fri, Jul 31, 2009 at 11:03 AM, Michael Rubin<[email protected]> wrote:
> This will our expectations for DIO on ext4 also.

<sigh>

I meant "This will fulfill our expectations for DIO on ext4 also."

mrubin

2009-08-01 06:56:57

by Mingming Cao

[permalink] [raw]
Subject: [PATCH RFC] ext4 direct IO for holes, fallocate

Hi,

I want to throw out my first attempt to fix the DIO fallocate issue, to
see if the approach makes sense. This is the approach I take will

1) DIO write to hole will not fall back to buffered IO. holes are
marked as unwritten extents after block allocation.
2) New extents allocated at the end of file is also treated as unwritten
extents
3) those unwritten extents, and fallocate extents, will be converted to
written extents (and update disk isize if needed)when the IO is
complete. The conversion is triggered using end_io call back function
passing from ext4 fs to direct IO.

4) We need to tell direct IO do not force create =0 for write on holes.
Does not want use blockdev_dio_own_locking(), instead,add another dio
type which indicating DIO that fs still needs the lock handling at VFS
layer, but wants to take care of the hole/unwritten stuff by fs itself.

untested patch below...

Signed-off-by: Mingming Cao <[email protected]>
---
fs/direct-io.c | 26 +++++++-
fs/ext4/Makefile | 4 -
fs/ext4/dio.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 22 ++++++-
fs/ext4/extents.c | 66 +++++++++++++++++++++
fs/ext4/inode.c | 83 ---------------------------
fs/ext4/super.c | 11 +++
include/linux/fs.h | 10 +++
8 files changed, 293 insertions(+), 91 deletions(-)

Index: linux-2.6.31-rc4/fs/ext4/inode.c
===================================================================
--- linux-2.6.31-rc4.orig/fs/ext4/inode.c 2009-07-31 12:57:53.000000000 -0700
+++ linux-2.6.31-rc4/fs/ext4/inode.c 2009-07-31 14:48:27.000000000 -0700
@@ -3279,89 +3279,6 @@ static int ext4_releasepage(struct page
}

/*
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file. But current
- * VFS code falls back into buffered path in that case so we are safe.
- */
-static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- handle_t *handle;
- ssize_t ret;
- int orphan = 0;
- size_t count = iov_length(iov, nr_segs);
-
- if (rw == WRITE) {
- loff_t final_size = offset + count;
-
- if (final_size > inode->i_size) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ei->i_disksize = inode->i_size;
- ext4_journal_stop(handle);
- }
- }
-
- ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL);
-
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- /* This is really bad luck. We've written the data
- * but cannot extend i_size. Bail out and pretend
- * the write failed... */
- ret = PTR_ERR(handle);
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size) {
- ei->i_disksize = end;
- block_extend_i_size(inode, offset, ret);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
* much here because ->set_page_dirty is called under VFS locks. The page is
Index: linux-2.6.31-rc4/fs/ext4/ext4.h
===================================================================
--- linux-2.6.31-rc4.orig/fs/ext4/ext4.h 2009-07-31 12:57:53.000000000 -0700
+++ linux-2.6.31-rc4/fs/ext4/ext4.h 2009-07-31 23:31:26.000000000 -0700
@@ -111,6 +111,15 @@ struct ext4_allocation_request {
unsigned int flags;
};

+typedef struct ext4_io_end{
+ struct inode *inode; /* file being written to */
+ unsigned int type; /* unwritten or written */
+ int error; /* I/O error code */
+ ext4_lblk_t offset; /* offset in the file */
+ size_t size; /* size of the extent */
+ struct work_struct work; /* data work queue */
+}ext4_io_end_t;
+
/*
* Special inodes numbers
*/
@@ -330,8 +339,8 @@ struct ext4_new_group_data {
/* Call ext4_da_update_reserve_space() after successfully
allocating the blocks */
#define EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE 0x0008
-
-
+#define EXT4_GET_BLOCKS_DIO 0x0010
+#define EXT4_GET_BLOCKS_CONVERT_UNINIT 0x0020
/*
* ioctl commands
*/
@@ -960,6 +969,9 @@ struct ext4_sb_info {

unsigned int s_log_groups_per_flex;
struct flex_groups *s_flex_groups;
+
+ /* workqueue for dio unwritten */
+ struct workqueue_struct *dio_unwritten_wq;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1453,6 +1465,10 @@ extern __le16 ext4_group_desc_csum(struc
struct ext4_group_desc *gdp);
extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
+/* dio.c */
+extern ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs);

static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
{
@@ -1650,6 +1666,8 @@ extern void ext4_ext_init(struct super_b
extern void ext4_ext_release(struct super_block *);
extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
loff_t len);
+extern long ext4_convert_unwritten(struct inode *inode, loff_t offset,
+ loff_t len);
extern int ext4_get_blocks(handle_t *handle, struct inode *inode,
sector_t block, unsigned int max_blocks,
struct buffer_head *bh, int flags);
Index: linux-2.6.31-rc4/fs/ext4/dio.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.31-rc4/fs/ext4/dio.c 2009-07-31 23:24:25.000000000 -0700
@@ -0,0 +1,162 @@
+/*
+ * linux/fs/ext4/dio.c
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/jbd2.h>
+#include <linux/buffer_head.h>
+#include <linux/writeback.h>
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+
+#include "ext4_jbd2.h"
+
+struct workqueue_struct *ext4_unwritten_queue;
+
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+int ext4_get_block_dio(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ handle_t *handle;
+ int ret = 0, started = 0;
+ unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
+ int dio_credits;
+ loff_t offset = (loff_t)iblock << inode->i_blkbits;
+ unsigned int flags = 0;
+
+ if (create) {
+ if (max_blocks > DIO_MAX_BLOCKS)
+ max_blocks = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
+ handle = ext4_journal_start(inode, dio_credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ started = 1;
+
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
+ }
+
+ ret = ext4_get_blocks(handle, inode, iblock, max_blocks, bh_result,
+ flags);
+ if (ret > 0) {
+ bh_result->b_size = (ret << inode->i_blkbits);
+ ret = 0;
+ }
+ if (started)
+ ext4_journal_stop(handle);
+out:
+ return ret;
+}
+
+#define DIO_UNWRITTEN 0x1
+
+/*
+ * IO write completion for unwritten extents.
+ *
+ * convert a buffer range from unwritten to written extents.
+ */
+static void ext4_end_dio_unwritten(struct work_struct *work)
+{
+ ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
+ struct inode *inode = io->inode;
+ loff_t offset = io->offset;
+ size_t size = io->size;
+
+ ext4_convert_unwritten(inode, offset, size);
+
+ kfree(io);
+}
+
+ext4_io_end_t *ext4_init_io_end (struct inode *inode, unsigned int type)
+{
+ ext4_io_end_t *io = NULL;
+
+ io = kmalloc(sizeof(*io), GFP_NOFS);
+
+ if (io) {
+ io->inode = inode;
+ io->type = type;
+ io->offset = 0;
+ io->size = 0;
+ io->error = 0;
+ INIT_WORK(&io->work, ext4_end_dio_unwritten);
+ }
+
+ return io;
+}
+
+static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
+ ssize_t size, void *private)
+{
+ ext4_io_end_t *io_end = iocb->private;
+ struct workqueue_struct *wq;
+ /*FIX ME call ext4 code to convert the extents, commit the log*/
+
+ /* if not hole or unwritten extents, just simple return */
+ if (!io_endi || !size)
+ return;
+ io_end->offset = offset;
+ io_end->size = size;
+ wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+
+ /* We need to convert unwritten extents to written */
+ queue_work(wq, &io_end->work);
+
+ if (is_sync_kiocb(iocb))
+ flush_workqueue(wq);
+
+ iocb->private = NULL;
+}
+/*
+ * If the O_DIRECT write will extend the file then add this inode to the
+ * orphan list. So recovery will truncate it back to the original size
+ * if the machine crashes during the write.
+ *
+ * If the O_DIRECT write is filling unintialized space inside i_size,
+ * convertion to initialized extent should happen after data written to disk
+ * to ensure we don't expose stale disk data after crash.
+ */
+ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ handle_t *handle;
+ ssize_t ret;
+ int orphan = 0;
+ size_t count = iov_length(iov, nr_segs);
+
+ if (rw == WRITE) {
+ /*
+ * we treat blocks allocated for DIO as
+ * as unwritten extents, need to convert them
+ * to written extents when IO is complete.
+ *
+ * In the case of AIO DIO, since VFS dio code
+ * does not wait for io complete, here
+ * i_size update needs to be buddled at the
+ * time DIO IO is completed, not here
+ */
+ iocb->private = ext4_init_io_end(inode, DIO_UNWRITTEN);
+ if (!iocb->private)
+ return -ENOMEM;
+ }
+
+ ret = blockdev_direct_IO_hole(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov,
+ offset, nr_segs,
+ ext4_get_block_dio, ext4_end_io_dio);
+
+ /* In the case of AIO DIO, VFS dio,
+ * does not wait for io complete, here
+ * i_size update needs to be buddled at the
+ * time DIO IO is completed, not here
+ */
+ return ret;
+}
Index: linux-2.6.31-rc4/fs/ext4/Makefile
===================================================================
--- linux-2.6.31-rc4.orig/fs/ext4/Makefile 2009-07-31 14:51:38.000000000 -0700
+++ linux-2.6.31-rc4/fs/ext4/Makefile 2009-07-31 15:07:54.000000000 -0700
@@ -6,8 +6,8 @@ obj-$(CONFIG_EXT4_FS) += ext4.o

ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
- ext4_jbd2.o migrate.o mballoc.o block_validity.o move_extent.o
-
+ ext4_jbd2.o migrate.o mballoc.o block_validity.o \
+ move_extent.o dio.o
ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
Index: linux-2.6.31-rc4/fs/ext4/super.c
===================================================================
--- linux-2.6.31-rc4.orig/fs/ext4/super.c 2009-07-31 15:26:23.000000000 -0700
+++ linux-2.6.31-rc4/fs/ext4/super.c 2009-07-31 15:57:52.000000000 -0700
@@ -578,6 +578,9 @@ static void ext4_put_super(struct super_
struct ext4_super_block *es = sbi->s_es;
int i, err;

+ flush_workqueue(sbi->dio_unwritten_wq);
+ destroy_workqueue(sbi->dio_unwritten_wq);
+
lock_super(sb);
lock_kernel();
if (sb->s_dirt)
@@ -2781,6 +2784,12 @@ no_journal:
clear_opt(sbi->s_mount_opt, NOBH);
}
}
+ EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
+ if (!EXT4_SB(sb)->dio_unwritten_wq) {
+ printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
+ goto failed_mount_wq;
+ }
+
/*
* The jbd2_journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
@@ -2893,6 +2902,8 @@ cantfind_ext4:

failed_mount4:
ext4_msg(sb, KERN_ERR, "mount failed");
+ destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
+failed_mount_wq:
ext4_release_system_zone(sb);
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
Index: linux-2.6.31-rc4/fs/direct-io.c
===================================================================
--- linux-2.6.31-rc4.orig/fs/direct-io.c 2009-07-31 17:12:30.000000000 -0700
+++ linux-2.6.31-rc4/fs/direct-io.c 2009-07-31 17:25:17.000000000 -0700
@@ -240,7 +240,7 @@ static int dio_complete(struct dio *dio,
if (dio->end_io && dio->result)
dio->end_io(dio->iocb, offset, transferred,
dio->map_bh.b_private);
- if (dio->lock_type == DIO_LOCKING)
+ if (dio->lock_type == DIO_LOCKING || dio->lock_type == DIO_LOCKING_HOLE)
/* lockdep: non-owner release */
up_read_non_owner(&dio->inode->i_alloc_sem);

@@ -516,6 +516,17 @@ static int get_more_blocks(struct dio *d
map_bh->b_size = fs_count << dio->inode->i_blkbits;

create = dio->rw & WRITE;
+ /*
+ * If the dio lock type does not indicate the underlying
+ * fs could handle unwritten extents, pass create as 0 to
+ * just do a plain block look up. If it is a hole, DIO will
+ * falls back to buffered IO to prevent stale data out after
+ * crash before IO complete.
+ *
+ * if the dio lock type is DIO_LOCKING_HOLE
+ * then dio will preserve the create flag, and let the
+ * underlying filesystem to handle the hole issue.
+ */
if (dio->lock_type == DIO_LOCKING) {
if (dio->block_in_file < (i_size_read(dio->inode) >>
dio->blkbits))
@@ -1042,7 +1053,8 @@ direct_io_worker(int rw, struct kiocb *i
* we can let i_mutex go now that its achieved its purpose
* of protecting us from looking up uninitialized blocks.
*/
- if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
+ if ((rw == READ) && (dio->lock_type == DIO_LOCKING ||
+ dio->lock_type == DIO_LOCKING_HOLE))
mutex_unlock(&dio->inode->i_mutex);

/*
@@ -1097,6 +1109,10 @@ direct_io_worker(int rw, struct kiocb *i
* For reads, i_mutex is not held on entry, but it is taken and dropped before
* returning.
*
+ * DIO_LOCKING_HOLE (simple locking for regular files)
+ * Same as above, except that filesystem support unwritten extents thus
+ * allow DIO to write to holes, no need to fall back to buffered IO.
+ *
* DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
* uninitialised data, allowing parallel direct readers and writers)
* For writes we are called without i_mutex, return without it, never touch it.
@@ -1190,7 +1206,8 @@ __blockdev_direct_IO(int rw, struct kioc
}
}

- if (dio_lock_type == DIO_LOCKING)
+ if (dio_lock_type == DIO_LOCKING ||
+ dio_lock_type == DIO_LOCKING_HOLE)
/* lockdep: not the owner will release it */
down_read_non_owner(&inode->i_alloc_sem);
}
@@ -1220,7 +1237,8 @@ __blockdev_direct_IO(int rw, struct kioc
vmtruncate(inode, isize);
}

- if (rw == READ && dio_lock_type == DIO_LOCKING)
+ if (rw == READ && (dio_lock_type == DIO_LOCKING ||
+ dio_lock_type == DIO_LOCKING_HOLE))
release_i_mutex = 0;

out:
Index: linux-2.6.31-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.31-rc4.orig/include/linux/fs.h 2009-07-31 17:09:07.000000000 -0700
+++ linux-2.6.31-rc4/include/linux/fs.h 2009-07-31 17:20:39.000000000 -0700
@@ -2255,6 +2255,7 @@ enum {
DIO_LOCKING = 1, /* need locking between buffered and direct access */
DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
+ DIO_LOCKING_HOLE /* need locking, but allow DIO write to holes */
};

static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
@@ -2283,6 +2284,15 @@ static inline ssize_t blockdev_direct_IO
return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
nr_segs, get_block, end_io, DIO_OWN_LOCKING);
}
+
+static inline ssize_t blockdev_direct_IO_hole(int rw, struct kiocb *iocb,
+ struct inode *inode, struct block_device *bdev, const struct iovec *iov,
+ loff_t offset, unsigned long nr_segs, get_block_t get_block,
+ dio_iodone_t end_io)
+{
+ return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
+ nr_segs, get_block, end_io, DIO_LOCKING_HOLE);
+}
#endif

extern const struct file_operations generic_ro_fops;
Index: linux-2.6.31-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.31-rc4.orig/fs/ext4/extents.c 2009-07-31 22:45:58.000000000 -0700
+++ linux-2.6.31-rc4/fs/ext4/extents.c 2009-07-31 23:37:20.000000000 -0700
@@ -3170,6 +3170,72 @@ retry:
return ret > 0 ? ret2 : ret;
}

+long ext4_convert_unwritten(struct inode *inode, loff_t offset, loff_t len)
+{
+ handle_t *handle;
+ ext4_lblk_t block;
+ loff_t new_size;
+ unsigned int max_blocks;
+ int ret = 0;
+ int ret2 = 0;
+ struct buffer_head map_bh;
+ unsigned int credits, blkbits = inode->i_blkbits;
+
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return -EOPNOTSUPP;
+
+ block = offset >> blkbits;
+ /*
+ * We can't just convert len to max_blocks because
+ * If blocksize = 4096 offset = 3072 and len = 2048
+ */
+ max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+ - block;
+ /*
+ * credits to insert 1 extent into extent tree
+ */
+ credits = ext4_chunk_trans_blocks(inode, max_blocks);
+ while (ret >= 0 && ret < max_blocks) {
+ block = block + ret;
+ max_blocks = max_blocks - ret;
+ handle = ext4_journal_start(inode, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ break;
+ }
+ map_bh.b_state = 0;
+ ret = ext4_get_blocks(handle, inode, block,
+ max_blocks, &map_bh,
+ EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT);
+ if (ret <= 0) {
+ WARN_ON(ret <= 0);
+ printk(KERN_ERR "%s: ext4_ext_get_blocks "
+ "returned error inode#%lu, block=%u, "
+ "max_blocks=%u", __func__,
+ inode->i_ino, block, max_blocks);
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+
+ if ((block + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
+ blkbits) >> blkbits))
+ new_size = offset + len;
+ else
+ new_size = (block + ret) << blkbits;
+
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
+
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+ }
+ return ret > 0 ? ret2 : ret;
+}
/*
* Callback function called for each extent to gather FIEMAP information.
*/




2009-08-03 09:36:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] Insure direct IO writes do not use the page cache

On Thu 30-07-09 16:33:51, Theodore Tso wrote:
> On Thu, Jul 30, 2009 at 08:30:53PM +0200, Jan Kara wrote:
> > I have to say I'm a bit worried about modify-in-place tricks - it's
> > not trivial to make sure buffer is not part of any transaction in the
> > journal, since the buffer head could have been evicted from memory, but
> > the transaction still is not fully checkpointed. Hence in memory, you
> > don't have any evidence of the fact that if the machine crashes, your
> > modify-in-place gets overwritten by journal-replay.
>
> Yeah, good point; tracking which blocks might get overwritten on a
> journal replay is tough. What we *could* do that would make this easier
> is to insert a revoke record for all extent tree blocks after the
> blocks have been written to disk (since at that point there's no need
> for that block to be replayed).
Hmm, but will this help you? You'd have to wait for revoke records to
commit before you can be sure that journal replay won't overwrite your
in-place changes.
Looking at the O_DIRECT semantics, I don't think nobody really requires
the data being on disk after the write() returns and we crash - in
particular if we extend the file, the write will be just an ordinary
buffered write so in practice, it behaves like this already. Given the fact
that only a bit special applications use O_DIRECT, I think we can afford to
make a reservation that O_DIRECT writes even to a preallocated space do not
have any special data-consistency guarantees.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-08-03 16:47:48

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4 direct IO for holes, fallocate

On Fri, Jul 31, 2009 at 11:56:55PM -0700, Mingming wrote:
> Hi,
>
> I want to throw out my first attempt to fix the DIO fallocate issue, to
> see if the approach makes sense. This is the approach I take will
>
> 1) DIO write to hole will not fall back to buffered IO. holes are
> marked as unwritten extents after block allocation.
> 2) New extents allocated at the end of file is also treated as unwritten
> extents
> 3) those unwritten extents, and fallocate extents, will be converted to
> written extents (and update disk isize if needed)when the IO is
> complete. The conversion is triggered using end_io call back function
> passing from ext4 fs to direct IO.
>
> 4) We need to tell direct IO do not force create =0 for write on holes.
> Does not want use blockdev_dio_own_locking(), instead,add another dio
> type which indicating DIO that fs still needs the lock handling at VFS
> layer, but wants to take care of the hole/unwritten stuff by fs itself.
>
> untested patch below...
>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/direct-io.c | 26 +++++++-
> fs/ext4/Makefile | 4 -
> fs/ext4/dio.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/ext4.h | 22 ++++++-
> fs/ext4/extents.c | 66 +++++++++++++++++++++
> fs/ext4/inode.c | 83 ---------------------------
> fs/ext4/super.c | 11 +++
> include/linux/fs.h | 10 +++
> 8 files changed, 293 insertions(+), 91 deletions(-)
>
> Index: linux-2.6.31-rc4/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/ext4/inode.c 2009-07-31 12:57:53.000000000 -0700
> +++ linux-2.6.31-rc4/fs/ext4/inode.c 2009-07-31 14:48:27.000000000 -0700
> @@ -3279,89 +3279,6 @@ static int ext4_releasepage(struct page
> }
>
> /*
> - * If the O_DIRECT write will extend the file then add this inode to the
> - * orphan list. So recovery will truncate it back to the original size
> - * if the machine crashes during the write.
> - *
> - * If the O_DIRECT write is intantiating holes inside i_size and the machine
> - * crashes then stale disk data _may_ be exposed inside the file. But current
> - * VFS code falls back into buffered path in that case so we are safe.
> - */
> -static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> - const struct iovec *iov, loff_t offset,
> - unsigned long nr_segs)
> -{
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - handle_t *handle;
> - ssize_t ret;
> - int orphan = 0;
> - size_t count = iov_length(iov, nr_segs);
> -
> - if (rw == WRITE) {
> - loff_t final_size = offset + count;
> -
> - if (final_size > inode->i_size) {
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - ret = ext4_orphan_add(handle, inode);
> - if (ret) {
> - ext4_journal_stop(handle);
> - goto out;
> - }
> - orphan = 1;
> - ei->i_disksize = inode->i_size;
> - ext4_journal_stop(handle);
> - }
> - }
> -
> - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> - offset, nr_segs,
> - ext4_get_block, NULL);
> -
> - if (orphan) {
> - int err;
> -
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, 2);
> - if (IS_ERR(handle)) {
> - /* This is really bad luck. We've written the data
> - * but cannot extend i_size. Bail out and pretend
> - * the write failed... */
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - if (inode->i_nlink)
> - ext4_orphan_del(handle, inode);
> - if (ret > 0) {
> - loff_t end = offset + ret;
> - if (end > inode->i_size) {
> - ei->i_disksize = end;
> - block_extend_i_size(inode, offset, ret);
> - /*
> - * We're going to return a positive `ret'
> - * here due to non-zero-length I/O, so there's
> - * no way of reporting error returns from
> - * ext4_mark_inode_dirty() to userspace. So
> - * ignore it.
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> - }
> - err = ext4_journal_stop(handle);
> - if (ret == 0)
> - ret = err;
> - }
> -out:
> - return ret;
> -}
> -
> -/*
> * Pages can be marked dirty completely asynchronously from ext4's journalling
> * activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
> * much here because ->set_page_dirty is called under VFS locks. The page is
> Index: linux-2.6.31-rc4/fs/ext4/ext4.h
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/ext4/ext4.h 2009-07-31 12:57:53.000000000 -0700
> +++ linux-2.6.31-rc4/fs/ext4/ext4.h 2009-07-31 23:31:26.000000000 -0700
> @@ -111,6 +111,15 @@ struct ext4_allocation_request {
> unsigned int flags;
> };
>
> +typedef struct ext4_io_end{
> + struct inode *inode; /* file being written to */
> + unsigned int type; /* unwritten or written */
> + int error; /* I/O error code */
> + ext4_lblk_t offset; /* offset in the file */
> + size_t size; /* size of the extent */
> + struct work_struct work; /* data work queue */
> +}ext4_io_end_t;
> +
> /*
> * Special inodes numbers
> */
> @@ -330,8 +339,8 @@ struct ext4_new_group_data {
> /* Call ext4_da_update_reserve_space() after successfully
> allocating the blocks */
> #define EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE 0x0008
> -
> -
> +#define EXT4_GET_BLOCKS_DIO 0x0010
> +#define EXT4_GET_BLOCKS_CONVERT_UNINIT 0x0020
> /*
> * ioctl commands
> */
> @@ -960,6 +969,9 @@ struct ext4_sb_info {
>
> unsigned int s_log_groups_per_flex;
> struct flex_groups *s_flex_groups;
> +
> + /* workqueue for dio unwritten */
> + struct workqueue_struct *dio_unwritten_wq;
> };
>
> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> @@ -1453,6 +1465,10 @@ extern __le16 ext4_group_desc_csum(struc
> struct ext4_group_desc *gdp);
> extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
> struct ext4_group_desc *gdp);
> +/* dio.c */
> +extern ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> + const struct iovec *iov, loff_t offset,
> + unsigned long nr_segs);
>
> static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
> {
> @@ -1650,6 +1666,8 @@ extern void ext4_ext_init(struct super_b
> extern void ext4_ext_release(struct super_block *);
> extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
> loff_t len);
> +extern long ext4_convert_unwritten(struct inode *inode, loff_t offset,
> + loff_t len);
> extern int ext4_get_blocks(handle_t *handle, struct inode *inode,
> sector_t block, unsigned int max_blocks,
> struct buffer_head *bh, int flags);
> Index: linux-2.6.31-rc4/fs/ext4/dio.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.31-rc4/fs/ext4/dio.c 2009-07-31 23:24:25.000000000 -0700
> @@ -0,0 +1,162 @@
> +/*
> + * linux/fs/ext4/dio.c
> + *
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/jbd2.h>
> +#include <linux/buffer_head.h>
> +#include <linux/writeback.h>
> +#include <linux/bio.h>
> +#include <linux/workqueue.h>
> +
> +#include "ext4_jbd2.h"
> +
> +struct workqueue_struct *ext4_unwritten_queue;
> +
> +/* Maximum number of blocks we map for direct IO at once. */
> +#define DIO_MAX_BLOCKS 4096
> +int ext4_get_block_dio(struct inode *inode, sector_t iblock,
> + struct buffer_head *bh_result, int create)
> +{
> + handle_t *handle;
> + int ret = 0, started = 0;
> + unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> + int dio_credits;
> + loff_t offset = (loff_t)iblock << inode->i_blkbits;
> + unsigned int flags = 0;
> +
> + if (create) {
> + if (max_blocks > DIO_MAX_BLOCKS)
> + max_blocks = DIO_MAX_BLOCKS;
> + dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
> + handle = ext4_journal_start(inode, dio_credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out;
> + }
> + started = 1;
> +
> + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;

That would imply we are always creating uninit extent. You will have to make
sure we do the right thing with bh_flag of bh_result in this case. We have
ignored bh_flag update for EXT4_GET_BLOCKS_CREATE_UNINIT_EXT because that was
initially done only from the fallocate call path which didn't look at the flag
at all.



> + }
> +
> + ret = ext4_get_blocks(handle, inode, iblock, max_blocks, bh_result,
> + flags);
> + if (ret > 0) {
> + bh_result->b_size = (ret << inode->i_blkbits);
> + ret = 0;
> + }
> + if (started)
> + ext4_journal_stop(handle);
> +out:
> + return ret;
> +}
> +
> +#define DIO_UNWRITTEN 0x1
> +
> +/*
> + * IO write completion for unwritten extents.
> + *
> + * convert a buffer range from unwritten to written extents.
> + */
> +static void ext4_end_dio_unwritten(struct work_struct *work)
> +{
> + ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> + struct inode *inode = io->inode;
> + loff_t offset = io->offset;
> + size_t size = io->size;
> +
> + ext4_convert_unwritten(inode, offset, size);
> +
> + kfree(io);
> +}
> +
> +ext4_io_end_t *ext4_init_io_end (struct inode *inode, unsigned int type)
> +{
> + ext4_io_end_t *io = NULL;
> +
> + io = kmalloc(sizeof(*io), GFP_NOFS);
> +
> + if (io) {
> + io->inode = inode;
> + io->type = type;
> + io->offset = 0;
> + io->size = 0;
> + io->error = 0;
> + INIT_WORK(&io->work, ext4_end_dio_unwritten);
> + }
> +
> + return io;
> +}
> +
> +static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> + ssize_t size, void *private)
> +{
> + ext4_io_end_t *io_end = iocb->private;
> + struct workqueue_struct *wq;
> + /*FIX ME call ext4 code to convert the extents, commit the log*/
> +
> + /* if not hole or unwritten extents, just simple return */
> + if (!io_endi || !size)
> + return;
> + io_end->offset = offset;
> + io_end->size = size;
> + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> +
> + /* We need to convert unwritten extents to written */
> + queue_work(wq, &io_end->work);
> +
> + if (is_sync_kiocb(iocb))
> + flush_workqueue(wq);
> +
> + iocb->private = NULL;
> +}
> +/*
> + * If the O_DIRECT write will extend the file then add this inode to the
> + * orphan list. So recovery will truncate it back to the original size
> + * if the machine crashes during the write.
> + *
> + * If the O_DIRECT write is filling unintialized space inside i_size,
> + * convertion to initialized extent should happen after data written to disk
> + * to ensure we don't expose stale disk data after crash.
> + */
> +ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> + const struct iovec *iov, loff_t offset,
> + unsigned long nr_segs)
> +{
> + struct file *file = iocb->ki_filp;
> + struct inode *inode = file->f_mapping->host;
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + handle_t *handle;
> + ssize_t ret;
> + int orphan = 0;
> + size_t count = iov_length(iov, nr_segs);
> +
> + if (rw == WRITE) {
> + /*
> + * we treat blocks allocated for DIO as
> + * as unwritten extents, need to convert them
> + * to written extents when IO is complete.
> + *
> + * In the case of AIO DIO, since VFS dio code
> + * does not wait for io complete, here
> + * i_size update needs to be buddled at the
> + * time DIO IO is completed, not here
> + */
> + iocb->private = ext4_init_io_end(inode, DIO_UNWRITTEN);
> + if (!iocb->private)
> + return -ENOMEM;
> + }
> +
> + ret = blockdev_direct_IO_hole(rw, iocb, inode,
> + inode->i_sb->s_bdev, iov,
> + offset, nr_segs,
> + ext4_get_block_dio, ext4_end_io_dio);
> +
> + /* In the case of AIO DIO, VFS dio,
> + * does not wait for io complete, here
> + * i_size update needs to be buddled at the
> + * time DIO IO is completed, not here
> + */
> + return ret;
> +}
> Index: linux-2.6.31-rc4/fs/ext4/Makefile
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/ext4/Makefile 2009-07-31 14:51:38.000000000 -0700
> +++ linux-2.6.31-rc4/fs/ext4/Makefile 2009-07-31 15:07:54.000000000 -0700
> @@ -6,8 +6,8 @@ obj-$(CONFIG_EXT4_FS) += ext4.o
>
> ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
> ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
> - ext4_jbd2.o migrate.o mballoc.o block_validity.o move_extent.o
> -
> + ext4_jbd2.o migrate.o mballoc.o block_validity.o \
> + move_extent.o dio.o
> ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
> ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
> ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
> Index: linux-2.6.31-rc4/fs/ext4/super.c
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/ext4/super.c 2009-07-31 15:26:23.000000000 -0700
> +++ linux-2.6.31-rc4/fs/ext4/super.c 2009-07-31 15:57:52.000000000 -0700
> @@ -578,6 +578,9 @@ static void ext4_put_super(struct super_
> struct ext4_super_block *es = sbi->s_es;
> int i, err;
>
> + flush_workqueue(sbi->dio_unwritten_wq);
> + destroy_workqueue(sbi->dio_unwritten_wq);
> +
> lock_super(sb);
> lock_kernel();
> if (sb->s_dirt)
> @@ -2781,6 +2784,12 @@ no_journal:
> clear_opt(sbi->s_mount_opt, NOBH);
> }
> }
> + EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
> + if (!EXT4_SB(sb)->dio_unwritten_wq) {
> + printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
> + goto failed_mount_wq;
> + }
> +
> /*
> * The jbd2_journal_load will have done any necessary log recovery,
> * so we can safely mount the rest of the filesystem now.
> @@ -2893,6 +2902,8 @@ cantfind_ext4:
>
> failed_mount4:
> ext4_msg(sb, KERN_ERR, "mount failed");
> + destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
> +failed_mount_wq:
> ext4_release_system_zone(sb);
> if (sbi->s_journal) {
> jbd2_journal_destroy(sbi->s_journal);
> Index: linux-2.6.31-rc4/fs/direct-io.c
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/direct-io.c 2009-07-31 17:12:30.000000000 -0700
> +++ linux-2.6.31-rc4/fs/direct-io.c 2009-07-31 17:25:17.000000000 -0700
> @@ -240,7 +240,7 @@ static int dio_complete(struct dio *dio,
> if (dio->end_io && dio->result)
> dio->end_io(dio->iocb, offset, transferred,
> dio->map_bh.b_private);
> - if (dio->lock_type == DIO_LOCKING)
> + if (dio->lock_type == DIO_LOCKING || dio->lock_type == DIO_LOCKING_HOLE)
> /* lockdep: non-owner release */
> up_read_non_owner(&dio->inode->i_alloc_sem);
>
> @@ -516,6 +516,17 @@ static int get_more_blocks(struct dio *d
> map_bh->b_size = fs_count << dio->inode->i_blkbits;
>
> create = dio->rw & WRITE;
> + /*
> + * If the dio lock type does not indicate the underlying
> + * fs could handle unwritten extents, pass create as 0 to
> + * just do a plain block look up. If it is a hole, DIO will
> + * falls back to buffered IO to prevent stale data out after
> + * crash before IO complete.
> + *
> + * if the dio lock type is DIO_LOCKING_HOLE
> + * then dio will preserve the create flag, and let the
> + * underlying filesystem to handle the hole issue.
> + */
> if (dio->lock_type == DIO_LOCKING) {
> if (dio->block_in_file < (i_size_read(dio->inode) >>
> dio->blkbits))
> @@ -1042,7 +1053,8 @@ direct_io_worker(int rw, struct kiocb *i
> * we can let i_mutex go now that its achieved its purpose
> * of protecting us from looking up uninitialized blocks.
> */
> - if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
> + if ((rw == READ) && (dio->lock_type == DIO_LOCKING ||
> + dio->lock_type == DIO_LOCKING_HOLE))
> mutex_unlock(&dio->inode->i_mutex);
>
> /*
> @@ -1097,6 +1109,10 @@ direct_io_worker(int rw, struct kiocb *i
> * For reads, i_mutex is not held on entry, but it is taken and dropped before
> * returning.
> *
> + * DIO_LOCKING_HOLE (simple locking for regular files)
> + * Same as above, except that filesystem support unwritten extents thus
> + * allow DIO to write to holes, no need to fall back to buffered IO.
> + *
> * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
> * uninitialised data, allowing parallel direct readers and writers)
> * For writes we are called without i_mutex, return without it, never touch it.
> @@ -1190,7 +1206,8 @@ __blockdev_direct_IO(int rw, struct kioc
> }
> }
>
> - if (dio_lock_type == DIO_LOCKING)
> + if (dio_lock_type == DIO_LOCKING ||
> + dio_lock_type == DIO_LOCKING_HOLE)
> /* lockdep: not the owner will release it */
> down_read_non_owner(&inode->i_alloc_sem);
> }
> @@ -1220,7 +1237,8 @@ __blockdev_direct_IO(int rw, struct kioc
> vmtruncate(inode, isize);
> }
>
> - if (rw == READ && dio_lock_type == DIO_LOCKING)
> + if (rw == READ && (dio_lock_type == DIO_LOCKING ||
> + dio_lock_type == DIO_LOCKING_HOLE))
> release_i_mutex = 0;
>
> out:
> Index: linux-2.6.31-rc4/include/linux/fs.h
> ===================================================================
> --- linux-2.6.31-rc4.orig/include/linux/fs.h 2009-07-31 17:09:07.000000000 -0700
> +++ linux-2.6.31-rc4/include/linux/fs.h 2009-07-31 17:20:39.000000000 -0700
> @@ -2255,6 +2255,7 @@ enum {
> DIO_LOCKING = 1, /* need locking between buffered and direct access */
> DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
> DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
> + DIO_LOCKING_HOLE /* need locking, but allow DIO write to holes */
> };
>
> static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
> @@ -2283,6 +2284,15 @@ static inline ssize_t blockdev_direct_IO
> return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> nr_segs, get_block, end_io, DIO_OWN_LOCKING);
> }
> +
> +static inline ssize_t blockdev_direct_IO_hole(int rw, struct kiocb *iocb,
> + struct inode *inode, struct block_device *bdev, const struct iovec *iov,
> + loff_t offset, unsigned long nr_segs, get_block_t get_block,
> + dio_iodone_t end_io)
> +{
> + return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> + nr_segs, get_block, end_io, DIO_LOCKING_HOLE);
> +}
> #endif
>
> extern const struct file_operations generic_ro_fops;
> Index: linux-2.6.31-rc4/fs/ext4/extents.c
> ===================================================================
> --- linux-2.6.31-rc4.orig/fs/ext4/extents.c 2009-07-31 22:45:58.000000000 -0700
> +++ linux-2.6.31-rc4/fs/ext4/extents.c 2009-07-31 23:37:20.000000000 -0700
> @@ -3170,6 +3170,72 @@ retry:
> return ret > 0 ? ret2 : ret;
> }
>
> +long ext4_convert_unwritten(struct inode *inode, loff_t offset, loff_t len)
> +{
> + handle_t *handle;
> + ext4_lblk_t block;
> + loff_t new_size;
> + unsigned int max_blocks;
> + int ret = 0;
> + int ret2 = 0;
> + struct buffer_head map_bh;
> + unsigned int credits, blkbits = inode->i_blkbits;
> +
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + return -EOPNOTSUPP;
> +
> + block = offset >> blkbits;
> + /*
> + * We can't just convert len to max_blocks because
> + * If blocksize = 4096 offset = 3072 and len = 2048
> + */
> + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> + - block;
> + /*
> + * credits to insert 1 extent into extent tree
> + */
> + credits = ext4_chunk_trans_blocks(inode, max_blocks);
> + while (ret >= 0 && ret < max_blocks) {
> + block = block + ret;
> + max_blocks = max_blocks - ret;
> + handle = ext4_journal_start(inode, credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + break;
> + }
> + map_bh.b_state = 0;
> + ret = ext4_get_blocks(handle, inode, block,
> + max_blocks, &map_bh,
> + EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT)

Where is EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT defined ? Assuming it is
EXT4_GET_BLOCKS_CREATE, it implies we can endup doing block allocation
here. If we don't have enough space, we zero-out the full extent. Which
implies we would endup overwritting the already wrote contents.

;
> + if (ret <= 0) {
> + WARN_ON(ret <= 0);
> + printk(KERN_ERR "%s: ext4_ext_get_blocks "
> + "returned error inode#%lu, block=%u, "
> + "max_blocks=%u", __func__,
> + inode->i_ino, block, max_blocks);
> + ext4_mark_inode_dirty(handle, inode);
> + ret2 = ext4_journal_stop(handle);
> + break;
> + }
> +
> + if ((block + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
> + blkbits) >> blkbits))
> + new_size = offset + len;
> + else
> + new_size = (block + ret) << blkbits;
> +
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);

shouldn't i_size update always happen with inode->i_mutex held. I guess
we can't do a i_size update in end_io call back.


> + if (new_size > EXT4_I(inode)->i_disksize)
> + ext4_update_i_disksize(inode, new_size);
> +
> + ext4_mark_inode_dirty(handle, inode);
> + ret2 = ext4_journal_stop(handle);
> + if (ret2)
> + break;
> + }
> + return ret > 0 ? ret2 : ret;
> +}
> /*
> * Callback function called for each extent to gather FIEMAP information.
> */
>
>


Can you send it as a multi patch series. Having only ext4 specific changes
to look at will makes this review easier.

-aneesh

2009-08-03 23:40:12

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH RFC] ext4 direct IO for holes, fallocate

On Mon, 2009-08-03 at 22:17 +0530, Aneesh Kumar K.V wrote:
> On Fri, Jul 31, 2009 at 11:56:55PM -0700, Mingming wrote:
> > Hi,
> >
> > I want to throw out my first attempt to fix the DIO fallocate issue, to
> > see if the approach makes sense. This is the approach I take will
> >
> > 1) DIO write to hole will not fall back to buffered IO. holes are
> > marked as unwritten extents after block allocation.
> > 2) New extents allocated at the end of file is also treated as unwritten
> > extents
> > 3) those unwritten extents, and fallocate extents, will be converted to
> > written extents (and update disk isize if needed)when the IO is
> > complete. The conversion is triggered using end_io call back function
> > passing from ext4 fs to direct IO.
> >
> > 4) We need to tell direct IO do not force create =0 for write on holes.
> > Does not want use blockdev_dio_own_locking(), instead,add another dio
> > type which indicating DIO that fs still needs the lock handling at VFS
> > layer, but wants to take care of the hole/unwritten stuff by fs itself.
> >
> > untested patch below...
> >
> > Signed-off-by: Mingming Cao <[email protected]>
> > ---
> > fs/direct-io.c | 26 +++++++-
> > fs/ext4/Makefile | 4 -
> > fs/ext4/dio.c | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ext4/ext4.h | 22 ++++++-
> > fs/ext4/extents.c | 66 +++++++++++++++++++++
> > fs/ext4/inode.c | 83 ---------------------------
> > fs/ext4/super.c | 11 +++
> > include/linux/fs.h | 10 +++
> > 8 files changed, 293 insertions(+), 91 deletions(-)
> >
> > Index: linux-2.6.31-rc4/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/ext4/inode.c 2009-07-31 12:57:53.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/ext4/inode.c 2009-07-31 14:48:27.000000000 -0700
> > @@ -3279,89 +3279,6 @@ static int ext4_releasepage(struct page
> > }
> >
> > /*
> > - * If the O_DIRECT write will extend the file then add this inode to the
> > - * orphan list. So recovery will truncate it back to the original size
> > - * if the machine crashes during the write.
> > - *
> > - * If the O_DIRECT write is intantiating holes inside i_size and the machine
> > - * crashes then stale disk data _may_ be exposed inside the file. But current
> > - * VFS code falls back into buffered path in that case so we are safe.
> > - */
> > -static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> > - const struct iovec *iov, loff_t offset,
> > - unsigned long nr_segs)
> > -{
> > - struct file *file = iocb->ki_filp;
> > - struct inode *inode = file->f_mapping->host;
> > - struct ext4_inode_info *ei = EXT4_I(inode);
> > - handle_t *handle;
> > - ssize_t ret;
> > - int orphan = 0;
> > - size_t count = iov_length(iov, nr_segs);
> > -
> > - if (rw == WRITE) {
> > - loff_t final_size = offset + count;
> > -
> > - if (final_size > inode->i_size) {
> > - /* Credits for sb + inode write */
> > - handle = ext4_journal_start(inode, 2);
> > - if (IS_ERR(handle)) {
> > - ret = PTR_ERR(handle);
> > - goto out;
> > - }
> > - ret = ext4_orphan_add(handle, inode);
> > - if (ret) {
> > - ext4_journal_stop(handle);
> > - goto out;
> > - }
> > - orphan = 1;
> > - ei->i_disksize = inode->i_size;
> > - ext4_journal_stop(handle);
> > - }
> > - }
> > -
> > - ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
> > - offset, nr_segs,
> > - ext4_get_block, NULL);
> > -
> > - if (orphan) {
> > - int err;
> > -
> > - /* Credits for sb + inode write */
> > - handle = ext4_journal_start(inode, 2);
> > - if (IS_ERR(handle)) {
> > - /* This is really bad luck. We've written the data
> > - * but cannot extend i_size. Bail out and pretend
> > - * the write failed... */
> > - ret = PTR_ERR(handle);
> > - goto out;
> > - }
> > - if (inode->i_nlink)
> > - ext4_orphan_del(handle, inode);
> > - if (ret > 0) {
> > - loff_t end = offset + ret;
> > - if (end > inode->i_size) {
> > - ei->i_disksize = end;
> > - block_extend_i_size(inode, offset, ret);
> > - /*
> > - * We're going to return a positive `ret'
> > - * here due to non-zero-length I/O, so there's
> > - * no way of reporting error returns from
> > - * ext4_mark_inode_dirty() to userspace. So
> > - * ignore it.
> > - */
> > - ext4_mark_inode_dirty(handle, inode);
> > - }
> > - }
> > - err = ext4_journal_stop(handle);
> > - if (ret == 0)
> > - ret = err;
> > - }
> > -out:
> > - return ret;
> > -}
> > -
> > -/*
> > * Pages can be marked dirty completely asynchronously from ext4's journalling
> > * activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
> > * much here because ->set_page_dirty is called under VFS locks. The page is
> > Index: linux-2.6.31-rc4/fs/ext4/ext4.h
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/ext4/ext4.h 2009-07-31 12:57:53.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/ext4/ext4.h 2009-07-31 23:31:26.000000000 -0700
> > @@ -111,6 +111,15 @@ struct ext4_allocation_request {
> > unsigned int flags;
> > };
> >
> > +typedef struct ext4_io_end{
> > + struct inode *inode; /* file being written to */
> > + unsigned int type; /* unwritten or written */
> > + int error; /* I/O error code */
> > + ext4_lblk_t offset; /* offset in the file */
> > + size_t size; /* size of the extent */
> > + struct work_struct work; /* data work queue */
> > +}ext4_io_end_t;
> > +
> > /*
> > * Special inodes numbers
> > */
> > @@ -330,8 +339,8 @@ struct ext4_new_group_data {
> > /* Call ext4_da_update_reserve_space() after successfully
> > allocating the blocks */
> > #define EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE 0x0008
> > -
> > -
> > +#define EXT4_GET_BLOCKS_DIO 0x0010
> > +#define EXT4_GET_BLOCKS_CONVERT_UNINIT 0x0020
> > /*
> > * ioctl commands
> > */
> > @@ -960,6 +969,9 @@ struct ext4_sb_info {
> >
> > unsigned int s_log_groups_per_flex;
> > struct flex_groups *s_flex_groups;
> > +
> > + /* workqueue for dio unwritten */
> > + struct workqueue_struct *dio_unwritten_wq;
> > };
> >
> > static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> > @@ -1453,6 +1465,10 @@ extern __le16 ext4_group_desc_csum(struc
> > struct ext4_group_desc *gdp);
> > extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
> > struct ext4_group_desc *gdp);
> > +/* dio.c */
> > +extern ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> > + const struct iovec *iov, loff_t offset,
> > + unsigned long nr_segs);
> >
> > static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
> > {
> > @@ -1650,6 +1666,8 @@ extern void ext4_ext_init(struct super_b
> > extern void ext4_ext_release(struct super_block *);
> > extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
> > loff_t len);
> > +extern long ext4_convert_unwritten(struct inode *inode, loff_t offset,
> > + loff_t len);
> > extern int ext4_get_blocks(handle_t *handle, struct inode *inode,
> > sector_t block, unsigned int max_blocks,
> > struct buffer_head *bh, int flags);
> > Index: linux-2.6.31-rc4/fs/ext4/dio.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6.31-rc4/fs/ext4/dio.c 2009-07-31 23:24:25.000000000 -0700
> > @@ -0,0 +1,162 @@
> > +/*
> > + * linux/fs/ext4/dio.c
> > + *
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/jbd2.h>
> > +#include <linux/buffer_head.h>
> > +#include <linux/writeback.h>
> > +#include <linux/bio.h>
> > +#include <linux/workqueue.h>
> > +
> > +#include "ext4_jbd2.h"
> > +
> > +struct workqueue_struct *ext4_unwritten_queue;
> > +
> > +/* Maximum number of blocks we map for direct IO at once. */
> > +#define DIO_MAX_BLOCKS 4096
> > +int ext4_get_block_dio(struct inode *inode, sector_t iblock,
> > + struct buffer_head *bh_result, int create)
> > +{
> > + handle_t *handle;
> > + int ret = 0, started = 0;
> > + unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > + int dio_credits;
> > + loff_t offset = (loff_t)iblock << inode->i_blkbits;
> > + unsigned int flags = 0;
> > +
> > + if (create) {
> > + if (max_blocks > DIO_MAX_BLOCKS)
> > + max_blocks = DIO_MAX_BLOCKS;
> > + dio_credits = ext4_chunk_trans_blocks(inode, max_blocks);
> > + handle = ext4_journal_start(inode, dio_credits);
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + goto out;
> > + }
> > + started = 1;
> > +
> > + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
>
> That would imply we are always creating uninit extent. You will have to make
> sure we do the right thing with bh_flag of bh_result in this case. We have
> ignored bh_flag update for EXT4_GET_BLOCKS_CREATE_UNINIT_EXT because that was
> initially done only from the fallocate call path which didn't look at the flag
> at all.
>

I have checked the return from ext4_ext_get_blocks(), with the
EXT4_GET_BLOCKS_CREATE_UNINIT_EXT flag, on success of block fallocate
for holes/extend file size, the buffer is always marked mapped. If we do
a EXT4_GET_BLOCKS_CREATE_UNINIT_EXT on an already-fallocated-extent,
then the bh is marked as mapped too. They all go out to the "out"
at the end.
>
>
> > + }
> > +
> > + ret = ext4_get_blocks(handle, inode, iblock, max_blocks, bh_result,
> > + flags);
> > + if (ret > 0) {
> > + bh_result->b_size = (ret << inode->i_blkbits);
> > + ret = 0;
> > + }
> > + if (started)
> > + ext4_journal_stop(handle);
> > +out:
> > + return ret;
> > +}
> > +
> > +#define DIO_UNWRITTEN 0x1
> > +
> > +/*
> > + * IO write completion for unwritten extents.
> > + *
> > + * convert a buffer range from unwritten to written extents.
> > + */
> > +static void ext4_end_dio_unwritten(struct work_struct *work)
> > +{
> > + ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > + struct inode *inode = io->inode;
> > + loff_t offset = io->offset;
> > + size_t size = io->size;
> > +
> > + ext4_convert_unwritten(inode, offset, size);
> > +
> > + kfree(io);
> > +}
> > +
> > +ext4_io_end_t *ext4_init_io_end (struct inode *inode, unsigned int type)
> > +{
> > + ext4_io_end_t *io = NULL;
> > +
> > + io = kmalloc(sizeof(*io), GFP_NOFS);
> > +
> > + if (io) {
> > + io->inode = inode;
> > + io->type = type;
> > + io->offset = 0;
> > + io->size = 0;
> > + io->error = 0;
> > + INIT_WORK(&io->work, ext4_end_dio_unwritten);
> > + }
> > +
> > + return io;
> > +}
> > +
> > +static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> > + ssize_t size, void *private)
> > +{
> > + ext4_io_end_t *io_end = iocb->private;
> > + struct workqueue_struct *wq;
> > + /*FIX ME call ext4 code to convert the extents, commit the log*/
> > +
> > + /* if not hole or unwritten extents, just simple return */
> > + if (!io_endi || !size)
> > + return;
> > + io_end->offset = offset;
> > + io_end->size = size;
> > + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> > +
> > + /* We need to convert unwritten extents to written */
> > + queue_work(wq, &io_end->work);
> > +
> > + if (is_sync_kiocb(iocb))
> > + flush_workqueue(wq);
> > +
> > + iocb->private = NULL;
> > +}
> > +/*
> > + * If the O_DIRECT write will extend the file then add this inode to the
> > + * orphan list. So recovery will truncate it back to the original size
> > + * if the machine crashes during the write.
> > + *
> > + * If the O_DIRECT write is filling unintialized space inside i_size,
> > + * convertion to initialized extent should happen after data written to disk
> > + * to ensure we don't expose stale disk data after crash.
> > + */
> > +ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> > + const struct iovec *iov, loff_t offset,
> > + unsigned long nr_segs)
> > +{
> > + struct file *file = iocb->ki_filp;
> > + struct inode *inode = file->f_mapping->host;
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > + handle_t *handle;
> > + ssize_t ret;
> > + int orphan = 0;
> > + size_t count = iov_length(iov, nr_segs);
> > +
> > + if (rw == WRITE) {
> > + /*
> > + * we treat blocks allocated for DIO as
> > + * as unwritten extents, need to convert them
> > + * to written extents when IO is complete.
> > + *
> > + * In the case of AIO DIO, since VFS dio code
> > + * does not wait for io complete, here
> > + * i_size update needs to be buddled at the
> > + * time DIO IO is completed, not here
> > + */
> > + iocb->private = ext4_init_io_end(inode, DIO_UNWRITTEN);
> > + if (!iocb->private)
> > + return -ENOMEM;
> > + }
> > +
> > + ret = blockdev_direct_IO_hole(rw, iocb, inode,
> > + inode->i_sb->s_bdev, iov,
> > + offset, nr_segs,
> > + ext4_get_block_dio, ext4_end_io_dio);
> > +
> > + /* In the case of AIO DIO, VFS dio,
> > + * does not wait for io complete, here
> > + * i_size update needs to be buddled at the
> > + * time DIO IO is completed, not here
> > + */
> > + return ret;
> > +}
> > Index: linux-2.6.31-rc4/fs/ext4/Makefile
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/ext4/Makefile 2009-07-31 14:51:38.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/ext4/Makefile 2009-07-31 15:07:54.000000000 -0700
> > @@ -6,8 +6,8 @@ obj-$(CONFIG_EXT4_FS) += ext4.o
> >
> > ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
> > ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
> > - ext4_jbd2.o migrate.o mballoc.o block_validity.o move_extent.o
> > -
> > + ext4_jbd2.o migrate.o mballoc.o block_validity.o \
> > + move_extent.o dio.o
> > ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
> > ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
> > ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
> > Index: linux-2.6.31-rc4/fs/ext4/super.c
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/ext4/super.c 2009-07-31 15:26:23.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/ext4/super.c 2009-07-31 15:57:52.000000000 -0700
> > @@ -578,6 +578,9 @@ static void ext4_put_super(struct super_
> > struct ext4_super_block *es = sbi->s_es;
> > int i, err;
> >
> > + flush_workqueue(sbi->dio_unwritten_wq);
> > + destroy_workqueue(sbi->dio_unwritten_wq);
> > +
> > lock_super(sb);
> > lock_kernel();
> > if (sb->s_dirt)
> > @@ -2781,6 +2784,12 @@ no_journal:
> > clear_opt(sbi->s_mount_opt, NOBH);
> > }
> > }
> > + EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
> > + if (!EXT4_SB(sb)->dio_unwritten_wq) {
> > + printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
> > + goto failed_mount_wq;
> > + }
> > +
> > /*
> > * The jbd2_journal_load will have done any necessary log recovery,
> > * so we can safely mount the rest of the filesystem now.
> > @@ -2893,6 +2902,8 @@ cantfind_ext4:
> >
> > failed_mount4:
> > ext4_msg(sb, KERN_ERR, "mount failed");
> > + destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
> > +failed_mount_wq:
> > ext4_release_system_zone(sb);
> > if (sbi->s_journal) {
> > jbd2_journal_destroy(sbi->s_journal);
> > Index: linux-2.6.31-rc4/fs/direct-io.c
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/direct-io.c 2009-07-31 17:12:30.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/direct-io.c 2009-07-31 17:25:17.000000000 -0700
> > @@ -240,7 +240,7 @@ static int dio_complete(struct dio *dio,
> > if (dio->end_io && dio->result)
> > dio->end_io(dio->iocb, offset, transferred,
> > dio->map_bh.b_private);
> > - if (dio->lock_type == DIO_LOCKING)
> > + if (dio->lock_type == DIO_LOCKING || dio->lock_type == DIO_LOCKING_HOLE)
> > /* lockdep: non-owner release */
> > up_read_non_owner(&dio->inode->i_alloc_sem);
> >
> > @@ -516,6 +516,17 @@ static int get_more_blocks(struct dio *d
> > map_bh->b_size = fs_count << dio->inode->i_blkbits;
> >
> > create = dio->rw & WRITE;
> > + /*
> > + * If the dio lock type does not indicate the underlying
> > + * fs could handle unwritten extents, pass create as 0 to
> > + * just do a plain block look up. If it is a hole, DIO will
> > + * falls back to buffered IO to prevent stale data out after
> > + * crash before IO complete.
> > + *
> > + * if the dio lock type is DIO_LOCKING_HOLE
> > + * then dio will preserve the create flag, and let the
> > + * underlying filesystem to handle the hole issue.
> > + */
> > if (dio->lock_type == DIO_LOCKING) {
> > if (dio->block_in_file < (i_size_read(dio->inode) >>
> > dio->blkbits))
> > @@ -1042,7 +1053,8 @@ direct_io_worker(int rw, struct kiocb *i
> > * we can let i_mutex go now that its achieved its purpose
> > * of protecting us from looking up uninitialized blocks.
> > */
> > - if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
> > + if ((rw == READ) && (dio->lock_type == DIO_LOCKING ||
> > + dio->lock_type == DIO_LOCKING_HOLE))
> > mutex_unlock(&dio->inode->i_mutex);
> >
> > /*
> > @@ -1097,6 +1109,10 @@ direct_io_worker(int rw, struct kiocb *i
> > * For reads, i_mutex is not held on entry, but it is taken and dropped before
> > * returning.
> > *
> > + * DIO_LOCKING_HOLE (simple locking for regular files)
> > + * Same as above, except that filesystem support unwritten extents thus
> > + * allow DIO to write to holes, no need to fall back to buffered IO.
> > + *
> > * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
> > * uninitialised data, allowing parallel direct readers and writers)
> > * For writes we are called without i_mutex, return without it, never touch it.
> > @@ -1190,7 +1206,8 @@ __blockdev_direct_IO(int rw, struct kioc
> > }
> > }
> >
> > - if (dio_lock_type == DIO_LOCKING)
> > + if (dio_lock_type == DIO_LOCKING ||
> > + dio_lock_type == DIO_LOCKING_HOLE)
> > /* lockdep: not the owner will release it */
> > down_read_non_owner(&inode->i_alloc_sem);
> > }
> > @@ -1220,7 +1237,8 @@ __blockdev_direct_IO(int rw, struct kioc
> > vmtruncate(inode, isize);
> > }
> >
> > - if (rw == READ && dio_lock_type == DIO_LOCKING)
> > + if (rw == READ && (dio_lock_type == DIO_LOCKING ||
> > + dio_lock_type == DIO_LOCKING_HOLE))
> > release_i_mutex = 0;
> >
> > out:
> > Index: linux-2.6.31-rc4/include/linux/fs.h
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/include/linux/fs.h 2009-07-31 17:09:07.000000000 -0700
> > +++ linux-2.6.31-rc4/include/linux/fs.h 2009-07-31 17:20:39.000000000 -0700
> > @@ -2255,6 +2255,7 @@ enum {
> > DIO_LOCKING = 1, /* need locking between buffered and direct access */
> > DIO_NO_LOCKING, /* bdev; no locking at all between buffered/direct */
> > DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
> > + DIO_LOCKING_HOLE /* need locking, but allow DIO write to holes */
> > };
> >
> > static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
> > @@ -2283,6 +2284,15 @@ static inline ssize_t blockdev_direct_IO
> > return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> > nr_segs, get_block, end_io, DIO_OWN_LOCKING);
> > }
> > +
> > +static inline ssize_t blockdev_direct_IO_hole(int rw, struct kiocb *iocb,
> > + struct inode *inode, struct block_device *bdev, const struct iovec *iov,
> > + loff_t offset, unsigned long nr_segs, get_block_t get_block,
> > + dio_iodone_t end_io)
> > +{
> > + return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> > + nr_segs, get_block, end_io, DIO_LOCKING_HOLE);
> > +}
> > #endif
> >
> > extern const struct file_operations generic_ro_fops;
> > Index: linux-2.6.31-rc4/fs/ext4/extents.c
> > ===================================================================
> > --- linux-2.6.31-rc4.orig/fs/ext4/extents.c 2009-07-31 22:45:58.000000000 -0700
> > +++ linux-2.6.31-rc4/fs/ext4/extents.c 2009-07-31 23:37:20.000000000 -0700
> > @@ -3170,6 +3170,72 @@ retry:
> > return ret > 0 ? ret2 : ret;
> > }
> >
> > +long ext4_convert_unwritten(struct inode *inode, loff_t offset, loff_t len)
> > +{
> > + handle_t *handle;
> > + ext4_lblk_t block;
> > + loff_t new_size;
> > + unsigned int max_blocks;
> > + int ret = 0;
> > + int ret2 = 0;
> > + struct buffer_head map_bh;
> > + unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > + return -EOPNOTSUPP;
> > +
> > + block = offset >> blkbits;
> > + /*
> > + * We can't just convert len to max_blocks because
> > + * If blocksize = 4096 offset = 3072 and len = 2048
> > + */
> > + max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > + - block;
> > + /*
> > + * credits to insert 1 extent into extent tree
> > + */
> > + credits = ext4_chunk_trans_blocks(inode, max_blocks);
> > + while (ret >= 0 && ret < max_blocks) {
> > + block = block + ret;
> > + max_blocks = max_blocks - ret;
> > + handle = ext4_journal_start(inode, credits);
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + break;
> > + }
> > + map_bh.b_state = 0;
> > + ret = ext4_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT)
>
> Where is EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT defined ? Assuming it is
> EXT4_GET_BLOCKS_CREATE, it implies we can endup doing block allocation
> here. If we don't have enough space, we zero-out the full extent. Which
> implies we would endup overwritting the already wrote contents.
>

Sorry the patch was not the latest one. The
EXT4_GET_BLOCKS_CONVERT_UNINIT_EXT flag is set different than
EXT4_GET_BLOCKS_CREATE. It there to notify that we should not do block
allocation, and avoid zero out the full extent. I was planning to add
another code to zero out only the uninitialized part of the extents.

> ;
> > + if (ret <= 0) {
> > + WARN_ON(ret <= 0);
> > + printk(KERN_ERR "%s: ext4_ext_get_blocks "
> > + "returned error inode#%lu, block=%u, "
> > + "max_blocks=%u", __func__,
> > + inode->i_ino, block, max_blocks);
> > + ext4_mark_inode_dirty(handle, inode);
> > + ret2 = ext4_journal_stop(handle);
> > + break;
> > + }
> > +
> > + if ((block + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
> > + blkbits) >> blkbits))
> > + new_size = offset + len;
> > + else
> > + new_size = (block + ret) << blkbits;
> > +
> > + if (new_size > i_size_read(inode))
> > + i_size_write(inode, new_size);
>
> shouldn't i_size update always happen with inode->i_mutex held. I guess
> we can't do a i_size update in end_io call back.
>
>
Maybe you are right, i_size update should already been taken care of at
the ext4_direct_IO() code, I will correct this.

> > + if (new_size > EXT4_I(inode)->i_disksize)
> > + ext4_update_i_disksize(inode, new_size);
> > +
> > + ext4_mark_inode_dirty(handle, inode);
> > + ret2 = ext4_journal_stop(handle);
> > + if (ret2)
> > + break;
> > + }
> > + return ret > 0 ? ret2 : ret;
> > +}
> > /*
> > * Callback function called for each extent to gather FIEMAP information.
> > */
> >
> >
>
>
> Can you send it as a multi patch series. Having only ext4 specific changes
> to look at will makes this review easier.
>

Yes I am planning to.

About the DIO layer change, all we need is a way to pass create=1 for
holes/fallocate from DIO. If there is a clear way to do from DIO that
would be greate. Otherwise, I was actually thinking we could workaround
in ext4, which we create a seperate ext4 directio get_block() function
for write and read. So when ext4_get_block_dio_write() is called, it
could override the create flag passed from DIO, and do the mapping for
fallocate and hole. Again,the best solution is to fix DIO code to pass
the flag to allow hole/fallocate map/filling,for fs like ext4 who wants
to do so.

Mingming