2013-05-14 07:52:32

by Li Wang

[permalink] [raw]
Subject: [PATCH] Avoid unnecessarily writing back dirty pages before hole punching

For hole punching, currently ext4 will synchronously write back the
dirty pages fit into the hole, since the data on the disk responding
to those pages are to be deleted, it is benefical to directly release
those pages, no matter they are dirty or not, except the ordered case.

Signed-off-by: Li Wang <[email protected]>
Signed-off-by: Yunchuan Wen <[email protected]>
Reviewed-by: Theodore Ts'o <[email protected]>
Reviewed-by: Dmitry Monakhov <[email protected]>
---
fs/ext4/inode.c | 21 ++++++++++-------
fs/jbd2/transaction.c | 62 +++++++++++++++++++++++++++++++------------------
include/linux/jbd2.h | 2 ++
3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0723774..3e00360 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3578,6 +3578,16 @@ int ext4_can_truncate(struct inode *inode)
return 0;
}

+static inline int ext4_begin_ordered_fallocate(struct inode *inode,
+ loff_t start, loff_t length)
+{
+ if (!EXT4_I(inode)->jinode)
+ return 0;
+ return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
+ EXT4_I(inode)->jinode,
+ start, length);
+}
+
/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
@@ -3611,17 +3621,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)

trace_ext4_punch_hole(inode, offset, length);

- /*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + length - 1);
+ if (ext4_should_order_data(inode)) {
+ ret = ext4_begin_ordered_fallocate(inode, offset, length);
if (ret)
return ret;
}
-
+
mutex_lock(&inode->i_mutex);
/* It's not possible punch hole on append only file */
if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10f524c..1284e1b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2305,6 +2305,36 @@ done:
return 0;
}

+
+static int jbd2_journal_begin_ordered_discard(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t start, loff_t end)
+{
+ transaction_t *inode_trans, *commit_trans;
+ int ret = 0;
+
+ /* This is a quick check to avoid locking if not necessary */
+ if (!jinode->i_transaction)
+ goto out;
+ /* Locks are here just to force reading of recent values, it is
+ * enough that the transaction was not committing before we started
+ * a transaction adding the inode to orphan list */
+ read_lock(&journal->j_state_lock);
+ commit_trans = journal->j_committing_transaction;
+ read_unlock(&journal->j_state_lock);
+ spin_lock(&journal->j_list_lock);
+ inode_trans = jinode->i_transaction;
+ spin_unlock(&journal->j_list_lock);
+ if (inode_trans == commit_trans) {
+ ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
+ start, end);
+ if (ret)
+ jbd2_journal_abort(journal, ret);
+ }
+out:
+ return ret;
+}
+
/*
* File truncate and transaction commit interact with each other in a
* non-trivial way. If a transaction writing data block A is
@@ -2329,27 +2359,15 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
struct jbd2_inode *jinode,
loff_t new_size)
{
- transaction_t *inode_trans, *commit_trans;
- int ret = 0;
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ new_size, LLONG_MAX);
+}

- /* This is a quick check to avoid locking if not necessary */
- if (!jinode->i_transaction)
- goto out;
- /* Locks are here just to force reading of recent values, it is
- * enough that the transaction was not committing before we started
- * a transaction adding the inode to orphan list */
- read_lock(&journal->j_state_lock);
- commit_trans = journal->j_committing_transaction;
- read_unlock(&journal->j_state_lock);
- spin_lock(&journal->j_list_lock);
- inode_trans = jinode->i_transaction;
- spin_unlock(&journal->j_list_lock);
- if (inode_trans == commit_trans) {
- ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
- new_size, LLONG_MAX);
- if (ret)
- jbd2_journal_abort(journal, ret);
- }
-out:
- return ret;
+int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t start, loff_t length)
+{
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ start, start + length - 1);
}
+
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6e051f4..9940cfb 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1128,6 +1128,8 @@ extern int jbd2_journal_force_commit(journal_t *);
extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
struct jbd2_inode *inode, loff_t new_size);
+extern int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
+ struct jbd2_inode *inode, loff_t start, loff_t length);
extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);

--
1.7.9.5




2013-05-17 03:05:44

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] Avoid unnecessarily writing back dirty pages before hole punching

Hello Li,

Thanks for fixing this. Some comments below.

On Tue, May 14, 2013 at 03:44:54PM +0800, Li Wang wrote:
> For hole punching, currently ext4 will synchronously write back the
> dirty pages fit into the hole, since the data on the disk responding
> to those pages are to be deleted, it is benefical to directly release
> those pages, no matter they are dirty or not, except the ordered case.
>
> Signed-off-by: Li Wang <[email protected]>
> Signed-off-by: Yunchuan Wen <[email protected]>
> Reviewed-by: Theodore Ts'o <[email protected]>
> Reviewed-by: Dmitry Monakhov <[email protected]>

If I remember correctly, Ted and Dmirty only answer your question about
this problem, but they don't review the patch. So don't add
"Reviewed-by", please.

> ---
> fs/ext4/inode.c | 21 ++++++++++-------
> fs/jbd2/transaction.c | 62 +++++++++++++++++++++++++++++++------------------
> include/linux/jbd2.h | 2 ++
> 3 files changed, 55 insertions(+), 30 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0723774..3e00360 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3578,6 +3578,16 @@ int ext4_can_truncate(struct inode *inode)
> return 0;
> }
>
> +static inline int ext4_begin_ordered_fallocate(struct inode *inode,
> + loff_t start, loff_t length)
> +{
> + if (!EXT4_I(inode)->jinode)
> + return 0;
> + return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
> + EXT4_I(inode)->jinode,
> + start, length);
> +}
> +
> /*
> * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> * associated with the given offset and length
> @@ -3611,17 +3621,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>
> trace_ext4_punch_hole(inode, offset, length);
>
> - /*
> - * Write out all dirty pages to avoid race conditions
> - * Then release them.
> - */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> - ret = filemap_write_and_wait_range(mapping, offset,
> - offset + length - 1);
> + if (ext4_should_order_data(inode)) {
> + ret = ext4_begin_ordered_fallocate(inode, offset, length);
> if (ret)
> return ret;
> }
> -
> +
> mutex_lock(&inode->i_mutex);

I am wondering if we need to call ext4_begin_ordered_fallocate() after
i_mutex is taken. Otherwise we can not prevent other buffered writes to
redirty this range of pages.

> /* It's not possible punch hole on append only file */
> if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 10f524c..1284e1b 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2305,6 +2305,36 @@ done:
> return 0;
> }
>
> +
> +static int jbd2_journal_begin_ordered_discard(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t end)
> +{
> + transaction_t *inode_trans, *commit_trans;
> + int ret = 0;
> +
> + /* This is a quick check to avoid locking if not necessary */
> + if (!jinode->i_transaction)
> + goto out;
> + /* Locks are here just to force reading of recent values, it is
> + * enough that the transaction was not committing before we started
> + * a transaction adding the inode to orphan list */
> + read_lock(&journal->j_state_lock);
> + commit_trans = journal->j_committing_transaction;
> + read_unlock(&journal->j_state_lock);
> + spin_lock(&journal->j_list_lock);
> + inode_trans = jinode->i_transaction;
> + spin_unlock(&journal->j_list_lock);
> + if (inode_trans == commit_trans) {
> + ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> + start, end);
> + if (ret)
> + jbd2_journal_abort(journal, ret);
> + }
> +out:
> + return ret;
> +}
> +
> /*
> * File truncate and transaction commit interact with each other in a
> * non-trivial way. If a transaction writing data block A is
> @@ -2329,27 +2359,15 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *jinode,
> loff_t new_size)
> {
> - transaction_t *inode_trans, *commit_trans;
> - int ret = 0;
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + new_size, LLONG_MAX);

Coding style problem. Before you submit a patch, you could use
$LINUX/scripts/checkpatch.pl to check your patch in order to make sure
whether it is ready for submission or not.

> +}
>
> - /* This is a quick check to avoid locking if not necessary */
> - if (!jinode->i_transaction)
> - goto out;
> - /* Locks are here just to force reading of recent values, it is
> - * enough that the transaction was not committing before we started
> - * a transaction adding the inode to orphan list */
> - read_lock(&journal->j_state_lock);
> - commit_trans = journal->j_committing_transaction;
> - read_unlock(&journal->j_state_lock);
> - spin_lock(&journal->j_list_lock);
> - inode_trans = jinode->i_transaction;
> - spin_unlock(&journal->j_list_lock);
> - if (inode_trans == commit_trans) {
> - ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> - new_size, LLONG_MAX);
> - if (ret)
> - jbd2_journal_abort(journal, ret);
> - }
> -out:
> - return ret;
> +int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t length)
> +{
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + start, start + length - 1);

The same as above.

Regards,
- Zheng

> }
> +
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 6e051f4..9940cfb 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1128,6 +1128,8 @@ extern int jbd2_journal_force_commit(journal_t *);
> extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
> extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *inode, loff_t new_size);
> +extern int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *inode, loff_t start, loff_t length);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
>
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-20 09:05:14

by Li Wang

[permalink] [raw]
Subject: [PATCH v2] ext4: Avoid unnecessarily writing back dirty pages before hole punching

For hole punching, currently ext4 will synchronously write back the
dirty pages fit into the hole, since the data on the disk responding
to those pages are to be deleted, it is benefical to directly release
those pages, no matter they are dirty or not, except the ordered case.

Signed-off-by: Li Wang <[email protected]>
Signed-off-by: Yunchuan Wen <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
Cc: Dmitry Monakhov <[email protected]>
---
Hi Zheng,
Thanks for your comments.
This is the revised version with the operation of writting back moved
down after the inode mutex held. But there is one thing I wanna confirm
is that whether the inode mutex could prevent the mmap() writer? I did
not take a careful look at the mmap() code, the straightforward thinking
is that mmap() write will directly dirty the pages without going through
the VFS generic_file_write() path.
BTW, I have one other question to confirm regarding the ext4 journal mode:
what is the advantage of data=ordered journal mode compared to data=writeback?
For overwriting write, it still may lead to the inconsistence between data and
metadata, that is, data is new and metadata is old. So its standpoint is
that it beats data=writeback in appending write?
---
fs/ext4/inode.c | 27 +++++++++++++---------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 61 +++++++++++++++++++++++++++++++------------------
include/linux/jbd2.h | 3 +++
4 files changed, 59 insertions(+), 33 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d6382b8..568b0bd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
return 0;
}

+static inline int ext4_begin_ordered_fallocate(struct inode *inode,
+ loff_t start, loff_t length)
+{
+ if (!EXT4_I(inode)->jinode)
+ return 0;
+ return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
+ EXT4_I(inode)->jinode,
+ start, length);
+}
+
/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
@@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)

trace_ext4_punch_hole(inode, offset, length);

- /*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + length - 1);
- if (ret)
- return ret;
- }
-
mutex_lock(&inode->i_mutex);
/* It's not possible punch hole on append only file */
if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
@@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
first_page_offset = first_page << PAGE_CACHE_SHIFT;
last_page_offset = last_page << PAGE_CACHE_SHIFT;

+ if (ext4_should_order_data(inode)) {
+ ret = ext4_begin_ordered_fallocate(inode, offset, length);
+ if (ret)
+ return ret;
+ }
+
/* Now release the pages */
if (last_page_offset > first_page_offset) {
truncate_pagecache_range(inode, first_page_offset,
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 9545757..ccc483a 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -98,6 +98,7 @@ EXPORT_SYMBOL(jbd2_journal_file_inode);
EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
+EXPORT_SYMBOL(jbd2_journal_begin_ordered_fallocate);
EXPORT_SYMBOL(jbd2_inode_cache);

static void __journal_abort_soft (journal_t *journal, int errno);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10f524c..035c064 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2305,6 +2305,36 @@ done:
return 0;
}

+
+static int jbd2_journal_begin_ordered_discard(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t start, loff_t end)
+{
+ transaction_t *inode_trans, *commit_trans;
+ int ret = 0;
+
+ /* This is a quick check to avoid locking if not necessary */
+ if (!jinode->i_transaction)
+ goto out;
+ /* Locks are here just to force reading of recent values, it is
+ * enough that the transaction was not committing before we started
+ * a transaction adding the inode to orphan list */
+ read_lock(&journal->j_state_lock);
+ commit_trans = journal->j_committing_transaction;
+ read_unlock(&journal->j_state_lock);
+ spin_lock(&journal->j_list_lock);
+ inode_trans = jinode->i_transaction;
+ spin_unlock(&journal->j_list_lock);
+ if (inode_trans == commit_trans) {
+ ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
+ start, end);
+ if (ret)
+ jbd2_journal_abort(journal, ret);
+ }
+out:
+ return ret;
+}
+
/*
* File truncate and transaction commit interact with each other in a
* non-trivial way. If a transaction writing data block A is
@@ -2329,27 +2359,14 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
struct jbd2_inode *jinode,
loff_t new_size)
{
- transaction_t *inode_trans, *commit_trans;
- int ret = 0;
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ new_size, LLONG_MAX);
+}

- /* This is a quick check to avoid locking if not necessary */
- if (!jinode->i_transaction)
- goto out;
- /* Locks are here just to force reading of recent values, it is
- * enough that the transaction was not committing before we started
- * a transaction adding the inode to orphan list */
- read_lock(&journal->j_state_lock);
- commit_trans = journal->j_committing_transaction;
- read_unlock(&journal->j_state_lock);
- spin_lock(&journal->j_list_lock);
- inode_trans = jinode->i_transaction;
- spin_unlock(&journal->j_list_lock);
- if (inode_trans == commit_trans) {
- ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
- new_size, LLONG_MAX);
- if (ret)
- jbd2_journal_abort(journal, ret);
- }
-out:
- return ret;
+int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t start, loff_t length)
+{
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ start, start + length - 1);
}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6e051f4..6c63c5e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1128,6 +1128,9 @@ extern int jbd2_journal_force_commit(journal_t *);
extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
struct jbd2_inode *inode, loff_t new_size);
+extern int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
+ struct jbd2_inode *inode, loff_t start,
+ loff_t length);
extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);

--
1.7.9.5



2013-05-21 03:50:18

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: Avoid unnecessarily writing back dirty pages before hole punching

On Mon, May 20, 2013 at 05:04:35PM +0800, Li Wang wrote:
> For hole punching, currently ext4 will synchronously write back the
> dirty pages fit into the hole, since the data on the disk responding
> to those pages are to be deleted, it is benefical to directly release
> those pages, no matter they are dirty or not, except the ordered case.
>
> Signed-off-by: Li Wang <[email protected]>
> Signed-off-by: Yunchuan Wen <[email protected]>
> Reviewed-by: Zheng Liu <[email protected]>

FWIW, you could Cc me but you couldn't add 'Reviewed-by' here because in
first version I give a comment to you, and I don't think the patch looks
good to me at that time. So please do not add 'Reviewed-by'. I paste a
description about 'Reviewed-by' from Documentation/SubmittingPatches for
you information:

Reviewed-by:, instead, indicates that the patch has been reviewed and found
acceptable according to the Reviewer's Statement:

Reviewer's statement of oversight

By offering my Reviewed-by: tag, I state that:

(a) I have carried out a technical review of this patch to
evaluate its appropriateness and readiness for inclusion into
the mainline kernel.

(b) Any problems, concerns, or questions relating to the patch
have been communicated back to the submitter. I am satisfied
with the submitter's response to my comments.

(c) While there may be things that could be improved with this
submission, I believe that it is, at this time, (1) a
worthwhile modification to the kernel, and (2) free of known
issues which would argue against its inclusion.

(d) While I have reviewed the patch and believe it to be sound, I
do not (unless explicitly stated elsewhere) make any
warranties or guarantees that it will achieve its stated
purpose or function properly in any given situation.

A Reviewed-by tag is a statement of opinion that the patch is an
appropriate modification of the kernel without any remaining serious
technical issues. Any interested reviewer (who has done the work) can
offer a Reviewed-by tag for a patch. This tag serves to give credit to
reviewers and to inform maintainers of the degree of review which has been
done on the patch. Reviewed-by: tags, when supplied by reviewers known to
understand the subject area and to perform thorough reviews, will normally
increase the likelihood of your patch getting into the kernel.

Otherwise the patch looks good to me. You can add:
Reviewed-by: Zheng Liu <[email protected]>

:-)

> Cc: Dmitry Monakhov <[email protected]>
> ---
> Hi Zheng,
> Thanks for your comments.
> This is the revised version with the operation of writting back moved
> down after the inode mutex held. But there is one thing I wanna confirm
> is that whether the inode mutex could prevent the mmap() writer? I did
> not take a careful look at the mmap() code, the straightforward thinking
> is that mmap() write will directly dirty the pages without going through
> the VFS generic_file_write() path.

As far as I understand, you are right. We couldn't avoid this race
condition with punching hole. Please correct me if I miss something.

> BTW, I have one other question to confirm regarding the ext4 journal mode:
> what is the advantage of data=ordered journal mode compared to data=writeback?

'data=ordered' mode means that when we try to commit metadata, we must
make sure that related data has been written out on disk, but
'data=writeback' mode doesn't guarantee this. So when we got a power
failure, we could read stale data with 'data=writeback' mode.

> For overwriting write, it still may lead to the inconsistence between data and
> metadata, that is, data is new and metadata is old.

No, a overwriting write means that metadata has been allocated. So we
don't need to allocate some blocks for these data. That means that we
don't need to worry about the corrupted metadata.

>So its standpoint is
> that it beats data=writeback in appending write?

> ---

One tip: you can write your question here in a patch. When maintainer
apply your patch, your question will be ignored automatically.
Otherwise maintainer will change your commit log manually.

Regards,
- Zheng

> fs/ext4/inode.c | 27 +++++++++++++---------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 61 +++++++++++++++++++++++++++++++------------------
> include/linux/jbd2.h | 3 +++
> 4 files changed, 59 insertions(+), 33 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d6382b8..568b0bd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
> return 0;
> }
>
> +static inline int ext4_begin_ordered_fallocate(struct inode *inode,
> + loff_t start, loff_t length)
> +{
> + if (!EXT4_I(inode)->jinode)
> + return 0;
> + return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
> + EXT4_I(inode)->jinode,
> + start, length);
> +}
> +
> /*
> * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> * associated with the given offset and length
> @@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>
> trace_ext4_punch_hole(inode, offset, length);
>
> - /*
> - * Write out all dirty pages to avoid race conditions
> - * Then release them.
> - */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> - ret = filemap_write_and_wait_range(mapping, offset,
> - offset + length - 1);
> - if (ret)
> - return ret;
> - }
> -
> mutex_lock(&inode->i_mutex);
> /* It's not possible punch hole on append only file */
> if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> @@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> first_page_offset = first_page << PAGE_CACHE_SHIFT;
> last_page_offset = last_page << PAGE_CACHE_SHIFT;
>
> + if (ext4_should_order_data(inode)) {
> + ret = ext4_begin_ordered_fallocate(inode, offset, length);
> + if (ret)
> + return ret;
> + }
> +
> /* Now release the pages */
> if (last_page_offset > first_page_offset) {
> truncate_pagecache_range(inode, first_page_offset,
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 9545757..ccc483a 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -98,6 +98,7 @@ EXPORT_SYMBOL(jbd2_journal_file_inode);
> EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> +EXPORT_SYMBOL(jbd2_journal_begin_ordered_fallocate);
> EXPORT_SYMBOL(jbd2_inode_cache);
>
> static void __journal_abort_soft (journal_t *journal, int errno);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 10f524c..035c064 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2305,6 +2305,36 @@ done:
> return 0;
> }
>
> +
> +static int jbd2_journal_begin_ordered_discard(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t end)
> +{
> + transaction_t *inode_trans, *commit_trans;
> + int ret = 0;
> +
> + /* This is a quick check to avoid locking if not necessary */
> + if (!jinode->i_transaction)
> + goto out;
> + /* Locks are here just to force reading of recent values, it is
> + * enough that the transaction was not committing before we started
> + * a transaction adding the inode to orphan list */
> + read_lock(&journal->j_state_lock);
> + commit_trans = journal->j_committing_transaction;
> + read_unlock(&journal->j_state_lock);
> + spin_lock(&journal->j_list_lock);
> + inode_trans = jinode->i_transaction;
> + spin_unlock(&journal->j_list_lock);
> + if (inode_trans == commit_trans) {
> + ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> + start, end);
> + if (ret)
> + jbd2_journal_abort(journal, ret);
> + }
> +out:
> + return ret;
> +}
> +
> /*
> * File truncate and transaction commit interact with each other in a
> * non-trivial way. If a transaction writing data block A is
> @@ -2329,27 +2359,14 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *jinode,
> loff_t new_size)
> {
> - transaction_t *inode_trans, *commit_trans;
> - int ret = 0;
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + new_size, LLONG_MAX);
> +}
>
> - /* This is a quick check to avoid locking if not necessary */
> - if (!jinode->i_transaction)
> - goto out;
> - /* Locks are here just to force reading of recent values, it is
> - * enough that the transaction was not committing before we started
> - * a transaction adding the inode to orphan list */
> - read_lock(&journal->j_state_lock);
> - commit_trans = journal->j_committing_transaction;
> - read_unlock(&journal->j_state_lock);
> - spin_lock(&journal->j_list_lock);
> - inode_trans = jinode->i_transaction;
> - spin_unlock(&journal->j_list_lock);
> - if (inode_trans == commit_trans) {
> - ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> - new_size, LLONG_MAX);
> - if (ret)
> - jbd2_journal_abort(journal, ret);
> - }
> -out:
> - return ret;
> +int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t length)
> +{
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + start, start + length - 1);
> }
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 6e051f4..6c63c5e 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1128,6 +1128,9 @@ extern int jbd2_journal_force_commit(journal_t *);
> extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
> extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *inode, loff_t new_size);
> +extern int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *inode, loff_t start,
> + loff_t length);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
>
> --
> 1.7.9.5
>
>

2013-05-21 09:22:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: Avoid unnecessarily writing back dirty pages before hole punching

On Mon 20-05-13 17:04:35, Li Wang wrote:
> For hole punching, currently ext4 will synchronously write back the
> dirty pages fit into the hole, since the data on the disk responding
> to those pages are to be deleted, it is benefical to directly release
> those pages, no matter they are dirty or not, except the ordered case.
>
> Signed-off-by: Li Wang <[email protected]>
> Signed-off-by: Yunchuan Wen <[email protected]>
> Reviewed-by: Zheng Liu <[email protected]>
> Cc: Dmitry Monakhov <[email protected]>
> ---
> Hi Zheng,
> Thanks for your comments.
> This is the revised version with the operation of writting back moved
> down after the inode mutex held. But there is one thing I wanna confirm
> is that whether the inode mutex could prevent the mmap() writer? I did
> not take a careful look at the mmap() code, the straightforward thinking
> is that mmap() write will directly dirty the pages without going through
> the VFS generic_file_write() path.
Currently there's no easy way to stop mmap writer. When I eventually get
to implementing the mapping range lock, it will be used exactly for that.

> BTW, I have one other question to confirm regarding the ext4 journal mode:
> what is the advantage of data=ordered journal mode compared to data=writeback?
> For overwriting write, it still may lead to the inconsistence between data and
> metadata, that is, data is new and metadata is old. So its standpoint is
> that it beats data=writeback in appending write?
The advantage of data=ordered mode is that in case of a system crash /
power failure, you are guaranteed that only data sometimes written to a
file is visible there - i.e., you will not ever expose uninitialized blocks
to user (which is a security concern on multiuser systems as those
uninitialized blocks can contain old versions of /etc/shadow or other
sensitive data).

Honza

> ---
> fs/ext4/inode.c | 27 +++++++++++++---------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 61 +++++++++++++++++++++++++++++++------------------
> include/linux/jbd2.h | 3 +++
> 4 files changed, 59 insertions(+), 33 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d6382b8..568b0bd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
> return 0;
> }
>
> +static inline int ext4_begin_ordered_fallocate(struct inode *inode,
> + loff_t start, loff_t length)
> +{
> + if (!EXT4_I(inode)->jinode)
> + return 0;
> + return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
> + EXT4_I(inode)->jinode,
> + start, length);
> +}
> +
I somewhat dislike the naming of the functions - especially 'fallocate'
seems a bit misleading as it's really about hole punching. So maybe we
could call the functions like:
ext4_begin_ordered_punch_hole()
and
jbd2_journal_begin_ordered_punch_hole()?

> /*
> * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> * associated with the given offset and length
> @@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>
> trace_ext4_punch_hole(inode, offset, length);
>
> - /*
> - * Write out all dirty pages to avoid race conditions
> - * Then release them.
> - */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> - ret = filemap_write_and_wait_range(mapping, offset,
> - offset + length - 1);
> - if (ret)
> - return ret;
> - }
> -
> mutex_lock(&inode->i_mutex);
> /* It's not possible punch hole on append only file */
> if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> @@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> first_page_offset = first_page << PAGE_CACHE_SHIFT;
> last_page_offset = last_page << PAGE_CACHE_SHIFT;
>
> + if (ext4_should_order_data(inode)) {
> + ret = ext4_begin_ordered_fallocate(inode, offset, length);
> + if (ret)
> + return ret;
> + }
> +
> /* Now release the pages */
> if (last_page_offset > first_page_offset) {
> truncate_pagecache_range(inode, first_page_offset,
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 9545757..ccc483a 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -98,6 +98,7 @@ EXPORT_SYMBOL(jbd2_journal_file_inode);
> EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> +EXPORT_SYMBOL(jbd2_journal_begin_ordered_fallocate);
> EXPORT_SYMBOL(jbd2_inode_cache);
>
> static void __journal_abort_soft (journal_t *journal, int errno);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 10f524c..035c064 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2305,6 +2305,36 @@ done:
> return 0;
> }
>
> +
> +static int jbd2_journal_begin_ordered_discard(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t end)
> +{
I don't see a need for this internal function - it is exactly the same as
the external one (jbd2_journal_begin_ordered_punch_hole()).

> + transaction_t *inode_trans, *commit_trans;
> + int ret = 0;
> +
> + /* This is a quick check to avoid locking if not necessary */
> + if (!jinode->i_transaction)
> + goto out;
> + /* Locks are here just to force reading of recent values, it is
> + * enough that the transaction was not committing before we started
> + * a transaction adding the inode to orphan list */
> + read_lock(&journal->j_state_lock);
> + commit_trans = journal->j_committing_transaction;
> + read_unlock(&journal->j_state_lock);
> + spin_lock(&journal->j_list_lock);
> + inode_trans = jinode->i_transaction;
> + spin_unlock(&journal->j_list_lock);
> + if (inode_trans == commit_trans) {
> + ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> + start, end);
> + if (ret)
> + jbd2_journal_abort(journal, ret);
> + }
> +out:
> + return ret;
> +}
> +
> /*
> * File truncate and transaction commit interact with each other in a
> * non-trivial way. If a transaction writing data block A is
> @@ -2329,27 +2359,14 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *jinode,
> loff_t new_size)
> {
> - transaction_t *inode_trans, *commit_trans;
> - int ret = 0;
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + new_size, LLONG_MAX);
> +}
You can make this function inline and declare it in jbd2.h header file...

Honza

>
> - /* This is a quick check to avoid locking if not necessary */
> - if (!jinode->i_transaction)
> - goto out;
> - /* Locks are here just to force reading of recent values, it is
> - * enough that the transaction was not committing before we started
> - * a transaction adding the inode to orphan list */
> - read_lock(&journal->j_state_lock);
> - commit_trans = journal->j_committing_transaction;
> - read_unlock(&journal->j_state_lock);
> - spin_lock(&journal->j_list_lock);
> - inode_trans = jinode->i_transaction;
> - spin_unlock(&journal->j_list_lock);
> - if (inode_trans == commit_trans) {
> - ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> - new_size, LLONG_MAX);
> - if (ret)
> - jbd2_journal_abort(journal, ret);
> - }
> -out:
> - return ret;
> +int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t length)
> +{
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + start, start + length - 1);
> }
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 6e051f4..6c63c5e 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1128,6 +1128,9 @@ extern int jbd2_journal_force_commit(journal_t *);
> extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
> extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *inode, loff_t new_size);
> +extern int jbd2_journal_begin_ordered_fallocate(journal_t *journal,
> + struct jbd2_inode *inode, loff_t start,
> + loff_t length);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
>
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-05-27 14:25:36

by Li Wang

[permalink] [raw]
Subject: [PATCH] ext4: Avoid unnecessarily writing back dirty pages before hole punching

For hole punching, currently ext4 will synchronously write back the
dirty pages fit into the hole, since the data on the disk responding
to those pages are to be deleted, it is benefical to directly release
those pages, no matter they are dirty or not, except the ordered case.

Signed-off-by: Li Wang <[email protected]>
Signed-off-by: Yunchuan Wen <[email protected]>
Cc: Dmitry Monakhov <[email protected]>
Cc: Jan Kara <[email protected]>
---
Hi Zheng and Jan,
Thanks for your comments.
For data=ordered vs. data=writeback, my understanding is
that they both journal metadata, so metadata won't be corrupted
in both cases. And they both do not journal data, so data
may be lost under either case. So it is basically the same
under overwriting situation, that is, data may not be fully updated.
The difference lies in that for appending write, with data=writeback,
the commit of metadata is done asynchronously
with the write of data, so it may happen
that file size is increased, with data incompletely written,
leaves partly uninitialized data, as
pointed out by Jan, that results in security issues. For
data=ordered, metadata is committed after data are
written with slightly? lower performance, so reader won't read out
uninitialized data.
We introduce the internal function jbd2_journal_begin_ordered_discard()
because it will be called by both jbd2_journal_begin_ordered_punch_hole()
and jbd2_journal_begin_ordered_truncate(),
and we want to leave the function prototype of jbd2_journal_begin_ordered_
truncate() unchanged, and it has less arguments than the punch hole
counterpart. The other way is we implement them independently without the
internal begin_ordered_discard() function, however, in that case,
the two functions will suffer from sharing a
big and almost the same body, that is not elegant.
We have taken other suggestions from Jan.
---
fs/ext4/inode.c | 27 ++++++++++++++++-----------
fs/jbd2/journal.c | 2 +-
fs/jbd2/transaction.c | 29 ++++++-----------------------
include/linux/jbd2.h | 41 +++++++++++++++++++++++++++++++++++++++--
4 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d6382b8..6b0251e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
return 0;
}

+static inline int ext4_begin_ordered_punch_hole(struct inode *inode,
+ loff_t start, loff_t length)
+{
+ if (!EXT4_I(inode)->jinode)
+ return 0;
+ return jbd2_journal_begin_ordered_punch_hole(EXT4_JOURNAL(inode),
+ EXT4_I(inode)->jinode,
+ start, length);
+}
+
/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
@@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)

trace_ext4_punch_hole(inode, offset, length);

- /*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + length - 1);
- if (ret)
- return ret;
- }
-
mutex_lock(&inode->i_mutex);
/* It's not possible punch hole on append only file */
if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
@@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
first_page_offset = first_page << PAGE_CACHE_SHIFT;
last_page_offset = last_page << PAGE_CACHE_SHIFT;

+ if (ext4_should_order_data(inode)) {
+ ret = ext4_begin_ordered_punch_hole(inode, offset, length);
+ if (ret)
+ return ret;
+ }
+
/* Now release the pages */
if (last_page_offset > first_page_offset) {
truncate_pagecache_range(inode, first_page_offset,
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 9545757..166ca5d 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -97,7 +97,7 @@ EXPORT_SYMBOL(jbd2_journal_force_commit);
EXPORT_SYMBOL(jbd2_journal_file_inode);
EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
-EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
+EXPORT_SYMBOL(jbd2_journal_begin_ordered_discard);
EXPORT_SYMBOL(jbd2_inode_cache);

static void __journal_abort_soft (journal_t *journal, int errno);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10f524c..2d7a3bf 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2305,29 +2305,10 @@ done:
return 0;
}

-/*
- * File truncate and transaction commit interact with each other in a
- * non-trivial way. If a transaction writing data block A is
- * committing, we cannot discard the data by truncate until we have
- * written them. Otherwise if we crashed after the transaction with
- * write has committed but before the transaction with truncate has
- * committed, we could see stale data in block A. This function is a
- * helper to solve this problem. It starts writeout of the truncated
- * part in case it is in the committing transaction.
- *
- * Filesystem code must call this function when inode is journaled in
- * ordered mode before truncation happens and after the inode has been
- * placed on orphan list with the new inode size. The second condition
- * avoids the race that someone writes new data and we start
- * committing the transaction after this function has been called but
- * before a transaction for truncate is started (and furthermore it
- * allows us to optimize the case where the addition to orphan list
- * happens in the same transaction as write --- we don't have to write
- * any data in such case).
- */
-int jbd2_journal_begin_ordered_truncate(journal_t *journal,
+
+int jbd2_journal_begin_ordered_discard(journal_t *journal,
struct jbd2_inode *jinode,
- loff_t new_size)
+ loff_t start, loff_t end)
{
transaction_t *inode_trans, *commit_trans;
int ret = 0;
@@ -2346,10 +2327,12 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
spin_unlock(&journal->j_list_lock);
if (inode_trans == commit_trans) {
ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
- new_size, LLONG_MAX);
+ start, end);
if (ret)
jbd2_journal_abort(journal, ret);
}
out:
return ret;
}
+
+
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6e051f4..9543a5a 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1126,12 +1126,49 @@ extern int jbd2_journal_clear_err (journal_t *);
extern int jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
extern int jbd2_journal_force_commit(journal_t *);
extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
-extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
- struct jbd2_inode *inode, loff_t new_size);
+extern int jbd2_journal_begin_ordered_discard(journal_t *,
+ struct jbd2_inode *,
+ loff_t, loff_t);
extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);

/*
+ * File truncate and transaction commit interact with each other in a
+ * non-trivial way. If a transaction writing data block A is
+ * committing, we cannot discard the data by truncate until we have
+ * written them. Otherwise if we crashed after the transaction with
+ * write has committed but before the transaction with truncate has
+ * committed, we could see stale data in block A. This function is a
+ * helper to solve this problem. It starts writeout of the truncated
+ * part in case it is in the committing transaction.
+ *
+ * Filesystem code must call this function when inode is journaled in
+ * ordered mode before truncation happens and after the inode has been
+ * placed on orphan list with the new inode size. The second condition
+ * avoids the race that someone writes new data and we start
+ * committing the transaction after this function has been called but
+ * before a transaction for truncate is started (and furthermore it
+ * allows us to optimize the case where the addition to orphan list
+ * happens in the same transaction as write --- we don't have to write
+ * any data in such case).
+ */
+static inline int jbd2_journal_begin_ordered_truncate(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t new_size)
+{
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ new_size, LLONG_MAX);
+}
+
+static inline int jbd2_journal_begin_ordered_punch_hole(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t start, loff_t length)
+{
+ return jbd2_journal_begin_ordered_discard(journal, jinode,
+ start, start + length - 1);
+}
+
+/*
* journal_head management
*/
struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh);
--
1.7.9.5

2013-05-27 14:47:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Avoid unnecessarily writing back dirty pages before hole punching

On Mon 27-05-13 22:25:36, Li Wang wrote:
> For hole punching, currently ext4 will synchronously write back the
> dirty pages fit into the hole, since the data on the disk responding
> to those pages are to be deleted, it is benefical to directly release
> those pages, no matter they are dirty or not, except the ordered case.
>
> Signed-off-by: Li Wang <[email protected]>
> Signed-off-by: Yunchuan Wen <[email protected]>
> Cc: Dmitry Monakhov <[email protected]>
> Cc: Jan Kara <[email protected]>
The patch looks good to me except for one nitpick below. Feel free to
add:

Reviewed-by: Jan Kara <[email protected]>

> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2305,29 +2305,10 @@ done:
> return 0;
> }
>
> -/*
> - * File truncate and transaction commit interact with each other in a
> - * non-trivial way. If a transaction writing data block A is
> - * committing, we cannot discard the data by truncate until we have
> - * written them. Otherwise if we crashed after the transaction with
> - * write has committed but before the transaction with truncate has
> - * committed, we could see stale data in block A. This function is a
> - * helper to solve this problem. It starts writeout of the truncated
> - * part in case it is in the committing transaction.
> - *
> - * Filesystem code must call this function when inode is journaled in
> - * ordered mode before truncation happens and after the inode has been
> - * placed on orphan list with the new inode size. The second condition
> - * avoids the race that someone writes new data and we start
> - * committing the transaction after this function has been called but
> - * before a transaction for truncate is started (and furthermore it
> - * allows us to optimize the case where the addition to orphan list
> - * happens in the same transaction as write --- we don't have to write
> - * any data in such case).
> - */
> -int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> +
> +int jbd2_journal_begin_ordered_discard(journal_t *journal,
> struct jbd2_inode *jinode,
> - loff_t new_size)
> + loff_t start, loff_t end)
Why don't you call this function jbd2_journal_begin_ordered_punch_hole()?
They are the same (well, except the end vs length difference but that seems
minor).

Honza

> {
> transaction_t *inode_trans, *commit_trans;
> int ret = 0;
> @@ -2346,10 +2327,12 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> spin_unlock(&journal->j_list_lock);
> if (inode_trans == commit_trans) {
> ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> - new_size, LLONG_MAX);
> + start, end);
> if (ret)
> jbd2_journal_abort(journal, ret);
> }
> out:
> return ret;
> }
> +
> +
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 6e051f4..9543a5a 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1126,12 +1126,49 @@ extern int jbd2_journal_clear_err (journal_t *);
> extern int jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
> extern int jbd2_journal_force_commit(journal_t *);
> extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
> -extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> - struct jbd2_inode *inode, loff_t new_size);
> +extern int jbd2_journal_begin_ordered_discard(journal_t *,
> + struct jbd2_inode *,
> + loff_t, loff_t);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
>
> /*
> + * File truncate and transaction commit interact with each other in a
> + * non-trivial way. If a transaction writing data block A is
> + * committing, we cannot discard the data by truncate until we have
> + * written them. Otherwise if we crashed after the transaction with
> + * write has committed but before the transaction with truncate has
> + * committed, we could see stale data in block A. This function is a
> + * helper to solve this problem. It starts writeout of the truncated
> + * part in case it is in the committing transaction.
> + *
> + * Filesystem code must call this function when inode is journaled in
> + * ordered mode before truncation happens and after the inode has been
> + * placed on orphan list with the new inode size. The second condition
> + * avoids the race that someone writes new data and we start
> + * committing the transaction after this function has been called but
> + * before a transaction for truncate is started (and furthermore it
> + * allows us to optimize the case where the addition to orphan list
> + * happens in the same transaction as write --- we don't have to write
> + * any data in such case).
> + */
> +static inline int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t new_size)
> +{
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + new_size, LLONG_MAX);
> +}
> +
> +static inline int jbd2_journal_begin_ordered_punch_hole(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t start, loff_t length)
> +{
> + return jbd2_journal_begin_ordered_discard(journal, jinode,
> + start, start + length - 1);
> +}
> +
> +/*
> * journal_head management
> */
> struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh);
> --
> 1.7.9.5
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-05-28 02:23:25

by Li Wang

[permalink] [raw]
Subject: [PATCH v4] ext4: Avoid unnecessarily writing back dirty pages before hole punching

For hole punching, currently ext4 will synchronously write back the
dirty pages fit into the hole, since the data on the disk responding
to those pages are to be deleted, it is benefical to directly release
those pages, no matter they are dirty or not, except the ordered case.

Signed-off-by: Li Wang <[email protected]>
Signed-off-by: Yunchuan Wen <[email protected]>
Cc: Dmitry Monakhov <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
Hi Jan,
Did you mean this?
It seems you donot like the jbd2_journal_begin_ordered_discard:),
However, what do you think of calling jbd2_journal_begin_ordered_punch_hole()
from jbd2_journal_begin_ordered_truncate()? In my option,
the two guys stand at the same level. Nevertheless,
it is up to your choice.
---
fs/ext4/inode.c | 27 ++++++++++++++++-----------
fs/jbd2/journal.c | 2 +-
fs/jbd2/transaction.c | 29 ++++++-----------------------
include/linux/jbd2.h | 33 +++++++++++++++++++++++++++++++--
4 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d6382b8..844d1b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
return 0;
}

+static inline int ext4_begin_ordered_punch_hole(struct inode *inode,
+ loff_t start, loff_t length)
+{
+ if (!EXT4_I(inode)->jinode)
+ return 0;
+ return jbd2_journal_begin_ordered_punch_hole(EXT4_JOURNAL(inode),
+ EXT4_I(inode)->jinode,
+ start, start+length-1);
+}
+
/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
@@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)

trace_ext4_punch_hole(inode, offset, length);

- /*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + length - 1);
- if (ret)
- return ret;
- }
-
mutex_lock(&inode->i_mutex);
/* It's not possible punch hole on append only file */
if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
@@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
first_page_offset = first_page << PAGE_CACHE_SHIFT;
last_page_offset = last_page << PAGE_CACHE_SHIFT;

+ if (ext4_should_order_data(inode)) {
+ ret = ext4_begin_ordered_punch_hole(inode, offset, length);
+ if (ret)
+ return ret;
+ }
+
/* Now release the pages */
if (last_page_offset > first_page_offset) {
truncate_pagecache_range(inode, first_page_offset,
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 9545757..7af4e4f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -97,7 +97,7 @@ EXPORT_SYMBOL(jbd2_journal_force_commit);
EXPORT_SYMBOL(jbd2_journal_file_inode);
EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
-EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
+EXPORT_SYMBOL(jbd2_journal_begin_ordered_punch_hole);
EXPORT_SYMBOL(jbd2_inode_cache);

static void __journal_abort_soft (journal_t *journal, int errno);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10f524c..262b1c3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2305,29 +2305,10 @@ done:
return 0;
}

-/*
- * File truncate and transaction commit interact with each other in a
- * non-trivial way. If a transaction writing data block A is
- * committing, we cannot discard the data by truncate until we have
- * written them. Otherwise if we crashed after the transaction with
- * write has committed but before the transaction with truncate has
- * committed, we could see stale data in block A. This function is a
- * helper to solve this problem. It starts writeout of the truncated
- * part in case it is in the committing transaction.
- *
- * Filesystem code must call this function when inode is journaled in
- * ordered mode before truncation happens and after the inode has been
- * placed on orphan list with the new inode size. The second condition
- * avoids the race that someone writes new data and we start
- * committing the transaction after this function has been called but
- * before a transaction for truncate is started (and furthermore it
- * allows us to optimize the case where the addition to orphan list
- * happens in the same transaction as write --- we don't have to write
- * any data in such case).
- */
-int jbd2_journal_begin_ordered_truncate(journal_t *journal,
+
+int jbd2_journal_begin_ordered_punch_hole(journal_t *journal,
struct jbd2_inode *jinode,
- loff_t new_size)
+ loff_t start, loff_t end)
{
transaction_t *inode_trans, *commit_trans;
int ret = 0;
@@ -2346,10 +2327,12 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
spin_unlock(&journal->j_list_lock);
if (inode_trans == commit_trans) {
ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
- new_size, LLONG_MAX);
+ start, end);
if (ret)
jbd2_journal_abort(journal, ret);
}
out:
return ret;
}
+
+
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6e051f4..8eb7865 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1126,12 +1126,41 @@ extern int jbd2_journal_clear_err (journal_t *);
extern int jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
extern int jbd2_journal_force_commit(journal_t *);
extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
-extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
- struct jbd2_inode *inode, loff_t new_size);
+extern int jbd2_journal_begin_ordered_punch_hole(journal_t *,
+ struct jbd2_inode *,
+ loff_t, loff_t);
extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);

/*
+ * File truncate and transaction commit interact with each other in a
+ * non-trivial way. If a transaction writing data block A is
+ * committing, we cannot discard the data by truncate until we have
+ * written them. Otherwise if we crashed after the transaction with
+ * write has committed but before the transaction with truncate has
+ * committed, we could see stale data in block A. This function is a
+ * helper to solve this problem. It starts writeout of the truncated
+ * part in case it is in the committing transaction.
+ *
+ * Filesystem code must call this function when inode is journaled in
+ * ordered mode before truncation happens and after the inode has been
+ * placed on orphan list with the new inode size. The second condition
+ * avoids the race that someone writes new data and we start
+ * committing the transaction after this function has been called but
+ * before a transaction for truncate is started (and furthermore it
+ * allows us to optimize the case where the addition to orphan list
+ * happens in the same transaction as write --- we don't have to write
+ * any data in such case).
+ */
+static inline int jbd2_journal_begin_ordered_truncate(journal_t *journal,
+ struct jbd2_inode *jinode,
+ loff_t new_size)
+{
+ return jbd2_journal_begin_ordered_punch_hole(journal, jinode,
+ new_size, LLONG_MAX);
+}
+
+/*
* journal_head management
*/
struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh);
--
1.7.9.5

2013-05-28 09:04:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4] ext4: Avoid unnecessarily writing back dirty pages before hole punching

On Tue 28-05-13 10:23:25, Li Wang wrote:
> For hole punching, currently ext4 will synchronously write back the
> dirty pages fit into the hole, since the data on the disk responding
> to those pages are to be deleted, it is benefical to directly release
> those pages, no matter they are dirty or not, except the ordered case.
>
> Signed-off-by: Li Wang <[email protected]>
> Signed-off-by: Yunchuan Wen <[email protected]>
> Cc: Dmitry Monakhov <[email protected]>
> Reviewed-by: Zheng Liu <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> Hi Jan,
> Did you mean this?
> It seems you donot like the jbd2_journal_begin_ordered_discard:),
> However, what do you think of calling jbd2_journal_begin_ordered_punch_hole()
> from jbd2_journal_begin_ordered_truncate()? In my option,
> the two guys stand at the same level. Nevertheless,
> it is up to your choice.
Well, punch hole is a more generic version of truncate so it seems
perfectly fine for me to implement truncate using punch hole. Thanks for
updating the patch!

Honza


> ---
> fs/ext4/inode.c | 27 ++++++++++++++++-----------
> fs/jbd2/journal.c | 2 +-
> fs/jbd2/transaction.c | 29 ++++++-----------------------
> include/linux/jbd2.h | 33 +++++++++++++++++++++++++++++++--
> 4 files changed, 54 insertions(+), 37 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d6382b8..844d1b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3569,6 +3569,16 @@ int ext4_can_truncate(struct inode *inode)
> return 0;
> }
>
> +static inline int ext4_begin_ordered_punch_hole(struct inode *inode,
> + loff_t start, loff_t length)
> +{
> + if (!EXT4_I(inode)->jinode)
> + return 0;
> + return jbd2_journal_begin_ordered_punch_hole(EXT4_JOURNAL(inode),
> + EXT4_I(inode)->jinode,
> + start, start+length-1);
> +}
> +
> /*
> * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> * associated with the given offset and length
> @@ -3602,17 +3612,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>
> trace_ext4_punch_hole(inode, offset, length);
>
> - /*
> - * Write out all dirty pages to avoid race conditions
> - * Then release them.
> - */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> - ret = filemap_write_and_wait_range(mapping, offset,
> - offset + length - 1);
> - if (ret)
> - return ret;
> - }
> -
> mutex_lock(&inode->i_mutex);
> /* It's not possible punch hole on append only file */
> if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> @@ -3644,6 +3643,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> first_page_offset = first_page << PAGE_CACHE_SHIFT;
> last_page_offset = last_page << PAGE_CACHE_SHIFT;
>
> + if (ext4_should_order_data(inode)) {
> + ret = ext4_begin_ordered_punch_hole(inode, offset, length);
> + if (ret)
> + return ret;
> + }
> +
> /* Now release the pages */
> if (last_page_offset > first_page_offset) {
> truncate_pagecache_range(inode, first_page_offset,
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 9545757..7af4e4f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -97,7 +97,7 @@ EXPORT_SYMBOL(jbd2_journal_force_commit);
> EXPORT_SYMBOL(jbd2_journal_file_inode);
> EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
> -EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> +EXPORT_SYMBOL(jbd2_journal_begin_ordered_punch_hole);
> EXPORT_SYMBOL(jbd2_inode_cache);
>
> static void __journal_abort_soft (journal_t *journal, int errno);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 10f524c..262b1c3 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -2305,29 +2305,10 @@ done:
> return 0;
> }
>
> -/*
> - * File truncate and transaction commit interact with each other in a
> - * non-trivial way. If a transaction writing data block A is
> - * committing, we cannot discard the data by truncate until we have
> - * written them. Otherwise if we crashed after the transaction with
> - * write has committed but before the transaction with truncate has
> - * committed, we could see stale data in block A. This function is a
> - * helper to solve this problem. It starts writeout of the truncated
> - * part in case it is in the committing transaction.
> - *
> - * Filesystem code must call this function when inode is journaled in
> - * ordered mode before truncation happens and after the inode has been
> - * placed on orphan list with the new inode size. The second condition
> - * avoids the race that someone writes new data and we start
> - * committing the transaction after this function has been called but
> - * before a transaction for truncate is started (and furthermore it
> - * allows us to optimize the case where the addition to orphan list
> - * happens in the same transaction as write --- we don't have to write
> - * any data in such case).
> - */
> -int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> +
> +int jbd2_journal_begin_ordered_punch_hole(journal_t *journal,
> struct jbd2_inode *jinode,
> - loff_t new_size)
> + loff_t start, loff_t end)
> {
> transaction_t *inode_trans, *commit_trans;
> int ret = 0;
> @@ -2346,10 +2327,12 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> spin_unlock(&journal->j_list_lock);
> if (inode_trans == commit_trans) {
> ret = filemap_fdatawrite_range(jinode->i_vfs_inode->i_mapping,
> - new_size, LLONG_MAX);
> + start, end);
> if (ret)
> jbd2_journal_abort(journal, ret);
> }
> out:
> return ret;
> }
> +
> +
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 6e051f4..8eb7865 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1126,12 +1126,41 @@ extern int jbd2_journal_clear_err (journal_t *);
> extern int jbd2_journal_bmap(journal_t *, unsigned long, unsigned long long *);
> extern int jbd2_journal_force_commit(journal_t *);
> extern int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *inode);
> -extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> - struct jbd2_inode *inode, loff_t new_size);
> +extern int jbd2_journal_begin_ordered_punch_hole(journal_t *,
> + struct jbd2_inode *,
> + loff_t, loff_t);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> extern void jbd2_journal_release_jbd_inode(journal_t *journal, struct jbd2_inode *jinode);
>
> /*
> + * File truncate and transaction commit interact with each other in a
> + * non-trivial way. If a transaction writing data block A is
> + * committing, we cannot discard the data by truncate until we have
> + * written them. Otherwise if we crashed after the transaction with
> + * write has committed but before the transaction with truncate has
> + * committed, we could see stale data in block A. This function is a
> + * helper to solve this problem. It starts writeout of the truncated
> + * part in case it is in the committing transaction.
> + *
> + * Filesystem code must call this function when inode is journaled in
> + * ordered mode before truncation happens and after the inode has been
> + * placed on orphan list with the new inode size. The second condition
> + * avoids the race that someone writes new data and we start
> + * committing the transaction after this function has been called but
> + * before a transaction for truncate is started (and furthermore it
> + * allows us to optimize the case where the addition to orphan list
> + * happens in the same transaction as write --- we don't have to write
> + * any data in such case).
> + */
> +static inline int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> + struct jbd2_inode *jinode,
> + loff_t new_size)
> +{
> + return jbd2_journal_begin_ordered_punch_hole(journal, jinode,
> + new_size, LLONG_MAX);
> +}
> +
> +/*
> * journal_head management
> */
> struct journal_head *jbd2_journal_add_journal_head(struct buffer_head *bh);
> --
> 1.7.9.5
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR