2009-03-23 20:32:22

by Jeff Layton

[permalink] [raw]
Subject: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

This may be a problem on other filesystems too, but the reproducer I
have involves NFS.

On NFS, the __mark_inode_dirty() call after writing back the inode is
done in the rpc_release handler for COMMIT calls. This call is done
asynchronously after the call completes.

Because there's no real coordination between __mark_inode_dirty() and
__sync_single_inode(), it's often the case that these two calls will
race and __mark_inode_dirty() will get called while I_SYNC is still set.
When this happens, __sync_single_inode() should detect that the inode
was redirtied while we were flushing it and call redirty_tail() to put
it back on the s_dirty list.

When redirty_tail() puts it back on the list, it only resets the
dirtied_when value if it's necessary to maintain the list order. Given
the right situation (the right I/O patterns and a lot of luck), this
could result in dirtied_when never getting updated on an inode that's
constantly being redirtied while pdflush is writing it back.

Since dirtied_when is based on jiffies, it's possible for it to persist
across 2 sign-bit flips of jiffies. When that happens, the time_after()
check in sync_sb_inodes no longer works correctly and writeouts by
pdflush of this inode and any inodes after it on the list stop.

This patch fixes this by resetting the dirtied_when value on an inode
when we're adding it back onto an empty s_dirty list. Since we generally
write inodes from oldest to newest dirtied_when values, this has the
effect of making it so that these inodes don't end up with dirtied_when
values that are frozen.

I've also taken the liberty of fixing up the comments a bit and changed
the !time_after_eq() check in redirty_tail to be time_before(). That
should be functionally equivalent but I think it's more readable.

I wish this were just a theoretical problem, but we've had a customer
hit a variant of it in an older kernel. Newer upstream kernels have a
number of changes that make this problem less likely. As best I can tell
though, there is nothing that really prevents it.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs-writeback.c | 22 +++++++++++++++++-----
1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e3fe991..bd2a7ff 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
* furthest end of its superblock's dirty-inode list.
*
* Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list. If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
+ * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
+ * list. If that is the case then we don't need to restamp it to maintain the
+ * order of the list.
+ *
+ * If s_dirty is empty however, then we need to go ahead and update
+ * dirtied_when for the inode. Not doing so will mean that inodes that are
+ * constantly being redirtied can end up with "stuck" dirtied_when values if
+ * they happen to consistently be the first one to go back on the list.
+ *
+ * Since we're using jiffies values in that field, letting dirtied_when grow
+ * too old will be problematic if jiffies wraps. It may also be causing
+ * pdflush to flush the inode too often since it'll always look like it was
+ * dirtied a long time ago.
*/
static void redirty_tail(struct inode *inode)
{
struct super_block *sb = inode->i_sb;

- if (!list_empty(&sb->s_dirty)) {
+ if (list_empty(&sb->s_dirty)) {
+ inode->dirtied_when = jiffies;
+ } else {
struct inode *tail_inode;

tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
+ if (time_before(inode->dirtied_when,
tail_inode->dirtied_when))
inode->dirtied_when = jiffies;
}
--
1.6.0.6


2009-03-24 04:41:42

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Mon, 23 Mar 2009, Jeff Layton wrote:

> This may be a problem on other filesystems too, but the reproducer I
> have involves NFS.
>
> On NFS, the __mark_inode_dirty() call after writing back the inode is
> done in the rpc_release handler for COMMIT calls. This call is done
> asynchronously after the call completes.
>
> Because there's no real coordination between __mark_inode_dirty() and
> __sync_single_inode(), it's often the case that these two calls will
> race and __mark_inode_dirty() will get called while I_SYNC is still set.
> When this happens, __sync_single_inode() should detect that the inode
> was redirtied while we were flushing it and call redirty_tail() to put
> it back on the s_dirty list.
>
> When redirty_tail() puts it back on the list, it only resets the
> dirtied_when value if it's necessary to maintain the list order. Given
> the right situation (the right I/O patterns and a lot of luck), this
> could result in dirtied_when never getting updated on an inode that's
> constantly being redirtied while pdflush is writing it back.
>
> Since dirtied_when is based on jiffies, it's possible for it to persist
> across 2 sign-bit flips of jiffies. When that happens, the time_after()
> check in sync_sb_inodes no longer works correctly and writeouts by
> pdflush of this inode and any inodes after it on the list stop.
>
> This patch fixes this by resetting the dirtied_when value on an inode
> when we're adding it back onto an empty s_dirty list. Since we generally
> write inodes from oldest to newest dirtied_when values, this has the
> effect of making it so that these inodes don't end up with dirtied_when
> values that are frozen.
>
> I've also taken the liberty of fixing up the comments a bit and changed
> the !time_after_eq() check in redirty_tail to be time_before(). That
> should be functionally equivalent but I think it's more readable.
>
> I wish this were just a theoretical problem, but we've had a customer
> hit a variant of it in an older kernel. Newer upstream kernels have a
> number of changes that make this problem less likely. As best I can tell
> though, there is nothing that really prevents it.
>
> Signed-off-by: Jeff Layton <[email protected]>
Acked-by: Ian Kent <[email protected]>

The assumption is that all inodes heading for the s_dirty list will get
their by calling redirty_tail(). It looks like that's the case but, Andrew
do you agree the assumption holds?

> ---
> fs/fs-writeback.c | 22 +++++++++++++++++-----
> 1 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e3fe991..bd2a7ff 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> * furthest end of its superblock's dirty-inode list.
> *
> * Before stamping the inode's ->dirtied_when, we check to see whether it is
> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> - * the case then the inode must have been redirtied while it was being written
> - * out and we don't reset its dirtied_when.
> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> + * list. If that is the case then we don't need to restamp it to maintain the
> + * order of the list.
> + *
> + * If s_dirty is empty however, then we need to go ahead and update
> + * dirtied_when for the inode. Not doing so will mean that inodes that are
> + * constantly being redirtied can end up with "stuck" dirtied_when values if
> + * they happen to consistently be the first one to go back on the list.
> + *
> + * Since we're using jiffies values in that field, letting dirtied_when grow
> + * too old will be problematic if jiffies wraps. It may also be causing
> + * pdflush to flush the inode too often since it'll always look like it was
> + * dirtied a long time ago.
> */
> static void redirty_tail(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
>
> - if (!list_empty(&sb->s_dirty)) {
> + if (list_empty(&sb->s_dirty)) {
> + inode->dirtied_when = jiffies;
> + } else {
> struct inode *tail_inode;
>
> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> - if (!time_after_eq(inode->dirtied_when,
> + if (time_before(inode->dirtied_when,
> tail_inode->dirtied_when))
> inode->dirtied_when = jiffies;
> }
> --
> 1.6.0.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-03-24 05:04:38

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Ian Kent wrote:
> On Mon, 23 Mar 2009, Jeff Layton wrote:
>
>> This may be a problem on other filesystems too, but the reproducer I
>> have involves NFS.
>>
>> On NFS, the __mark_inode_dirty() call after writing back the inode is
>> done in the rpc_release handler for COMMIT calls. This call is done
>> asynchronously after the call completes.
>>
>> Because there's no real coordination between __mark_inode_dirty() and
>> __sync_single_inode(), it's often the case that these two calls will
>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
>> When this happens, __sync_single_inode() should detect that the inode
>> was redirtied while we were flushing it and call redirty_tail() to put
>> it back on the s_dirty list.
>>
>> When redirty_tail() puts it back on the list, it only resets the
>> dirtied_when value if it's necessary to maintain the list order. Given
>> the right situation (the right I/O patterns and a lot of luck), this
>> could result in dirtied_when never getting updated on an inode that's
>> constantly being redirtied while pdflush is writing it back.
>>
>> Since dirtied_when is based on jiffies, it's possible for it to persist
>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
>> check in sync_sb_inodes no longer works correctly and writeouts by
>> pdflush of this inode and any inodes after it on the list stop.
>>
>> This patch fixes this by resetting the dirtied_when value on an inode
>> when we're adding it back onto an empty s_dirty list. Since we generally
>> write inodes from oldest to newest dirtied_when values, this has the
>> effect of making it so that these inodes don't end up with dirtied_when
>> values that are frozen.
>>
>> I've also taken the liberty of fixing up the comments a bit and changed
>> the !time_after_eq() check in redirty_tail to be time_before(). That
>> should be functionally equivalent but I think it's more readable.
>>
>> I wish this were just a theoretical problem, but we've had a customer
>> hit a variant of it in an older kernel. Newer upstream kernels have a
>> number of changes that make this problem less likely. As best I can tell
>> though, there is nothing that really prevents it.
>>
>> Signed-off-by: Jeff Layton <[email protected]>
> Acked-by: Ian Kent <[email protected]>
>
> The assumption is that all inodes heading for the s_dirty list will get
> their by calling redirty_tail(). It looks like that's the case but, Andrew
> do you agree the assumption holds?

Oh .. hang on, that's now quite right.

Or that dirtied_when has been set to jiffies at the time of the move
(aka a newly dirtied inode).

>
>> ---
>> fs/fs-writeback.c | 22 +++++++++++++++++-----
>> 1 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index e3fe991..bd2a7ff 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
>> * furthest end of its superblock's dirty-inode list.
>> *
>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
>> - * the case then the inode must have been redirtied while it was being written
>> - * out and we don't reset its dirtied_when.
>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
>> + * list. If that is the case then we don't need to restamp it to maintain the
>> + * order of the list.
>> + *
>> + * If s_dirty is empty however, then we need to go ahead and update
>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
>> + * they happen to consistently be the first one to go back on the list.
>> + *
>> + * Since we're using jiffies values in that field, letting dirtied_when grow
>> + * too old will be problematic if jiffies wraps. It may also be causing
>> + * pdflush to flush the inode too often since it'll always look like it was
>> + * dirtied a long time ago.
>> */
>> static void redirty_tail(struct inode *inode)
>> {
>> struct super_block *sb = inode->i_sb;
>>
>> - if (!list_empty(&sb->s_dirty)) {
>> + if (list_empty(&sb->s_dirty)) {
>> + inode->dirtied_when = jiffies;
>> + } else {
>> struct inode *tail_inode;
>>
>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
>> - if (!time_after_eq(inode->dirtied_when,
>> + if (time_before(inode->dirtied_when,
>> tail_inode->dirtied_when))
>> inode->dirtied_when = jiffies;
>> }
>> --
>> 1.6.0.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-03-24 13:58:20

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Hi Jeff,

On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> This may be a problem on other filesystems too, but the reproducer I
> have involves NFS.
>
> On NFS, the __mark_inode_dirty() call after writing back the inode is
> done in the rpc_release handler for COMMIT calls. This call is done
> asynchronously after the call completes.
>
> Because there's no real coordination between __mark_inode_dirty() and
> __sync_single_inode(), it's often the case that these two calls will
> race and __mark_inode_dirty() will get called while I_SYNC is still set.
> When this happens, __sync_single_inode() should detect that the inode
> was redirtied while we were flushing it and call redirty_tail() to put
> it back on the s_dirty list.
>
> When redirty_tail() puts it back on the list, it only resets the
> dirtied_when value if it's necessary to maintain the list order. Given
> the right situation (the right I/O patterns and a lot of luck), this
> could result in dirtied_when never getting updated on an inode that's
> constantly being redirtied while pdflush is writing it back.
>
> Since dirtied_when is based on jiffies, it's possible for it to persist
> across 2 sign-bit flips of jiffies. When that happens, the time_after()
> check in sync_sb_inodes no longer works correctly and writeouts by
> pdflush of this inode and any inodes after it on the list stop.
>
> This patch fixes this by resetting the dirtied_when value on an inode
> when we're adding it back onto an empty s_dirty list. Since we generally
> write inodes from oldest to newest dirtied_when values, this has the
> effect of making it so that these inodes don't end up with dirtied_when
> values that are frozen.
>
> I've also taken the liberty of fixing up the comments a bit and changed
> the !time_after_eq() check in redirty_tail to be time_before(). That
> should be functionally equivalent but I think it's more readable.
>
> I wish this were just a theoretical problem, but we've had a customer
> hit a variant of it in an older kernel. Newer upstream kernels have a
> number of changes that make this problem less likely. As best I can tell
> though, there is nothing that really prevents it.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/fs-writeback.c | 22 +++++++++++++++++-----
> 1 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e3fe991..bd2a7ff 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> * furthest end of its superblock's dirty-inode list.
> *
> * Before stamping the inode's ->dirtied_when, we check to see whether it is
> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> - * the case then the inode must have been redirtied while it was being written
> - * out and we don't reset its dirtied_when.
> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> + * list. If that is the case then we don't need to restamp it to maintain the
> + * order of the list.
> + *
> + * If s_dirty is empty however, then we need to go ahead and update
> + * dirtied_when for the inode. Not doing so will mean that inodes that are
> + * constantly being redirtied can end up with "stuck" dirtied_when values if
> + * they happen to consistently be the first one to go back on the list.
> + *
> + * Since we're using jiffies values in that field, letting dirtied_when grow
> + * too old will be problematic if jiffies wraps. It may also be causing
> + * pdflush to flush the inode too often since it'll always look like it was
> + * dirtied a long time ago.
> */
> static void redirty_tail(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
>
> - if (!list_empty(&sb->s_dirty)) {
> + if (list_empty(&sb->s_dirty)) {
> + inode->dirtied_when = jiffies;
> + } else {
> struct inode *tail_inode;
>
> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> - if (!time_after_eq(inode->dirtied_when,
> + if (time_before(inode->dirtied_when,
> tail_inode->dirtied_when))
> inode->dirtied_when = jiffies;
> }

I'm afraid you patch is equivalent to the following one.
Because once the first inode's dirtied_when is set to jiffies,
in order to keep the list in order, the following ones (mostly)
will also be updated. A domino effect.

Thanks,
Fengguang

---
fs/fs-writeback.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -182,24 +182,12 @@ static int write_inode(struct inode *ino
/*
* Redirty an inode: set its when-it-was dirtied timestamp and move it to the
* furthest end of its superblock's dirty-inode list.
- *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list. If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
*/
static void redirty_tail(struct inode *inode)
{
struct super_block *sb = inode->i_sb;

- if (!list_empty(&sb->s_dirty)) {
- struct inode *tail_inode;
-
- tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
- inode->dirtied_when = jiffies;
- }
+ inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
}

2009-03-24 14:28:23

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Wu Fengguang wrote:
> Hi Jeff,
>
> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
>> This may be a problem on other filesystems too, but the reproducer I
>> have involves NFS.
>>
>> On NFS, the __mark_inode_dirty() call after writing back the inode is
>> done in the rpc_release handler for COMMIT calls. This call is done
>> asynchronously after the call completes.
>>
>> Because there's no real coordination between __mark_inode_dirty() and
>> __sync_single_inode(), it's often the case that these two calls will
>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
>> When this happens, __sync_single_inode() should detect that the inode
>> was redirtied while we were flushing it and call redirty_tail() to put
>> it back on the s_dirty list.
>>
>> When redirty_tail() puts it back on the list, it only resets the
>> dirtied_when value if it's necessary to maintain the list order. Given
>> the right situation (the right I/O patterns and a lot of luck), this
>> could result in dirtied_when never getting updated on an inode that's
>> constantly being redirtied while pdflush is writing it back.
>>
>> Since dirtied_when is based on jiffies, it's possible for it to persist
>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
>> check in sync_sb_inodes no longer works correctly and writeouts by
>> pdflush of this inode and any inodes after it on the list stop.
>>
>> This patch fixes this by resetting the dirtied_when value on an inode
>> when we're adding it back onto an empty s_dirty list. Since we generally
>> write inodes from oldest to newest dirtied_when values, this has the
>> effect of making it so that these inodes don't end up with dirtied_when
>> values that are frozen.
>>
>> I've also taken the liberty of fixing up the comments a bit and changed
>> the !time_after_eq() check in redirty_tail to be time_before(). That
>> should be functionally equivalent but I think it's more readable.
>>
>> I wish this were just a theoretical problem, but we've had a customer
>> hit a variant of it in an older kernel. Newer upstream kernels have a
>> number of changes that make this problem less likely. As best I can tell
>> though, there is nothing that really prevents it.
>>
>> Signed-off-by: Jeff Layton <[email protected]>
>> ---
>> fs/fs-writeback.c | 22 +++++++++++++++++-----
>> 1 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index e3fe991..bd2a7ff 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
>> * furthest end of its superblock's dirty-inode list.
>> *
>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
>> - * the case then the inode must have been redirtied while it was being written
>> - * out and we don't reset its dirtied_when.
>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
>> + * list. If that is the case then we don't need to restamp it to maintain the
>> + * order of the list.
>> + *
>> + * If s_dirty is empty however, then we need to go ahead and update
>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
>> + * they happen to consistently be the first one to go back on the list.
>> + *
>> + * Since we're using jiffies values in that field, letting dirtied_when grow
>> + * too old will be problematic if jiffies wraps. It may also be causing
>> + * pdflush to flush the inode too often since it'll always look like it was
>> + * dirtied a long time ago.
>> */
>> static void redirty_tail(struct inode *inode)
>> {
>> struct super_block *sb = inode->i_sb;
>>
>> - if (!list_empty(&sb->s_dirty)) {
>> + if (list_empty(&sb->s_dirty)) {
>> + inode->dirtied_when = jiffies;
>> + } else {
>> struct inode *tail_inode;
>>
>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
>> - if (!time_after_eq(inode->dirtied_when,
>> + if (time_before(inode->dirtied_when,
>> tail_inode->dirtied_when))
>> inode->dirtied_when = jiffies;
>> }
>
> I'm afraid you patch is equivalent to the following one.
> Because once the first inode's dirtied_when is set to jiffies,
> in order to keep the list in order, the following ones (mostly)
> will also be updated. A domino effect.
>
> Thanks,
> Fengguang
>
> ---
> fs/fs-writeback.c | 14 +-------------
> 1 file changed, 1 insertion(+), 13 deletions(-)
>
> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -182,24 +182,12 @@ static int write_inode(struct inode *ino
> /*
> * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> * furthest end of its superblock's dirty-inode list.
> - *
> - * Before stamping the inode's ->dirtied_when, we check to see whether it is
> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> - * the case then the inode must have been redirtied while it was being written
> - * out and we don't reset its dirtied_when.
> */
> static void redirty_tail(struct inode *inode)
> {
> struct super_block *sb = inode->i_sb;
>
> - if (!list_empty(&sb->s_dirty)) {
> - struct inode *tail_inode;
> -
> - tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> - if (!time_after_eq(inode->dirtied_when,
> - tail_inode->dirtied_when))
> - inode->dirtied_when = jiffies;
> - }
> + inode->dirtied_when = jiffies;
> list_move(&inode->i_list, &sb->s_dirty);
> }

Oh .. of course .. at least it's simpler.

Ian

2009-03-24 14:29:23

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Tue, 24 Mar 2009 21:57:20 +0800
Wu Fengguang <[email protected]> wrote:

> Hi Jeff,
>
> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> > This may be a problem on other filesystems too, but the reproducer I
> > have involves NFS.
> >
> > On NFS, the __mark_inode_dirty() call after writing back the inode is
> > done in the rpc_release handler for COMMIT calls. This call is done
> > asynchronously after the call completes.
> >
> > Because there's no real coordination between __mark_inode_dirty() and
> > __sync_single_inode(), it's often the case that these two calls will
> > race and __mark_inode_dirty() will get called while I_SYNC is still set.
> > When this happens, __sync_single_inode() should detect that the inode
> > was redirtied while we were flushing it and call redirty_tail() to put
> > it back on the s_dirty list.
> >
> > When redirty_tail() puts it back on the list, it only resets the
> > dirtied_when value if it's necessary to maintain the list order. Given
> > the right situation (the right I/O patterns and a lot of luck), this
> > could result in dirtied_when never getting updated on an inode that's
> > constantly being redirtied while pdflush is writing it back.
> >
> > Since dirtied_when is based on jiffies, it's possible for it to persist
> > across 2 sign-bit flips of jiffies. When that happens, the time_after()
> > check in sync_sb_inodes no longer works correctly and writeouts by
> > pdflush of this inode and any inodes after it on the list stop.
> >
> > This patch fixes this by resetting the dirtied_when value on an inode
> > when we're adding it back onto an empty s_dirty list. Since we generally
> > write inodes from oldest to newest dirtied_when values, this has the
> > effect of making it so that these inodes don't end up with dirtied_when
> > values that are frozen.
> >
> > I've also taken the liberty of fixing up the comments a bit and changed
> > the !time_after_eq() check in redirty_tail to be time_before(). That
> > should be functionally equivalent but I think it's more readable.
> >
> > I wish this were just a theoretical problem, but we've had a customer
> > hit a variant of it in an older kernel. Newer upstream kernels have a
> > number of changes that make this problem less likely. As best I can tell
> > though, there is nothing that really prevents it.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/fs-writeback.c | 22 +++++++++++++++++-----
> > 1 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index e3fe991..bd2a7ff 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> > * furthest end of its superblock's dirty-inode list.
> > *
> > * Before stamping the inode's ->dirtied_when, we check to see whether it is
> > - * already the most-recently-dirtied inode on the s_dirty list. If that is
> > - * the case then the inode must have been redirtied while it was being written
> > - * out and we don't reset its dirtied_when.
> > + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> > + * list. If that is the case then we don't need to restamp it to maintain the
> > + * order of the list.
> > + *
> > + * If s_dirty is empty however, then we need to go ahead and update
> > + * dirtied_when for the inode. Not doing so will mean that inodes that are
> > + * constantly being redirtied can end up with "stuck" dirtied_when values if
> > + * they happen to consistently be the first one to go back on the list.
> > + *
> > + * Since we're using jiffies values in that field, letting dirtied_when grow
> > + * too old will be problematic if jiffies wraps. It may also be causing
> > + * pdflush to flush the inode too often since it'll always look like it was
> > + * dirtied a long time ago.
> > */
> > static void redirty_tail(struct inode *inode)
> > {
> > struct super_block *sb = inode->i_sb;
> >
> > - if (!list_empty(&sb->s_dirty)) {
> > + if (list_empty(&sb->s_dirty)) {
> > + inode->dirtied_when = jiffies;
> > + } else {
> > struct inode *tail_inode;
> >
> > tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> > - if (!time_after_eq(inode->dirtied_when,
> > + if (time_before(inode->dirtied_when,
> > tail_inode->dirtied_when))
> > inode->dirtied_when = jiffies;
> > }
>
> I'm afraid you patch is equivalent to the following one.
> Because once the first inode's dirtied_when is set to jiffies,
> in order to keep the list in order, the following ones (mostly)
> will also be updated. A domino effect.
>
> Thanks,
> Fengguang
>

Good point. One of our other engineers proposed a similar patch
originally. I considered it but wasn't clear whether there could be a
situation where unconditionally resetting dirtied_when would be a
problem. Now that I think about it though, I think you're right...

So maybe something like the patch below is the right thing to do? Or,
maybe when we believe that the inode was fully cleaned and then
redirtied, we'd just unconditionally stamp dirtied_when. Something like
this maybe?

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bd2a7ff..596c96e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
* Someone redirtied the inode while were writing back
* the pages.
*/
- redirty_tail(inode);
+ inode->dirtied_when = jiffies;
+ list_move(&inode->i_list, &sb->s_dirty);
} else if (atomic_read(&inode->i_count)) {
/*
* The inode is clean, inuse

2009-03-24 14:48:26

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Tue, 24 Mar 2009 10:28:06 -0400
Jeff Layton <[email protected]> wrote:

> On Tue, 24 Mar 2009 21:57:20 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Hi Jeff,
> >
> > On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> > > This may be a problem on other filesystems too, but the reproducer I
> > > have involves NFS.
> > >
> > > On NFS, the __mark_inode_dirty() call after writing back the inode is
> > > done in the rpc_release handler for COMMIT calls. This call is done
> > > asynchronously after the call completes.
> > >
> > > Because there's no real coordination between __mark_inode_dirty() and
> > > __sync_single_inode(), it's often the case that these two calls will
> > > race and __mark_inode_dirty() will get called while I_SYNC is still set.
> > > When this happens, __sync_single_inode() should detect that the inode
> > > was redirtied while we were flushing it and call redirty_tail() to put
> > > it back on the s_dirty list.
> > >
> > > When redirty_tail() puts it back on the list, it only resets the
> > > dirtied_when value if it's necessary to maintain the list order. Given
> > > the right situation (the right I/O patterns and a lot of luck), this
> > > could result in dirtied_when never getting updated on an inode that's
> > > constantly being redirtied while pdflush is writing it back.
> > >
> > > Since dirtied_when is based on jiffies, it's possible for it to persist
> > > across 2 sign-bit flips of jiffies. When that happens, the time_after()
> > > check in sync_sb_inodes no longer works correctly and writeouts by
> > > pdflush of this inode and any inodes after it on the list stop.
> > >
> > > This patch fixes this by resetting the dirtied_when value on an inode
> > > when we're adding it back onto an empty s_dirty list. Since we generally
> > > write inodes from oldest to newest dirtied_when values, this has the
> > > effect of making it so that these inodes don't end up with dirtied_when
> > > values that are frozen.
> > >
> > > I've also taken the liberty of fixing up the comments a bit and changed
> > > the !time_after_eq() check in redirty_tail to be time_before(). That
> > > should be functionally equivalent but I think it's more readable.
> > >
> > > I wish this were just a theoretical problem, but we've had a customer
> > > hit a variant of it in an older kernel. Newer upstream kernels have a
> > > number of changes that make this problem less likely. As best I can tell
> > > though, there is nothing that really prevents it.
> > >
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 22 +++++++++++++++++-----
> > > 1 files changed, 17 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index e3fe991..bd2a7ff 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> > > * furthest end of its superblock's dirty-inode list.
> > > *
> > > * Before stamping the inode's ->dirtied_when, we check to see whether it is
> > > - * already the most-recently-dirtied inode on the s_dirty list. If that is
> > > - * the case then the inode must have been redirtied while it was being written
> > > - * out and we don't reset its dirtied_when.
> > > + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> > > + * list. If that is the case then we don't need to restamp it to maintain the
> > > + * order of the list.
> > > + *
> > > + * If s_dirty is empty however, then we need to go ahead and update
> > > + * dirtied_when for the inode. Not doing so will mean that inodes that are
> > > + * constantly being redirtied can end up with "stuck" dirtied_when values if
> > > + * they happen to consistently be the first one to go back on the list.
> > > + *
> > > + * Since we're using jiffies values in that field, letting dirtied_when grow
> > > + * too old will be problematic if jiffies wraps. It may also be causing
> > > + * pdflush to flush the inode too often since it'll always look like it was
> > > + * dirtied a long time ago.
> > > */
> > > static void redirty_tail(struct inode *inode)
> > > {
> > > struct super_block *sb = inode->i_sb;
> > >
> > > - if (!list_empty(&sb->s_dirty)) {
> > > + if (list_empty(&sb->s_dirty)) {
> > > + inode->dirtied_when = jiffies;
> > > + } else {
> > > struct inode *tail_inode;
> > >
> > > tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> > > - if (!time_after_eq(inode->dirtied_when,
> > > + if (time_before(inode->dirtied_when,
> > > tail_inode->dirtied_when))
> > > inode->dirtied_when = jiffies;
> > > }
> >
> > I'm afraid you patch is equivalent to the following one.
> > Because once the first inode's dirtied_when is set to jiffies,
> > in order to keep the list in order, the following ones (mostly)
> > will also be updated. A domino effect.
> >
> > Thanks,
> > Fengguang
> >
>
> Good point. One of our other engineers proposed a similar patch
> originally. I considered it but wasn't clear whether there could be a
> situation where unconditionally resetting dirtied_when would be a
> problem. Now that I think about it though, I think you're right...
>
> So maybe something like the patch below is the right thing to do? Or,
> maybe when we believe that the inode was fully cleaned and then
> redirtied, we'd just unconditionally stamp dirtied_when. Something like
> this maybe?
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bd2a7ff..596c96e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> * Someone redirtied the inode while were writing back
> * the pages.
> */
> - redirty_tail(inode);
> + inode->dirtied_when = jiffies;
> + list_move(&inode->i_list, &sb->s_dirty);
> } else if (atomic_read(&inode->i_count)) {
> /*
> * The inode is clean, inuse

Hmm...though it is still possible that you could consistently race in
such a way that after writepages(), I_DIRTY is never set but the
PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
to the same problem of a stuck dirtied_when value.

So maybe someone can explain to me why we take such great pains to
preserve the dirtied_when value when we're putting the inode back on
the tail of s_dirty? Why not just unconditionally reset it?

--
Jeff Layton <[email protected]>

2009-03-24 15:14:44

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Jeff Layton wrote:
> On Tue, 24 Mar 2009 10:28:06 -0400
> Jeff Layton <[email protected]> wrote:
>
>> On Tue, 24 Mar 2009 21:57:20 +0800
>> Wu Fengguang <[email protected]> wrote:
>>
>>> Hi Jeff,
>>>
>>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
>>>> This may be a problem on other filesystems too, but the reproducer I
>>>> have involves NFS.
>>>>
>>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
>>>> done in the rpc_release handler for COMMIT calls. This call is done
>>>> asynchronously after the call completes.
>>>>
>>>> Because there's no real coordination between __mark_inode_dirty() and
>>>> __sync_single_inode(), it's often the case that these two calls will
>>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
>>>> When this happens, __sync_single_inode() should detect that the inode
>>>> was redirtied while we were flushing it and call redirty_tail() to put
>>>> it back on the s_dirty list.
>>>>
>>>> When redirty_tail() puts it back on the list, it only resets the
>>>> dirtied_when value if it's necessary to maintain the list order. Given
>>>> the right situation (the right I/O patterns and a lot of luck), this
>>>> could result in dirtied_when never getting updated on an inode that's
>>>> constantly being redirtied while pdflush is writing it back.
>>>>
>>>> Since dirtied_when is based on jiffies, it's possible for it to persist
>>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
>>>> check in sync_sb_inodes no longer works correctly and writeouts by
>>>> pdflush of this inode and any inodes after it on the list stop.
>>>>
>>>> This patch fixes this by resetting the dirtied_when value on an inode
>>>> when we're adding it back onto an empty s_dirty list. Since we generally
>>>> write inodes from oldest to newest dirtied_when values, this has the
>>>> effect of making it so that these inodes don't end up with dirtied_when
>>>> values that are frozen.
>>>>
>>>> I've also taken the liberty of fixing up the comments a bit and changed
>>>> the !time_after_eq() check in redirty_tail to be time_before(). That
>>>> should be functionally equivalent but I think it's more readable.
>>>>
>>>> I wish this were just a theoretical problem, but we've had a customer
>>>> hit a variant of it in an older kernel. Newer upstream kernels have a
>>>> number of changes that make this problem less likely. As best I can tell
>>>> though, there is nothing that really prevents it.
>>>>
>>>> Signed-off-by: Jeff Layton <[email protected]>
>>>> ---
>>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
>>>> 1 files changed, 17 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>> index e3fe991..bd2a7ff 100644
>>>> --- a/fs/fs-writeback.c
>>>> +++ b/fs/fs-writeback.c
>>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
>>>> * furthest end of its superblock's dirty-inode list.
>>>> *
>>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
>>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
>>>> - * the case then the inode must have been redirtied while it was being written
>>>> - * out and we don't reset its dirtied_when.
>>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
>>>> + * list. If that is the case then we don't need to restamp it to maintain the
>>>> + * order of the list.
>>>> + *
>>>> + * If s_dirty is empty however, then we need to go ahead and update
>>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
>>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
>>>> + * they happen to consistently be the first one to go back on the list.
>>>> + *
>>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
>>>> + * too old will be problematic if jiffies wraps. It may also be causing
>>>> + * pdflush to flush the inode too often since it'll always look like it was
>>>> + * dirtied a long time ago.
>>>> */
>>>> static void redirty_tail(struct inode *inode)
>>>> {
>>>> struct super_block *sb = inode->i_sb;
>>>>
>>>> - if (!list_empty(&sb->s_dirty)) {
>>>> + if (list_empty(&sb->s_dirty)) {
>>>> + inode->dirtied_when = jiffies;
>>>> + } else {
>>>> struct inode *tail_inode;
>>>>
>>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
>>>> - if (!time_after_eq(inode->dirtied_when,
>>>> + if (time_before(inode->dirtied_when,
>>>> tail_inode->dirtied_when))
>>>> inode->dirtied_when = jiffies;
>>>> }
>>> I'm afraid you patch is equivalent to the following one.
>>> Because once the first inode's dirtied_when is set to jiffies,
>>> in order to keep the list in order, the following ones (mostly)
>>> will also be updated. A domino effect.
>>>
>>> Thanks,
>>> Fengguang
>>>
>> Good point. One of our other engineers proposed a similar patch
>> originally. I considered it but wasn't clear whether there could be a
>> situation where unconditionally resetting dirtied_when would be a
>> problem. Now that I think about it though, I think you're right...
>>
>> So maybe something like the patch below is the right thing to do? Or,
>> maybe when we believe that the inode was fully cleaned and then
>> redirtied, we'd just unconditionally stamp dirtied_when. Something like
>> this maybe?
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index bd2a7ff..596c96e 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
>> * Someone redirtied the inode while were writing back
>> * the pages.
>> */
>> - redirty_tail(inode);
>> + inode->dirtied_when = jiffies;
>> + list_move(&inode->i_list, &sb->s_dirty);
>> } else if (atomic_read(&inode->i_count)) {
>> /*
>> * The inode is clean, inuse
>
> Hmm...though it is still possible that you could consistently race in
> such a way that after writepages(), I_DIRTY is never set but the
> PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> to the same problem of a stuck dirtied_when value.
>
> So maybe someone can explain to me why we take such great pains to
> preserve the dirtied_when value when we're putting the inode back on
> the tail of s_dirty? Why not just unconditionally reset it?

I think that redirty_tail() is the best place for this as it is a
central location where dirtied_when can be updated. Then all we have to
worry about is making sure it is called from all the locations needed.

I'm not sure that removing the comment is a good idea (the Wu Fengguang
patch) but it probably needs to be revised to explain why dirtied_when
is forcing a rewrite of the list entry times.

Ian

2009-03-25 01:29:42

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Hi Jeff,

On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
> On Tue, 24 Mar 2009 10:28:06 -0400
> Jeff Layton <[email protected]> wrote:
>
> > On Tue, 24 Mar 2009 21:57:20 +0800
> > Wu Fengguang <[email protected]> wrote:
> >
> > > Hi Jeff,
> > >
> > > On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> > > > This may be a problem on other filesystems too, but the reproducer I
> > > > have involves NFS.
> > > >
> > > > On NFS, the __mark_inode_dirty() call after writing back the inode is
> > > > done in the rpc_release handler for COMMIT calls. This call is done
> > > > asynchronously after the call completes.
> > > >
> > > > Because there's no real coordination between __mark_inode_dirty() and
> > > > __sync_single_inode(), it's often the case that these two calls will
> > > > race and __mark_inode_dirty() will get called while I_SYNC is still set.
> > > > When this happens, __sync_single_inode() should detect that the inode
> > > > was redirtied while we were flushing it and call redirty_tail() to put
> > > > it back on the s_dirty list.
> > > >
> > > > When redirty_tail() puts it back on the list, it only resets the
> > > > dirtied_when value if it's necessary to maintain the list order. Given
> > > > the right situation (the right I/O patterns and a lot of luck), this
> > > > could result in dirtied_when never getting updated on an inode that's
> > > > constantly being redirtied while pdflush is writing it back.
> > > >
> > > > Since dirtied_when is based on jiffies, it's possible for it to persist
> > > > across 2 sign-bit flips of jiffies. When that happens, the time_after()
> > > > check in sync_sb_inodes no longer works correctly and writeouts by
> > > > pdflush of this inode and any inodes after it on the list stop.
> > > >
> > > > This patch fixes this by resetting the dirtied_when value on an inode
> > > > when we're adding it back onto an empty s_dirty list. Since we generally
> > > > write inodes from oldest to newest dirtied_when values, this has the
> > > > effect of making it so that these inodes don't end up with dirtied_when
> > > > values that are frozen.
> > > >
> > > > I've also taken the liberty of fixing up the comments a bit and changed
> > > > the !time_after_eq() check in redirty_tail to be time_before(). That
> > > > should be functionally equivalent but I think it's more readable.
> > > >
> > > > I wish this were just a theoretical problem, but we've had a customer
> > > > hit a variant of it in an older kernel. Newer upstream kernels have a
> > > > number of changes that make this problem less likely. As best I can tell
> > > > though, there is nothing that really prevents it.
> > > >
> > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > ---
> > > > fs/fs-writeback.c | 22 +++++++++++++++++-----
> > > > 1 files changed, 17 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index e3fe991..bd2a7ff 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> > > > * furthest end of its superblock's dirty-inode list.
> > > > *
> > > > * Before stamping the inode's ->dirtied_when, we check to see whether it is
> > > > - * already the most-recently-dirtied inode on the s_dirty list. If that is
> > > > - * the case then the inode must have been redirtied while it was being written
> > > > - * out and we don't reset its dirtied_when.
> > > > + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> > > > + * list. If that is the case then we don't need to restamp it to maintain the
> > > > + * order of the list.
> > > > + *
> > > > + * If s_dirty is empty however, then we need to go ahead and update
> > > > + * dirtied_when for the inode. Not doing so will mean that inodes that are
> > > > + * constantly being redirtied can end up with "stuck" dirtied_when values if
> > > > + * they happen to consistently be the first one to go back on the list.
> > > > + *
> > > > + * Since we're using jiffies values in that field, letting dirtied_when grow
> > > > + * too old will be problematic if jiffies wraps. It may also be causing
> > > > + * pdflush to flush the inode too often since it'll always look like it was
> > > > + * dirtied a long time ago.
> > > > */
> > > > static void redirty_tail(struct inode *inode)
> > > > {
> > > > struct super_block *sb = inode->i_sb;
> > > >
> > > > - if (!list_empty(&sb->s_dirty)) {
> > > > + if (list_empty(&sb->s_dirty)) {
> > > > + inode->dirtied_when = jiffies;
> > > > + } else {
> > > > struct inode *tail_inode;
> > > >
> > > > tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> > > > - if (!time_after_eq(inode->dirtied_when,
> > > > + if (time_before(inode->dirtied_when,
> > > > tail_inode->dirtied_when))
> > > > inode->dirtied_when = jiffies;
> > > > }
> > >
> > > I'm afraid you patch is equivalent to the following one.
> > > Because once the first inode's dirtied_when is set to jiffies,
> > > in order to keep the list in order, the following ones (mostly)
> > > will also be updated. A domino effect.
> > >
> > > Thanks,
> > > Fengguang
> > >
> >
> > Good point. One of our other engineers proposed a similar patch
> > originally. I considered it but wasn't clear whether there could be a
> > situation where unconditionally resetting dirtied_when would be a
> > problem. Now that I think about it though, I think you're right...
> >
> > So maybe something like the patch below is the right thing to do? Or,
> > maybe when we believe that the inode was fully cleaned and then
> > redirtied, we'd just unconditionally stamp dirtied_when. Something like
> > this maybe?
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index bd2a7ff..596c96e 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> > * Someone redirtied the inode while were writing back
> > * the pages.
> > */
> > - redirty_tail(inode);
> > + inode->dirtied_when = jiffies;
> > + list_move(&inode->i_list, &sb->s_dirty);
> > } else if (atomic_read(&inode->i_count)) {
> > /*
> > * The inode is clean, inuse
>
> Hmm...though it is still possible that you could consistently race in
> such a way that after writepages(), I_DIRTY is never set but the
> PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> to the same problem of a stuck dirtied_when value.

Jeff, did you spot real impacts of stuck dirtied_when?
Or it's simply possible in theory?

IMHO it requires extremely strong conditions to happen: It takes
months to wrap around the value, during that period it takes only
one _single_ newly dirtied inode to refresh the stuck dirtied_when.

However...

> So maybe someone can explain to me why we take such great pains to
> preserve the dirtied_when value when we're putting the inode back on
> the tail of s_dirty? Why not just unconditionally reset it?

...I see no obvious reasons against unconditionally resetting dirtied_when.

(a) Delaying an inode's writeback for 30s maybe too long - its blocking
condition may well go away within 1s. (b) And it would be very undesirable
if one big file is repeatedly redirtied hence its writeback being
delayed considerably.

However, redirty_tail() currently only tries to speedup writeback-after-redirty
in a _best effort_ way. It at best partially hides the above issues,
if there are any. In particular, if (b) is possible, the bug should
already show up at least in some situations.

For XFS, immediately sync of redirtied inode is actually discouraged:

http://lkml.org/lkml/2008/1/16/491


Thanks,
Fengguang

2009-03-25 02:16:56

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 09:28:29 +0800
Wu Fengguang <[email protected]> wrote:

> Hi Jeff,
>
> On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
> > On Tue, 24 Mar 2009 10:28:06 -0400
> > Jeff Layton <[email protected]> wrote:
> >
> > > On Tue, 24 Mar 2009 21:57:20 +0800
> > > Wu Fengguang <[email protected]> wrote:
> > >
> > > > Hi Jeff,
> > > >
> > > > On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> > > > > This may be a problem on other filesystems too, but the reproducer I
> > > > > have involves NFS.
> > > > >
> > > > > On NFS, the __mark_inode_dirty() call after writing back the inode is
> > > > > done in the rpc_release handler for COMMIT calls. This call is done
> > > > > asynchronously after the call completes.
> > > > >
> > > > > Because there's no real coordination between __mark_inode_dirty() and
> > > > > __sync_single_inode(), it's often the case that these two calls will
> > > > > race and __mark_inode_dirty() will get called while I_SYNC is still set.
> > > > > When this happens, __sync_single_inode() should detect that the inode
> > > > > was redirtied while we were flushing it and call redirty_tail() to put
> > > > > it back on the s_dirty list.
> > > > >
> > > > > When redirty_tail() puts it back on the list, it only resets the
> > > > > dirtied_when value if it's necessary to maintain the list order. Given
> > > > > the right situation (the right I/O patterns and a lot of luck), this
> > > > > could result in dirtied_when never getting updated on an inode that's
> > > > > constantly being redirtied while pdflush is writing it back.
> > > > >
> > > > > Since dirtied_when is based on jiffies, it's possible for it to persist
> > > > > across 2 sign-bit flips of jiffies. When that happens, the time_after()
> > > > > check in sync_sb_inodes no longer works correctly and writeouts by
> > > > > pdflush of this inode and any inodes after it on the list stop.
> > > > >
> > > > > This patch fixes this by resetting the dirtied_when value on an inode
> > > > > when we're adding it back onto an empty s_dirty list. Since we generally
> > > > > write inodes from oldest to newest dirtied_when values, this has the
> > > > > effect of making it so that these inodes don't end up with dirtied_when
> > > > > values that are frozen.
> > > > >
> > > > > I've also taken the liberty of fixing up the comments a bit and changed
> > > > > the !time_after_eq() check in redirty_tail to be time_before(). That
> > > > > should be functionally equivalent but I think it's more readable.
> > > > >
> > > > > I wish this were just a theoretical problem, but we've had a customer
> > > > > hit a variant of it in an older kernel. Newer upstream kernels have a
> > > > > number of changes that make this problem less likely. As best I can tell
> > > > > though, there is nothing that really prevents it.
> > > > >
> > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > ---
> > > > > fs/fs-writeback.c | 22 +++++++++++++++++-----
> > > > > 1 files changed, 17 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > index e3fe991..bd2a7ff 100644
> > > > > --- a/fs/fs-writeback.c
> > > > > +++ b/fs/fs-writeback.c
> > > > > @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> > > > > * furthest end of its superblock's dirty-inode list.
> > > > > *
> > > > > * Before stamping the inode's ->dirtied_when, we check to see whether it is
> > > > > - * already the most-recently-dirtied inode on the s_dirty list. If that is
> > > > > - * the case then the inode must have been redirtied while it was being written
> > > > > - * out and we don't reset its dirtied_when.
> > > > > + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> > > > > + * list. If that is the case then we don't need to restamp it to maintain the
> > > > > + * order of the list.
> > > > > + *
> > > > > + * If s_dirty is empty however, then we need to go ahead and update
> > > > > + * dirtied_when for the inode. Not doing so will mean that inodes that are
> > > > > + * constantly being redirtied can end up with "stuck" dirtied_when values if
> > > > > + * they happen to consistently be the first one to go back on the list.
> > > > > + *
> > > > > + * Since we're using jiffies values in that field, letting dirtied_when grow
> > > > > + * too old will be problematic if jiffies wraps. It may also be causing
> > > > > + * pdflush to flush the inode too often since it'll always look like it was
> > > > > + * dirtied a long time ago.
> > > > > */
> > > > > static void redirty_tail(struct inode *inode)
> > > > > {
> > > > > struct super_block *sb = inode->i_sb;
> > > > >
> > > > > - if (!list_empty(&sb->s_dirty)) {
> > > > > + if (list_empty(&sb->s_dirty)) {
> > > > > + inode->dirtied_when = jiffies;
> > > > > + } else {
> > > > > struct inode *tail_inode;
> > > > >
> > > > > tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> > > > > - if (!time_after_eq(inode->dirtied_when,
> > > > > + if (time_before(inode->dirtied_when,
> > > > > tail_inode->dirtied_when))
> > > > > inode->dirtied_when = jiffies;
> > > > > }
> > > >
> > > > I'm afraid you patch is equivalent to the following one.
> > > > Because once the first inode's dirtied_when is set to jiffies,
> > > > in order to keep the list in order, the following ones (mostly)
> > > > will also be updated. A domino effect.
> > > >
> > > > Thanks,
> > > > Fengguang
> > > >
> > >
> > > Good point. One of our other engineers proposed a similar patch
> > > originally. I considered it but wasn't clear whether there could be a
> > > situation where unconditionally resetting dirtied_when would be a
> > > problem. Now that I think about it though, I think you're right...
> > >
> > > So maybe something like the patch below is the right thing to do? Or,
> > > maybe when we believe that the inode was fully cleaned and then
> > > redirtied, we'd just unconditionally stamp dirtied_when. Something like
> > > this maybe?
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index bd2a7ff..596c96e 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> > > * Someone redirtied the inode while were writing back
> > > * the pages.
> > > */
> > > - redirty_tail(inode);
> > > + inode->dirtied_when = jiffies;
> > > + list_move(&inode->i_list, &sb->s_dirty);
> > > } else if (atomic_read(&inode->i_count)) {
> > > /*
> > > * The inode is clean, inuse
> >
> > Hmm...though it is still possible that you could consistently race in
> > such a way that after writepages(), I_DIRTY is never set but the
> > PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> > to the same problem of a stuck dirtied_when value.
>
> Jeff, did you spot real impacts of stuck dirtied_when?
> Or it's simply possible in theory?
>
> IMHO it requires extremely strong conditions to happen: It takes
> months to wrap around the value, during that period it takes only
> one _single_ newly dirtied inode to refresh the stuck dirtied_when.
>

Yes, we did see this with inodes on NFS...

We saw it in an older kernel on several machines from one customer
(RHEL4 2.6.9-based 32-bit kernel). Our support engineering group got a
vmcore from one of the boxes and it had a dirtied_when value on an
s_dirty inode that appeared to be in the future. The uptime on the box
indicated that jiffies had wrapped once.

I'm also pretty sure I could reproduce this on a 2.6.18-based kernel
given enough time (based on some debug patches + a reproducer program
I have). I ran the program overnight and dirtied_when never changed.

With these earlier kernels, the __mark_inode_dirty call after writeback
is done in a function that's called from nfs_writepages(). I_LOCK is
set there (these kernels predate the introduction of I_SYNC), so
I_DIRTY gets set but that codepath can never update dirtied_when.

Current mainline kernels aren't as susceptible to this problem on NFS.
The __mark_inode_dirty call there is done asynchronously as a side
effect of some other changes that went in to fix deadlocking problems.
So there, dirtied_when can get updated after writeback, but only if
the rpc_release callback wins the race with __sync_single_inode.

Given the right situation though (or maybe the right filesystem), it's
not too hard to imagine this problem occurring even in current mainline
code with an inode that's frequently being redirtied.

> However...
>
> > So maybe someone can explain to me why we take such great pains to
> > preserve the dirtied_when value when we're putting the inode back on
> > the tail of s_dirty? Why not just unconditionally reset it?
>
> ...I see no obvious reasons against unconditionally resetting dirtied_when.
>
> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> condition may well go away within 1s. (b) And it would be very undesirable
> if one big file is repeatedly redirtied hence its writeback being
> delayed considerably.
>
> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> in a _best effort_ way. It at best partially hides the above issues,
> if there are any. In particular, if (b) is possible, the bug should
> already show up at least in some situations.
>
> For XFS, immediately sync of redirtied inode is actually discouraged:
>
> http://lkml.org/lkml/2008/1/16/491
>
>

Ok, those are good points that I need to think about.

Thanks for the help so far. I'd welcome any suggestions you have on
how best to fix this.

--
Jeff Layton <[email protected]>

2009-03-25 02:26:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Hi Ian,

On Tue, Mar 24, 2009 at 11:04:11PM +0800, Ian Kent wrote:
> Jeff Layton wrote:
> > On Tue, 24 Mar 2009 10:28:06 -0400
> > Jeff Layton <[email protected]> wrote:
> >
> >> On Tue, 24 Mar 2009 21:57:20 +0800
> >> Wu Fengguang <[email protected]> wrote:
> >>
> >>> Hi Jeff,
> >>>
> >>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> >>>> This may be a problem on other filesystems too, but the reproducer I
> >>>> have involves NFS.
> >>>>
> >>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
> >>>> done in the rpc_release handler for COMMIT calls. This call is done
> >>>> asynchronously after the call completes.
> >>>>
> >>>> Because there's no real coordination between __mark_inode_dirty() and
> >>>> __sync_single_inode(), it's often the case that these two calls will
> >>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
> >>>> When this happens, __sync_single_inode() should detect that the inode
> >>>> was redirtied while we were flushing it and call redirty_tail() to put
> >>>> it back on the s_dirty list.
> >>>>
> >>>> When redirty_tail() puts it back on the list, it only resets the
> >>>> dirtied_when value if it's necessary to maintain the list order. Given
> >>>> the right situation (the right I/O patterns and a lot of luck), this
> >>>> could result in dirtied_when never getting updated on an inode that's
> >>>> constantly being redirtied while pdflush is writing it back.
> >>>>
> >>>> Since dirtied_when is based on jiffies, it's possible for it to persist
> >>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
> >>>> check in sync_sb_inodes no longer works correctly and writeouts by
> >>>> pdflush of this inode and any inodes after it on the list stop.
> >>>>
> >>>> This patch fixes this by resetting the dirtied_when value on an inode
> >>>> when we're adding it back onto an empty s_dirty list. Since we generally
> >>>> write inodes from oldest to newest dirtied_when values, this has the
> >>>> effect of making it so that these inodes don't end up with dirtied_when
> >>>> values that are frozen.
> >>>>
> >>>> I've also taken the liberty of fixing up the comments a bit and changed
> >>>> the !time_after_eq() check in redirty_tail to be time_before(). That
> >>>> should be functionally equivalent but I think it's more readable.
> >>>>
> >>>> I wish this were just a theoretical problem, but we've had a customer
> >>>> hit a variant of it in an older kernel. Newer upstream kernels have a
> >>>> number of changes that make this problem less likely. As best I can tell
> >>>> though, there is nothing that really prevents it.
> >>>>
> >>>> Signed-off-by: Jeff Layton <[email protected]>
> >>>> ---
> >>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
> >>>> 1 files changed, 17 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>>> index e3fe991..bd2a7ff 100644
> >>>> --- a/fs/fs-writeback.c
> >>>> +++ b/fs/fs-writeback.c
> >>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> >>>> * furthest end of its superblock's dirty-inode list.
> >>>> *
> >>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
> >>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> >>>> - * the case then the inode must have been redirtied while it was being written
> >>>> - * out and we don't reset its dirtied_when.
> >>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> >>>> + * list. If that is the case then we don't need to restamp it to maintain the
> >>>> + * order of the list.
> >>>> + *
> >>>> + * If s_dirty is empty however, then we need to go ahead and update
> >>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
> >>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
> >>>> + * they happen to consistently be the first one to go back on the list.
> >>>> + *
> >>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
> >>>> + * too old will be problematic if jiffies wraps. It may also be causing
> >>>> + * pdflush to flush the inode too often since it'll always look like it was
> >>>> + * dirtied a long time ago.
> >>>> */
> >>>> static void redirty_tail(struct inode *inode)
> >>>> {
> >>>> struct super_block *sb = inode->i_sb;
> >>>>
> >>>> - if (!list_empty(&sb->s_dirty)) {
> >>>> + if (list_empty(&sb->s_dirty)) {
> >>>> + inode->dirtied_when = jiffies;
> >>>> + } else {
> >>>> struct inode *tail_inode;
> >>>>
> >>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> >>>> - if (!time_after_eq(inode->dirtied_when,
> >>>> + if (time_before(inode->dirtied_when,
> >>>> tail_inode->dirtied_when))
> >>>> inode->dirtied_when = jiffies;
> >>>> }
> >>> I'm afraid you patch is equivalent to the following one.
> >>> Because once the first inode's dirtied_when is set to jiffies,
> >>> in order to keep the list in order, the following ones (mostly)
> >>> will also be updated. A domino effect.
> >>>
> >>> Thanks,
> >>> Fengguang
> >>>
> >> Good point. One of our other engineers proposed a similar patch
> >> originally. I considered it but wasn't clear whether there could be a
> >> situation where unconditionally resetting dirtied_when would be a
> >> problem. Now that I think about it though, I think you're right...
> >>
> >> So maybe something like the patch below is the right thing to do? Or,
> >> maybe when we believe that the inode was fully cleaned and then
> >> redirtied, we'd just unconditionally stamp dirtied_when. Something like
> >> this maybe?
> >>
> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> index bd2a7ff..596c96e 100644
> >> --- a/fs/fs-writeback.c
> >> +++ b/fs/fs-writeback.c
> >> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> >> * Someone redirtied the inode while were writing back
> >> * the pages.
> >> */
> >> - redirty_tail(inode);
> >> + inode->dirtied_when = jiffies;
> >> + list_move(&inode->i_list, &sb->s_dirty);
> >> } else if (atomic_read(&inode->i_count)) {
> >> /*
> >> * The inode is clean, inuse
> >
> > Hmm...though it is still possible that you could consistently race in
> > such a way that after writepages(), I_DIRTY is never set but the
> > PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> > to the same problem of a stuck dirtied_when value.
> >
> > So maybe someone can explain to me why we take such great pains to
> > preserve the dirtied_when value when we're putting the inode back on
> > the tail of s_dirty? Why not just unconditionally reset it?
>
> I think that redirty_tail() is the best place for this as it is a
> central location where dirtied_when can be updated. Then all we have to
> worry about is making sure it is called from all the locations needed.
>
> I'm not sure that removing the comment is a good idea (the Wu Fengguang
> patch) but it probably needs to be revised to explain why dirtied_when
> is forcing a rewrite of the list entry times.

The comment basically says "we do not want to reset dirtied_when for
inodes redirtied while being written out". I guess there are two intentions:

- to retry its writeback as soon as possible and avoid long 30s delays;
- to keep a faithful dirtied_when.

The first one is best effort anyway, changing it should not create new
bugs. The second one shall not, either. Due to the very limited use of
dirtied_when.

However, we do have another cheap solution that can retain both of the
two original intentions. The main idea is to introduce a new s_more_io_wait
queue, and convert the current redirty_tail() calls to either
requeue_io_wait() or some completely_dirty_inode().

Thanks,
Fengguang
---
(a not-up-to-date patch)
writeback: introduce super_block.s_more_io_wait

Introduce super_block.s_more_io_wait to park inodes that for some reason cannot
be synced immediately. They will be revisited in the next s_io enqueue time(<=5s).

The new data flow after this patchset:

s_dirty --> s_io --> s_more_io/s_more_io_wait --+
^ |
| |
+----------------------------------+

- to fill s_io:
s_more_io +
s_dirty(expired) +
s_more_io_wait
---> s_io
- to drain s_io:
s_io -+--> clean inodes goto inode_in_use/inode_unused
|
+--> s_more_io
|
+--> s_more_io_wait

Obviously there're no ordering or starvation problems in the queues:
- s_dirty is now a strict FIFO queue
- inode.dirtied_when is only set when made dirty
- once exipired, the dirty inode will stay in s_*io* queues until made clean
- the dirty inodes in s_*io* will be revisted in order, hence small files won't
be starved by big dirty files.

Cc: David Chinner <[email protected]>
Cc: Michael Rubin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/fs-writeback.c | 19 +++++++++++++++----
fs/super.c | 1 +
include/linux/fs.h | 1 +
3 files changed, 17 insertions(+), 4 deletions(-)

--- linux-mm.orig/fs/fs-writeback.c
+++ linux-mm/fs/fs-writeback.c
@@ -172,6 +172,14 @@ static void requeue_io(struct inode *ino
list_move(&inode->i_list, &inode->i_sb->s_more_io);
}

+/*
+ * The inode should be retried after _sleeping_ for a while.
+ */
+static void requeue_io_wait(struct inode *inode)
+{
+ list_move(&inode->i_list, &inode->i_sb->s_more_io_wait);
+}
+
static void inode_sync_complete(struct inode *inode)
{
/*
@@ -200,7 +208,8 @@ static void move_expired_inodes(struct l

/*
* Queue all expired dirty inodes for io, eldest first:
- * (entrance) => s_dirty inodes
+ * (entrance) => s_more_io_wait inodes
+ * => s_dirty inodes
* => s_more_io inodes
* => remaining inodes in s_io => (dequeue for sync)
*/
@@ -209,13 +218,15 @@ static void queue_io(struct super_block
{
list_splice_init(&sb->s_more_io, &sb->s_io);
move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+ list_splice_init(&sb->s_more_io_wait, &sb->s_io);
}

int sb_has_dirty_inodes(struct super_block *sb)
{
- return !list_empty(&sb->s_dirty) ||
- !list_empty(&sb->s_io) ||
- !list_empty(&sb->s_more_io);
+ return !list_empty(&sb->s_dirty) ||
+ !list_empty(&sb->s_io) ||
+ !list_empty(&sb->s_more_io) ||
+ !list_empty(&sb->s_more_io_wait);
}
EXPORT_SYMBOL(sb_has_dirty_inodes);

--- linux-mm.orig/fs/super.c
+++ linux-mm/fs/super.c
@@ -64,6 +64,7 @@ static struct super_block *alloc_super(s
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_more_io);
+ INIT_LIST_HEAD(&s->s_more_io_wait);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
--- linux-mm.orig/include/linux/fs.h
+++ linux-mm/include/linux/fs.h
@@ -1012,6 +1012,7 @@ struct super_block {
struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
struct list_head s_more_io; /* parked for more writeback */
+ struct list_head s_more_io_wait; /* parked for sleep-then-retry */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;

2009-03-25 02:53:54

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 10:15:28AM +0800, Jeff Layton wrote:
> On Wed, 25 Mar 2009 09:28:29 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > Hi Jeff,
> >
> > On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
> > > On Tue, 24 Mar 2009 10:28:06 -0400
> > > Jeff Layton <[email protected]> wrote:
> > >
> > > > On Tue, 24 Mar 2009 21:57:20 +0800
> > > > Wu Fengguang <[email protected]> wrote:
> > > >
> > > > > Hi Jeff,
> > > > >
> > > > > On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> > > > > > This may be a problem on other filesystems too, but the reproducer I
> > > > > > have involves NFS.
> > > > > >
> > > > > > On NFS, the __mark_inode_dirty() call after writing back the inode is
> > > > > > done in the rpc_release handler for COMMIT calls. This call is done
> > > > > > asynchronously after the call completes.
> > > > > >
> > > > > > Because there's no real coordination between __mark_inode_dirty() and
> > > > > > __sync_single_inode(), it's often the case that these two calls will
> > > > > > race and __mark_inode_dirty() will get called while I_SYNC is still set.
> > > > > > When this happens, __sync_single_inode() should detect that the inode
> > > > > > was redirtied while we were flushing it and call redirty_tail() to put
> > > > > > it back on the s_dirty list.
> > > > > >
> > > > > > When redirty_tail() puts it back on the list, it only resets the
> > > > > > dirtied_when value if it's necessary to maintain the list order. Given
> > > > > > the right situation (the right I/O patterns and a lot of luck), this
> > > > > > could result in dirtied_when never getting updated on an inode that's
> > > > > > constantly being redirtied while pdflush is writing it back.
> > > > > >
> > > > > > Since dirtied_when is based on jiffies, it's possible for it to persist
> > > > > > across 2 sign-bit flips of jiffies. When that happens, the time_after()
> > > > > > check in sync_sb_inodes no longer works correctly and writeouts by
> > > > > > pdflush of this inode and any inodes after it on the list stop.
> > > > > >
> > > > > > This patch fixes this by resetting the dirtied_when value on an inode
> > > > > > when we're adding it back onto an empty s_dirty list. Since we generally
> > > > > > write inodes from oldest to newest dirtied_when values, this has the
> > > > > > effect of making it so that these inodes don't end up with dirtied_when
> > > > > > values that are frozen.
> > > > > >
> > > > > > I've also taken the liberty of fixing up the comments a bit and changed
> > > > > > the !time_after_eq() check in redirty_tail to be time_before(). That
> > > > > > should be functionally equivalent but I think it's more readable.
> > > > > >
> > > > > > I wish this were just a theoretical problem, but we've had a customer
> > > > > > hit a variant of it in an older kernel. Newer upstream kernels have a
> > > > > > number of changes that make this problem less likely. As best I can tell
> > > > > > though, there is nothing that really prevents it.
> > > > > >
> > > > > > Signed-off-by: Jeff Layton <[email protected]>
> > > > > > ---
> > > > > > fs/fs-writeback.c | 22 +++++++++++++++++-----
> > > > > > 1 files changed, 17 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > > index e3fe991..bd2a7ff 100644
> > > > > > --- a/fs/fs-writeback.c
> > > > > > +++ b/fs/fs-writeback.c
> > > > > > @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> > > > > > * furthest end of its superblock's dirty-inode list.
> > > > > > *
> > > > > > * Before stamping the inode's ->dirtied_when, we check to see whether it is
> > > > > > - * already the most-recently-dirtied inode on the s_dirty list. If that is
> > > > > > - * the case then the inode must have been redirtied while it was being written
> > > > > > - * out and we don't reset its dirtied_when.
> > > > > > + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> > > > > > + * list. If that is the case then we don't need to restamp it to maintain the
> > > > > > + * order of the list.
> > > > > > + *
> > > > > > + * If s_dirty is empty however, then we need to go ahead and update
> > > > > > + * dirtied_when for the inode. Not doing so will mean that inodes that are
> > > > > > + * constantly being redirtied can end up with "stuck" dirtied_when values if
> > > > > > + * they happen to consistently be the first one to go back on the list.
> > > > > > + *
> > > > > > + * Since we're using jiffies values in that field, letting dirtied_when grow
> > > > > > + * too old will be problematic if jiffies wraps. It may also be causing
> > > > > > + * pdflush to flush the inode too often since it'll always look like it was
> > > > > > + * dirtied a long time ago.
> > > > > > */
> > > > > > static void redirty_tail(struct inode *inode)
> > > > > > {
> > > > > > struct super_block *sb = inode->i_sb;
> > > > > >
> > > > > > - if (!list_empty(&sb->s_dirty)) {
> > > > > > + if (list_empty(&sb->s_dirty)) {
> > > > > > + inode->dirtied_when = jiffies;
> > > > > > + } else {
> > > > > > struct inode *tail_inode;
> > > > > >
> > > > > > tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> > > > > > - if (!time_after_eq(inode->dirtied_when,
> > > > > > + if (time_before(inode->dirtied_when,
> > > > > > tail_inode->dirtied_when))
> > > > > > inode->dirtied_when = jiffies;
> > > > > > }
> > > > >
> > > > > I'm afraid you patch is equivalent to the following one.
> > > > > Because once the first inode's dirtied_when is set to jiffies,
> > > > > in order to keep the list in order, the following ones (mostly)
> > > > > will also be updated. A domino effect.
> > > > >
> > > > > Thanks,
> > > > > Fengguang
> > > > >
> > > >
> > > > Good point. One of our other engineers proposed a similar patch
> > > > originally. I considered it but wasn't clear whether there could be a
> > > > situation where unconditionally resetting dirtied_when would be a
> > > > problem. Now that I think about it though, I think you're right...
> > > >
> > > > So maybe something like the patch below is the right thing to do? Or,
> > > > maybe when we believe that the inode was fully cleaned and then
> > > > redirtied, we'd just unconditionally stamp dirtied_when. Something like
> > > > this maybe?
> > > >
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index bd2a7ff..596c96e 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> > > > * Someone redirtied the inode while were writing back
> > > > * the pages.
> > > > */
> > > > - redirty_tail(inode);
> > > > + inode->dirtied_when = jiffies;
> > > > + list_move(&inode->i_list, &sb->s_dirty);
> > > > } else if (atomic_read(&inode->i_count)) {
> > > > /*
> > > > * The inode is clean, inuse
> > >
> > > Hmm...though it is still possible that you could consistently race in
> > > such a way that after writepages(), I_DIRTY is never set but the
> > > PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> > > to the same problem of a stuck dirtied_when value.
> >
> > Jeff, did you spot real impacts of stuck dirtied_when?
> > Or it's simply possible in theory?
> >
> > IMHO it requires extremely strong conditions to happen: It takes
> > months to wrap around the value, during that period it takes only
> > one _single_ newly dirtied inode to refresh the stuck dirtied_when.
> >
>
> Yes, we did see this with inodes on NFS...
>
> We saw it in an older kernel on several machines from one customer
> (RHEL4 2.6.9-based 32-bit kernel). Our support engineering group got a
> vmcore from one of the boxes and it had a dirtied_when value on an
> s_dirty inode that appeared to be in the future. The uptime on the box
> indicated that jiffies had wrapped once.
>
> I'm also pretty sure I could reproduce this on a 2.6.18-based kernel
> given enough time (based on some debug patches + a reproducer program
> I have). I ran the program overnight and dirtied_when never changed.
>
> With these earlier kernels, the __mark_inode_dirty call after writeback
> is done in a function that's called from nfs_writepages(). I_LOCK is
> set there (these kernels predate the introduction of I_SYNC), so
> I_DIRTY gets set but that codepath can never update dirtied_when.
>
> Current mainline kernels aren't as susceptible to this problem on NFS.
> The __mark_inode_dirty call there is done asynchronously as a side
> effect of some other changes that went in to fix deadlocking problems.
> So there, dirtied_when can get updated after writeback, but only if
> the rpc_release callback wins the race with __sync_single_inode.

There have been lots of writeback-queue updates after 2.6.18...
So my assumptions are sure not valid.

> Given the right situation though (or maybe the right filesystem), it's
> not too hard to imagine this problem occurring even in current mainline
> code with an inode that's frequently being redirtied.

My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
inodes will be parked in s_dirty for 30s. During which time the
actively being-redirtied inodes, if their dirtied_when is an old stuck
value, will be retried for writeback and then re-inserted into a
non-empty s_dirty queue and have their dirtied_when refreshed.

> > However...
> >
> > > So maybe someone can explain to me why we take such great pains to
> > > preserve the dirtied_when value when we're putting the inode back on
> > > the tail of s_dirty? Why not just unconditionally reset it?
> >
> > ...I see no obvious reasons against unconditionally resetting dirtied_when.
> >
> > (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > condition may well go away within 1s. (b) And it would be very undesirable
> > if one big file is repeatedly redirtied hence its writeback being
> > delayed considerably.
> >
> > However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > in a _best effort_ way. It at best partially hides the above issues,
> > if there are any. In particular, if (b) is possible, the bug should
> > already show up at least in some situations.
> >
> > For XFS, immediately sync of redirtied inode is actually discouraged:
> >
> > http://lkml.org/lkml/2008/1/16/491
> >
> >
>
> Ok, those are good points that I need to think about.
>
> Thanks for the help so far. I'd welcome any suggestions you have on
> how best to fix this.

For NFS, is it desirable to retry a redirtied inode after 30s, or
after a shorter 5s, or after 0.1~5s? Or the exact timing simply
doesn't matter?

Thanks,
Fengguang

2009-03-25 02:56:56

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Wu Fengguang wrote:
> Hi Jeff,
>
> On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
>> On Tue, 24 Mar 2009 10:28:06 -0400
>> Jeff Layton <[email protected]> wrote:
>>
>>> On Tue, 24 Mar 2009 21:57:20 +0800
>>> Wu Fengguang <[email protected]> wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
>>>>> This may be a problem on other filesystems too, but the reproducer I
>>>>> have involves NFS.
>>>>>
>>>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
>>>>> done in the rpc_release handler for COMMIT calls. This call is done
>>>>> asynchronously after the call completes.
>>>>>
>>>>> Because there's no real coordination between __mark_inode_dirty() and
>>>>> __sync_single_inode(), it's often the case that these two calls will
>>>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
>>>>> When this happens, __sync_single_inode() should detect that the inode
>>>>> was redirtied while we were flushing it and call redirty_tail() to put
>>>>> it back on the s_dirty list.
>>>>>
>>>>> When redirty_tail() puts it back on the list, it only resets the
>>>>> dirtied_when value if it's necessary to maintain the list order. Given
>>>>> the right situation (the right I/O patterns and a lot of luck), this
>>>>> could result in dirtied_when never getting updated on an inode that's
>>>>> constantly being redirtied while pdflush is writing it back.
>>>>>
>>>>> Since dirtied_when is based on jiffies, it's possible for it to persist
>>>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
>>>>> check in sync_sb_inodes no longer works correctly and writeouts by
>>>>> pdflush of this inode and any inodes after it on the list stop.
>>>>>
>>>>> This patch fixes this by resetting the dirtied_when value on an inode
>>>>> when we're adding it back onto an empty s_dirty list. Since we generally
>>>>> write inodes from oldest to newest dirtied_when values, this has the
>>>>> effect of making it so that these inodes don't end up with dirtied_when
>>>>> values that are frozen.
>>>>>
>>>>> I've also taken the liberty of fixing up the comments a bit and changed
>>>>> the !time_after_eq() check in redirty_tail to be time_before(). That
>>>>> should be functionally equivalent but I think it's more readable.
>>>>>
>>>>> I wish this were just a theoretical problem, but we've had a customer
>>>>> hit a variant of it in an older kernel. Newer upstream kernels have a
>>>>> number of changes that make this problem less likely. As best I can tell
>>>>> though, there is nothing that really prevents it.
>>>>>
>>>>> Signed-off-by: Jeff Layton <[email protected]>
>>>>> ---
>>>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
>>>>> 1 files changed, 17 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>> index e3fe991..bd2a7ff 100644
>>>>> --- a/fs/fs-writeback.c
>>>>> +++ b/fs/fs-writeback.c
>>>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
>>>>> * furthest end of its superblock's dirty-inode list.
>>>>> *
>>>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
>>>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
>>>>> - * the case then the inode must have been redirtied while it was being written
>>>>> - * out and we don't reset its dirtied_when.
>>>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
>>>>> + * list. If that is the case then we don't need to restamp it to maintain the
>>>>> + * order of the list.
>>>>> + *
>>>>> + * If s_dirty is empty however, then we need to go ahead and update
>>>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
>>>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
>>>>> + * they happen to consistently be the first one to go back on the list.
>>>>> + *
>>>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
>>>>> + * too old will be problematic if jiffies wraps. It may also be causing
>>>>> + * pdflush to flush the inode too often since it'll always look like it was
>>>>> + * dirtied a long time ago.
>>>>> */
>>>>> static void redirty_tail(struct inode *inode)
>>>>> {
>>>>> struct super_block *sb = inode->i_sb;
>>>>>
>>>>> - if (!list_empty(&sb->s_dirty)) {
>>>>> + if (list_empty(&sb->s_dirty)) {
>>>>> + inode->dirtied_when = jiffies;
>>>>> + } else {
>>>>> struct inode *tail_inode;
>>>>>
>>>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
>>>>> - if (!time_after_eq(inode->dirtied_when,
>>>>> + if (time_before(inode->dirtied_when,
>>>>> tail_inode->dirtied_when))
>>>>> inode->dirtied_when = jiffies;
>>>>> }
>>>> I'm afraid you patch is equivalent to the following one.
>>>> Because once the first inode's dirtied_when is set to jiffies,
>>>> in order to keep the list in order, the following ones (mostly)
>>>> will also be updated. A domino effect.
>>>>
>>>> Thanks,
>>>> Fengguang
>>>>
>>> Good point. One of our other engineers proposed a similar patch
>>> originally. I considered it but wasn't clear whether there could be a
>>> situation where unconditionally resetting dirtied_when would be a
>>> problem. Now that I think about it though, I think you're right...
>>>
>>> So maybe something like the patch below is the right thing to do? Or,
>>> maybe when we believe that the inode was fully cleaned and then
>>> redirtied, we'd just unconditionally stamp dirtied_when. Something like
>>> this maybe?
>>>
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index bd2a7ff..596c96e 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
>>> * Someone redirtied the inode while were writing back
>>> * the pages.
>>> */
>>> - redirty_tail(inode);
>>> + inode->dirtied_when = jiffies;
>>> + list_move(&inode->i_list, &sb->s_dirty);
>>> } else if (atomic_read(&inode->i_count)) {
>>> /*
>>> * The inode is clean, inuse
>> Hmm...though it is still possible that you could consistently race in
>> such a way that after writepages(), I_DIRTY is never set but the
>> PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
>> to the same problem of a stuck dirtied_when value.
>
> Jeff, did you spot real impacts of stuck dirtied_when?
> Or it's simply possible in theory?

We've seen it with Web Server logging to an NFS mounted filesystem, and
that is both continuous and frequent.

>
> IMHO it requires extremely strong conditions to happen: It takes
> months to wrap around the value, during that period it takes only
> one _single_ newly dirtied inode to refresh the stuck dirtied_when.

Sure, but with the above workload we see inodes continuously dirty and,
as they age, find their way to the tail of the queue. They continue to
age and when the time difference between dirtied_when and jiffies (or
start in generic_sync_sb_inodes()) becomes greater than 2^31 the logic
of the time_* macros inverts and dirtied_when appears to be in the
future. Then, in generic_sync_sb_inodes() the check:

/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;

always breaks out without doing anything and writback for the filesystem
stops.

Also, from the investigation, we see that it takes a while before the
inode dirtied_when gets stuck so the problem isn't seen until around 50
days or more of uptime.

The other way to work around this without changing dirtied_when is to
use a range for the aove check in generic_sync_sb_inodes(). Like,

if (dirtied_when is between start and "right now")
break;

But the problem with this is that there are other places these macros
could yield incorrect and possibly undesirable results, such as in
queue_io() (via move_expired_inodes()). Which is what lead us to use the
more aggressive dirtied_when stamping.

>
> However...
>
>> So maybe someone can explain to me why we take such great pains to
>> preserve the dirtied_when value when we're putting the inode back on
>> the tail of s_dirty? Why not just unconditionally reset it?
>
> ...I see no obvious reasons against unconditionally resetting dirtied_when.
>
> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> condition may well go away within 1s. (b) And it would be very undesirable
> if one big file is repeatedly redirtied hence its writeback being
> delayed considerably.
>
> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> in a _best effort_ way. It at best partially hides the above issues,
> if there are any. In particular, if (b) is possible, the bug should
> already show up at least in some situations.
>
> For XFS, immediately sync of redirtied inode is actually discouraged:
>
> http://lkml.org/lkml/2008/1/16/491

Yes, that's an interesting (and unfortuneate) case.

It looks like the potential to re-write an already written inode is also
present because in generic_sync_sb_inodes() the inode could be marked as
dirty either "before" or after the writeback. I can't see any way to
detect and handle this within the current code.

Ian

2009-03-25 04:05:47

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 10:56:36AM +0800, Ian Kent wrote:
> Wu Fengguang wrote:
> > Hi Jeff,
> >
> > On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
> >> On Tue, 24 Mar 2009 10:28:06 -0400
> >> Jeff Layton <[email protected]> wrote:
> >>
> >>> On Tue, 24 Mar 2009 21:57:20 +0800
> >>> Wu Fengguang <[email protected]> wrote:
> >>>
> >>>> Hi Jeff,
> >>>>
> >>>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
> >>>>> This may be a problem on other filesystems too, but the reproducer I
> >>>>> have involves NFS.
> >>>>>
> >>>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
> >>>>> done in the rpc_release handler for COMMIT calls. This call is done
> >>>>> asynchronously after the call completes.
> >>>>>
> >>>>> Because there's no real coordination between __mark_inode_dirty() and
> >>>>> __sync_single_inode(), it's often the case that these two calls will
> >>>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
> >>>>> When this happens, __sync_single_inode() should detect that the inode
> >>>>> was redirtied while we were flushing it and call redirty_tail() to put
> >>>>> it back on the s_dirty list.
> >>>>>
> >>>>> When redirty_tail() puts it back on the list, it only resets the
> >>>>> dirtied_when value if it's necessary to maintain the list order. Given
> >>>>> the right situation (the right I/O patterns and a lot of luck), this
> >>>>> could result in dirtied_when never getting updated on an inode that's
> >>>>> constantly being redirtied while pdflush is writing it back.
> >>>>>
> >>>>> Since dirtied_when is based on jiffies, it's possible for it to persist
> >>>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
> >>>>> check in sync_sb_inodes no longer works correctly and writeouts by
> >>>>> pdflush of this inode and any inodes after it on the list stop.
> >>>>>
> >>>>> This patch fixes this by resetting the dirtied_when value on an inode
> >>>>> when we're adding it back onto an empty s_dirty list. Since we generally
> >>>>> write inodes from oldest to newest dirtied_when values, this has the
> >>>>> effect of making it so that these inodes don't end up with dirtied_when
> >>>>> values that are frozen.
> >>>>>
> >>>>> I've also taken the liberty of fixing up the comments a bit and changed
> >>>>> the !time_after_eq() check in redirty_tail to be time_before(). That
> >>>>> should be functionally equivalent but I think it's more readable.
> >>>>>
> >>>>> I wish this were just a theoretical problem, but we've had a customer
> >>>>> hit a variant of it in an older kernel. Newer upstream kernels have a
> >>>>> number of changes that make this problem less likely. As best I can tell
> >>>>> though, there is nothing that really prevents it.
> >>>>>
> >>>>> Signed-off-by: Jeff Layton <[email protected]>
> >>>>> ---
> >>>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
> >>>>> 1 files changed, 17 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>>>> index e3fe991..bd2a7ff 100644
> >>>>> --- a/fs/fs-writeback.c
> >>>>> +++ b/fs/fs-writeback.c
> >>>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
> >>>>> * furthest end of its superblock's dirty-inode list.
> >>>>> *
> >>>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
> >>>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
> >>>>> - * the case then the inode must have been redirtied while it was being written
> >>>>> - * out and we don't reset its dirtied_when.
> >>>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
> >>>>> + * list. If that is the case then we don't need to restamp it to maintain the
> >>>>> + * order of the list.
> >>>>> + *
> >>>>> + * If s_dirty is empty however, then we need to go ahead and update
> >>>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
> >>>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
> >>>>> + * they happen to consistently be the first one to go back on the list.
> >>>>> + *
> >>>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
> >>>>> + * too old will be problematic if jiffies wraps. It may also be causing
> >>>>> + * pdflush to flush the inode too often since it'll always look like it was
> >>>>> + * dirtied a long time ago.
> >>>>> */
> >>>>> static void redirty_tail(struct inode *inode)
> >>>>> {
> >>>>> struct super_block *sb = inode->i_sb;
> >>>>>
> >>>>> - if (!list_empty(&sb->s_dirty)) {
> >>>>> + if (list_empty(&sb->s_dirty)) {
> >>>>> + inode->dirtied_when = jiffies;
> >>>>> + } else {
> >>>>> struct inode *tail_inode;
> >>>>>
> >>>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> >>>>> - if (!time_after_eq(inode->dirtied_when,
> >>>>> + if (time_before(inode->dirtied_when,
> >>>>> tail_inode->dirtied_when))
> >>>>> inode->dirtied_when = jiffies;
> >>>>> }
> >>>> I'm afraid you patch is equivalent to the following one.
> >>>> Because once the first inode's dirtied_when is set to jiffies,
> >>>> in order to keep the list in order, the following ones (mostly)
> >>>> will also be updated. A domino effect.
> >>>>
> >>>> Thanks,
> >>>> Fengguang
> >>>>
> >>> Good point. One of our other engineers proposed a similar patch
> >>> originally. I considered it but wasn't clear whether there could be a
> >>> situation where unconditionally resetting dirtied_when would be a
> >>> problem. Now that I think about it though, I think you're right...
> >>>
> >>> So maybe something like the patch below is the right thing to do? Or,
> >>> maybe when we believe that the inode was fully cleaned and then
> >>> redirtied, we'd just unconditionally stamp dirtied_when. Something like
> >>> this maybe?
> >>>
> >>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>> index bd2a7ff..596c96e 100644
> >>> --- a/fs/fs-writeback.c
> >>> +++ b/fs/fs-writeback.c
> >>> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
> >>> * Someone redirtied the inode while were writing back
> >>> * the pages.
> >>> */
> >>> - redirty_tail(inode);
> >>> + inode->dirtied_when = jiffies;
> >>> + list_move(&inode->i_list, &sb->s_dirty);
> >>> } else if (atomic_read(&inode->i_count)) {
> >>> /*
> >>> * The inode is clean, inuse
> >> Hmm...though it is still possible that you could consistently race in
> >> such a way that after writepages(), I_DIRTY is never set but the
> >> PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
> >> to the same problem of a stuck dirtied_when value.
> >
> > Jeff, did you spot real impacts of stuck dirtied_when?
> > Or it's simply possible in theory?
>
> We've seen it with Web Server logging to an NFS mounted filesystem, and
> that is both continuous and frequent.

What's your kernel version? In old kernels, the s_dirty queue will be
completely spliced into s_io, being walked on until hit first not-yet-expired
inode, and have the remaining s_io inodes spliced back to s_dirty.

The new behavior is to move only expired inodes into s_io. So the
redirtied inodes will now be inserted into a *non-empty* s_dirty if
there are any newly dirtied inodes, and have their stuck dirtied_when
refreshed. This makes a huge difference.

> >
> > IMHO it requires extremely strong conditions to happen: It takes
> > months to wrap around the value, during that period it takes only
> > one _single_ newly dirtied inode to refresh the stuck dirtied_when.
>
> Sure, but with the above workload we see inodes continuously dirty and,
> as they age, find their way to the tail of the queue. They continue to
> age and when the time difference between dirtied_when and jiffies (or
> start in generic_sync_sb_inodes()) becomes greater than 2^31 the logic
> of the time_* macros inverts and dirtied_when appears to be in the
> future. Then, in generic_sync_sb_inodes() the check:
>
> /* Was this inode dirtied after sync_sb_inodes was called? */
> if (time_after(inode->dirtied_when, start))
> break;
>
> always breaks out without doing anything and writback for the filesystem
> stops.
>
> Also, from the investigation, we see that it takes a while before the
> inode dirtied_when gets stuck so the problem isn't seen until around 50
> days or more of uptime.
>
> The other way to work around this without changing dirtied_when is to
> use a range for the aove check in generic_sync_sb_inodes(). Like,
>
> if (dirtied_when is between start and "right now")
> break;
>
> But the problem with this is that there are other places these macros
> could yield incorrect and possibly undesirable results, such as in
> queue_io() (via move_expired_inodes()). Which is what lead us to use the
> more aggressive dirtied_when stamping.
>
> >
> > However...
> >
> >> So maybe someone can explain to me why we take such great pains to
> >> preserve the dirtied_when value when we're putting the inode back on
> >> the tail of s_dirty? Why not just unconditionally reset it?
> >
> > ...I see no obvious reasons against unconditionally resetting dirtied_when.
> >
> > (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > condition may well go away within 1s. (b) And it would be very undesirable
> > if one big file is repeatedly redirtied hence its writeback being
> > delayed considerably.
> >
> > However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > in a _best effort_ way. It at best partially hides the above issues,
> > if there are any. In particular, if (b) is possible, the bug should
> > already show up at least in some situations.
> >
> > For XFS, immediately sync of redirtied inode is actually discouraged:
> >
> > http://lkml.org/lkml/2008/1/16/491
>
> Yes, that's an interesting (and unfortuneate) case.
>
> It looks like the potential to re-write an already written inode is also
> present because in generic_sync_sb_inodes() the inode could be marked as
> dirty either "before" or after the writeback. I can't see any way to
> detect and handle this within the current code.

I'm not sure. Can you elaborate the problem (and why it's a problem)
with flags, states etc.?

Thanks,
Fengguang

2009-03-25 05:03:50

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Wu Fengguang wrote:
> On Wed, Mar 25, 2009 at 10:56:36AM +0800, Ian Kent wrote:
>> Wu Fengguang wrote:
>>> Hi Jeff,
>>>
>>> On Tue, Mar 24, 2009 at 10:46:57PM +0800, Jeff Layton wrote:
>>>> On Tue, 24 Mar 2009 10:28:06 -0400
>>>> Jeff Layton <[email protected]> wrote:
>>>>
>>>>> On Tue, 24 Mar 2009 21:57:20 +0800
>>>>> Wu Fengguang <[email protected]> wrote:
>>>>>
>>>>>> Hi Jeff,
>>>>>>
>>>>>> On Mon, Mar 23, 2009 at 04:30:33PM -0400, Jeff Layton wrote:
>>>>>>> This may be a problem on other filesystems too, but the reproducer I
>>>>>>> have involves NFS.
>>>>>>>
>>>>>>> On NFS, the __mark_inode_dirty() call after writing back the inode is
>>>>>>> done in the rpc_release handler for COMMIT calls. This call is done
>>>>>>> asynchronously after the call completes.
>>>>>>>
>>>>>>> Because there's no real coordination between __mark_inode_dirty() and
>>>>>>> __sync_single_inode(), it's often the case that these two calls will
>>>>>>> race and __mark_inode_dirty() will get called while I_SYNC is still set.
>>>>>>> When this happens, __sync_single_inode() should detect that the inode
>>>>>>> was redirtied while we were flushing it and call redirty_tail() to put
>>>>>>> it back on the s_dirty list.
>>>>>>>
>>>>>>> When redirty_tail() puts it back on the list, it only resets the
>>>>>>> dirtied_when value if it's necessary to maintain the list order. Given
>>>>>>> the right situation (the right I/O patterns and a lot of luck), this
>>>>>>> could result in dirtied_when never getting updated on an inode that's
>>>>>>> constantly being redirtied while pdflush is writing it back.
>>>>>>>
>>>>>>> Since dirtied_when is based on jiffies, it's possible for it to persist
>>>>>>> across 2 sign-bit flips of jiffies. When that happens, the time_after()
>>>>>>> check in sync_sb_inodes no longer works correctly and writeouts by
>>>>>>> pdflush of this inode and any inodes after it on the list stop.
>>>>>>>
>>>>>>> This patch fixes this by resetting the dirtied_when value on an inode
>>>>>>> when we're adding it back onto an empty s_dirty list. Since we generally
>>>>>>> write inodes from oldest to newest dirtied_when values, this has the
>>>>>>> effect of making it so that these inodes don't end up with dirtied_when
>>>>>>> values that are frozen.
>>>>>>>
>>>>>>> I've also taken the liberty of fixing up the comments a bit and changed
>>>>>>> the !time_after_eq() check in redirty_tail to be time_before(). That
>>>>>>> should be functionally equivalent but I think it's more readable.
>>>>>>>
>>>>>>> I wish this were just a theoretical problem, but we've had a customer
>>>>>>> hit a variant of it in an older kernel. Newer upstream kernels have a
>>>>>>> number of changes that make this problem less likely. As best I can tell
>>>>>>> though, there is nothing that really prevents it.
>>>>>>>
>>>>>>> Signed-off-by: Jeff Layton <[email protected]>
>>>>>>> ---
>>>>>>> fs/fs-writeback.c | 22 +++++++++++++++++-----
>>>>>>> 1 files changed, 17 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>>> index e3fe991..bd2a7ff 100644
>>>>>>> --- a/fs/fs-writeback.c
>>>>>>> +++ b/fs/fs-writeback.c
>>>>>>> @@ -184,19 +184,31 @@ static int write_inode(struct inode *inode, int sync)
>>>>>>> * furthest end of its superblock's dirty-inode list.
>>>>>>> *
>>>>>>> * Before stamping the inode's ->dirtied_when, we check to see whether it is
>>>>>>> - * already the most-recently-dirtied inode on the s_dirty list. If that is
>>>>>>> - * the case then the inode must have been redirtied while it was being written
>>>>>>> - * out and we don't reset its dirtied_when.
>>>>>>> + * "newer" or equal to that of the most-recently-dirtied inode on the s_dirty
>>>>>>> + * list. If that is the case then we don't need to restamp it to maintain the
>>>>>>> + * order of the list.
>>>>>>> + *
>>>>>>> + * If s_dirty is empty however, then we need to go ahead and update
>>>>>>> + * dirtied_when for the inode. Not doing so will mean that inodes that are
>>>>>>> + * constantly being redirtied can end up with "stuck" dirtied_when values if
>>>>>>> + * they happen to consistently be the first one to go back on the list.
>>>>>>> + *
>>>>>>> + * Since we're using jiffies values in that field, letting dirtied_when grow
>>>>>>> + * too old will be problematic if jiffies wraps. It may also be causing
>>>>>>> + * pdflush to flush the inode too often since it'll always look like it was
>>>>>>> + * dirtied a long time ago.
>>>>>>> */
>>>>>>> static void redirty_tail(struct inode *inode)
>>>>>>> {
>>>>>>> struct super_block *sb = inode->i_sb;
>>>>>>>
>>>>>>> - if (!list_empty(&sb->s_dirty)) {
>>>>>>> + if (list_empty(&sb->s_dirty)) {
>>>>>>> + inode->dirtied_when = jiffies;
>>>>>>> + } else {
>>>>>>> struct inode *tail_inode;
>>>>>>>
>>>>>>> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
>>>>>>> - if (!time_after_eq(inode->dirtied_when,
>>>>>>> + if (time_before(inode->dirtied_when,
>>>>>>> tail_inode->dirtied_when))
>>>>>>> inode->dirtied_when = jiffies;
>>>>>>> }
>>>>>> I'm afraid you patch is equivalent to the following one.
>>>>>> Because once the first inode's dirtied_when is set to jiffies,
>>>>>> in order to keep the list in order, the following ones (mostly)
>>>>>> will also be updated. A domino effect.
>>>>>>
>>>>>> Thanks,
>>>>>> Fengguang
>>>>>>
>>>>> Good point. One of our other engineers proposed a similar patch
>>>>> originally. I considered it but wasn't clear whether there could be a
>>>>> situation where unconditionally resetting dirtied_when would be a
>>>>> problem. Now that I think about it though, I think you're right...
>>>>>
>>>>> So maybe something like the patch below is the right thing to do? Or,
>>>>> maybe when we believe that the inode was fully cleaned and then
>>>>> redirtied, we'd just unconditionally stamp dirtied_when. Something like
>>>>> this maybe?
>>>>>
>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>> index bd2a7ff..596c96e 100644
>>>>> --- a/fs/fs-writeback.c
>>>>> +++ b/fs/fs-writeback.c
>>>>> @@ -364,7 +364,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
>>>>> * Someone redirtied the inode while were writing back
>>>>> * the pages.
>>>>> */
>>>>> - redirty_tail(inode);
>>>>> + inode->dirtied_when = jiffies;
>>>>> + list_move(&inode->i_list, &sb->s_dirty);
>>>>> } else if (atomic_read(&inode->i_count)) {
>>>>> /*
>>>>> * The inode is clean, inuse
>>>> Hmm...though it is still possible that you could consistently race in
>>>> such a way that after writepages(), I_DIRTY is never set but the
>>>> PAGECACHE_TAG_DIRTY is still set on the mapping. And then we'd be back
>>>> to the same problem of a stuck dirtied_when value.
>>>
>>> Jeff, did you spot real impacts of stuck dirtied_when?
>>> Or it's simply possible in theory?
>> We've seen it with Web Server logging to an NFS mounted filesystem, and
>> that is both continuous and frequent.
>
> What's your kernel version? In old kernels, the s_dirty queue will be
> completely spliced into s_io, being walked on until hit first not-yet-expired
> inode, and have the remaining s_io inodes spliced back to s_dirty.

As you know the problem has been observed in an older kernel.
But I'm looking at the code in the current Linus tree and also the mmtom
tree with what has been observed in mind.

>
> The new behavior is to move only expired inodes into s_io. So the
> redirtied inodes will now be inserted into a *non-empty* s_dirty if
> there are any newly dirtied inodes, and have their stuck dirtied_when
> refreshed. This makes a huge difference.

Yes, Jeffs work indicates that to be the case but, based on the what
we've seen in older kernels, I'm having trouble seeing that it isn't
still a problem in the current code.

In the current code, if an inode that is constantly marked as dirty gets
to the tail of the s_dirty list how does its dirtied_when get updated?

>
>>> IMHO it requires extremely strong conditions to happen: It takes
>>> months to wrap around the value, during that period it takes only
>>> one _single_ newly dirtied inode to refresh the stuck dirtied_when.
>> Sure, but with the above workload we see inodes continuously dirty and,
>> as they age, find their way to the tail of the queue. They continue to
>> age and when the time difference between dirtied_when and jiffies (or
>> start in generic_sync_sb_inodes()) becomes greater than 2^31 the logic
>> of the time_* macros inverts and dirtied_when appears to be in the
>> future. Then, in generic_sync_sb_inodes() the check:
>>
>> /* Was this inode dirtied after sync_sb_inodes was called? */
>> if (time_after(inode->dirtied_when, start))
>> break;
>>
>> always breaks out without doing anything and writback for the filesystem
>> stops.
>>
>> Also, from the investigation, we see that it takes a while before the
>> inode dirtied_when gets stuck so the problem isn't seen until around 50
>> days or more of uptime.
>>
>> The other way to work around this without changing dirtied_when is to
>> use a range for the aove check in generic_sync_sb_inodes(). Like,
>>
>> if (dirtied_when is between start and "right now")
>> break;
>>
>> But the problem with this is that there are other places these macros
>> could yield incorrect and possibly undesirable results, such as in
>> queue_io() (via move_expired_inodes()). Which is what lead us to use the
>> more aggressive dirtied_when stamping.
>>
>>> However...
>>>
>>>> So maybe someone can explain to me why we take such great pains to
>>>> preserve the dirtied_when value when we're putting the inode back on
>>>> the tail of s_dirty? Why not just unconditionally reset it?
>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
>>>
>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
>>> condition may well go away within 1s. (b) And it would be very undesirable
>>> if one big file is repeatedly redirtied hence its writeback being
>>> delayed considerably.
>>>
>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
>>> in a _best effort_ way. It at best partially hides the above issues,
>>> if there are any. In particular, if (b) is possible, the bug should
>>> already show up at least in some situations.
>>>
>>> For XFS, immediately sync of redirtied inode is actually discouraged:
>>>
>>> http://lkml.org/lkml/2008/1/16/491
>> Yes, that's an interesting (and unfortuneate) case.
>>
>> It looks like the potential to re-write an already written inode is also
>> present because in generic_sync_sb_inodes() the inode could be marked as
>> dirty either "before" or after the writeback. I can't see any way to
>> detect and handle this within the current code.
>
> I'm not sure. Can you elaborate the problem (and why it's a problem)
> with flags, states etc.?

It's only a problem if writing the same inode out twice is considered a
problem. Probably not a big deal.

It looks to me like __mark_inode_dirty() can be called at any time while
I_SYNC is set in __sync_single_inode() and the inode_lock is not held.
So an inode can be marked dirty at any time between the inode_lock
release and when it is re-aquired so it looks like the inode could be
written out either before or after the inode dirty marking occurs. So if
the inode is written after the dirty marking it will still be seen as
dirty even though it has already been written out.

But this is just my impression from reading the code and I could be
mistaken.

Ian

2009-03-25 11:52:59

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 10:50:37 +0800
Wu Fengguang <[email protected]> wrote:

> > Given the right situation though (or maybe the right filesystem), it's
> > not too hard to imagine this problem occurring even in current mainline
> > code with an inode that's frequently being redirtied.
>
> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> inodes will be parked in s_dirty for 30s. During which time the
> actively being-redirtied inodes, if their dirtied_when is an old stuck
> value, will be retried for writeback and then re-inserted into a
> non-empty s_dirty queue and have their dirtied_when refreshed.
>

Doesn't that assume that there are new inodes that are being dirtied?
If you only have the same inodes being redirtied and never any new
ones, the problem still occurs, right?

> > >
> > > ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > >
> > > (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > > condition may well go away within 1s. (b) And it would be very undesirable
> > > if one big file is repeatedly redirtied hence its writeback being
> > > delayed considerably.
> > >
> > > However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > > in a _best effort_ way. It at best partially hides the above issues,
> > > if there are any. In particular, if (b) is possible, the bug should
> > > already show up at least in some situations.
> > >
> > > For XFS, immediately sync of redirtied inode is actually discouraged:
> > >
> > > http://lkml.org/lkml/2008/1/16/491
> > >
> > >
> >
> > Ok, those are good points that I need to think about.
> >
> > Thanks for the help so far. I'd welcome any suggestions you have on
> > how best to fix this.
>
> For NFS, is it desirable to retry a redirtied inode after 30s, or
> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> doesn't matter?
>

I don't really consider NFS to be a special case here. It just happens
to be where we saw the problem originally. Some of its characteristics
might make it easier to hit this, but I'm not certain of that.

--
Jeff Layton <[email protected]>

2009-03-25 12:18:22

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> On Wed, 25 Mar 2009 10:50:37 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > > Given the right situation though (or maybe the right filesystem), it's
> > > not too hard to imagine this problem occurring even in current mainline
> > > code with an inode that's frequently being redirtied.
> >
> > My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> > happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> > inodes will be parked in s_dirty for 30s. During which time the
> > actively being-redirtied inodes, if their dirtied_when is an old stuck
> > value, will be retried for writeback and then re-inserted into a
> > non-empty s_dirty queue and have their dirtied_when refreshed.
> >
>
> Doesn't that assume that there are new inodes that are being dirtied?
> If you only have the same inodes being redirtied and never any new
> ones, the problem still occurs, right?

Yes. But will a production server run months without making one single
new dirtied inode? (Just out of curiosity. Not that I'm not willing to
fix this possible issue.:)

> > > > ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > > >
> > > > (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > > > condition may well go away within 1s. (b) And it would be very undesirable
> > > > if one big file is repeatedly redirtied hence its writeback being
> > > > delayed considerably.
> > > >
> > > > However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > > > in a _best effort_ way. It at best partially hides the above issues,
> > > > if there are any. In particular, if (b) is possible, the bug should
> > > > already show up at least in some situations.
> > > >
> > > > For XFS, immediately sync of redirtied inode is actually discouraged:
> > > >
> > > > http://lkml.org/lkml/2008/1/16/491
> > > >
> > > >
> > >
> > > Ok, those are good points that I need to think about.
> > >
> > > Thanks for the help so far. I'd welcome any suggestions you have on
> > > how best to fix this.
> >
> > For NFS, is it desirable to retry a redirtied inode after 30s, or
> > after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> > doesn't matter?
> >
>
> I don't really consider NFS to be a special case here. It just happens
> to be where we saw the problem originally. Some of its characteristics
> might make it easier to hit this, but I'm not certain of that.

Now there are now two possible solutions:
- unconditionally update dirtied_when in redirty_tail();
- keep dirtied_when and redirty inodes to a new dedicated queue.
The first one involves less code, the second one allows more flexible timing.

NFS/XFS could be a good starting point for discussing the
requirements, so that we can reach a suitable solution.

Thanks,
Fengguang

2009-03-25 13:14:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 20:17:43 +0800
Wu Fengguang <[email protected]> wrote:

> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> > On Wed, 25 Mar 2009 10:50:37 +0800
> > Wu Fengguang <[email protected]> wrote:
> >
> > > > Given the right situation though (or maybe the right filesystem), it's
> > > > not too hard to imagine this problem occurring even in current mainline
> > > > code with an inode that's frequently being redirtied.
> > >
> > > My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> > > happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> > > inodes will be parked in s_dirty for 30s. During which time the
> > > actively being-redirtied inodes, if their dirtied_when is an old stuck
> > > value, will be retried for writeback and then re-inserted into a
> > > non-empty s_dirty queue and have their dirtied_when refreshed.
> > >
> >
> > Doesn't that assume that there are new inodes that are being dirtied?
> > If you only have the same inodes being redirtied and never any new
> > ones, the problem still occurs, right?
>
> Yes. But will a production server run months without making one single
> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> fix this possible issue.:)
>

Yes. It's not that the box will run that long without creating a
single new dirtied inode, but rather that it won't necessarily create
one on all of its mounts. It's often the case that someone has a
mountpoint for a dedicated purpose.

Consider a host that has a mountpoint that contains logfiles that are
being heavily written. There's nothing that says that they must rotate
those logs over a particular period (assuming the fs has enough space,
etc). If the same ones are constantly being redirtied and no new
ones are created, then I think this problem can easily happen.

> > > > > ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > > > >
> > > > > (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > > > > condition may well go away within 1s. (b) And it would be very undesirable
> > > > > if one big file is repeatedly redirtied hence its writeback being
> > > > > delayed considerably.
> > > > >
> > > > > However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > > > > in a _best effort_ way. It at best partially hides the above issues,
> > > > > if there are any. In particular, if (b) is possible, the bug should
> > > > > already show up at least in some situations.
> > > > >
> > > > > For XFS, immediately sync of redirtied inode is actually discouraged:
> > > > >
> > > > > http://lkml.org/lkml/2008/1/16/491
> > > > >
> > > > >
> > > >
> > > > Ok, those are good points that I need to think about.
> > > >
> > > > Thanks for the help so far. I'd welcome any suggestions you have on
> > > > how best to fix this.
> > >
> > > For NFS, is it desirable to retry a redirtied inode after 30s, or
> > > after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> > > doesn't matter?
> > >
> >
> > I don't really consider NFS to be a special case here. It just happens
> > to be where we saw the problem originally. Some of its characteristics
> > might make it easier to hit this, but I'm not certain of that.
>
> Now there are now two possible solutions:
> - unconditionally update dirtied_when in redirty_tail();
> - keep dirtied_when and redirty inodes to a new dedicated queue.
> The first one involves less code, the second one allows more flexible timing.
>
> NFS/XFS could be a good starting point for discussing the
> requirements, so that we can reach a suitable solution.
>

It sounds like it, yes. I saw that you posted some patches in January
(including your s_more_io_wait patch). I'll give those a closer look.
Adding the new s_more_io_wait queue is interesting and might sidestep
this problem nicely.

--
Jeff Layton <[email protected]>

2009-03-25 13:19:20

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Jeff Layton wrote:
> On Wed, 25 Mar 2009 20:17:43 +0800
> Wu Fengguang <[email protected]> wrote:
>
>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
>>> On Wed, 25 Mar 2009 10:50:37 +0800
>>> Wu Fengguang <[email protected]> wrote:
>>>
>>>>> Given the right situation though (or maybe the right filesystem), it's
>>>>> not too hard to imagine this problem occurring even in current mainline
>>>>> code with an inode that's frequently being redirtied.
>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
>>>> inodes will be parked in s_dirty for 30s. During which time the
>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
>>>> value, will be retried for writeback and then re-inserted into a
>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
>>>>
>>> Doesn't that assume that there are new inodes that are being dirtied?
>>> If you only have the same inodes being redirtied and never any new
>>> ones, the problem still occurs, right?
>> Yes. But will a production server run months without making one single
>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
>> fix this possible issue.:)
>>
>
> Yes. It's not that the box will run that long without creating a
> single new dirtied inode, but rather that it won't necessarily create
> one on all of its mounts. It's often the case that someone has a
> mountpoint for a dedicated purpose.
>
> Consider a host that has a mountpoint that contains logfiles that are
> being heavily written. There's nothing that says that they must rotate
> those logs over a particular period (assuming the fs has enough space,
> etc). If the same ones are constantly being redirtied and no new
> ones are created, then I think this problem can easily happen.
>
>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
>>>>>>
>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
>>>>>> if one big file is repeatedly redirtied hence its writeback being
>>>>>> delayed considerably.
>>>>>>
>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
>>>>>> in a _best effort_ way. It at best partially hides the above issues,
>>>>>> if there are any. In particular, if (b) is possible, the bug should
>>>>>> already show up at least in some situations.
>>>>>>
>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
>>>>>>
>>>>>> http://lkml.org/lkml/2008/1/16/491
>>>>>>
>>>>>>
>>>>> Ok, those are good points that I need to think about.
>>>>>
>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
>>>>> how best to fix this.
>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
>>>> doesn't matter?
>>>>
>>> I don't really consider NFS to be a special case here. It just happens
>>> to be where we saw the problem originally. Some of its characteristics
>>> might make it easier to hit this, but I'm not certain of that.
>> Now there are now two possible solutions:
>> - unconditionally update dirtied_when in redirty_tail();
>> - keep dirtied_when and redirty inodes to a new dedicated queue.
>> The first one involves less code, the second one allows more flexible timing.
>>
>> NFS/XFS could be a good starting point for discussing the
>> requirements, so that we can reach a suitable solution.
>>
>
> It sounds like it, yes. I saw that you posted some patches in January
> (including your s_more_io_wait patch). I'll give those a closer look.
> Adding the new s_more_io_wait queue is interesting and might sidestep
> this problem nicely.
>

Yes, I was looking at that bit of code but, so far, I think it won't be
called for the case we are trying to describe.

Ian

2009-03-25 13:39:12

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

Ian Kent wrote:
> Jeff Layton wrote:
>> On Wed, 25 Mar 2009 20:17:43 +0800
>> Wu Fengguang <[email protected]> wrote:
>>
>>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
>>>> On Wed, 25 Mar 2009 10:50:37 +0800
>>>> Wu Fengguang <[email protected]> wrote:
>>>>
>>>>>> Given the right situation though (or maybe the right filesystem), it's
>>>>>> not too hard to imagine this problem occurring even in current mainline
>>>>>> code with an inode that's frequently being redirtied.
>>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
>>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
>>>>> inodes will be parked in s_dirty for 30s. During which time the
>>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
>>>>> value, will be retried for writeback and then re-inserted into a
>>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
>>>>>
>>>> Doesn't that assume that there are new inodes that are being dirtied?
>>>> If you only have the same inodes being redirtied and never any new
>>>> ones, the problem still occurs, right?
>>> Yes. But will a production server run months without making one single
>>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
>>> fix this possible issue.:)
>>>
>> Yes. It's not that the box will run that long without creating a
>> single new dirtied inode, but rather that it won't necessarily create
>> one on all of its mounts. It's often the case that someone has a
>> mountpoint for a dedicated purpose.
>>
>> Consider a host that has a mountpoint that contains logfiles that are
>> being heavily written. There's nothing that says that they must rotate
>> those logs over a particular period (assuming the fs has enough space,
>> etc). If the same ones are constantly being redirtied and no new
>> ones are created, then I think this problem can easily happen.
>>
>>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
>>>>>>>
>>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
>>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
>>>>>>> if one big file is repeatedly redirtied hence its writeback being
>>>>>>> delayed considerably.
>>>>>>>
>>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
>>>>>>> in a _best effort_ way. It at best partially hides the above issues,
>>>>>>> if there are any. In particular, if (b) is possible, the bug should
>>>>>>> already show up at least in some situations.
>>>>>>>
>>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
>>>>>>>
>>>>>>> http://lkml.org/lkml/2008/1/16/491
>>>>>>>
>>>>>>>
>>>>>> Ok, those are good points that I need to think about.
>>>>>>
>>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
>>>>>> how best to fix this.
>>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
>>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
>>>>> doesn't matter?
>>>>>
>>>> I don't really consider NFS to be a special case here. It just happens
>>>> to be where we saw the problem originally. Some of its characteristics
>>>> might make it easier to hit this, but I'm not certain of that.
>>> Now there are now two possible solutions:
>>> - unconditionally update dirtied_when in redirty_tail();
>>> - keep dirtied_when and redirty inodes to a new dedicated queue.
>>> The first one involves less code, the second one allows more flexible timing.
>>>
>>> NFS/XFS could be a good starting point for discussing the
>>> requirements, so that we can reach a suitable solution.
>>>
>> It sounds like it, yes. I saw that you posted some patches in January
>> (including your s_more_io_wait patch). I'll give those a closer look.
>> Adding the new s_more_io_wait queue is interesting and might sidestep
>> this problem nicely.
>>
>
> Yes, I was looking at that bit of code but, so far, I think it won't be
> called for the case we are trying to describe.

I take that back.
As Jeff pointed out I haven't seen these patches and can't seem to find
them in my fsdevel list folder, Wu can you send me a copy please?

Ian

2009-03-25 13:45:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 09:38:47PM +0800, Ian Kent wrote:
> Ian Kent wrote:
> > Jeff Layton wrote:
> >> On Wed, 25 Mar 2009 20:17:43 +0800
> >> Wu Fengguang <[email protected]> wrote:
> >>
> >>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> >>>> On Wed, 25 Mar 2009 10:50:37 +0800
> >>>> Wu Fengguang <[email protected]> wrote:
> >>>>
> >>>>>> Given the right situation though (or maybe the right filesystem), it's
> >>>>>> not too hard to imagine this problem occurring even in current mainline
> >>>>>> code with an inode that's frequently being redirtied.
> >>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> >>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> >>>>> inodes will be parked in s_dirty for 30s. During which time the
> >>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
> >>>>> value, will be retried for writeback and then re-inserted into a
> >>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
> >>>>>
> >>>> Doesn't that assume that there are new inodes that are being dirtied?
> >>>> If you only have the same inodes being redirtied and never any new
> >>>> ones, the problem still occurs, right?
> >>> Yes. But will a production server run months without making one single
> >>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> >>> fix this possible issue.:)
> >>>
> >> Yes. It's not that the box will run that long without creating a
> >> single new dirtied inode, but rather that it won't necessarily create
> >> one on all of its mounts. It's often the case that someone has a
> >> mountpoint for a dedicated purpose.
> >>
> >> Consider a host that has a mountpoint that contains logfiles that are
> >> being heavily written. There's nothing that says that they must rotate
> >> those logs over a particular period (assuming the fs has enough space,
> >> etc). If the same ones are constantly being redirtied and no new
> >> ones are created, then I think this problem can easily happen.
> >>
> >>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
> >>>>>>>
> >>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> >>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
> >>>>>>> if one big file is repeatedly redirtied hence its writeback being
> >>>>>>> delayed considerably.
> >>>>>>>
> >>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> >>>>>>> in a _best effort_ way. It at best partially hides the above issues,
> >>>>>>> if there are any. In particular, if (b) is possible, the bug should
> >>>>>>> already show up at least in some situations.
> >>>>>>>
> >>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
> >>>>>>>
> >>>>>>> http://lkml.org/lkml/2008/1/16/491
> >>>>>>>
> >>>>>>>
> >>>>>> Ok, those are good points that I need to think about.
> >>>>>>
> >>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
> >>>>>> how best to fix this.
> >>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
> >>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> >>>>> doesn't matter?
> >>>>>
> >>>> I don't really consider NFS to be a special case here. It just happens
> >>>> to be where we saw the problem originally. Some of its characteristics
> >>>> might make it easier to hit this, but I'm not certain of that.
> >>> Now there are now two possible solutions:
> >>> - unconditionally update dirtied_when in redirty_tail();
> >>> - keep dirtied_when and redirty inodes to a new dedicated queue.
> >>> The first one involves less code, the second one allows more flexible timing.
> >>>
> >>> NFS/XFS could be a good starting point for discussing the
> >>> requirements, so that we can reach a suitable solution.
> >>>
> >> It sounds like it, yes. I saw that you posted some patches in January
> >> (including your s_more_io_wait patch). I'll give those a closer look.
> >> Adding the new s_more_io_wait queue is interesting and might sidestep
> >> this problem nicely.
> >>
> >
> > Yes, I was looking at that bit of code but, so far, I think it won't be
> > called for the case we are trying to describe.

You mean this case?

} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back
* the pages.
*/
redirty_tail(inode);
} else if (atomic_read(&inode->i_count)) {

Sure we can replace the redirty_tail() with requeue_io_wait().

> I take that back.
> As Jeff pointed out I haven't seen these patches and can't seem to find
> them in my fsdevel list folder, Wu can you send me a copy please?

OK, wait a minute...

Thanks,
Fengguang

2009-03-25 14:02:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 22:38:47 +0900
Ian Kent <[email protected]> wrote:

> Ian Kent wrote:
> > Jeff Layton wrote:
> >> On Wed, 25 Mar 2009 20:17:43 +0800
> >> Wu Fengguang <[email protected]> wrote:
> >>
> >>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> >>>> On Wed, 25 Mar 2009 10:50:37 +0800
> >>>> Wu Fengguang <[email protected]> wrote:
> >>>>
> >>>>>> Given the right situation though (or maybe the right filesystem), it's
> >>>>>> not too hard to imagine this problem occurring even in current mainline
> >>>>>> code with an inode that's frequently being redirtied.
> >>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> >>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> >>>>> inodes will be parked in s_dirty for 30s. During which time the
> >>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
> >>>>> value, will be retried for writeback and then re-inserted into a
> >>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
> >>>>>
> >>>> Doesn't that assume that there are new inodes that are being dirtied?
> >>>> If you only have the same inodes being redirtied and never any new
> >>>> ones, the problem still occurs, right?
> >>> Yes. But will a production server run months without making one single
> >>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> >>> fix this possible issue.:)
> >>>
> >> Yes. It's not that the box will run that long without creating a
> >> single new dirtied inode, but rather that it won't necessarily create
> >> one on all of its mounts. It's often the case that someone has a
> >> mountpoint for a dedicated purpose.
> >>
> >> Consider a host that has a mountpoint that contains logfiles that are
> >> being heavily written. There's nothing that says that they must rotate
> >> those logs over a particular period (assuming the fs has enough space,
> >> etc). If the same ones are constantly being redirtied and no new
> >> ones are created, then I think this problem can easily happen.
> >>
> >>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
> >>>>>>>
> >>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> >>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
> >>>>>>> if one big file is repeatedly redirtied hence its writeback being
> >>>>>>> delayed considerably.
> >>>>>>>
> >>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> >>>>>>> in a _best effort_ way. It at best partially hides the above issues,
> >>>>>>> if there are any. In particular, if (b) is possible, the bug should
> >>>>>>> already show up at least in some situations.
> >>>>>>>
> >>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
> >>>>>>>
> >>>>>>> http://lkml.org/lkml/2008/1/16/491
> >>>>>>>
> >>>>>>>
> >>>>>> Ok, those are good points that I need to think about.
> >>>>>>
> >>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
> >>>>>> how best to fix this.
> >>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
> >>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> >>>>> doesn't matter?
> >>>>>
> >>>> I don't really consider NFS to be a special case here. It just happens
> >>>> to be where we saw the problem originally. Some of its characteristics
> >>>> might make it easier to hit this, but I'm not certain of that.
> >>> Now there are now two possible solutions:
> >>> - unconditionally update dirtied_when in redirty_tail();
> >>> - keep dirtied_when and redirty inodes to a new dedicated queue.
> >>> The first one involves less code, the second one allows more flexible timing.
> >>>
> >>> NFS/XFS could be a good starting point for discussing the
> >>> requirements, so that we can reach a suitable solution.
> >>>
> >> It sounds like it, yes. I saw that you posted some patches in January
> >> (including your s_more_io_wait patch). I'll give those a closer look.
> >> Adding the new s_more_io_wait queue is interesting and might sidestep
> >> this problem nicely.
> >>
> >
> > Yes, I was looking at that bit of code but, so far, I think it won't be
> > called for the case we are trying to describe.
>
> I take that back.
> As Jeff pointed out I haven't seen these patches and can't seem to find
> them in my fsdevel list folder, Wu can you send me a copy please?
>

Actually, I think you were right. We still have this check in
generic_sync_sb_inodes() even with Wu's January 2008 patches:

/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;

...this check is the crux of the problem. We're assuming that the
dirtied_when value will never appear to be in the future. If we change
this check so that it's checking that dirtied_when is between "start"
and "now", then this problem basically goes away.

We'll probably also need to change the test in move_expired_inodes
too, unless Wu's changes go in.

--
Jeff Layton <[email protected]>

2009-03-25 14:16:55

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 10:00:49PM +0800, Jeff Layton wrote:
> On Wed, 25 Mar 2009 22:38:47 +0900
> Ian Kent <[email protected]> wrote:
>
> > Ian Kent wrote:
> > > Jeff Layton wrote:
> > >> On Wed, 25 Mar 2009 20:17:43 +0800
> > >> Wu Fengguang <[email protected]> wrote:
> > >>
> > >>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> > >>>> On Wed, 25 Mar 2009 10:50:37 +0800
> > >>>> Wu Fengguang <[email protected]> wrote:
> > >>>>
> > >>>>>> Given the right situation though (or maybe the right filesystem), it's
> > >>>>>> not too hard to imagine this problem occurring even in current mainline
> > >>>>>> code with an inode that's frequently being redirtied.
> > >>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> > >>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> > >>>>> inodes will be parked in s_dirty for 30s. During which time the
> > >>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
> > >>>>> value, will be retried for writeback and then re-inserted into a
> > >>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
> > >>>>>
> > >>>> Doesn't that assume that there are new inodes that are being dirtied?
> > >>>> If you only have the same inodes being redirtied and never any new
> > >>>> ones, the problem still occurs, right?
> > >>> Yes. But will a production server run months without making one single
> > >>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> > >>> fix this possible issue.:)
> > >>>
> > >> Yes. It's not that the box will run that long without creating a
> > >> single new dirtied inode, but rather that it won't necessarily create
> > >> one on all of its mounts. It's often the case that someone has a
> > >> mountpoint for a dedicated purpose.
> > >>
> > >> Consider a host that has a mountpoint that contains logfiles that are
> > >> being heavily written. There's nothing that says that they must rotate
> > >> those logs over a particular period (assuming the fs has enough space,
> > >> etc). If the same ones are constantly being redirtied and no new
> > >> ones are created, then I think this problem can easily happen.
> > >>
> > >>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > >>>>>>>
> > >>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > >>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
> > >>>>>>> if one big file is repeatedly redirtied hence its writeback being
> > >>>>>>> delayed considerably.
> > >>>>>>>
> > >>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > >>>>>>> in a _best effort_ way. It at best partially hides the above issues,
> > >>>>>>> if there are any. In particular, if (b) is possible, the bug should
> > >>>>>>> already show up at least in some situations.
> > >>>>>>>
> > >>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
> > >>>>>>>
> > >>>>>>> http://lkml.org/lkml/2008/1/16/491
> > >>>>>>>
> > >>>>>>>
> > >>>>>> Ok, those are good points that I need to think about.
> > >>>>>>
> > >>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
> > >>>>>> how best to fix this.
> > >>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
> > >>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> > >>>>> doesn't matter?
> > >>>>>
> > >>>> I don't really consider NFS to be a special case here. It just happens
> > >>>> to be where we saw the problem originally. Some of its characteristics
> > >>>> might make it easier to hit this, but I'm not certain of that.
> > >>> Now there are now two possible solutions:
> > >>> - unconditionally update dirtied_when in redirty_tail();
> > >>> - keep dirtied_when and redirty inodes to a new dedicated queue.
> > >>> The first one involves less code, the second one allows more flexible timing.
> > >>>
> > >>> NFS/XFS could be a good starting point for discussing the
> > >>> requirements, so that we can reach a suitable solution.
> > >>>
> > >> It sounds like it, yes. I saw that you posted some patches in January
> > >> (including your s_more_io_wait patch). I'll give those a closer look.
> > >> Adding the new s_more_io_wait queue is interesting and might sidestep
> > >> this problem nicely.
> > >>
> > >
> > > Yes, I was looking at that bit of code but, so far, I think it won't be
> > > called for the case we are trying to describe.
> >
> > I take that back.
> > As Jeff pointed out I haven't seen these patches and can't seem to find
> > them in my fsdevel list folder, Wu can you send me a copy please?
> >
>
> Actually, I think you were right. We still have this check in
> generic_sync_sb_inodes() even with Wu's January 2008 patches:
>
> /* Was this inode dirtied after sync_sb_inodes was called? */
> if (time_after(inode->dirtied_when, start))
> break;

Yeah, ugly code. Jens' per-bdi flush daemons should eliminate it...

> ...this check is the crux of the problem. We're assuming that the
> dirtied_when value will never appear to be in the future. If we change
> this check so that it's checking that dirtied_when is between "start"
> and "now", then this problem basically goes away.

Yeah that turns the problem into a temporary and tolerable one.

> We'll probably also need to change the test in move_expired_inodes
> too, unless Wu's changes go in.

So the most simple (and complete) solution is still this one ;-)

Thanks,
Fengguang

---
fs/fs-writeback.c | 14 +-------------
1 file changed, 1 insertion(+), 13 deletions(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -182,24 +182,12 @@ static int write_inode(struct inode *ino
/*
* Redirty an inode: set its when-it-was dirtied timestamp and move it to the
* furthest end of its superblock's dirty-inode list.
- *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list. If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
*/
static void redirty_tail(struct inode *inode)
{
struct super_block *sb = inode->i_sb;

- if (!list_empty(&sb->s_dirty)) {
- struct inode *tail_inode;
-
- tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
- inode->dirtied_when = jiffies;
- }
+ inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
}

2009-03-25 14:30:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 22:16:18 +0800
Wu Fengguang <[email protected]> wrote:

> On Wed, Mar 25, 2009 at 10:00:49PM +0800, Jeff Layton wrote:
> > On Wed, 25 Mar 2009 22:38:47 +0900
> > Ian Kent <[email protected]> wrote:
> >
> > > Ian Kent wrote:
> > > > Jeff Layton wrote:
> > > >> On Wed, 25 Mar 2009 20:17:43 +0800
> > > >> Wu Fengguang <[email protected]> wrote:
> > > >>
> > > >>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> > > >>>> On Wed, 25 Mar 2009 10:50:37 +0800
> > > >>>> Wu Fengguang <[email protected]> wrote:
> > > >>>>
> > > >>>>>> Given the right situation though (or maybe the right filesystem), it's
> > > >>>>>> not too hard to imagine this problem occurring even in current mainline
> > > >>>>>> code with an inode that's frequently being redirtied.
> > > >>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> > > >>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> > > >>>>> inodes will be parked in s_dirty for 30s. During which time the
> > > >>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
> > > >>>>> value, will be retried for writeback and then re-inserted into a
> > > >>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
> > > >>>>>
> > > >>>> Doesn't that assume that there are new inodes that are being dirtied?
> > > >>>> If you only have the same inodes being redirtied and never any new
> > > >>>> ones, the problem still occurs, right?
> > > >>> Yes. But will a production server run months without making one single
> > > >>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> > > >>> fix this possible issue.:)
> > > >>>
> > > >> Yes. It's not that the box will run that long without creating a
> > > >> single new dirtied inode, but rather that it won't necessarily create
> > > >> one on all of its mounts. It's often the case that someone has a
> > > >> mountpoint for a dedicated purpose.
> > > >>
> > > >> Consider a host that has a mountpoint that contains logfiles that are
> > > >> being heavily written. There's nothing that says that they must rotate
> > > >> those logs over a particular period (assuming the fs has enough space,
> > > >> etc). If the same ones are constantly being redirtied and no new
> > > >> ones are created, then I think this problem can easily happen.
> > > >>
> > > >>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > > >>>>>>>
> > > >>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > > >>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
> > > >>>>>>> if one big file is repeatedly redirtied hence its writeback being
> > > >>>>>>> delayed considerably.
> > > >>>>>>>
> > > >>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > > >>>>>>> in a _best effort_ way. It at best partially hides the above issues,
> > > >>>>>>> if there are any. In particular, if (b) is possible, the bug should
> > > >>>>>>> already show up at least in some situations.
> > > >>>>>>>
> > > >>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
> > > >>>>>>>
> > > >>>>>>> http://lkml.org/lkml/2008/1/16/491
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>> Ok, those are good points that I need to think about.
> > > >>>>>>
> > > >>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
> > > >>>>>> how best to fix this.
> > > >>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
> > > >>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> > > >>>>> doesn't matter?
> > > >>>>>
> > > >>>> I don't really consider NFS to be a special case here. It just happens
> > > >>>> to be where we saw the problem originally. Some of its characteristics
> > > >>>> might make it easier to hit this, but I'm not certain of that.
> > > >>> Now there are now two possible solutions:
> > > >>> - unconditionally update dirtied_when in redirty_tail();
> > > >>> - keep dirtied_when and redirty inodes to a new dedicated queue.
> > > >>> The first one involves less code, the second one allows more flexible timing.
> > > >>>
> > > >>> NFS/XFS could be a good starting point for discussing the
> > > >>> requirements, so that we can reach a suitable solution.
> > > >>>
> > > >> It sounds like it, yes. I saw that you posted some patches in January
> > > >> (including your s_more_io_wait patch). I'll give those a closer look.
> > > >> Adding the new s_more_io_wait queue is interesting and might sidestep
> > > >> this problem nicely.
> > > >>
> > > >
> > > > Yes, I was looking at that bit of code but, so far, I think it won't be
> > > > called for the case we are trying to describe.
> > >
> > > I take that back.
> > > As Jeff pointed out I haven't seen these patches and can't seem to find
> > > them in my fsdevel list folder, Wu can you send me a copy please?
> > >
> >
> > Actually, I think you were right. We still have this check in
> > generic_sync_sb_inodes() even with Wu's January 2008 patches:
> >
> > /* Was this inode dirtied after sync_sb_inodes was called? */
> > if (time_after(inode->dirtied_when, start))
> > break;
>
> Yeah, ugly code. Jens' per-bdi flush daemons should eliminate it...
>

Ok, good to know. I need to look at those more closely I guess...

> > ...this check is the crux of the problem. We're assuming that the
> > dirtied_when value will never appear to be in the future. If we change
> > this check so that it's checking that dirtied_when is between "start"
> > and "now", then this problem basically goes away.
>
> Yeah that turns the problem into a temporary and tolerable one.
>

Yes.

> > We'll probably also need to change the test in move_expired_inodes
> > too, unless Wu's changes go in.
>
> So the most simple (and complete) solution is still this one ;-)
>

I suppose so. I guess that also takes care of the problem on XFS (and
maybe other filesystems too?) of inodes getting flushed too frequently
when they're redirtied.

The downside sounds like that it'll mean that big files that are being
frequently redirtied might get less frequent writeout attempts. We can
easily dirty pages faster than we can write them out (at least with
most filesystems). Will that cause problem where we accumulate too many
dirty pages for the inode? That also means that the I/O will be more
"spiky"...

pdflush writes out some data
inode goes back on s_dirty and dirtied_when gets restamped
wait 30s...
pdflush writes out more data
etc...

That seems sub-optimal.

--
Jeff Layton <[email protected]>

2009-03-25 14:38:54

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 10:28:33PM +0800, Jeff Layton wrote:
> On Wed, 25 Mar 2009 22:16:18 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > On Wed, Mar 25, 2009 at 10:00:49PM +0800, Jeff Layton wrote:
> > > On Wed, 25 Mar 2009 22:38:47 +0900
> > > Ian Kent <[email protected]> wrote:
> > >
> > > > Ian Kent wrote:
> > > > > Jeff Layton wrote:
> > > > >> On Wed, 25 Mar 2009 20:17:43 +0800
> > > > >> Wu Fengguang <[email protected]> wrote:
> > > > >>
> > > > >>> On Wed, Mar 25, 2009 at 07:51:10PM +0800, Jeff Layton wrote:
> > > > >>>> On Wed, 25 Mar 2009 10:50:37 +0800
> > > > >>>> Wu Fengguang <[email protected]> wrote:
> > > > >>>>
> > > > >>>>>> Given the right situation though (or maybe the right filesystem), it's
> > > > >>>>>> not too hard to imagine this problem occurring even in current mainline
> > > > >>>>>> code with an inode that's frequently being redirtied.
> > > > >>>>> My reasoning with recent kernel is: for kupdate, s_dirty enqueues only
> > > > >>>>> happen in __mark_inode_dirty() and redirty_tail(). Newly dirtied
> > > > >>>>> inodes will be parked in s_dirty for 30s. During which time the
> > > > >>>>> actively being-redirtied inodes, if their dirtied_when is an old stuck
> > > > >>>>> value, will be retried for writeback and then re-inserted into a
> > > > >>>>> non-empty s_dirty queue and have their dirtied_when refreshed.
> > > > >>>>>
> > > > >>>> Doesn't that assume that there are new inodes that are being dirtied?
> > > > >>>> If you only have the same inodes being redirtied and never any new
> > > > >>>> ones, the problem still occurs, right?
> > > > >>> Yes. But will a production server run months without making one single
> > > > >>> new dirtied inode? (Just out of curiosity. Not that I'm not willing to
> > > > >>> fix this possible issue.:)
> > > > >>>
> > > > >> Yes. It's not that the box will run that long without creating a
> > > > >> single new dirtied inode, but rather that it won't necessarily create
> > > > >> one on all of its mounts. It's often the case that someone has a
> > > > >> mountpoint for a dedicated purpose.
> > > > >>
> > > > >> Consider a host that has a mountpoint that contains logfiles that are
> > > > >> being heavily written. There's nothing that says that they must rotate
> > > > >> those logs over a particular period (assuming the fs has enough space,
> > > > >> etc). If the same ones are constantly being redirtied and no new
> > > > >> ones are created, then I think this problem can easily happen.
> > > > >>
> > > > >>>>>>> ...I see no obvious reasons against unconditionally resetting dirtied_when.
> > > > >>>>>>>
> > > > >>>>>>> (a) Delaying an inode's writeback for 30s maybe too long - its blocking
> > > > >>>>>>> condition may well go away within 1s. (b) And it would be very undesirable
> > > > >>>>>>> if one big file is repeatedly redirtied hence its writeback being
> > > > >>>>>>> delayed considerably.
> > > > >>>>>>>
> > > > >>>>>>> However, redirty_tail() currently only tries to speedup writeback-after-redirty
> > > > >>>>>>> in a _best effort_ way. It at best partially hides the above issues,
> > > > >>>>>>> if there are any. In particular, if (b) is possible, the bug should
> > > > >>>>>>> already show up at least in some situations.
> > > > >>>>>>>
> > > > >>>>>>> For XFS, immediately sync of redirtied inode is actually discouraged:
> > > > >>>>>>>
> > > > >>>>>>> http://lkml.org/lkml/2008/1/16/491
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>> Ok, those are good points that I need to think about.
> > > > >>>>>>
> > > > >>>>>> Thanks for the help so far. I'd welcome any suggestions you have on
> > > > >>>>>> how best to fix this.
> > > > >>>>> For NFS, is it desirable to retry a redirtied inode after 30s, or
> > > > >>>>> after a shorter 5s, or after 0.1~5s? Or the exact timing simply
> > > > >>>>> doesn't matter?
> > > > >>>>>
> > > > >>>> I don't really consider NFS to be a special case here. It just happens
> > > > >>>> to be where we saw the problem originally. Some of its characteristics
> > > > >>>> might make it easier to hit this, but I'm not certain of that.
> > > > >>> Now there are now two possible solutions:
> > > > >>> - unconditionally update dirtied_when in redirty_tail();
> > > > >>> - keep dirtied_when and redirty inodes to a new dedicated queue.
> > > > >>> The first one involves less code, the second one allows more flexible timing.
> > > > >>>
> > > > >>> NFS/XFS could be a good starting point for discussing the
> > > > >>> requirements, so that we can reach a suitable solution.
> > > > >>>
> > > > >> It sounds like it, yes. I saw that you posted some patches in January
> > > > >> (including your s_more_io_wait patch). I'll give those a closer look.
> > > > >> Adding the new s_more_io_wait queue is interesting and might sidestep
> > > > >> this problem nicely.
> > > > >>
> > > > >
> > > > > Yes, I was looking at that bit of code but, so far, I think it won't be
> > > > > called for the case we are trying to describe.
> > > >
> > > > I take that back.
> > > > As Jeff pointed out I haven't seen these patches and can't seem to find
> > > > them in my fsdevel list folder, Wu can you send me a copy please?
> > > >
> > >
> > > Actually, I think you were right. We still have this check in
> > > generic_sync_sb_inodes() even with Wu's January 2008 patches:
> > >
> > > /* Was this inode dirtied after sync_sb_inodes was called? */
> > > if (time_after(inode->dirtied_when, start))
> > > break;
> >
> > Yeah, ugly code. Jens' per-bdi flush daemons should eliminate it...
> >
>
> Ok, good to know. I need to look at those more closely I guess...
>
> > > ...this check is the crux of the problem. We're assuming that the
> > > dirtied_when value will never appear to be in the future. If we change
> > > this check so that it's checking that dirtied_when is between "start"
> > > and "now", then this problem basically goes away.
> >
> > Yeah that turns the problem into a temporary and tolerable one.
> >
>
> Yes.
>
> > > We'll probably also need to change the test in move_expired_inodes
> > > too, unless Wu's changes go in.
> >
> > So the most simple (and complete) solution is still this one ;-)
> >
>
> I suppose so. I guess that also takes care of the problem on XFS (and
> maybe other filesystems too?) of inodes getting flushed too frequently
> when they're redirtied.
>
> The downside sounds like that it'll mean that big files that are being
> frequently redirtied might get less frequent writeout attempts. We can
> easily dirty pages faster than we can write them out (at least with
> most filesystems). Will that cause problem where we accumulate too many
> dirty pages for the inode? That also means that the I/O will be more
> "spiky"...
>
> pdflush writes out some data
> inode goes back on s_dirty and dirtied_when gets restamped
> wait 30s...
> pdflush writes out more data
> etc...
>
> That seems sub-optimal.

Yup, adding 30s delay on each redirty sounds too much. That's why
Andrew tried to keep dirtied_when untouched, and why I proposed the
s_more_io_wait queue.

So let's refresh the s_more_io_wait patchset?
I'll do it tomorrow...in a fresh day :-)

Thanks,
Fengguang

2009-03-25 16:55:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, Mar 25, 2009 at 08:17:43PM +0800, Wu Fengguang wrote:
> Now there are now two possible solutions:
> - unconditionally update dirtied_when in redirty_tail();
> - keep dirtied_when and redirty inodes to a new dedicated queue.
> The first one involves less code, the second one allows more flexible timing.
>
> NFS/XFS could be a good starting point for discussing the
> requirements, so that we can reach a suitable solution.

Note that the XFS requirement also applies to all filesystems that
perform some sort of metadata updats on I/O completeion. That includes
at least ext4, btrfs and most likely the cluster filesystems too.

2009-03-25 20:09:21

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 2009-03-25 at 12:55 -0400, [email protected] wrote:
> On Wed, Mar 25, 2009 at 08:17:43PM +0800, Wu Fengguang wrote:
> > Now there are now two possible solutions:
> > - unconditionally update dirtied_when in redirty_tail();
> > - keep dirtied_when and redirty inodes to a new dedicated queue.
> > The first one involves less code, the second one allows more flexible timing.
> >
> > NFS/XFS could be a good starting point for discussing the
> > requirements, so that we can reach a suitable solution.
>
> Note that the XFS requirement also applies to all filesystems that
> perform some sort of metadata updats on I/O completeion. That includes
> at least ext4, btrfs and most likely the cluster filesystems too.

btrfs at least doesn't dirty the inode on I/O completion. It just puts
the changes directly into the btree blocks.

-chris

2009-03-26 17:05:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Wed, 25 Mar 2009 22:16:18 +0800
Wu Fengguang <[email protected]> wrote:

> >
> > Actually, I think you were right. We still have this check in
> > generic_sync_sb_inodes() even with Wu's January 2008 patches:
> >
> > /* Was this inode dirtied after sync_sb_inodes was called? */
> > if (time_after(inode->dirtied_when, start))
> > break;
>
> Yeah, ugly code. Jens' per-bdi flush daemons should eliminate it...
>

I had a look over Jens' patches and they seem to be more concerned with
how the queues and daemons are organized (per-bdi rather than per-sb).
The actual way that inodes flow between the queues and get written out
don't look like they really change with his set.

They also don't eliminate the problematic check above. Regardless of
whether your or Jens' patches make it in, I think we'll still need
something like the following (untested) patch.

If this looks ok, I'll flesh out the comments some and "officially" post
it. Thoughts?

--------------[snip]-----------------

>From d10adff2d5f9a15d19c438119dbb2c410bd26e3c Mon Sep 17 00:00:00 2001
From: Jeff Layton <[email protected]>
Date: Thu, 26 Mar 2009 12:54:52 -0400
Subject: [PATCH] writeback: guard against jiffies wraparound on inode->dirtied_when checks

The dirtied_when value on an inode is supposed to represent the first
time that an inode has one of its pages dirtied. This value is in units
of jiffies. This value is used in several places in the writeback code
to determine when to write out an inode.

The problem is that these checks assume that dirtied_when is updated
periodically. But if an inode is continuously being used for I/O it can
be persistently marked as dirty and will continue to age. Once the time
difference between dirtied_when and the jiffies value it is being
compared to is greater than (or equal to) half the maximum of the
jiffies type, the logic of the time_*() macros inverts and the opposite
of what is needed is returned. On 32-bit architectures that's just under
25 days (assuming HZ == 1000).

As the least-recently dirtied inode, it'll end up being the first one
that pdflush will try to write out. sync_sb_inodes does this check
however:

/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;

...but now dirtied_when appears to be in the future. sync_sb_inodes
bails out without attempting to write any dirty inodes. When this
occurs, pdflush will stop writing out inodes for this superblock and
nothing will unwedge it until jiffies moves out of the problematic
window.

This patch fixes this problem by changing the time_after checks against
dirtied_when to also check whether dirtied_when appears to be in the
future. If it does, then we consider the value to be in the past.

This should shrink the problematic window to such a small period as not
to matter.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/fs-writeback.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e3fe991..dba69a5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -196,8 +196,9 @@ static void redirty_tail(struct inode *inode)
struct inode *tail_inode;

tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
- if (!time_after_eq(inode->dirtied_when,
- tail_inode->dirtied_when))
+ if (time_before(inode->dirtied_when,
+ tail_inode->dirtied_when) ||
+ time_after(inode->dirtied_when, jiffies))
inode->dirtied_when = jiffies;
}
list_move(&inode->i_list, &sb->s_dirty);
@@ -231,7 +232,8 @@ static void move_expired_inodes(struct list_head *delaying_queue,
struct inode *inode = list_entry(delaying_queue->prev,
struct inode, i_list);
if (older_than_this &&
- time_after(inode->dirtied_when, *older_than_this))
+ time_after(inode->dirtied_when, *older_than_this) &&
+ time_before_eq(inode->dirtied_when, jiffies))
break;
list_move(&inode->i_list, dispatch_queue);
}
@@ -493,7 +495,8 @@ void generic_sync_sb_inodes(struct super_block *sb,
}

/* Was this inode dirtied after sync_sb_inodes was called? */
- if (time_after(inode->dirtied_when, start))
+ if (time_after(inode->dirtied_when, start) &&
+ time_before_eq(inode->dirtied_when, jiffies))
break;

/* Is another pdflush already flushing this queue? */
--
1.5.5.6

2009-03-27 02:13:46

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Fri, Mar 27, 2009 at 01:03:27AM +0800, Jeff Layton wrote:
> On Wed, 25 Mar 2009 22:16:18 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > >
> > > Actually, I think you were right. We still have this check in
> > > generic_sync_sb_inodes() even with Wu's January 2008 patches:
> > >
> > > /* Was this inode dirtied after sync_sb_inodes was called? */
> > > if (time_after(inode->dirtied_when, start))
> > > break;
> >
> > Yeah, ugly code. Jens' per-bdi flush daemons should eliminate it...
> >
>
> I had a look over Jens' patches and they seem to be more concerned with
> how the queues and daemons are organized (per-bdi rather than per-sb).
> The actual way that inodes flow between the queues and get written out
> don't look like they really change with his set.

OK, sorry that I've not carefully reviewed the per-bdi flushing patchset.

> They also don't eliminate the problematic check above. Regardless of
> whether your or Jens' patches make it in, I think we'll still need
> something like the following (untested) patch.
>
> If this looks ok, I'll flesh out the comments some and "officially" post
> it. Thoughts?

It's good in itself. However with more_io_wait queue, the first two
chunks will be eliminated. Mind I carry this patch with my patchset?

Thanks,
Fengguang


> --------------[snip]-----------------
>
> >From d10adff2d5f9a15d19c438119dbb2c410bd26e3c Mon Sep 17 00:00:00 2001
> From: Jeff Layton <[email protected]>
> Date: Thu, 26 Mar 2009 12:54:52 -0400
> Subject: [PATCH] writeback: guard against jiffies wraparound on inode->dirtied_when checks
>
> The dirtied_when value on an inode is supposed to represent the first
> time that an inode has one of its pages dirtied. This value is in units
> of jiffies. This value is used in several places in the writeback code
> to determine when to write out an inode.
>
> The problem is that these checks assume that dirtied_when is updated
> periodically. But if an inode is continuously being used for I/O it can
> be persistently marked as dirty and will continue to age. Once the time
> difference between dirtied_when and the jiffies value it is being
> compared to is greater than (or equal to) half the maximum of the
> jiffies type, the logic of the time_*() macros inverts and the opposite
> of what is needed is returned. On 32-bit architectures that's just under
> 25 days (assuming HZ == 1000).
>
> As the least-recently dirtied inode, it'll end up being the first one
> that pdflush will try to write out. sync_sb_inodes does this check
> however:
>
> /* Was this inode dirtied after sync_sb_inodes was called? */
> if (time_after(inode->dirtied_when, start))
> break;
>
> ...but now dirtied_when appears to be in the future. sync_sb_inodes
> bails out without attempting to write any dirty inodes. When this
> occurs, pdflush will stop writing out inodes for this superblock and
> nothing will unwedge it until jiffies moves out of the problematic
> window.
>
> This patch fixes this problem by changing the time_after checks against
> dirtied_when to also check whether dirtied_when appears to be in the
> future. If it does, then we consider the value to be in the past.
>
> This should shrink the problematic window to such a small period as not
> to matter.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/fs-writeback.c | 11 +++++++----
> 1 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e3fe991..dba69a5 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -196,8 +196,9 @@ static void redirty_tail(struct inode *inode)
> struct inode *tail_inode;
>
> tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
> - if (!time_after_eq(inode->dirtied_when,
> - tail_inode->dirtied_when))
> + if (time_before(inode->dirtied_when,
> + tail_inode->dirtied_when) ||
> + time_after(inode->dirtied_when, jiffies))
> inode->dirtied_when = jiffies;
> }
> list_move(&inode->i_list, &sb->s_dirty);
> @@ -231,7 +232,8 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> struct inode *inode = list_entry(delaying_queue->prev,
> struct inode, i_list);
> if (older_than_this &&
> - time_after(inode->dirtied_when, *older_than_this))
> + time_after(inode->dirtied_when, *older_than_this) &&
> + time_before_eq(inode->dirtied_when, jiffies))
> break;
> list_move(&inode->i_list, dispatch_queue);
> }
> @@ -493,7 +495,8 @@ void generic_sync_sb_inodes(struct super_block *sb,
> }
>
> /* Was this inode dirtied after sync_sb_inodes was called? */
> - if (time_after(inode->dirtied_when, start))
> + if (time_after(inode->dirtied_when, start) &&
> + time_before_eq(inode->dirtied_when, jiffies))
> break;
>
> /* Is another pdflush already flushing this queue? */
> --
> 1.5.5.6

2009-03-27 11:18:38

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Fri, 27 Mar 2009 10:13:03 +0800
Wu Fengguang <[email protected]> wrote:

>
> > They also don't eliminate the problematic check above. Regardless of
> > whether your or Jens' patches make it in, I think we'll still need
> > something like the following (untested) patch.
> >
> > If this looks ok, I'll flesh out the comments some and "officially" post
> > it. Thoughts?
>
> It's good in itself. However with more_io_wait queue, the first two
> chunks will be eliminated. Mind I carry this patch with my patchset?
>

It makes sense to roll that fix in with the stuff you're doing.

If it's going to be a little while before your patches get taken into
mainline though, it might not hurt to go ahead and push my patch in as
an interim fix. It shouldn't change the behavior of the code in the
normal case of a short-lived dirtied_when value, and should guard
against major problems when there's a long-lived one.

--
Jeff Layton <[email protected]>

2009-03-29 00:06:00

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: reset inode dirty time when adding it back to empty s_dirty list

On Fri, Mar 27, 2009 at 07:16:33PM +0800, Jeff Layton wrote:
> On Fri, 27 Mar 2009 10:13:03 +0800
> Wu Fengguang <[email protected]> wrote:
>
> >
> > > They also don't eliminate the problematic check above. Regardless of
> > > whether your or Jens' patches make it in, I think we'll still need
> > > something like the following (untested) patch.
> > >
> > > If this looks ok, I'll flesh out the comments some and "officially" post
> > > it. Thoughts?
> >
> > It's good in itself. However with more_io_wait queue, the first two
> > chunks will be eliminated. Mind I carry this patch with my patchset?
> >
>
> It makes sense to roll that fix in with the stuff you're doing.
>
> If it's going to be a little while before your patches get taken into
> mainline though, it might not hurt to go ahead and push my patch in as
> an interim fix. It shouldn't change the behavior of the code in the
> normal case of a short-lived dirtied_when value, and should guard
> against major problems when there's a long-lived one.

I'm afraid my patchset will miss the 2.6.30 merge window, so it makes
sense to merge your patch first:

> From: Jeff Layton <[email protected]>
> Subject: [PATCH] writeback: guard against jiffies wraparound on inode->dirtied_when checks

Acked-by: Wu Fengguang <[email protected]>

Thanks,
Fengguang