2009-03-24 12:40:54

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] skip I_CLEAR state inodes

Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without the testing of I_CLEAR.

Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
Jan Kara reminds fixing the other two cases. Thanks!

Reported-by: Masayoshi MIZUMA <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/dquot.c | 2 +-
fs/drop_caches.c | 2 +-
fs/fs-writeback.c | 3 ++-
3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup

spin_lock(&inode_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_WILL_FREE))
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;
if (inode->i_mapping->nrpages == 0)
continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct address_space *mapping;

- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state &
+ (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
mapping = inode->i_mapping;
if (mapping->nrpages == 0)
--- mm.orig/fs/dquot.c
+++ mm/fs/dquot.c
@@ -793,7 +793,7 @@ static void add_dquot_ref(struct super_b
continue;
if (!dqinit_needed(inode, type))
continue;
- if (inode->i_state & (I_FREEING|I_WILL_FREE))
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
continue;

__iget(inode);


2009-03-30 07:18:51

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes

clear_inode() will switch inode state from I_FREEING to I_CLEAR,
and do so _outside_ of inode_lock. So any I_FREEING testing is
incomplete without a coupled testing of I_CLEAR.

So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.

Reported-by: Masayoshi MIZUMA <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/drop_caches.c | 2 +-
fs/fs-writeback.c | 3 ++-
fs/quota/dquot.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)

--- mm.orig/fs/drop_caches.c
+++ mm/fs/drop_caches.c
@@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup

spin_lock(&inode_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
if (inode->i_mapping->nrpages == 0)
continue;
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
struct address_space *mapping;

- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state &
+ (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
mapping = inode->i_mapping;
if (mapping->nrpages == 0)
--- mm.orig/fs/quota/dquot.c
+++ mm/fs/quota/dquot.c
@@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b

spin_lock(&inode_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
continue;
if (!atomic_read(&inode->i_writecount))
continue;

2009-03-31 23:49:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes

On Mon, 30 Mar 2009 15:18:24 +0800
Wu Fengguang <[email protected]> wrote:

> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without a coupled testing of I_CLEAR.
>
> So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
>
> Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> reminds fixing the other two cases.

ok...

But what is the user-visible consequence of this? You cc'ed
[email protected] so I assume it's serious. People will want to know
what problem we're fixing!

>
> --- mm.orig/fs/drop_caches.c
> +++ mm/fs/drop_caches.c
> @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
>
> spin_lock(&inode_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> continue;
> if (inode->i_mapping->nrpages == 0)
> continue;
> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> struct address_space *mapping;
>
> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> + if (inode->i_state &
> + (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> continue;
> mapping = inode->i_mapping;
> if (mapping->nrpages == 0)
> --- mm.orig/fs/quota/dquot.c
> +++ mm/fs/quota/dquot.c
> @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
>
> spin_lock(&inode_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> continue;
> if (!atomic_read(&inode->i_writecount))
> continue;

2009-04-01 00:53:38

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH][RESEND for 2.6.29-rc8-mm1] skip I_CLEAR state inodes

On Wed, Apr 01, 2009 at 07:43:32AM +0800, Andrew Morton wrote:
> On Mon, 30 Mar 2009 15:18:24 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without a coupled testing of I_CLEAR.
> >
> > So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> >
> > Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
> > reminds fixing the other two cases.
>
> ok...
>
> But what is the user-visible consequence of this? You cc'ed
> [email protected] so I assume it's serious. People will want to know
> what problem we're fixing!

Sorry, the changelog could be expanded with the following paragraph:

Fix real kernel panics. Masayoshi MIZUMA has a nice panic flow:
----------------------------------------------------------------------
[process A] | [process B]
| |
| prune_icache() | drop_pagecache()
| spin_lock(&inode_lock) | drop_pagecache_sb()
| inode->i_state |= I_FREEING; | |
| spin_unlock(&inode_lock) | V
| | | spin_lock(&inode_lock)
| V | |
| dispose_list() | |
| list_del() | |
| clear_inode() | |
| inode->i_state = I_CLEAR | |
| | | V
| | | if (inode->i_state & (I_FREEING|I_WILL_FREE))
| | | continue; <==== NOT MATCH
| | |
| | | (DANGER from here on! Accessing disposing inode!)
| | |
| | | __iget()
| | | list_move() <===== PANIC on poisoned list !!
V V |
(time)
----------------------------------------------------------------------

Thanks,
Fengguang

> >
> > --- mm.orig/fs/drop_caches.c
> > +++ mm/fs/drop_caches.c
> > @@ -18,7 +18,7 @@ static void drop_pagecache_sb(struct sup
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> > continue;
> > if (inode->i_mapping->nrpages == 0)
> > continue;
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -538,7 +538,8 @@ void generic_sync_sb_inodes(struct super
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > struct address_space *mapping;
> >
> > - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > + if (inode->i_state &
> > + (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> > continue;
> > mapping = inode->i_mapping;
> > if (mapping->nrpages == 0)
> > --- mm.orig/fs/quota/dquot.c
> > +++ mm/fs/quota/dquot.c
> > @@ -823,7 +823,7 @@ static void add_dquot_ref(struct super_b
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> > + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> > continue;
> > if (!atomic_read(&inode->i_writecount))
> > continue;

2009-06-01 21:38:55

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

Wu Fengguang wrote:
> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> add_dquot_ref().
>
> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> and do so _outside_ of inode_lock. So any I_FREEING testing is
> incomplete without the testing of I_CLEAR.
>
> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> Jan Kara reminds fixing the other two cases. Thanks!

Is there a reason it's not done for __sync_single_inode as well?

Jeff Layton asked the question and I'm following it up :)

__sync_single_inode currently only tests I_FREEING, but I think we are
safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
I_SYNC to be cleared before it changes I_STATE.

On the other hand, testing I_CLEAR here probably would be safe anyway,
and it'd be bonus points for consistency?

Same basic question for generic_sync_sb_inodes, which has a
BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
as well?

Thanks,
-Eric

2009-06-02 08:56:09

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> Wu Fengguang wrote:
> > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > add_dquot_ref().
> >
> > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > incomplete without the testing of I_CLEAR.
> >
> > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > Jan Kara reminds fixing the other two cases. Thanks!
>
> Is there a reason it's not done for __sync_single_inode as well?

It missed the glance because it don't have an obvious '|' in the line ;)

> Jeff Layton asked the question and I'm following it up :)
>
> __sync_single_inode currently only tests I_FREEING, but I think we are
> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> I_SYNC to be cleared before it changes I_STATE.

But I_SYNC is removed just before the I_FREEING test, so we still have
a small race window?

> On the other hand, testing I_CLEAR here probably would be safe anyway,
> and it'd be bonus points for consistency?

So let's add the I_CLEAR test?

> Same basic question for generic_sync_sb_inodes, which has a
> BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> as well?

Yes, we can add I_CLEAR here to catch more error condition.

Thanks,
Fengguang

---
skip I_CLEAR state inodes in writeback routines

The I_FREEING test in __sync_single_inode() is racy because
clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
and the test of I_FREEING.

Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.

Reported-by: Jeff Layton <[email protected]>
Reported-by: Eric Sandeen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/fs-writeback.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
spin_lock(&inode_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_SYNC;
- if (!(inode->i_state & I_FREEING)) {
+ if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if (!(inode->i_state & I_DIRTY) &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
/*
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
if (current_is_pdflush() && !writeback_acquire(bdi))
break;

- BUG_ON(inode->i_state & I_FREEING);
+ BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
__iget(inode);
pages_skipped = wbc->pages_skipped;
__writeback_single_inode(inode, wbc);

2009-06-02 10:28:08

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Tue, 2 Jun 2009 16:55:23 +0800
Wu Fengguang <[email protected]> wrote:

> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > >
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > >
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> >
> > Is there a reason it's not done for __sync_single_inode as well?
>
> It missed the glance because it don't have an obvious '|' in the line ;)
>
> > Jeff Layton asked the question and I'm following it up :)
> >
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
>
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
>

Yes, I think so. __sync_single_inode clears I_SYNC but doesn't wake up
the wait queue until the end of the function. So I think it's possible
(though unlikely) that another thread can race in and change the state
to I_CLEAR before the I_FREEING check.

> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
>
> So let's add the I_CLEAR test?
>
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
>
> Yes, we can add I_CLEAR here to catch more error condition.
>
> Thanks,
> Fengguang
>
> ---
> skip I_CLEAR state inodes in writeback routines
>
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
>
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
>
> Reported-by: Jeff Layton <[email protected]>
> Reported-by: Eric Sandeen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/fs-writeback.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> spin_lock(&inode_lock);
> WARN_ON(inode->i_state & I_NEW);
> inode->i_state &= ~I_SYNC;
> - if (!(inode->i_state & I_FREEING)) {
> + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> if (!(inode->i_state & I_DIRTY) &&
> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> /*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> if (current_is_pdflush() && !writeback_acquire(bdi))
> break;
>
> - BUG_ON(inode->i_state & I_FREEING);
> + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> __iget(inode);
> pages_skipped = wbc->pages_skipped;
> __writeback_single_inode(inode, wbc);

Acked-by: Jeff Layton <[email protected]>

2009-06-02 11:37:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > Wu Fengguang wrote:
> > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > add_dquot_ref().
> > >
> > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > incomplete without the testing of I_CLEAR.
> > >
> > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > Jan Kara reminds fixing the other two cases. Thanks!
> >
> > Is there a reason it's not done for __sync_single_inode as well?
>
> It missed the glance because it don't have an obvious '|' in the line ;)
>
> > Jeff Layton asked the question and I'm following it up :)
> >
> > __sync_single_inode currently only tests I_FREEING, but I think we are
> > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > I_SYNC to be cleared before it changes I_STATE.
>
> But I_SYNC is removed just before the I_FREEING test, so we still have
> a small race window?
>
> > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > and it'd be bonus points for consistency?
>
> So let's add the I_CLEAR test?
>
> > Same basic question for generic_sync_sb_inodes, which has a
> > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > as well?
>
> Yes, we can add I_CLEAR here to catch more error condition.
>
> Thanks,
> Fengguang
>
> ---
> skip I_CLEAR state inodes in writeback routines
>
> The I_FREEING test in __sync_single_inode() is racy because
> clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> and the test of I_FREEING.
>
> Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
>
> Reported-by: Jeff Layton <[email protected]>
> Reported-by: Eric Sandeen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/fs-writeback.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> spin_lock(&inode_lock);
> WARN_ON(inode->i_state & I_NEW);
> inode->i_state &= ~I_SYNC;
> - if (!(inode->i_state & I_FREEING)) {
> + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> if (!(inode->i_state & I_DIRTY) &&
> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
Is the whole if needed? I had an impression that everyone calling
__sync_single_inode() should better take care it does not race with inode
freeing... So WARN_ON would be more appropriate IMHO.

> /*
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> if (current_is_pdflush() && !writeback_acquire(bdi))
> break;
>
> - BUG_ON(inode->i_state & I_FREEING);
> + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> __iget(inode);
> pages_skipped = wbc->pages_skipped;
> __writeback_single_inode(inode, wbc);
Looking at this code, it looks a bit suspicious. What prevents this s_io
list scan to race with inode freeing? In particular generic_forget_inode()
can drop inode_lock to write the inode and in the mean time
generic_sync_sb_inodes() can come, get a reference to the inode and start
it's writeback... Subsequent iput() would then call generic_forget_inode()
on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
inodes in this scan like we do for later in the function for another scan?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-02 21:49:46

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
>> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
>>> Wu Fengguang wrote:
>>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
>>>> add_dquot_ref().
>>>>
>>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
>>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
>>>> incomplete without the testing of I_CLEAR.
>>>>
>>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
>>>> Jan Kara reminds fixing the other two cases. Thanks!
>>> Is there a reason it's not done for __sync_single_inode as well?
>> It missed the glance because it don't have an obvious '|' in the line ;)
>>
>>> Jeff Layton asked the question and I'm following it up :)
>>>
>>> __sync_single_inode currently only tests I_FREEING, but I think we are
>>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
>>> I_SYNC to be cleared before it changes I_STATE.
>> But I_SYNC is removed just before the I_FREEING test, so we still have
>> a small race window?

yep that's right.

inode->i_state &= ~I_SYNC;
>>> clear_inode->inode_sync_wait here and find it clear <<<
if (!(inode->i_state & I_FREEING)) {

...

>> --- linux.orig/fs/fs-writeback.c
>> +++ linux/fs/fs-writeback.c
>> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
>> spin_lock(&inode_lock);
>> WARN_ON(inode->i_state & I_NEW);
>> inode->i_state &= ~I_SYNC;
>> - if (!(inode->i_state & I_FREEING)) {
>> + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
>> if (!(inode->i_state & I_DIRTY) &&
>> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {

> Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

Maybe both then (both a WARN on and then the test (defensive here, I
guess)) because if we continue we may wander into a poisoned list
pointer and explode, right?

-Eric

2009-06-03 10:45:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Tue, 02 Jun 2009 16:48:57 -0500
Eric Sandeen <[email protected]> wrote:

> Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> >> On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> >>> Wu Fengguang wrote:
> >>>> Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> >>>> add_dquot_ref().
> >>>>
> >>>> clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> >>>> and do so _outside_ of inode_lock. So any I_FREEING testing is
> >>>> incomplete without the testing of I_CLEAR.
> >>>>
> >>>> Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> >>>> Jan Kara reminds fixing the other two cases. Thanks!
> >>> Is there a reason it's not done for __sync_single_inode as well?
> >> It missed the glance because it don't have an obvious '|' in the line ;)
> >>
> >>> Jeff Layton asked the question and I'm following it up :)
> >>>
> >>> __sync_single_inode currently only tests I_FREEING, but I think we are
> >>> safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> >>> I_SYNC to be cleared before it changes I_STATE.
> >> But I_SYNC is removed just before the I_FREEING test, so we still have
> >> a small race window?
>
> yep that's right.
>
> inode->i_state &= ~I_SYNC;
> >>> clear_inode->inode_sync_wait here and find it clear <<<
> if (!(inode->i_state & I_FREEING)) {
>
> ...
>
> >> --- linux.orig/fs/fs-writeback.c
> >> +++ linux/fs/fs-writeback.c
> >> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> >> spin_lock(&inode_lock);
> >> WARN_ON(inode->i_state & I_NEW);
> >> inode->i_state &= ~I_SYNC;
> >> - if (!(inode->i_state & I_FREEING)) {
> >> + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> >> if (!(inode->i_state & I_DIRTY) &&
> >> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>
> > Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
>
> Maybe both then (both a WARN on and then the test (defensive here, I
> guess)) because if we continue we may wander into a poisoned list
> pointer and explode, right?
>

Right.

I think removing this check would be roughly equivalent to adding a
BUG_ON. It just wouldn't reliably pop since it would depend on the
timing of the race.

Keeping the check and adding a WARN seems like a reasonable thing to
do. Maybe after a few releases if no one has seen the WARN fire we can
look at removing the check (or maybe turn it into a BUG)?

--
Jeff Layton <[email protected]>

2009-06-03 13:32:39

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

[reply the easy part first]
On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > >
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > >
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > >
> > > Is there a reason it's not done for __sync_single_inode as well?
> >
> > It missed the glance because it don't have an obvious '|' in the line ;)
> >
> > > Jeff Layton asked the question and I'm following it up :)
> > >
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> >
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> >
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> >
> > So let's add the I_CLEAR test?
> >
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> >
> > Yes, we can add I_CLEAR here to catch more error condition.
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > skip I_CLEAR state inodes in writeback routines
> >
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> >
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> >
> > Reported-by: Jeff Layton <[email protected]>
> > Reported-by: Eric Sandeen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > fs/fs-writeback.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > spin_lock(&inode_lock);
> > WARN_ON(inode->i_state & I_NEW);
> > inode->i_state &= ~I_SYNC;
> > - if (!(inode->i_state & I_FREEING)) {
> > + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > if (!(inode->i_state & I_DIRTY) &&
> > mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.

The caller doesn't matter here, because we temporarily dropped inode_lock
in __sync_single_inode() ?

2009-06-03 14:00:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Wed 03-06-09 21:32:02, Wu Fengguang wrote:
> [reply the easy part first]
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > >
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > >
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > >
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > >
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > >
> > > > Jeff Layton asked the question and I'm following it up :)
> > > >
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > >
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > >
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > >
> > > So let's add the I_CLEAR test?
> > >
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > >
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > >
> > > Thanks,
> > > Fengguang
> > >
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > >
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > >
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > >
> > > Reported-by: Jeff Layton <[email protected]>
> > > Reported-by: Eric Sandeen <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > > spin_lock(&inode_lock);
> > > WARN_ON(inode->i_state & I_NEW);
> > > inode->i_state &= ~I_SYNC;
> > > - if (!(inode->i_state & I_FREEING)) {
> > > + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > > if (!(inode->i_state & I_DIRTY) &&
> > > mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
>
> The caller doesn't matter here, because we temporarily dropped inode_lock
> in __sync_single_inode() ?
But the problem is basically what I described in the part you cut from
the email - if the caller of __sync_single_inode() does not have a
reference to the inode or some other means of protecting inode from being
released, then we have a problem because e.g. __writeback_single_inode()
can sleep and before I_SYNC is set, all the clear_inode() can finish and
we end up calling do_writepages() on invalidated mapping.
So in my opinion everybody calling __writeback_single_inode() and thus
__sync_single_inode() should do an equivalent of igrab() or be sure inode
cannot be freed e.g. because we have a file open.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-03 14:10:51

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > Wu Fengguang wrote:
> > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > add_dquot_ref().
> > > >
> > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > incomplete without the testing of I_CLEAR.
> > > >
> > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > Jan Kara reminds fixing the other two cases. Thanks!
> > >
> > > Is there a reason it's not done for __sync_single_inode as well?
> >
> > It missed the glance because it don't have an obvious '|' in the line ;)
> >
> > > Jeff Layton asked the question and I'm following it up :)
> > >
> > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > I_SYNC to be cleared before it changes I_STATE.
> >
> > But I_SYNC is removed just before the I_FREEING test, so we still have
> > a small race window?
> >
> > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > and it'd be bonus points for consistency?
> >
> > So let's add the I_CLEAR test?
> >
> > > Same basic question for generic_sync_sb_inodes, which has a
> > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > as well?
> >
> > Yes, we can add I_CLEAR here to catch more error condition.
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > skip I_CLEAR state inodes in writeback routines
> >
> > The I_FREEING test in __sync_single_inode() is racy because
> > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > and the test of I_FREEING.
> >
> > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> >
> > Reported-by: Jeff Layton <[email protected]>
> > Reported-by: Eric Sandeen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > fs/fs-writeback.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > spin_lock(&inode_lock);
> > WARN_ON(inode->i_state & I_NEW);
> > inode->i_state &= ~I_SYNC;
> > - if (!(inode->i_state & I_FREEING)) {
> > + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > if (!(inode->i_state & I_DIRTY) &&
> > mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> Is the whole if needed? I had an impression that everyone calling
> __sync_single_inode() should better take care it does not race with inode
> freeing... So WARN_ON would be more appropriate IMHO.
>
> > /*
> > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > if (current_is_pdflush() && !writeback_acquire(bdi))
> > break;
> >
> > - BUG_ON(inode->i_state & I_FREEING);
> > + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > __iget(inode);
> > pages_skipped = wbc->pages_skipped;
> > __writeback_single_inode(inode, wbc);
> Looking at this code, it looks a bit suspicious. What prevents this s_io
> list scan to race with inode freeing? In particular generic_forget_inode()

Good catch.

> can drop inode_lock to write the inode and in the mean time
> generic_sync_sb_inodes() can come, get a reference to the inode and start
> it's writeback... Subsequent iput() would then call generic_forget_inode()

Another possibility:

generic_forget_inode
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode_lock);
generic_sync_sb_inodes()
spin_lock(&inode_lock);
__iget(inode);
__writeback_single_inode
// see non zero i_count
WARN_ON(inode->i_state & I_WILL_FREE);

I'm wondering why didn't we saw reports on the last WARN_ON()?
Did we missed something?

> on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> inodes in this scan like we do for later in the function for another scan?

2009-06-03 14:16:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > Wu Fengguang wrote:
> > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > add_dquot_ref().
> > > > >
> > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > incomplete without the testing of I_CLEAR.
> > > > >
> > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > >
> > > > Is there a reason it's not done for __sync_single_inode as well?
> > >
> > > It missed the glance because it don't have an obvious '|' in the line ;)
> > >
> > > > Jeff Layton asked the question and I'm following it up :)
> > > >
> > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > I_SYNC to be cleared before it changes I_STATE.
> > >
> > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > a small race window?
> > >
> > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > and it'd be bonus points for consistency?
> > >
> > > So let's add the I_CLEAR test?
> > >
> > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > as well?
> > >
> > > Yes, we can add I_CLEAR here to catch more error condition.
> > >
> > > Thanks,
> > > Fengguang
> > >
> > > ---
> > > skip I_CLEAR state inodes in writeback routines
> > >
> > > The I_FREEING test in __sync_single_inode() is racy because
> > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > and the test of I_FREEING.
> > >
> > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > >
> > > Reported-by: Jeff Layton <[email protected]>
> > > Reported-by: Eric Sandeen <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > --- linux.orig/fs/fs-writeback.c
> > > +++ linux/fs/fs-writeback.c
> > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > > spin_lock(&inode_lock);
> > > WARN_ON(inode->i_state & I_NEW);
> > > inode->i_state &= ~I_SYNC;
> > > - if (!(inode->i_state & I_FREEING)) {
> > > + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > > if (!(inode->i_state & I_DIRTY) &&
> > > mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > Is the whole if needed? I had an impression that everyone calling
> > __sync_single_inode() should better take care it does not race with inode
> > freeing... So WARN_ON would be more appropriate IMHO.
> >
> > > /*
> > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > > if (current_is_pdflush() && !writeback_acquire(bdi))
> > > break;
> > >
> > > - BUG_ON(inode->i_state & I_FREEING);
> > > + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > > __iget(inode);
> > > pages_skipped = wbc->pages_skipped;
> > > __writeback_single_inode(inode, wbc);
> > Looking at this code, it looks a bit suspicious. What prevents this s_io
> > list scan to race with inode freeing? In particular generic_forget_inode()
>
> Good catch.
>
> > can drop inode_lock to write the inode and in the mean time
> > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > it's writeback... Subsequent iput() would then call generic_forget_inode()
>
> Another possibility:
>
> generic_forget_inode
> inode->i_state |= I_WILL_FREE;
> spin_unlock(&inode_lock);
> generic_sync_sb_inodes()
> spin_lock(&inode_lock);
> __iget(inode);
> __writeback_single_inode
> // see non zero i_count
> WARN_ON(inode->i_state & I_WILL_FREE);
>
> I'm wondering why didn't we saw reports on the last WARN_ON()?
> Did we missed something?
I meant the above race in my description ;-). Anyway, the race can happen
only if we are unmounting the filesystem (normally, we bail out on
sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
a while to understand why we weren't seeing tons of warnings...).

> > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > inodes in this scan like we do for later in the function for another scan?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-03 14:47:56

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] skip I_CLEAR state inodes

On Wed, Jun 03, 2009 at 10:16:36PM +0800, Jan Kara wrote:
> On Wed 03-06-09 22:10:21, Wu Fengguang wrote:
> > On Tue, Jun 02, 2009 at 07:37:36PM +0800, Jan Kara wrote:
> > > On Tue 02-06-09 16:55:23, Wu Fengguang wrote:
> > > > On Tue, Jun 02, 2009 at 05:38:35AM +0800, Eric Sandeen wrote:
> > > > > Wu Fengguang wrote:
> > > > > > Add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
> > > > > > add_dquot_ref().
> > > > > >
> > > > > > clear_inode() will switch inode state from I_FREEING to I_CLEAR,
> > > > > > and do so _outside_ of inode_lock. So any I_FREEING testing is
> > > > > > incomplete without the testing of I_CLEAR.
> > > > > >
> > > > > > Masayoshi MIZUMA first discovered the bug in drop_pagecache_sb() and
> > > > > > Jan Kara reminds fixing the other two cases. Thanks!
> > > > >
> > > > > Is there a reason it's not done for __sync_single_inode as well?
> > > >
> > > > It missed the glance because it don't have an obvious '|' in the line ;)
> > > >
> > > > > Jeff Layton asked the question and I'm following it up :)
> > > > >
> > > > > __sync_single_inode currently only tests I_FREEING, but I think we are
> > > > > safe because __sync_single_inode sets I_SYNC, and clear_inode waits for
> > > > > I_SYNC to be cleared before it changes I_STATE.
> > > >
> > > > But I_SYNC is removed just before the I_FREEING test, so we still have
> > > > a small race window?
> > > >
> > > > > On the other hand, testing I_CLEAR here probably would be safe anyway,
> > > > > and it'd be bonus points for consistency?
> > > >
> > > > So let's add the I_CLEAR test?
> > > >
> > > > > Same basic question for generic_sync_sb_inodes, which has a
> > > > > BUG_ON(inode->i_state & I_FREEING), seems like this could check I_CLWAR
> > > > > as well?
> > > >
> > > > Yes, we can add I_CLEAR here to catch more error condition.
> > > >
> > > > Thanks,
> > > > Fengguang
> > > >
> > > > ---
> > > > skip I_CLEAR state inodes in writeback routines
> > > >
> > > > The I_FREEING test in __sync_single_inode() is racy because
> > > > clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC
> > > > and the test of I_FREEING.
> > > >
> > > > Also extend the coverage of BUG_ON(I_FREEING) to I_CLEAR.
> > > >
> > > > Reported-by: Jeff Layton <[email protected]>
> > > > Reported-by: Eric Sandeen <[email protected]>
> > > > Signed-off-by: Wu Fengguang <[email protected]>
> > > > ---
> > > > fs/fs-writeback.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > --- linux.orig/fs/fs-writeback.c
> > > > +++ linux/fs/fs-writeback.c
> > > > @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> > > > spin_lock(&inode_lock);
> > > > WARN_ON(inode->i_state & I_NEW);
> > > > inode->i_state &= ~I_SYNC;
> > > > - if (!(inode->i_state & I_FREEING)) {
> > > > + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> > > > if (!(inode->i_state & I_DIRTY) &&
> > > > mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > > Is the whole if needed? I had an impression that everyone calling
> > > __sync_single_inode() should better take care it does not race with inode
> > > freeing... So WARN_ON would be more appropriate IMHO.
> > >
> > > > /*
> > > > @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> > > > if (current_is_pdflush() && !writeback_acquire(bdi))
> > > > break;
> > > >
> > > > - BUG_ON(inode->i_state & I_FREEING);
> > > > + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> > > > __iget(inode);
> > > > pages_skipped = wbc->pages_skipped;
> > > > __writeback_single_inode(inode, wbc);
> > > Looking at this code, it looks a bit suspicious. What prevents this s_io
> > > list scan to race with inode freeing? In particular generic_forget_inode()
> >
> > Good catch.
> >
> > > can drop inode_lock to write the inode and in the mean time
> > > generic_sync_sb_inodes() can come, get a reference to the inode and start
> > > it's writeback... Subsequent iput() would then call generic_forget_inode()
> >
> > Another possibility:
> >
> > generic_forget_inode
> > inode->i_state |= I_WILL_FREE;
> > spin_unlock(&inode_lock);
> > generic_sync_sb_inodes()
> > spin_lock(&inode_lock);
> > __iget(inode);
> > __writeback_single_inode
> > // see non zero i_count
> > WARN_ON(inode->i_state & I_WILL_FREE);
> >
> > I'm wondering why didn't we saw reports on the last WARN_ON()?
> > Did we missed something?
> I meant the above race in my description ;-). Anyway, the race can happen
> only if we are unmounting the filesystem (normally, we bail out on
> sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
> a while to understand why we weren't seeing tons of warnings...).

Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
- writeback_inodes(): umount prevented
- pohmelfs_kill_super(): just before umount
- ubifs calls: too complex to be obvious..
At least the first two cases are safe, so we didn't see the error report ;)

> > > on the inode again. So shouldn't we skip I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW
> > > inodes in this scan like we do for later in the function for another scan?

Yes we should do this at least for safety. I_WILL_FREE means
generic_forget_inode() is going to writeback the inode on its own, so
generic_sync_sb_inodes() would better not to wade in.

Thanks,
Fengguang

2009-06-06 03:07:49

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] writeback: skip new or to-be-freed inodes

1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

generic_forget_inode
inode->i_state |= I_WILL_FREE;
spin_unlock(&inode_lock);
generic_sync_sb_inodes()
spin_lock(&inode_lock);
__iget(inode);
__writeback_single_inode
// see non zero i_count
may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
spin_unlock(&inode_lock);
may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early. But we are not sure the UBIFS calls and future callers will guarantee
that. So skip I_WILL_FREE inodes for the sake of safety.

CC: Jan Kara <[email protected]>
CC: Eric Sandeen <[email protected]>
Acked-by: Jeff Layton <[email protected]>
CC: Masayoshi MIZUMA <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/fs-writeback.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
spin_lock(&inode_lock);
WARN_ON(inode->i_state & I_NEW);
inode->i_state &= ~I_SYNC;
- if (!(inode->i_state & I_FREEING)) {
+ if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
if (!(inode->i_state & I_DIRTY) &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
/*
@@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
break;
}

- if (inode->i_state & I_NEW) {
+ if (inode->i_state & (I_NEW | I_WILL_FREE)) {
requeue_io(inode);
continue;
}
@@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
if (current_is_pdflush() && !writeback_acquire(bdi))
break;

- BUG_ON(inode->i_state & I_FREEING);
+ BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
__iget(inode);
pages_skipped = wbc->pages_skipped;
__writeback_single_inode(inode, wbc);

2009-06-08 07:04:19

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

Wu Fengguang wrote:
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.

The inode states are a bit vague for me, but vs. UBIFS - feel
free to ask questions.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-06-08 09:29:43

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

Hi Artem,

On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
> Wu Fengguang wrote:
> > The above race and warning didn't turn up because writeback_inodes() holds
> > the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> > early. But we are not sure the UBIFS calls and future callers will guarantee
> > that. So skip I_WILL_FREE inodes for the sake of safety.
>
> The inode states are a bit vague for me, but vs. UBIFS - feel
> free to ask questions.

Thank you. Basically I'm not sure if UBIFS guarantees it won't be
unmounted (hence the MS_ACTIVE bit is on) when calling
generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Thanks,
Fengguang

PS: our previous discussions

> > Another possibility:
> >
> > generic_forget_inode
> > inode->i_state |= I_WILL_FREE;
> > spin_unlock(&inode_lock);
> > generic_sync_sb_inodes()
> > spin_lock(&inode_lock);
> > __iget(inode);
> > __writeback_single_inode
> > // see non zero i_count
> > WARN_ON(inode->i_state & I_WILL_FREE);
> >
> > I'm wondering why didn't we saw reports on the last WARN_ON()?
> > Did we missed something?
> I meant the above race in my description ;-). Anyway, the race can happen
> only if we are unmounting the filesystem (normally, we bail out on
> sb->s_flags & MS_ACTIVE check - yes, it's a bit hidden and it also took me
> a while to understand why we weren't seeing tons of warnings...).

Ah OK. Just checked that all three callers of generic_sync_sb_inodes():
- writeback_inodes(): umount prevented
- pohmelfs_kill_super(): just before umount
- ubifs calls: too complex to be obvious..
At least the first two cases are safe, so we didn't see the error report ;)

2009-06-08 10:46:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

Btw, the call in ubifs_sync_fs should be superflous in the
vfs-2.6#for-next tree. We now do make sure that all inodes are flushed
before calling ->sync_fs with the wait parameter.

shrink_liability is a more interesting case, I don't understand enough
of ubifs to comment on it.

2009-06-08 17:07:35

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

On Sat 06-06-09 11:07:25, Wu Fengguang wrote:
> 1) I_FREEING tests should be coupled with I_CLEAR
>
> The two I_FREEING tests are racy because clear_inode() can set i_state to
> I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
>
> 2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
> races with generic_forget_inode()
>
> generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
> generic_sync_sb_inodes() shall not try to step in and create possible races:
>
> generic_forget_inode
> inode->i_state |= I_WILL_FREE;
> spin_unlock(&inode_lock);
> generic_sync_sb_inodes()
> spin_lock(&inode_lock);
> __iget(inode);
> __writeback_single_inode
> // see non zero i_count
> may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE);
> spin_unlock(&inode_lock);
> may call generic_forget_inode again ==> iput(inode);
>
> The above race and warning didn't turn up because writeback_inodes() holds
> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
> early. But we are not sure the UBIFS calls and future callers will guarantee
> that. So skip I_WILL_FREE inodes for the sake of safety.
OK, the patch looks fine. Acked-by: Jan Kara <[email protected]>

Honza
> CC: Jan Kara <[email protected]>
> CC: Eric Sandeen <[email protected]>
> Acked-by: Jeff Layton <[email protected]>
> CC: Masayoshi MIZUMA <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> fs/fs-writeback.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -316,7 +316,7 @@ __sync_single_inode(struct inode *inode,
> spin_lock(&inode_lock);
> WARN_ON(inode->i_state & I_NEW);
> inode->i_state &= ~I_SYNC;
> - if (!(inode->i_state & I_FREEING)) {
> + if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> if (!(inode->i_state & I_DIRTY) &&
> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> /*
> @@ -487,7 +487,7 @@ void generic_sync_sb_inodes(struct super
> break;
> }
>
> - if (inode->i_state & I_NEW) {
> + if (inode->i_state & (I_NEW | I_WILL_FREE)) {
> requeue_io(inode);
> continue;
> }
> @@ -518,7 +518,7 @@ void generic_sync_sb_inodes(struct super
> if (current_is_pdflush() && !writeback_acquire(bdi))
> break;
>
> - BUG_ON(inode->i_state & I_FREEING);
> + BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
> __iget(inode);
> pages_skipped = wbc->pages_skipped;
> __writeback_single_inode(inode, wbc);
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-09 07:05:07

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

Wu Fengguang wrote:
> Hi Artem,
>
> On Mon, Jun 08, 2009 at 03:03:10PM +0800, Artem Bityutskiy wrote:
>> Wu Fengguang wrote:
>>> The above race and warning didn't turn up because writeback_inodes() holds
>>> the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
>>> early. But we are not sure the UBIFS calls and future callers will guarantee
>>> that. So skip I_WILL_FREE inodes for the sake of safety.
>> The inode states are a bit vague for me, but vs. UBIFS - feel
>> free to ask questions.
>
> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
> unmounted (hence the MS_ACTIVE bit is on) when calling
> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().

To be frank my VFS knowledge is not good enough to give you a
good answer. MS_ACTIVE seems to be set by file-systems when
they are mounted, and cleaned by VFS when unmounting.
I guess MS_ACTIVE is used by FS-es to check whether unmounting
is in progress or not. Anyway, UBIFS does not use it.

The generic_sync_sb_inodes() is called only from within
VFS operations, e.g., from ->create, ->rename, ->mknod,
->write_begin, ->setattr, etc. I mean, it is called from
UBIFS implementations of the above calls. UBIFS never calls
generic_sync_sb_inodes() function by itself, e.g., from
the UBIFS background thread.

Also, all calls to generic_sync_sb_inodes() are from within
VFS operations which need writing, e.g., VFS called
'mnt_want_write()' for all of them.

I think VFS must protect the FS from being unmunted while
it is in the middle of an operation, right? I'm just not
sure how this mechanism works. This would mean that yes,
we cannot be unmounted while we are in
generic_sync_sb_inodes() called by UBIFS.

Did this help :-) ?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-06-09 07:25:56

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] writeback: skip new or to-be-freed inodes

[CCed Adiran Hunter]

Christoph Hellwig wrote:
> On Mon, Jun 08, 2009 at 05:29:30PM +0800, Wu Fengguang wrote:
>> Thank you. Basically I'm not sure if UBIFS guarantees it won't be
>> unmounted (hence the MS_ACTIVE bit is on) when calling
>> generic_sync_sb_inodes() in shrink_liability() and ubifs_sync_fs().
>
> Btw, the call in ubifs_sync_fs should be superflous in the
> vfs-2.6#for-next tree. We now do make sure that all inodes are flushed
> before calling ->sync_fs with the wait parameter.

OK, thanks for letting know. Once this is merged upstream,
I'll amend UBIFS.

> shrink_liability is a more interesting case, I don't understand enough
> of ubifs to comment on it.

Well... I'm not sure if I can tell why we need this in few
words. But I'll try.

UBIFS supports on-the-flight compression. This, and other factors
lead to a situation when UBIFS does not know how much the dirty
data in the page/inode caches will take on the flash. Indeed,
how do we know how well the data will be compressed?

UBIFS has so called "budgeting" subsystem. This subsystem is
responsible for accounting flash space. If user writes a new
data page, which goes to the page cache and sits there, the
budgeting sub-system decrements the free space counters by
4KiB. And so on.

At some point the free space counters in the budgeting subsystem
reache zero, which means we do not have more space. However,
in 99% of the cases this is not true, because the budgeting
subsystem's calculations are _very_ pessimistic, they always
assume the worst case scenario like the data are uncompressible.

So consider a situation when user writes a new data page. First
of all, the ->write_begin function will call a budgeting
sub-system function in order to reserve flash space for this
new data page.

The budgeting subsystem will see that space counters are zero.
And what it will do it will call the 'shrink_liability()'
function, which, among other things, may call the
'generic_sync_sb_inodes()' function, which will force write-back,
and this will give us some space. Indeed, when we actually
write the data back, we'll see how much flash space they
really take. And in 99% of cases they will take less than
we budgeted for, usually much less.

This is the rough idea. In practice things are more complex,
and there are factors like inability to know how much of dirty
space may be reclaimed, what will be the index size after
commit, etc. This all makes the budgeting subsystem complex
and difficult to understand. Moreover, we still consider it
as a work in progress, because we use too rough calculations,
and there are too many heuristics.

Here you may read some more information about UBIFS flash
accounting issues:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

You may also find a lot of info here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation

Especially in this document:
http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf
but it is not easy reading. You may search for "budget"
in the doc.

HTH.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)