2005-10-18 08:27:18

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

Hello,

@@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
list_move(&inode->i_list, &inode_in_use);
} else {
list_move(&inode->i_list, &inode_unused);
+ inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);

Are you sure the above diff is correct? It was added somewhere between
2.6.5 and 2.6.8. I think it's wrong.

The only way I can imagine the i_count to be zero in the above path, is
that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
increase nr_unused. So I believe the above change is buggy and it will
definitely overstate the number of unused inodes and it should be backed
out.

Note that __writeback_single_inode before calling __sync_single_inode, can
drop the spinlock and we can have both the dirty and locked bitflags
clear here:

spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
XXXXXXX
spin_lock(&inode_lock);
}
use inode again here

a construct like the above makes zero sense from a reference counting
standpoint.

Either we don't ever use the inode again after the iput, or the
inode_lock should be taken _before_ executing the iput (i.e. a __iput
would be required). Taking the inode_lock after iput means the iget was
useless if we keep using the inode after the iput.

So the only chance the 2.6 was safe to call __writeback_single_inode
with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
prevent the VM to free the inode in XXXXX).

Potentially calling the above iput with I_WILL_FREE was also wrong
because it would recurse in iput_final (the second mainline bug).

The below (untested) patch fixes the nr_unused accounting, avoids
recursing in iput when I_WILL_FREE is set and makes sure (with the
BUG_ON) that we don't corrupt memory and that all holders that don't set
I_WILL_FREE, keeps a reference on the inode!

Comments welcome, thanks.

Signed-off-by: Andrea Arcangeli <[email protected]>

fs-writeback.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
--- linux-2.6/fs/fs-writeback.c.~1~ 2005-07-28 17:08:53.000000000 +0200
+++ linux-2.6/fs/fs-writeback.c 2005-10-17 15:43:53.000000000 +0200
@@ -230,7 +230,6 @@ __sync_single_inode(struct inode *inode,
* The inode is clean, unused
*/
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
@@ -246,6 +245,8 @@ __writeback_single_inode(struct inode *i
{
wait_queue_head_t *wqh;

+ BUG_ON(!atomic_read(&inode->i_count) ^ !!(inode->i_state & I_WILL_FREE));
+
if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
list_move(&inode->i_list, &inode->i_sb->s_dirty);
return 0;
@@ -259,11 +260,9 @@ __writeback_single_inode(struct inode *i

wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
do {
- __iget(inode);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
- iput(inode);
spin_lock(&inode_lock);
} while (inode->i_state & I_LOCK);
}


2005-10-19 00:13:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

Andrea Arcangeli <[email protected]> wrote:
>
> Hello,
>
> @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
> list_move(&inode->i_list, &inode_in_use);
> } else {
> list_move(&inode->i_list, &inode_unused);
> + inodes_stat.nr_unused++;
> }
> }
> wake_up_inode(inode);
>
> Are you sure the above diff is correct? It was added somewhere between
> 2.6.5 and 2.6.8. I think it's wrong.
>
> The only way I can imagine the i_count to be zero in the above path, is
> that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> increase nr_unused. So I believe the above change is buggy and it will
> definitely overstate the number of unused inodes and it should be backed
> out.

Well according to my assertion (below), the inode in __sync_single_inode()
cannot have a zero refcount, so the whole if() statement is never executed.

The thinking behind that increment is that __sync_single_inode() has just
taken a dirty, zero-refcount inode and has cleaned it. A dirty inode
cannot have previously been on inode_unused, hence we now are newly moving
it to inode_unused.

I'll stick a WARN_ON in there for now, wait and see if anyone can hit it.

> Note that __writeback_single_inode before calling __sync_single_inode, can
> drop the spinlock and we can have both the dirty and locked bitflags
> clear here:
>
> spin_unlock(&inode_lock);
> __wait_on_inode(inode);
> iput(inode);
> XXXXXXX
> spin_lock(&inode_lock);
> }
> use inode again here
>
> a construct like the above makes zero sense from a reference counting
> standpoint.
>
> Either we don't ever use the inode again after the iput, or the
> inode_lock should be taken _before_ executing the iput (i.e. a __iput
> would be required). Taking the inode_lock after iput means the iget was
> useless if we keep using the inode after the iput.
>
> So the only chance the 2.6 was safe to call __writeback_single_inode
> with the i_count == 0, is that I_WILL_FREE is set (I_WILL_FREE will
> prevent the VM to free the inode in XXXXX).
>
> Potentially calling the above iput with I_WILL_FREE was also wrong
> because it would recurse in iput_final (the second mainline bug).
>
> The below (untested) patch fixes the nr_unused accounting, avoids
> recursing in iput when I_WILL_FREE is set and makes sure (with the
> BUG_ON) that we don't corrupt memory and that all holders that don't set
> I_WILL_FREE, keeps a reference on the inode!
>

That's something which Bill snuck in there during some waitqueue rework.

I agree that the iget/iput is unneeded: all callers to
__writeback_single_inode() already have a ref on the inode: either via
sync_sb_inodes()'s iget() or via syscall(fd, ...).

So the BUG_ON() in __writeback_single_inode() becomes
BUG_ON(!atomic_read(&inode->i_count)); - I'll make it WARN_ON for now coz
I'm not so sure any more ;)

So... With updated comments:


diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c
--- 25/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set Tue Oct 18 17:07:49 2005
+++ 25-akpm/fs/fs-writeback.c Tue Oct 18 17:12:53 2005
@@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode,
/*
* The inode is clean, unused
*/
+ WARN_ON(1);
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
@@ -238,14 +238,16 @@ __sync_single_inode(struct inode *inode,
}

/*
- * Write out an inode's dirty pages. Called under inode_lock.
+ * Write out an inode's dirty pages. Called under inode_lock. The caller has
+ * ref on the inode (either via __iget or via syscall against an fd).
*/
static int
-__writeback_single_inode(struct inode *inode,
- struct writeback_control *wbc)
+__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
wait_queue_head_t *wqh;

+ WARN_ON(!atomic_read(&inode->i_count));
+
if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
list_move(&inode->i_list, &inode->i_sb->s_dirty);
return 0;
@@ -259,11 +261,9 @@ __writeback_single_inode(struct inode *i

wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
do {
- __iget(inode);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
- iput(inode);
spin_lock(&inode_lock);
} while (inode->i_state & I_LOCK);
}
@@ -541,14 +541,15 @@ void sync_inodes(int wait)
}

/**
- * write_inode_now - write an inode to disk
- * @inode: inode to write to disk
- * @sync: whether the write should be synchronous or not
+ * write_inode_now - write an inode to disk
+ * @inode: inode to write to disk
+ * @sync: whether the write should be synchronous or not
+ *
+ * This function commits an inode to disk immediately if it is dirty. This is
+ * primarily needed by knfsd.
*
- * This function commits an inode to disk immediately if it is
- * dirty. This is primarily needed by knfsd.
+ * The caller must have a ref on the inode.
*/
-
int write_inode_now(struct inode *inode, int sync)
{
int ret;
_

2005-10-19 00:40:28

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > Hello,
> >
> > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
> > list_move(&inode->i_list, &inode_in_use);
> > } else {
> > list_move(&inode->i_list, &inode_unused);
> > + inodes_stat.nr_unused++;
> > }
> > }
> > wake_up_inode(inode);
> >
> > Are you sure the above diff is correct? It was added somewhere between
> > 2.6.5 and 2.6.8. I think it's wrong.
> >
> > The only way I can imagine the i_count to be zero in the above path, is
> > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> > increase nr_unused. So I believe the above change is buggy and it will
> > definitely overstate the number of unused inodes and it should be backed
> > out.
>
> Well according to my assertion (below), the inode in __sync_single_inode()
> cannot have a zero refcount, so the whole if() statement is never executed.

generic_forget_inode->write_inode_now->__writeback_single_inode->
__sync_single_inode

We do have I_WILL_FREE, but i_count will be zero.

>
> The thinking behind that increment is that __sync_single_inode() has just
> taken a dirty, zero-refcount inode and has cleaned it. A dirty inode
> cannot have previously been on inode_unused, hence we now are newly moving
> it to inode_unused.

nr_unused doesn't seem to count the number of inodes on the unused list.
It is actually counting the number of inodes whose i_count is 0. See
generic_forget_inode and invalidate_list to see what I mean.

generic_forget_inode took care of incrementing the unused count when
i_count went to zero. So, I don't think we need to worry about the
unused count in __writeback_single_inode.

-chris

2005-10-19 01:16:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

Chris Mason <[email protected]> wrote:
>
> On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > @@ -183,6 +183,7 @@ __sync_single_inode(struct inode *inode,
> > > list_move(&inode->i_list, &inode_in_use);
> > > } else {
> > > list_move(&inode->i_list, &inode_unused);
> > > + inodes_stat.nr_unused++;
> > > }
> > > }
> > > wake_up_inode(inode);
> > >
> > > Are you sure the above diff is correct? It was added somewhere between
> > > 2.6.5 and 2.6.8. I think it's wrong.
> > >
> > > The only way I can imagine the i_count to be zero in the above path, is
> > > that I_WILL_FREE is set. And if I_WILL_FREE is set, then we must not
> > > increase nr_unused. So I believe the above change is buggy and it will
> > > definitely overstate the number of unused inodes and it should be backed
> > > out.
> >
> > Well according to my assertion (below), the inode in __sync_single_inode()
> > cannot have a zero refcount, so the whole if() statement is never executed.
>
> generic_forget_inode->write_inode_now->__writeback_single_inode->
> __sync_single_inode

oshit.

> We do have I_WILL_FREE, but i_count will be zero.

yup.

> >
> > The thinking behind that increment is that __sync_single_inode() has just
> > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode
> > cannot have previously been on inode_unused, hence we now are newly moving
> > it to inode_unused.
>
> nr_unused doesn't seem to count the number of inodes on the unused list.
> It is actually counting the number of inodes whose i_count is 0. See
> generic_forget_inode and invalidate_list to see what I mean.

hm, OK. It'd be nice to make that more explicit. Something like this?

--- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700
+++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700
@@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct
if (!hlist_unhashed(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_LOCK)))
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
if (!sb || (sb->s_flags & MS_ACTIVE)) {
+ inodes_stat.nr_unused++; /* One more 0-ref inode */
spin_unlock(&inode_lock);
return;
}
@@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct
write_inode_now(inode, 1);
spin_lock(&inode_lock);
inode->i_state &= ~I_WILL_FREE;
- inodes_stat.nr_unused--;
hlist_del_init(&inode->i_hash);
}
list_del_init(&inode->i_list);
_

> generic_forget_inode took care of incrementing the unused count when
> i_count went to zero. So, I don't think we need to worry about the
> unused count in __writeback_single_inode.
>

How about this for now?


diff -puN fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set fs/fs-writeback.c
--- devel/fs/fs-writeback.c~fix-nr_unused-accounting-and-avoid-recursing-in-iput-with-i_will_free-set 2005-10-18 18:02:51.000000000 -0700
+++ devel-akpm/fs/fs-writeback.c 2005-10-18 18:05:22.000000000 -0700
@@ -229,8 +229,8 @@ __sync_single_inode(struct inode *inode,
/*
* The inode is clean, unused
*/
+ WARN_ON(1);
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
}
}
wake_up_inode(inode);
@@ -238,14 +238,20 @@ __sync_single_inode(struct inode *inode,
}

/*
- * Write out an inode's dirty pages. Called under inode_lock.
+ * Write out an inode's dirty pages. Called under inode_lock. Either the
+ * caller has ref on the inode (either via __iget or via syscall against an fd)
+ * or the inode has I_WILL_FREE set (via generic_forget_inode)
*/
static int
-__writeback_single_inode(struct inode *inode,
- struct writeback_control *wbc)
+__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
wait_queue_head_t *wqh;

+ if (!atomic_read(&inode->i_count))
+ WARN_ON(!(inode->i_flags & I_WILL_FREE));
+ else
+ WARN_ON(inode->i_flags & I_WILL_FREE);
+
if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
list_move(&inode->i_list, &inode->i_sb->s_dirty);
return 0;
@@ -259,11 +265,9 @@ __writeback_single_inode(struct inode *i

wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
do {
- __iget(inode);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
- iput(inode);
spin_lock(&inode_lock);
} while (inode->i_state & I_LOCK);
}
@@ -541,14 +545,15 @@ void sync_inodes(int wait)
}

/**
- * write_inode_now - write an inode to disk
- * @inode: inode to write to disk
- * @sync: whether the write should be synchronous or not
+ * write_inode_now - write an inode to disk
+ * @inode: inode to write to disk
+ * @sync: whether the write should be synchronous or not
+ *
+ * This function commits an inode to disk immediately if it is dirty. This is
+ * primarily needed by knfsd.
*
- * This function commits an inode to disk immediately if it is
- * dirty. This is primarily needed by knfsd.
+ * The caller must either have a ref on the inode or must have set I_WILL_FREE.
*/
-
int write_inode_now(struct inode *inode, int sync)
{
int ret;
_

2005-10-19 01:58:47

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > cannot have a zero refcount, so the whole if() statement is never executed.
> >
> > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > __sync_single_inode
>
> oshit.

When does this ever happen? Just for private inodes released during
put_super right?

>
> > We do have I_WILL_FREE, but i_count will be zero.
>
> yup.
>
> > >
> > > The thinking behind that increment is that __sync_single_inode() has just
> > > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode
> > > cannot have previously been on inode_unused, hence we now are newly moving
> > > it to inode_unused.
> >
> > nr_unused doesn't seem to count the number of inodes on the unused list.
> > It is actually counting the number of inodes whose i_count is 0. See
> > generic_forget_inode and invalidate_list to see what I mean.
>
> hm, OK. It'd be nice to make that more explicit. Something like this?

Well, I can't quite convince myself it is wrong, but when
(!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
inode_lock with an inode with i_count == 0 and nr_unused hasn't been
incremented.

So, if someone (sync_sb_inodes?) comes in and runs __iget,
the counts end up wrong. Then again, whoever ran __iget would also run
iput and things would go horribly wrong anyway.

Did I mention the part where Andrea and I are hunting a bug where the
count of unused inodes goes negative and the everyone ends up spinning
in shrink_icache_memory? Andrea's patch doesn't fix the spinning, but
it might have fixed the unused inode count going negative. We're
waiting for another reproduce on the ppc64 race monster.

>
> --- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700
> +++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700
> @@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct
> if (!hlist_unhashed(&inode->i_hash)) {
> if (!(inode->i_state & (I_DIRTY|I_LOCK)))
> list_move(&inode->i_list, &inode_unused);
> - inodes_stat.nr_unused++;
> if (!sb || (sb->s_flags & MS_ACTIVE)) {
> + inodes_stat.nr_unused++; /* One more 0-ref inode */
> spin_unlock(&inode_lock);
> return;
> }
> @@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct
> write_inode_now(inode, 1);
> spin_lock(&inode_lock);
> inode->i_state &= ~I_WILL_FREE;
> - inodes_stat.nr_unused--;
> hlist_del_init(&inode->i_hash);
> }
> list_del_init(&inode->i_list);
> _
>
> > generic_forget_inode took care of incrementing the unused count when
> > i_count went to zero. So, I don't think we need to worry about the
> > unused count in __writeback_single_inode.
> >
>
> How about this for now?

This part looks good.

-chris

2005-10-19 02:27:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

Chris Mason <[email protected]> wrote:
>
> On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > > cannot have a zero refcount, so the whole if() statement is never executed.
> > >
> > > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > > __sync_single_inode
> >
> > oshit.
>
> When does this ever happen? Just for private inodes released during
> put_super right?

I suppose so, yes.

> >
> > > We do have I_WILL_FREE, but i_count will be zero.
> >
> > yup.
> >
> > > >
> > > > The thinking behind that increment is that __sync_single_inode() has just
> > > > taken a dirty, zero-refcount inode and has cleaned it. A dirty inode
> > > > cannot have previously been on inode_unused, hence we now are newly moving
> > > > it to inode_unused.
> > >
> > > nr_unused doesn't seem to count the number of inodes on the unused list.
> > > It is actually counting the number of inodes whose i_count is 0. See
> > > generic_forget_inode and invalidate_list to see what I mean.
> >
> > hm, OK. It'd be nice to make that more explicit. Something like this?
>
> Well, I can't quite convince myself it is wrong, but when
> (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
> inode_lock with an inode with i_count == 0 and nr_unused hasn't been
> incremented.
>
> So, if someone (sync_sb_inodes?) comes in and runs __iget,
> the counts end up wrong. Then again, whoever ran __iget would also run
> iput and things would go horribly wrong anyway.

Nope, it's equivalent:

--- devel/fs/inode.c~generic_forget_inode-nr_unused-cleanup 2005-10-18 18:13:22.000000000 -0700
+++ devel-akpm/fs/inode.c 2005-10-18 18:13:57.000000000 -0700
@@ -1067,8 +1067,8 @@ static void generic_forget_inode(struct
if (!hlist_unhashed(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_LOCK)))
list_move(&inode->i_list, &inode_unused);
- inodes_stat.nr_unused++;
if (!sb || (sb->s_flags & MS_ACTIVE)) {
+ inodes_stat.nr_unused++; /* One more 0-ref inode */
spin_unlock(&inode_lock);
return;
}
@@ -1077,7 +1077,6 @@ static void generic_forget_inode(struct
write_inode_now(inode, 1);
spin_lock(&inode_lock);
inode->i_state &= ~I_WILL_FREE;
- inodes_stat.nr_unused--;
hlist_del_init(&inode->i_hash);
}
list_del_init(&inode->i_list);
_


> Did I mention the part where Andrea and I are hunting a bug where the
> count of unused inodes goes negative and the everyone ends up spinning
> in shrink_icache_memory?

No.

> Andrea's patch doesn't fix the spinning, but
> it might have fixed the unused inode count going negative. We're
> waiting for another reproduce on the ppc64 race monster.

I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere?

In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's
obviously doing a bogus decrement a lot of times.


2005-10-19 02:58:10

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote:
> > > hm, OK. It'd be nice to make that more explicit. Something like this?
> >
> > Well, I can't quite convince myself it is wrong, but when
> > (!sb || (sb->s_flags & MS_ACTIVE), we're dropping the
> > inode_lock with an inode with i_count == 0 and nr_unused hasn't been
> > incremented.
> >
> > So, if someone (sync_sb_inodes?) comes in and runs __iget,
> > the counts end up wrong. Then again, whoever ran __iget would also run
> > iput and things would go horribly wrong anyway.
>
> Nope, it's equivalent:

The math ends up the same, but for your version there is a window where
the lock isn't held and the count doesn't reflect reality. I don't know
if anyone can race in and mess with the inode though. It is still on
various lists, but if we're only in that part of generic_forget_inode
during unmount, the super semaphore will keep out most potential racers.

I need to read harder.

> > Did I mention the part where Andrea and I are hunting a bug where the
> > count of unused inodes goes negative and the everyone ends up spinning
> > in shrink_icache_memory?
>
> No.
>
> > Andrea's patch doesn't fix the spinning, but
> > it might have fixed the unused inode count going negative. We're
> > waiting for another reproduce on the ppc64 race monster.
>
> I assume you have BUG_ON(inode_stat.nr_unused < 0)s in there everywhere?
>
> In fact WARN_ON(inode_stat.nr_unused < 100) might be better - something's
> obviously doing a bogus decrement a lot of times.
>

It goes negative in the invalidate_inodes run during unmount. I
think Andrea's patch will solve that part, hopefully we'll know more
tomorrow.

-chris

2005-10-19 07:49:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Tue, Oct 18, 2005 at 05:13:35PM -0700, Andrew Morton wrote:
> Well according to my assertion (below), the inode in __sync_single_inode()
> cannot have a zero refcount, so the whole if() statement is never executed.

It can, but only if it comes from I_WILL_FREE path
(generic_forget_inode). See the write_inode_now in
generic_forget_inode.

My BUG_ON already makes sure that when i_count is zero, I_WILL_FREE is
set (I_WILL_FREE will prevent the inode to be freed by the vm as well,
so it's ok).

2005-10-25 02:21:24

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Tue, Oct 18, 2005 at 07:26:46PM -0700, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> >
> > On Tue, Oct 18, 2005 at 06:15:48PM -0700, Andrew Morton wrote:
> > > > > Well according to my assertion (below), the inode in __sync_single_inode()
> > > > > cannot have a zero refcount, so the whole if() statement is never executed.
> > > >
> > > > generic_forget_inode->write_inode_now->__writeback_single_inode->
> > > > __sync_single_inode
> > >
> > > oshit.
> >
> > When does this ever happen? Just for private inodes released during
> > put_super right?
>
> I suppose so, yes.

It's not related to the bug, but prune_icache can jump in at any
time during generic_shutdown_super, except during the invalidate_inodes
runs. Something like this:

proc1 proc2
generic_shutdown_super
s->s_flags &= ~MS_ACTIVE
invalidate_inodes
put_super
shrink_icache_memory
prune_icache
invalidate_inode_pages
try_to_release_page

I doubt any FS triggers this. They would need to generate inodes
with pages during the put_super call, and get them onto the unused list.
But, I think prune_icache should just skip any inodes where the super
doesn't have MS_ACTIVE set.

At any rate, this wasn't the race I was looking for. Aside from the
bugs fixed by Andrea's patch, we were seeing inodes go negative thanks
to a bad interaction between a latency fix and a backport of something
else from mainline. Our latency code has a goto again, and mainline
has a big fat comment explaining why goto again isn't needed.

If the super->s_inodes list was long enough to reschedule in invalidate_list,
we would process the same inodes in multiple times without removing them.

The short version is that no additional patches should be needed for
mainline.

-chris

2005-10-25 14:15:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] fix nr_unused accounting, and avoid recursing in iput with I_WILL_FREE set

On Mon, Oct 24, 2005 at 10:21:02PM -0400, Chris Mason wrote:
> The short version is that no additional patches should be needed for
> mainline.

This one may be needed too. Perhaps it's unnecessary for the MS_ACTIVE
case (I would assume in that case by design nobody can find the inode
while MS_ACTIVE is not set), but the unhashed case sounds more
interesting. At the moment I'm unsure who is using the unhashed
last-iput feature to get rid of the inode but the below looks needed at
the light of that feature.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff -r 2c7561cc445e fs/inode.c
--- a/fs/inode.c Mon Oct 24 00:24:54 2005 +0011
+++ b/fs/inode.c Tue Oct 25 16:06:25 2005 +0200
@@ -1088,6 +1088,7 @@
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
+ wake_up_inode(inode);
destroy_inode(inode);
}