Andrew, Kirill,
I have been investigating a panic on an IBM X255 machine (4 physical
processors, 8 virtual with hyper-threading, 24GB RAM). I have narrowed
this down to a race condition between isofs mount and kswapd. I believe
this race condition is associated with the generic_forget_inode() issue
that you discussed in your email exchange back in September:
Kirill Korotaev <dev@xxxxx> wrote:
>
> Hello,
>
> 1. I found that generic_forget_inode() calls write_inode_now()
> dropping inode_lock and destroys inode after that. The problem
> is that write_inode_now() can sleep and during this sleep someone
> can find inode in the hash, w/o I_FREEING state and with i_count = 0.
Although I have been testing with RHEL4 (2.6.9-5.ELsmp), I believe (by
code inspection) that the problem also exists in Linus' tree. I have
attempted to reproduce the problem with the 2.6.11 kernel, but I have
not succeeded in doing so - probably because of variations in low memory
usage, or timing differences...?
During the isofs mount, isofs_fill_super() allocates an inode and then
quickly frees it with iput (both cases are marked in the following code
snippet:
/*
* Read the root inode, which _may_ result in changing
* the s_rock flag. Once we have the final s_rock value,
* we then decide whether to use the Joliet descriptor.
*/
----> inode = isofs_iget(s, sbi->s_firstdatazone, 0);
BC_LOG(1, 0xddddddd0, &inode, 4);
/*
* If this disk has both Rock Ridge and Joliet on it, then we
* want to use Rock Ridge by default. This can be overridden
* by using the norock mount option. There is still one other
* possibility that is not taken into account: a Rock Ridge
* CD with Unicode names. Until someone sees such a beast, it
* will not be supported.
*/
if (sbi->s_rock == 1) {
joliet_level = 0;
} else if (joliet_level) {
sbi->s_rock = 0;
if (sbi->s_firstdatazone != first_data_zone) {
sbi->s_firstdatazone = first_data_zone;
printk(KERN_DEBUG
"ISOFS: changing to secondary root\n");
BC_TAG(1, 0xddddddd1);
----> iput(inode);
inode = isofs_iget(s, sbi->s_firstdatazone, 0);
}
}
generic_forget_inode() is eventually called (within the context of
iput), the inode is placed on the unused list, and the inode_lock is
dropped.
kswapd calls prune_icache(), locks the inode_lock, and pulls the same
inode off of the unused list. Upon completion, prune_icache() calls
dispose_list() for the inodes that it has collected.
generic_forget_inode() calls write_inode_now(), which calls
__writeback_single_inode() which calls __sync_single_inode().
__sync_single_inode() panics when attempting to move the inode onto the
unused list (the last call to list_move). This is due to the poison
values that were previously loaded into the next and prev list pointers
by list_del().
I initially thought that I might be able to fix this particular case by
removing the two lines of code below (delineated with the
FIX_INODE_PANIC constant). However, I believe that even if this didn't
break other code, it is not a complete fix, because __sync_single_inode
will eventually put the same inode on the unused list, the inode_lock
will be dropped, and you essentially have the same scenario:
generic_forget_inode() will cause a panic when calling list_del_init()
for this inode.
It seems to me that there needs to be a flag associated with the inode
(or the superblock?) that prevents kswapd from selecting this inode. Or,
since at this point the problem is specifically related to isofs, maybe
the isofs_fill_super() logic should be changed?
I would appreciate any suggestions for how this can best be fixed - or
if you have a fix in mind, I would be happy to test it (in the context
of the 2.6.9-5.ELsmp redhat kernel).
Thanks!
- Russ
static void generic_forget_inode(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
if (!hlist_unhashed(&inode->i_hash)) {
#define FIX_INODE_PANIC
#ifndef FIX_INODE_PANIC
if (!(inode->i_state & (I_DIRTY|I_LOCK)))
list_move(&inode->i_list, &inode_unused);
#endif /* FIX_INODE_PANIC */
inodes_stat.nr_unused++;
spin_unlock(&inode_lock);
if (!sb || (sb->s_flags & MS_ACTIVE))
return;
write_inode_now(inode, 1);
spin_lock(&inode_lock);
inodes_stat.nr_unused--;
hlist_del_init(&inode->i_hash);
}
list_del_init(&inode->i_list);
inode->i_state|=I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
destroy_inode(inode);
}
--
- Russ
Russ Weight
IBM Linux Technology Center (LTC)
(503) 578-3461, T/L 775-3461
[email protected]
Russ Weight <[email protected]> wrote:
>
> generic_forget_inode() is eventually called (within the context of
> iput), the inode is placed on the unused list, and the inode_lock is
> dropped.
>
> kswapd calls prune_icache(), locks the inode_lock, and pulls the same
> inode off of the unused list. Upon completion, prune_icache() calls
> dispose_list() for the inodes that it has collected.
>
> generic_forget_inode() calls write_inode_now(), which calls
> __writeback_single_inode() which calls __sync_single_inode().
> __sync_single_inode() panics when attempting to move the inode onto the
> unused list (the last call to list_move). This is due to the poison
> values that were previously loaded into the next and prev list pointers
> by list_del().
It's not clear what the actual bug is here. When you say that
__sync_single_inode() panics over the list pointers, who was it that
poisoned them? dispose_list()?
Certainly isofs_fill_super() could trivially be rewritten to not do the
iget()/iput() but we should be sure that that's really the bug. The inode
lifetime management is rather messy, I'm afraid.
On Fri, 2005-03-18 at 14:17 -0800, Andrew Morton wrote:
> Russ Weight <[email protected]> wrote:
> >
> > generic_forget_inode() is eventually called (within the context of
> > iput), the inode is placed on the unused list, and the inode_lock is
> > dropped.
> >
> > kswapd calls prune_icache(), locks the inode_lock, and pulls the same
> > inode off of the unused list. Upon completion, prune_icache() calls
> > dispose_list() for the inodes that it has collected.
> >
> > generic_forget_inode() calls write_inode_now(), which calls
> > __writeback_single_inode() which calls __sync_single_inode().
> > __sync_single_inode() panics when attempting to move the inode onto the
> > unused list (the last call to list_move). This is due to the poison
> > values that were previously loaded into the next and prev list pointers
> > by list_del().
>
> It's not clear what the actual bug is here. When you say that
> __sync_single_inode() panics over the list pointers, who was it that
> poisoned them? dispose_list()?
>
> Certainly isofs_fill_super() could trivially be rewritten to not do the
> iget()/iput() but we should be sure that that's really the bug. The inode
> lifetime management is rather messy, I'm afraid.
The pointers are poisoned by dispose_list(). The race condition is
between prune_icache() and generic_forget_inode().
When I suggested that a change to isofs_fill_super() might be
considered, I was speculating that isofs_fill_super() might be creating
an unexpected state by doing something unconventional in its usage of
iget() and iput(). This is something I had not investigated.
The problem is more likely in generic_forget_inode(). It releases the
inode_lock and then locks it again without doing anything to prevent the
inode from being stolen, and without checking to see if it has been
stolen. Likewise, write_inode_now() does not do any checks to see if the
inode has been stolen.
- Russ
Russ Weight <[email protected]> wrote:
>
> On Fri, 2005-03-18 at 14:17 -0800, Andrew Morton wrote:
> > Russ Weight <[email protected]> wrote:
> > >
> > > generic_forget_inode() is eventually called (within the context of
> > > iput), the inode is placed on the unused list, and the inode_lock is
> > > dropped.
> > >
> > > kswapd calls prune_icache(), locks the inode_lock, and pulls the same
> > > inode off of the unused list. Upon completion, prune_icache() calls
> > > dispose_list() for the inodes that it has collected.
> > >
> > > generic_forget_inode() calls write_inode_now(), which calls
> > > __writeback_single_inode() which calls __sync_single_inode().
> > > __sync_single_inode() panics when attempting to move the inode onto the
> > > unused list (the last call to list_move). This is due to the poison
> > > values that were previously loaded into the next and prev list pointers
> > > by list_del().
> >
> > It's not clear what the actual bug is here. When you say that
> > __sync_single_inode() panics over the list pointers, who was it that
> > poisoned them? dispose_list()?
> >
> > Certainly isofs_fill_super() could trivially be rewritten to not do the
> > iget()/iput() but we should be sure that that's really the bug. The inode
> > lifetime management is rather messy, I'm afraid.
>
> The pointers are poisoned by dispose_list(). The race condition is
> between prune_icache() and generic_forget_inode().
>
> When I suggested that a change to isofs_fill_super() might be
> considered, I was speculating that isofs_fill_super() might be creating
> an unexpected state by doing something unconventional in its usage of
> iget() and iput(). This is something I had not investigated.
>
> The problem is more likely in generic_forget_inode(). It releases the
> inode_lock and then locks it again without doing anything to prevent the
> inode from being stolen, and without checking to see if it has been
> stolen. Likewise, write_inode_now() does not do any checks to see if the
> inode has been stolen.
>
I dunno. This is really fiddly code and does like to blow up in your face.
I doubt if the problem is in such a well-trodden path as
generic_forget_inode().
Perhaps isofs is breaking the rules by running iget() prior to setting
MS_ACTIVE. (What rules, you ask? hah.)
Russ Weight <[email protected]> wrote:
>
> The problem is more likely in generic_forget_inode(). It releases the
Exactly. It's a continuation of the greased turkey bug :)
When we're writing the inode out, we shouldn't place it on the unused
list at all. Placing the inode on the unused list only makes sense
when we return from generic_forget_inode without actually destroying
the inode.
So we need something like this. I'm not sure about the nr_unused
counter though. Should we be incrementing it as we do now even when
we don't put the inode on the unused list?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
===== fs/inode.c 1.144 vs edited =====
--- 1.144/fs/inode.c 2005-03-05 17:41:15 +11:00
+++ edited/fs/inode.c 2005-03-19 11:11:56 +11:00
@@ -1037,12 +1037,14 @@
struct super_block *sb = inode->i_sb;
if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_LOCK)))
- list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
- spin_unlock(&inode_lock);
- if (!sb || (sb->s_flags & MS_ACTIVE))
+ if (!sb || (sb->s_flags & MS_ACTIVE)) {
+ if (!(inode->i_state & (I_DIRTY|I_LOCK)))
+ list_move(&inode->i_list, &inode_unused);
+ spin_unlock(&inode_lock);
return;
+ }
+ spin_unlock(&inode_lock);
write_inode_now(inode, 1);
spin_lock(&inode_lock);
inodes_stat.nr_unused--;
On Fri, 2005-03-18 at 16:11 -0800, Andrew Morton wrote:
> Russ Weight <[email protected]> wrote:
> >
> > On Fri, 2005-03-18 at 14:17 -0800, Andrew Morton wrote:
> > > Russ Weight <[email protected]> wrote:
> > > >
> > > > generic_forget_inode() is eventually called (within the context of
> > > > iput), the inode is placed on the unused list, and the inode_lock is
> > > > dropped.
> > > >
> > > > kswapd calls prune_icache(), locks the inode_lock, and pulls the same
> > > > inode off of the unused list. Upon completion, prune_icache() calls
> > > > dispose_list() for the inodes that it has collected.
> > > >
> > > > generic_forget_inode() calls write_inode_now(), which calls
> > > > __writeback_single_inode() which calls __sync_single_inode().
> > > > __sync_single_inode() panics when attempting to move the inode onto the
> > > > unused list (the last call to list_move). This is due to the poison
> > > > values that were previously loaded into the next and prev list pointers
> > > > by list_del().
> > >
> > > It's not clear what the actual bug is here. When you say that
> > > __sync_single_inode() panics over the list pointers, who was it that
> > > poisoned them? dispose_list()?
> > >
> > > Certainly isofs_fill_super() could trivially be rewritten to not do the
> > > iget()/iput() but we should be sure that that's really the bug. The inode
> > > lifetime management is rather messy, I'm afraid.
> >
> > The pointers are poisoned by dispose_list(). The race condition is
> > between prune_icache() and generic_forget_inode().
> >
> > When I suggested that a change to isofs_fill_super() might be
> > considered, I was speculating that isofs_fill_super() might be creating
> > an unexpected state by doing something unconventional in its usage of
> > iget() and iput(). This is something I had not investigated.
> >
> > The problem is more likely in generic_forget_inode(). It releases the
> > inode_lock and then locks it again without doing anything to prevent the
> > inode from being stolen, and without checking to see if it has been
> > stolen. Likewise, write_inode_now() does not do any checks to see if the
> > inode has been stolen.
> >
>
> I dunno. This is really fiddly code and does like to blow up in your face.
> I doubt if the problem is in such a well-trodden path as
> generic_forget_inode().
>
> Perhaps isofs is breaking the rules by running iget() prior to setting
> MS_ACTIVE. (What rules, you ask? hah.)
>
isofs is, in fact, running iget() prior to setting MS_ACTIVE. I have
tested a fix for this and I will post a patch shortly.