LinuxLists.cc - [PATC] small VFS change for JFFS2

2005-04-18 08:47:54

Subject: [PATC] small VFS change for JFFS2

Hello,

tracing out an JFFS2's SMP/PREEMPT bug I've realized that I need to make
a small change in VFS in order to fix the bug gracefully.

I refer to the fs/inode.c file from the 2.6.11.5 Linux kernel.

JFFS2 supports its own accounting of inodes by means of small 'struct
jffs2_inode_cache' objects. There is a field 'state' in these objects
which indicates whether the inode is built (i.e. it is in the Linux
inode cache at the moment) or not built (i.e., it isn't in the Linux
inode cache). This is needed to facilitate fast Garbage Collection,
which is the crucial part of the Flash File System.

JFFS2 assumes that the above mentioned 'state' field is always coherent
with the real state of the inode. The state is changed on read_inode()
and clear_inode() inode operation calls.

But there is a problem with this. When kswapd prunes the Linux inode
cache, there is an interval where clear_inode() was not yet called but
the inode is already removed from the inode hash (inode->i_hash). I mean
the following piece of code at fs/inode.c:467

inodes_stat.nr_unused -= nr_pruned;
spin_unlock(&inode_lock);
// <-- Here
dispose_list(&freeable);
up(&iprune_sem);

This is not a bug - VFS handles things by means of I_FREEING state. But,
for JFFS2 this means, that it may assume that the inode is built, while
it actually isn't (if, say, we are preempted at the above marked point
and the JFFS2 GC starts and picks a node belonging an inode being
pruned). This leads to nasty things in JFFS2.

One obvious thing to fix this JFFS2 problem is to acquire the iprune_sem
semaphore in JFFS2 GC, but for this we need to export it. So, please,
consider the following VFS patch (against Linux 2.6.11.5):

-------------------------------------------------------
diff -auNrp linux-2.6.11.5/fs/inode.c linux-2.6.11.5_vfs_fix/fs/inode.c
--- linux-2.6.11.5/fs/inode.c 2005-03-19 09:35:04.000000000 +0300
+++ linux-2.6.11.5_vfs_fix/fs/inode.c 2005-04-18 12:09:43.000000000
+0400
@@ -92,6 +92,8 @@ DEFINE_SPINLOCK(inode_lock);
*/
DECLARE_MUTEX(iprune_sem);

+EXPORT_SYMBOL(iprune_sem);
+
/*
* Statistics gathering..
*/
-------------------------------------------------------

P.S. Probably we may fix this without VFS changes (albeit I'm not sure),
but it will be more difficult and much uglier.

The following is the correspondent patch for JFFS2 (against MTD CVS
Head):

-------------------------------------------------------
diff -auNrp --exclude=CVS mtd-pristine/fs/jffs2/gc.c mtd/fs/jffs2/gc.c
--- mtd-pristine/fs/jffs2/gc.c 2005-04-11 13:13:29.000000000 +0400
+++ mtd/fs/jffs2/gc.c 2005-04-18 12:20:40.390920376 +0400
@@ -126,15 +126,17 @@ int jffs2_garbage_collect_pass(struct jf
struct jffs2_eraseblock *jeb;
struct jffs2_raw_node_ref *raw;
int ret = 0, inum, nlink;
+ int iprune_sem_up = 0;

if (down_interruptible(&c->alloc_sem))
return -EINTR;

for (;;) {
- spin_lock(&c->erase_completion_lock);
if (!c->unchecked_size)
break;

+ spin_lock(&c->erase_completion_lock);
+
/* We can't start doing GC yet. We haven't finished
checking
the node CRCs etc. Do it now. */

@@ -206,6 +208,17 @@ int jffs2_garbage_collect_pass(struct jf
return ret;
}

+ /*
+ * To be sure the ic->state's value stays consistent, we need to
+ * forbid kswapd to prune our inode. Otherwise, we may end up
+ * with the situation where the ic->state = INO_STATE_PRESENT
+ * while it isn't actually in the inode cache (since it was just
+ * pruned by kswapd, but clear_inode() wasn't yet called).
+ */
+ down(&iprune_sem);
+
+ spin_lock(&c->erase_completion_lock);
+
/* First, work out which block we're garbage-collecting */
jeb = c->gcblock;

@@ -215,6 +228,7 @@ int jffs2_garbage_collect_pass(struct jf
if (!jeb) {
D1 (printk(KERN_NOTICE "jffs2: Couldn't find erase block
to garbage collect!\n"));
spin_unlock(&c->erase_completion_lock);
+ up(&iprune_sem);
up(&c->alloc_sem);
return -EIO;
}
@@ -224,6 +238,7 @@ int jffs2_garbage_collect_pass(struct jf
printk(KERN_DEBUG "Nextblock at %08x, used_size %08x,
dirty_size %08x, wasted_size %08x, free_size %08x\n", c->n
extblock->offset, c->nextblock->used_size, c->nextblock->dirty_size, c-
>nextblock->wasted_size, c->nextblock->free_size));

if (!jeb->used_size) {
+ up(&iprune_sem);
up(&c->alloc_sem);
goto eraseit;
}
@@ -239,6 +254,7 @@ int jffs2_garbage_collect_pass(struct jf
jeb->offset, jeb->free_size, jeb-
>dirty_size, jeb->used_size);
jeb->gc_node = raw;
spin_unlock(&c->erase_completion_lock);
+ up(&iprune_sem);
up(&c->alloc_sem);
BUG();
}
@@ -253,6 +269,7 @@ int jffs2_garbage_collect_pass(struct jf
we don't grok that has JFFS2_NODETYPE_RWCOMPAT_COPY,
we should do so */
spin_unlock(&c->erase_completion_lock);
jffs2_mark_node_obsolete(c, raw);
+ up(&iprune_sem);
up(&c->alloc_sem);
goto eraseit_lock;
@@ -283,6 +300,8 @@ int jffs2_garbage_collect_pass(struct jf
We can just copy any pristine nodes, but have
to prevent anyone else from doing read_inode() while
we're at it, so we set the state accordingly */
+ iprune_sem_up += 1;
+ up(&iprune_sem);
if (ref_flags(raw) == REF_PRISTINE)
ic->state = INO_STATE_GC;
else {
@@ -305,6 +324,7 @@ int jffs2_garbage_collect_pass(struct jf
*/
printk(KERN_CRIT "Inode #%u already in state %d in
jffs2_garbage_collect_pass()!\n",
ic->ino, ic->state);
+ up(&iprune_sem);
up(&c->alloc_sem);
spin_unlock(&c->inocache_lock);
BUG();
@@ -316,6 +336,7 @@ int jffs2_garbage_collect_pass(struct jf
the alloc_sem() (for marking nodes invalid) so we
must
drop the alloc_sem before sleeping. */

+ up(&iprune_sem);
up(&c->alloc_sem);
D1(printk(KERN_DEBUG "jffs2_garbage_collect_pass()
waiting for ino #%u in state %d\n",
ic->ino, ic->state));
@@ -367,6 +388,8 @@ int jffs2_garbage_collect_pass(struct jf
spin_unlock(&c->inocache_lock);

f = jffs2_gc_fetch_inode(c, inum, nlink);
+ if (!iprune_sem_up)
+ up(&iprune_sem);
if (IS_ERR(f)) {
ret = PTR_ERR(f);
goto release_sem;
}
-------------------------------------------------------

Comments?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-18 08:51:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, Apr 18, 2005 at 12:47:11PM +0400, Artem B. Bityuckiy wrote:
> JFFS2 assumes that the above mentioned 'state' field is always coherent
> with the real state of the inode. The state is changed on read_inode()
> and clear_inode() inode operation calls.

> One obvious thing to fix this JFFS2 problem is to acquire the iprune_sem
> semaphore in JFFS2 GC, but for this we need to export it. So, please,
> consider the following VFS patch (against Linux 2.6.11.5):

No, exporting locks is a really bad idea. Please try to find a better
method to fix your problem that doesn't export random kernel symbols.

2005-04-18 08:58:54

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 09:51 +0100, Christoph Hellwig wrote:
> No, exporting locks is a really bad idea. Please try to find a better
> method to fix your problem that doesn't export random kernel symbols.
>
In general it must be true. But this specific case I believe is
reasonable enough to export the mutext (as an exception).

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-18 10:53:08

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, Apr 18, 2005 at 12:58:50PM +0400, Artem B. Bityuckiy wrote:
> On Mon, 2005-04-18 at 09:51 +0100, Christoph Hellwig wrote:
> > No, exporting locks is a really bad idea. Please try to find a better
> > method to fix your problem that doesn't export random kernel symbols.
> >
> In general it must be true. But this specific case I believe is
> reasonable enough to export the mutext (as an exception).

Umm, no. It's absolutely not a good reason. What jffs2 is doing right
now is to poke into VFS internals it shouldn't, and exporting more internals
to prevent it from doing so isn't making the situation any better.

The VFS already has a method for freeing an struct inode pointer, and that
is ->destroy_inode. You're probably better off updating your GC state from
that place.

2005-04-18 11:46:36

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 11:53 +0100, Christoph Hellwig wrote:
> The VFS already has a method for freeing an struct inode pointer, and that
> is ->destroy_inode. You're probably better off updating your GC state from
> that place.
destroy_inode() does not help. JFFS2 already makes use of clear_inode()
which is in fact called even earlier (inode.c from 2.6.11.5, line 298):

static void dispose_list(struct list_head *head)
{
int nr_disposed = 0;

while (!list_empty(head)) {
struct inode *inode;

inode = list_entry(head->next, struct inode, i_list);
list_del(&inode->i_list);

if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode); /* <--------- here --------- */
destroy_inode(inode);
nr_disposed++;
}
spin_lock(&inode_lock);
inodes_stat.nr_inodes -= nr_disposed;
spin_unlock(&inode_lock);
}

I'll repeat my problem, please look tat the code and read my comments in
it (inode.c:443):

static void prune_icache(int nr_to_scan)
{
LIST_HEAD(freeable);
int nr_pruned = 0;
int nr_scanned;
unsigned long reap = 0;

down(&iprune_sem);
spin_lock(&inode_lock);
for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
struct inode *inode;

if (list_empty(&inode_unused))
break;

inode = list_entry(inode_unused.prev, struct inode,
i_list);

if (inode->i_state || atomic_read(&inode->i_count)) {
list_move(&inode->i_list, &inode_unused);
continue;
}
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
__iget(inode);
spin_unlock(&inode_lock);
if (remove_inode_buffers(inode))
reap += invalidate_inode_pages(&inode-
>i_data);
iput(inode);
spin_lock(&inode_lock);

if (inode != list_entry(inode_unused.next,
struct inode, i_list))
continue; /* wrong inode or
list_empty */
if (!can_unuse(inode))
continue;
}
hlist_del_init(&inode->i_hash);
list_move(&inode->i_list, &freeable);

/* we have removed inode from the inode hash and from this moment
forth it is not in the inode cache */

inode->i_state |= I_FREEING;
nr_pruned++;
}
inodes_stat.nr_unused -= nr_pruned;
spin_unlock(&inode_lock);

/* we unlocked the spinlock an may be, say, preempted here.
now inodes are not in the inode cache, but
clear_inode()/destroy_inode() hasn't been called yet.
JFFS2 thinks the inode in the inode cache and makes wrong things */

dispose_list(&freeable);
up(&iprune_sem);

if (current_is_kswapd())
mod_page_state(kswapd_inodesteal, reap);
else
mod_page_state(pginodesteal, reap);
}

> Umm, no. It's absolutely not a good reason. What jffs2 is doing right
> now is to poke into VFS internals it shouldn't, and exporting more internals
> to prevent it from doing so isn't making the situation any better.

I fully agree with you that we ought to be very picky about exports.
But this mutex is special case - it protects from races with the "external"
kswapd. I suspect I'm not the last person who will ask
to export it. If you or David suggest a workaround, you're welcome.

Cheers,
Artem.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-18 11:52:31

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, Apr 18, 2005 at 03:46:21PM +0400, Artem B. Bityuckiy wrote:
> On Mon, 2005-04-18 at 11:53 +0100, Christoph Hellwig wrote:
> > The VFS already has a method for freeing an struct inode pointer, and that
> > is ->destroy_inode. You're probably better off updating your GC state from
> > that place.
> destroy_inode() does not help. JFFS2 already makes use of clear_inode()
> which is in fact called even earlier (inode.c from 2.6.11.5, line 298):

Oh, I thought the problem is that JFFS2 thought an inode was freed when
it still was in use. So you're problem is actually that it's no in the
hash anymore but you don't know yet?

Anyway, please explain in detail why you need all this information, what
errors you see, etc so we can find a way to fix it properly.

2005-04-18 12:31:17

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 12:52 +0100, Christoph Hellwig wrote:
> Oh, I thought the problem is that JFFS2 thought an inode was freed when
> it still was in use. So you're problem is actually that it's no in the
> hash anymore but you don't know yet?
Yes, exactly. VFS developers may always say "it is your problem -
redesign JFFS2", but I think it is too late to redesign it.

> Anyway, please explain in detail why you need all this information, what
> errors you see, etc so we can find a way to fix it properly.
Well, I suspect I explained why I need the mutex. If people will find
the explanation vague, I'll make another attempt.

The error I see is:

Eep. Trying to read_inode #15601 when it's already in state 2!

I debugged this a lot before I've realized the reason. And I believe I
know JFFS2 very well to claim that redesigning it is very painful.

The erroneous code flow is like this:

kswapd: removes the inode 15601 from the inode hash (inode.c:478).
kswapd: is preempted at inode.c:485
JFFS2 writer: awakes, runs GC to reclaim some space.
JFFS2 writer: picks a JFFS2 node belonging to the inode 15601
JFFS2 writer: looks at the inode state, realizes it is in state PRESENT,
i.e. it is in the inode cache (which is wrong).
JFFS2 writer: runs iget() to acquire a pointer to the struct inode of
the inode 15601.
JFFS2 writer: iget() calls ->read_inode(), i.e., jffs2_read_inode().
JFFS2 writed: JFFS2 is surprised why read_inode() is called for the
already built inode 15601.

Or may be VFS is buggy, I'm not sure. May be it shouldn't remove inode
from the inode hash in that point (inode.c:487). It sets the I_FREENG
state to the inode being freed, and iget() may wait in find_inode_fast()
while the inode is actually destroyed (inode.c:562). The inode may be
removed from the inode hash later, in dispose_list() (inode.c:292).

Or may be this isn't a bug but a feature to make the inode_lock less
contended. Not sure, I'm not a VFS guru.

In that above described case the code flow would be:

kswapd: remove inode 15601 from the inode hash (inode.c:478).
kswapd: preempted at inode.s:485
JFFS2 writer: awakes, runs GC to reclaim some space.
JFFS2 writer: picks a JFFS2 node belonging to inode 15601
JFFS2 writer: looks at the inode state, realizes it is in state PRESENT,
i.e. it is in the inode cache (which is wrong).
JFFS2 writer: runs iget() to get the pointer to the struct inode of the
inode 15601.
JFFS2 writer goes sleep in find_inode_fast(), waiting for freeing
completion.
kswapd: calls dispose_list(), calls clear_inode() for 15601.
JFFS2 writed: read_inode() is called and all is OK since clear_inode()
was called before.

The latter scenario looks very logical for me.

Cheers, Artem.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-18 12:47:39

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, Apr 18, 2005 at 04:31:06PM +0400, Artem B. Bityuckiy wrote:
> On Mon, 2005-04-18 at 12:52 +0100, Christoph Hellwig wrote:
> > Oh, I thought the problem is that JFFS2 thought an inode was freed when
> > it still was in use. So you're problem is actually that it's no in the
> > hash anymore but you don't know yet?
> Yes, exactly. VFS developers may always say "it is your problem -
> redesign JFFS2", but I think it is too late to redesign it.

I don't think it's too late ever ;-)

> > Anyway, please explain in detail why you need all this information, what
> > errors you see, etc so we can find a way to fix it properly.
> Well, I suspect I explained why I need the mutex. If people will find
> the explanation vague, I'll make another attempt.
>
> The error I see is:
>
> Eep. Trying to read_inode #15601 when it's already in state 2!
>
> I debugged this a lot before I've realized the reason. And I believe I
> know JFFS2 very well to claim that redesigning it is very painful.
>
> The erroneous code flow is like this:
>
> kswapd: removes the inode 15601 from the inode hash (inode.c:478).
> kswapd: is preempted at inode.c:485
> JFFS2 writer: awakes, runs GC to reclaim some space.
> JFFS2 writer: picks a JFFS2 node belonging to the inode 15601
> JFFS2 writer: looks at the inode state, realizes it is in state PRESENT,
> i.e. it is in the inode cache (which is wrong).
> JFFS2 writer: runs iget() to acquire a pointer to the struct inode of
> the inode 15601.
> JFFS2 writer: iget() calls ->read_inode(), i.e., jffs2_read_inode().

Why doesn't __wait_on_freeing_inode get called? prune_icache sets I_FREEING
before it's dropping the inode lock.

Any, this sounds like you'd want to use ilookup because you don't want to
read the inode in the cache anyway, right?

> JFFS2 writed: JFFS2 is surprised why read_inode() is called for the
> already built inode 15601.
>
> Or may be VFS is buggy, I'm not sure. May be it shouldn't remove inode
> from the inode hash in that point (inode.c:487). It sets the I_FREENG
> state to the inode being freed, and iget() may wait in find_inode_fast()
> while the inode is actually destroyed (inode.c:562). The inode may be
> removed from the inode hash later, in dispose_list() (inode.c:292).
>
> Or may be this isn't a bug but a feature to make the inode_lock less
> contended. Not sure, I'm not a VFS guru.

Yes, it's a feature.

2005-04-18 12:49:37

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 16:31 +0400, Artem B. Bityuckiy wrote:
> Yes, exactly. VFS developers may always say "it is your problem -
> redesign JFFS2", but I think it is too late to redesign it.

Bear in mind that the reason I added the state tracking to the internal
jffs2_inode_cache state was specifically to deal with the fact that the
VFS can call read_inode() for an inode which was previously in core and
for which it hasn't yet called clear_inode().

I suspect that there's a way we could work around this problem in JFFS2,
and if not hch is probably right -- exporting the lock isn't good
enough; if we're going to change the VFS we should fix the actual
problem.

I'm confused by this though -- I thought we'd already _fixed_ this in
the VFS. What you describe shouldn't happen because I believe you're
missing a step:

> JFFS2 writer: looks at the inode state, realizes it is in state PRESENT,
> i.e. it is in the inode cache (which is wrong).
> JFFS2 writer: runs iget() to acquire a pointer to the struct inode of
> the inode 15601.

wait_for_freeing_inode() waits for the existing inode in state I_FREEING
to finish being cleared...

> JFFS2 writer: iget() calls ->read_inode(), i.e., jffs2_read_inode().
> JFFS2 writed: JFFS2 is surprised why read_inode() is called for the
> already built inode 15601.

... and this shouldn't happen.

--
dwmw2

2005-04-18 12:57:36

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

Christoph Hellwig wrote:
> Why doesn't __wait_on_freeing_inode get called? prune_icache sets I_FREEING
> before it's dropping the inode lock.
I suppose because the inode is *deleted* from i_hash. But
find_inode_fast looks for inode using *->i_hash*. Of course it will not
find anything and call read_inode() immediately. Did I miss something?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-18 13:08:47

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 13:46 +0100, Christoph Hellwig wrote:
> Any, this sounds like you'd want to use ilookup because you don't want
> to read the inode in the cache anyway, right?

We use ilookup() in some circumstances -- if the inode has zero nlink
and hence we definitely don't want to pull it back again if it's gone.

But sometimes we really do mean to use iget() to bring it into core. And
it's in that case that I believe Artem has found the problem, because if
I understand correctly he's still seeing two consecutive calls to
read_inode() for the same inode, without a clear_inode() in between.

prune_icache() is removing the inode from i_hash at line 457 of inode.c,
then being preempted when it drops the inode_lock at line 464, which is
_before_ it calls dispose_list() to actually get rid of the inode(s) in
question. So when iget() is called, the VFS ends up calling read_inode()
again instead of waiting for the original inode to finish going away.

--
dwmw2

2005-04-18 13:17:12

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 13:46 +0100, Christoph Hellwig wrote:
> Why doesn't __wait_on_freeing_inode get called? prune_icache sets
> I_FREEING before it's dropping the inode lock.

Because prune_icache() _also_ removes the inode from the hash before
dropping the inode lock. It shouldn't -- the inode should only get
removed from the hash when it's actually been cleared. That's the real
bug -- and I agree that the fix isn't to expose internal locks to let
JFFS2 work around it.

prune_icache() (and probably invalidate_inodes() too) needs to leave the
inode on the hash list while it's being freed.

--
dwmw2

2005-04-18 13:38:26

by Anton Altaparmakov

[permalink] [raw]

Subject: ntfs ->iput use - was: Re: [PATC] small VFS change for JFFS2

Hi Christoph,

If you remember you complained about ntfs' "fishy use of iput"?

I think that David's explanation below is exactly the reason why I had
to do it IIRC... So if the vfs is fixed so the below can _never_ happen
anymore then I believe it would be safe to do as you and Al suggest and
stop using iput in a "fishy" way in ntfs...

Best regards,

Anton

On Mon, 2005-04-18 at 23:08 +1000, David Woodhouse wrote:
> On Mon, 2005-04-18 at 13:46 +0100, Christoph Hellwig wrote:
> > Any, this sounds like you'd want to use ilookup because you don't want
> > to read the inode in the cache anyway, right?
>
> We use ilookup() in some circumstances -- if the inode has zero nlink
> and hence we definitely don't want to pull it back again if it's gone.
>
> But sometimes we really do mean to use iget() to bring it into core. And
> it's in that case that I believe Artem has found the problem, because if
> I understand correctly he's still seeing two consecutive calls to
> read_inode() for the same inode, without a clear_inode() in between.
>
> prune_icache() is removing the inode from i_hash at line 457 of inode.c,
> then being preempted when it drops the inode_lock at line 464, which is
> _before_ it calls dispose_list() to actually get rid of the inode(s) in
> question. So when iget() is called, the VFS ends up calling read_inode()
> again instead of waiting for the original inode to finish going away.

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-04-18 14:07:54

by Artem Bityutskiy

[permalink] [raw]

Subject: Re: [PATC] small VFS change for JFFS2

On Mon, 2005-04-18 at 23:16 +1000, David Woodhouse wrote:
> On Mon, 2005-04-18 at 13:46 +0100, Christoph Hellwig wrote:
> > Why doesn't __wait_on_freeing_inode get called? prune_icache sets
> > I_FREEING before it's dropping the inode lock.
>
> Because prune_icache() _also_ removes the inode from the hash before
> dropping the inode lock. It shouldn't -- the inode should only get
> removed from the hash when it's actually been cleared. That's the real
> bug -- and I agree that the fix isn't to expose internal locks to let
> JFFS2 work around it.
>
> prune_icache() (and probably invalidate_inodes() too) needs to leave the
> inode on the hash list while it's being freed.
>

Please, then consider the following patch (I didn't test it well,
though).

Signed-off-by: Artem B. Bityuckiy <[email protected]>

diff -auNrp linux-2.6.11.5/fs/inode.c linux-2.6.11.5_fixed/fs/inode.c
--- linux-2.6.11.5/fs/inode.c 2005-03-19 09:35:04.000000000 +0300
+++ linux-2.6.11.5_fixed/fs/inode.c 2005-04-18 17:54:16.000000000
+0400
@@ -284,6 +284,12 @@ static void dispose_list(struct list_hea
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
+
+ spin_lock(&inode_lock);
+ hlist_del_init(&inode->i_hash);
+ list_del_init(&inode->i_sb_list);
+ spin_unlock(&inode_lock);
+
destroy_inode(inode);
nr_disposed++;
}
@@ -319,8 +325,6 @@ static int invalidate_list(struct list_h
inode = list_entry(tmp, struct inode, i_sb_list);
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
- hlist_del_init(&inode->i_hash);
- list_del(&inode->i_sb_list);
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
count++;
@@ -455,8 +459,6 @@ static void prune_icache(int nr_to_scan)
if (!can_unuse(inode))
continue;
}
- hlist_del_init(&inode->i_hash);
- list_del_init(&inode->i_sb_list);
list_move(&inode->i_list, &freeable);
inode->i_state |= I_FREEING;
nr_pruned++;

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.