2005-04-27 13:16:02

by Artem Bityutskiy

[permalink] [raw]
Subject: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

Dear VFS developers,

is it possible to review/comment/apply the patch which fixes a severe
VFS bug which screws up JFFS2?

(the third attempt).

The problem description:
~~~~~~~~~~~~~~~~~~~~~~

prune_icache() removes inodes from the inode hash (inode->i_hash) and
drops the node_lock spinlock. If at that moment iget() is called, we end
up with the situation when VFS calls ->read_inode() twice for the same
inode without calling ->clear_inode() between. This happens despite of
the I_FREEING inode state because the inode is already removed from the
hash by the time find_inode_fast() is invoked.

The fix is: do not remove the inode from the hash too early.

The following patch fixes the problem. It was tested with JFFS2 (only)
and works perfectly.

Comments?



Signed-off-by: Artem B. Bityuckiy <[email protected]>


diff -auNrp linux-2.6.11.5/fs/inode.c linux-2.6.11.5_fixed/fs/inode.c
--- linux-2.6.11.5/fs/inode.c 2005-03-19 09:35:04.000000000 +0300
+++ linux-2.6.11.5_fixed/fs/inode.c 2005-04-18 17:54:16.000000000
+0400
@@ -284,6 +284,12 @@ static void dispose_list(struct list_hea
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
+
+ spin_lock(&inode_lock);
+ hlist_del_init(&inode->i_hash);
+ list_del_init(&inode->i_sb_list);
+ spin_unlock(&inode_lock);
+
destroy_inode(inode);
nr_disposed++;
}
@@ -319,8 +325,6 @@ static int invalidate_list(struct list_h
inode = list_entry(tmp, struct inode, i_sb_list);
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
- hlist_del_init(&inode->i_hash);
- list_del(&inode->i_sb_list);
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
count++;
@@ -455,8 +459,6 @@ static void prune_icache(int nr_to_scan)
if (!can_unuse(inode))
continue;
}
- hlist_del_init(&inode->i_hash);
- list_del_init(&inode->i_sb_list);
list_move(&inode->i_list, &freeable);
inode->i_state |= I_FREEING;
nr_pruned++;

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.


2005-04-27 13:51:54

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

On Wed, Apr 27, 2005 at 05:15:41PM +0400, Artem B. Bityuckiy wrote:
> Dear VFS developers,
>
> is it possible to review/comment/apply the patch which fixes a severe
> VFS bug which screws up JFFS2?
>
> (the third attempt).

I think this is a problem that hit Coda at some point. We used to have a
linked list that linked all the inodes together. However not long after
the RCU changes, I noticed that this linked list occasionally got
corrupted. I didn't want to dig too deep into all the dcache related
changes where I thought the problem came from, nobody else seemed to be
complaining and the linked list wasn't really necessary so I worked
around the problem by removing the linked list strucure.

But I think you found the actual cause and this patch looks good to me,
it probably should get merged.

Jan

2005-04-27 14:27:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> The following patch fixes the problem. It was tested with JFFS2 (only)
> and works perfectly.

You should send it to Andrew Morton, so it gets some more testing.
FWIW it looks good to me too.

Miklos

2005-04-27 15:59:10

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> Comments?
>

On second thought a wake_up_inode() seems to be missing in
dispose_list() just before destroy_inode().

Also I'm not sure delaying removal from i_sb_list is the right thing.
generic_delete_inode() does this before clear_inode().

Miklos

2005-04-27 16:19:15

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

? ???, 27/04/2005 ? 17:57 +0200, Miklos Szeredi ?????:
> On second thought a wake_up_inode() seems to be missing in
> dispose_list() just before destroy_inode().
Good point, thanks.

> Also I'm not sure delaying removal from i_sb_list is the right thing.
> generic_delete_inode() does this before clear_inode().
As I can see, dispose_list() doesn't correlate with generic_delete_inode
() code path, so this mustn't be a problem.

What am I supposed to do next? I already sent the old patch with the
error to Andrew Morton. Should I notify him about this? Just resend?
Whatever?

Cheers,
Artem.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-28 07:34:16

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> Why do you need to move it from prune_icache() to dispose_list()?
> For the hash list it's the right thing, but for sb_list it's not
> needed, is it?
Yes, it is not needed but harmless. I did it only because i_hash &
i_sb_list insertions/deletions always come in couple. So I decided move
them both, to be more consistent, to make code less complicated.

If you regard this hazardous I might split these removals. But IMHO, my
variant is a bit more pleasant.

> Retest, and resend.
Thanks.

I assume I don't need to notify Andrew about the inconsistency in the
old patch, or should I?

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-04-28 07:35:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

"Artem B. Bityuckiy" <[email protected]> wrote:
>
> I assume I don't need to notify Andrew about the inconsistency in the
> old patch, or should I?

Nope, just send the new patch. Please be sure to include a complete and
up-to-date explanation of what it does, and why - I haven't looked at this
yet.

2005-04-28 07:44:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> > Why do you need to move it from prune_icache() to dispose_list()?
> > For the hash list it's the right thing, but for sb_list it's not
> > needed, is it?
> Yes, it is not needed but harmless. I did it only because i_hash &
> i_sb_list insertions/deletions always come in couple. So I decided move
> them both, to be more consistent, to make code less complicated.
>
> If you regard this hazardous I might split these removals. But IMHO, my
> variant is a bit more pleasant.

It's not just pleasentness. You should be _very_ careful with any
changes you make to this kind of code, and have a very clear
explanation why the change is needed, and why it won't do any trouble.

I didn't actually think about the sb_list stuff, but my feeling is you
should not move it unless there's a very clear reason to do so.

Miklos

2005-04-28 07:49:30

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> I didn't actually think about the sb_list stuff, but my feeling is you
> should not move it unless there's a very clear reason to do so.

Well, no problems. Probably safeness is of greater priority. I'll take
care not to touch it.

Thanks for review.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

2005-05-04 12:17:52

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

Hello Andrew,

here you can find a new patch for the VFS bug which was discussed at
http://lkml.org/lkml/2005/4/27/84

I added wake_up_inode() invocation just as Miklos suggested.


Bug symptoms
~~~~~~~~~~~~
For the same inode VFS calls read_inode() twice and doesn't call
clear_inode() between the two read_inode() invocations.

Bug description
~~~~~~~~~~~~~~~
Suppose we have an inode which has zero reference count but is still in
the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
() is then going to call clear_inode(), but drops the inode_lock
spinlock before this. If in this moment another task calls iget() for an
inode which was just removed from i_hash by prune_icache(), then iget()
invokes read_inode() for this inode, because it is *already removed*
from i_hash.

The end result is: we call iget(#N) then iput(#N); inode #N has zero
i_count now and is in the inode cache; kswapd starts. kswapd removes the
inode #N from i_hash ans is preempted; we call iget(#N) again;
read_inode() is invoked as the result; but we expect clear_inode()
before.

Fix
~~~~~~~
To fix the bug I remove inodes from i_hash later, when clear_inode() is
actually called. I remove them from i_hash under spinlock protection.
Since the i_state is set to I_FREEING, it is safe to do this. The others
will sleep waiting for the inode state change.

I also postpone removing inodes from i_sb_list. It is not compulsory to
do so but I do it for readability reasons. Inodes are added/removed to
the lists together everywhere in the code and there is no point to
change this rule. This is harmless because the only user of i_sb_list
which somehow may interfere with me (invalidate_list()) is excluded by
the iprune_sem mutex.

The same race is possible in invalidate_list() so I do the same for it.

The patch is against linux 2.6.11.5.
The patch was tested for JFFS2.

Please. apply/comment.

Cheers,
Artem.

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.


Attachments:
vfs-double_inode_read-2.diff (1.16 kB)

2005-05-04 20:04:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

"Artem B. Bityuckiy" <[email protected]> wrote:
>
> Bug symptoms
> ~~~~~~~~~~~~
> For the same inode VFS calls read_inode() twice and doesn't call
> clear_inode() between the two read_inode() invocations.
>
> Bug description
> ~~~~~~~~~~~~~~~
> Suppose we have an inode which has zero reference count but is still in
> the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
> some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
> () is then going to call clear_inode(), but drops the inode_lock
> spinlock before this. If in this moment another task calls iget() for an
> inode which was just removed from i_hash by prune_icache(), then iget()
> invokes read_inode() for this inode, because it is *already removed*
> from i_hash.

This sounds more like a bug in the iget() caller to me.

Question is: if the inode has zero refcount and is unhashed then how did
the caller get its sticky paws onto the inode* in the first place?

If the caller had saved a copy of the inode* in local storage then the
caller should have taken a ref against the inode.

If the caller had just looked up the inode via hastable lookup via
iget_whatever() then again the caller will have a ref on the inode.

So. Please tell us more about how the caller got into this situation.

2005-05-04 21:36:50

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

On Wed, 2005-05-04 at 13:04 -0700, Andrew Morton wrote:
> This sounds more like a bug in the iget() caller to me.
>
> Question is: if the inode has zero refcount and is unhashed then how
> did the caller get its sticky paws onto the inode* in the first place?
>
> If the caller had saved a copy of the inode* in local storage then the
> caller should have taken a ref against the inode.
>
> If the caller had just looked up the inode via hastable lookup via
> iget_whatever() then again the caller will have a ref on the inode.
>
> So. Please tell us more about how the caller got into this situation.

I could explain in detail how JFFS2 garbage collection works, moving log
entries out of the way by calling iget() on the inode to which they
belong.... or I could just say "NFS".

--
dwmw2


2005-05-04 21:58:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

David Woodhouse <[email protected]> wrote:
>
> On Wed, 2005-05-04 at 13:04 -0700, Andrew Morton wrote:
> > This sounds more like a bug in the iget() caller to me.
> >
> > Question is: if the inode has zero refcount and is unhashed then how
> > did the caller get its sticky paws onto the inode* in the first place?
> >
> > If the caller had saved a copy of the inode* in local storage then the
> > caller should have taken a ref against the inode.
> >
> > If the caller had just looked up the inode via hastable lookup via
> > iget_whatever() then again the caller will have a ref on the inode.
> >
> > So. Please tell us more about how the caller got into this situation.
>
> I could explain in detail how JFFS2 garbage collection works, moving log
> entries out of the way by calling iget() on the inode to which they
> belong.... or I could just say "NFS".
>

That doesn't really settle the question of whether the callers are broken,
whether they are doing something which the VFS really should support and
what need to be done to the VFS to support it properly.

Looking at the proposed patch: what happens if an inode is on its way to
dispose_list() and someone tries to do an iget() on it? I don't think I_LOCK
is set, so __wait_on_freeing_inode() will no longer provide this guarantee:

/*
* If we try to find an inode in the inode hash while it is being deleted, we
* have to wait until the filesystem completes its deletion before reporting
* that it isn't found. This is because iget will immediately call
* ->read_inode, and we want to be sure that evidence of the deletion is found
* by ->read_inode.

Not only will we break the __wait_on_freeing_inode() guarantee, but we'll
break it in extremely rare circumstances.

And that's just from a quick peek. There may be other problems. Worried.

2005-05-05 09:10:56

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

On Wed, 2005-05-04 at 14:58 -0700, Andrew Morton wrote:
> That doesn't really settle the question of whether the callers are broken,
> whether they are doing something which the VFS really should support and
> what need to be done to the VFS to support it properly.

Filesystems exported by NFS _will_ get iget() called for recently-
deleted inodes. JFFS2 and (I believe) NTFS also do internal things which
end up having the same effect.

The premise is simple: regardless of who calls iget() and when they do
it, the VFS should not call the filesystem's read_inode() method twice
consecutively for the same inode without ever calling clear_inode() or
delete_inode() in between.

That's what __wait_on_freeing_inode() was introduced for -- so we can
make sure the clear_inode() call actually happened before we call
read_inode() again for the same inode. Unfortunately there is still a
code path where we can get it wrong, and that's what Artem is fixing.

> Looking at the proposed patch: what happens if an inode is on its way to
> dispose_list() and someone tries to do an iget() on it? I don't think I_LOCK
> is set, so __wait_on_freeing_inode() will no longer provide this guarantee:

> /*
> * If we try to find an inode in the inode hash while it is being deleted, we
> * have to wait until the filesystem completes its deletion before reporting
> * that it isn't found. This is because iget will immediately call
> * ->read_inode, and we want to be sure that evidence of the deletion is found
> * by ->read_inode.

That comment isn't true any more. Look at what __wait_on_freeing_inode()
actually does, and observe the fact that all its callers actually loop
and start again after calling it.

The current implementation of __wait_on_freeing_inode() waits until it
_might_ have changed, not until it _has_ changed. That's why it's OK for
it just to be a yield() or a wait on a bit_waitqueue.

I'm not convinced I _like_ that implementation, mind you -- it's changed
since I last looked at it. But I don't see that there's anything
strictly broken about it.

--
dwmw2


2005-05-05 16:20:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

> That comment isn't true any more. Look at what __wait_on_freeing_inode()
> actually does, and observe the fact that all its callers actually loop
> and start again after calling it.
>
> The current implementation of __wait_on_freeing_inode() waits until it
> _might_ have changed, not until it _has_ changed. That's why it's OK for
> it just to be a yield() or a wait on a bit_waitqueue.
>
> I'm not convinced I _like_ that implementation, mind you -- it's changed
> since I last looked at it. But I don't see that there's anything
> strictly broken about it.

Using yield() to wait for a precisely defined event (clear_inode()
finishing) doesn't seem like a very good idea. Especially, since
Artem's patch will probably make it trigger more often.

How about this (totally untested) patch? Even if I_LOCK is not set
initially, wake_up_inode() should do the right thing and wake up the
waiting task after clear_inode(). It shouldn't cause spurious
wakeups, since there should be no other reference to the inode.

Miklos

--- inode.c~ 2005-05-02 11:24:49.000000000 +0200
+++ inode.c 2005-05-05 18:12:57.000000000 +0200
@@ -1264,18 +1264,6 @@ static void __wait_on_freeing_inode(stru
{
wait_queue_head_t *wq;
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_LOCK);
-
- /*
- * I_FREEING and I_CLEAR are cleared in process context under
- * inode_lock, so we have to give the tasks who would clear them
- * a chance to run and acquire inode_lock.
- */
- if (!(inode->i_state & I_LOCK)) {
- spin_unlock(&inode_lock);
- yield();
- spin_lock(&inode_lock);
- return;
- }
wq = bit_waitqueue(&inode->i_state, __I_LOCK);
prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode_lock);

2005-05-06 11:09:05

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

On Thu, 2005-05-05 at 18:18 +0200, Miklos Szeredi wrote:
> Using yield() to wait for a precisely defined event (clear_inode()
> finishing) doesn't seem like a very good idea. Especially, since
> Artem's patch will probably make it trigger more often.

Agreed. Even before Artem's patch, we're still effectively busy-waiting
for something which calls back into the file system's clear_inode method
and may well sleep and perform I/O.

> How about this (totally untested) patch? Even if I_LOCK is not set
> initially, wake_up_inode() should do the right thing and wake up the
> waiting task after clear_inode(). It shouldn't cause spurious
> wakeups, since there should be no other reference to the inode.

Since Artem introduced a wake_up_inode() in dispose_list(), your patch
seems reasonable.

--
dwmw2