2005-09-20 00:48:05

by John McCutchan

[permalink] [raw]
Subject: [patch] stop inotify from sending random DELETE_SELF event under load

Yo,

Below is a patch that fixes the random DELETE_SELF events when the
system is under load. The problem is that the DELETE_SELF event is sent
from dentry_iput, which is called in two code paths,

1) When a dentry is being deleted
2) When the dcache is being pruned.

We only want to send the event in case 1.

This has been spotted in the wild, and should be put into 2.6.14. It
fixes http://bugzilla.kernel.org/show_bug.cgi?id=5279

Yes, the patch makes me cringe to, but we need fsnotify_inodremove where
it is to avoid locking problems and a race.


Signed-off-by: John McCutchan <[email protected]>

Index: linux-2.6.13/fs/dcache.c
===================================================================
--- linux-2.6.13.orig/fs/dcache.c 2005-08-28 19:41:01.000000000 -0400
+++ linux-2.6.13/fs/dcache.c 2005-09-19 15:30:03.000000000 -0400
@@ -94,7 +94,7 @@
* d_iput() operation if defined.
* Called with dcache_lock and per dentry lock held, drops both.
*/
-static inline void dentry_iput(struct dentry * dentry)
+static inline void dentry_iput(struct dentry * dentry, int call_inoderemove)
{
struct inode *inode = dentry->d_inode;
if (inode) {
@@ -102,7 +102,8 @@
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
- fsnotify_inoderemove(inode);
+ if (call_inoderemove)
+ fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
dentry->d_op->d_iput(dentry, inode);
else
@@ -195,7 +196,7 @@
list_del(&dentry->d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
- dentry_iput(dentry);
+ dentry_iput(dentry, 0);
parent = dentry->d_parent;
d_free(dentry);
if (dentry == parent)
@@ -370,7 +371,7 @@
__d_drop(dentry);
list_del(&dentry->d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
- dentry_iput(dentry);
+ dentry_iput(dentry, 0);
parent = dentry->d_parent;
d_free(dentry);
if (parent != dentry)
@@ -1175,7 +1176,7 @@
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (atomic_read(&dentry->d_count) == 1) {
- dentry_iput(dentry);
+ dentry_iput(dentry, 1);
fsnotify_nameremove(dentry, isdir);
return;
}


2005-09-20 01:37:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Mon, 19 Sep 2005, John McCutchan wrote:
>
> Below is a patch that fixes the random DELETE_SELF events when the
> system is under load. The problem is that the DELETE_SELF event is sent
> from dentry_iput, which is called in two code paths,
>
> 1) When a dentry is being deleted
> 2) When the dcache is being pruned.

No no.

The problem is that you put the "fsnotify_inoderemove(inode);" in the
wrong place, and I and Al never noticed.

iput() doesn't have anything to do with delete at all, and adding a flag
to it would be wrong. The inode may stay around _after_ the unlink() for
as long as it has users (or much longer, if you have hardlinks ;).

You should probably move the "fsnotify_inoderemove(inode);" call into
generic_delete_inode() instead, just after "security_inode_delete()". No
new flags, just a new place.

(Oh, I think you need to add it to "hugetlbfs_delete_inode()" too).

There's still a potential problem there: some network filesystems seem to
use "generic_delete_inode()" as their "drop_inode" thing. Which may mean
that you get spurious delete messages when the reference is dropped. I
don't see how to avoid that, though - we fundamentally don't _know_ when
the inode actually gets deleted.

Al, do you have any comments? Anything stupid I missed?

Linus

2005-09-20 01:59:33

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, 2005-09-19 at 18:37 -0700, Linus Torvalds wrote:
>
> On Mon, 19 Sep 2005, John McCutchan wrote:
> >
> > Below is a patch that fixes the random DELETE_SELF events when the
> > system is under load. The problem is that the DELETE_SELF event is sent
> > from dentry_iput, which is called in two code paths,
> >
> > 1) When a dentry is being deleted
> > 2) When the dcache is being pruned.
>
> No no.
>
> The problem is that you put the "fsnotify_inoderemove(inode);" in the
> wrong place, and I and Al never noticed.
>

To quote you:

Instead of the broken fsnotify_unlink/fsnotify_rmdir functions, you can
split this into two logically _different_ functions:

- fsnotify_nameremove(dentry) - called when the dentry goes away
- fsnotify_inoderemove(dentry) - called when the inode goes away

...

The fsnotify_inoderemove() is called from dentry_iput(), and that's the
one that specifies that an actual inode no longer exists.


;)

> iput() doesn't have anything to do with delete at all, and adding a flag
> to it would be wrong. The inode may stay around _after_ the unlink() for
> as long as it has users (or much longer, if you have hardlinks ;).
>
> You should probably move the "fsnotify_inoderemove(inode);" call into
> generic_delete_inode() instead, just after "security_inode_delete()". No
> new flags, just a new place.
>
> (Oh, I think you need to add it to "hugetlbfs_delete_inode()" too).
>
> There's still a potential problem there: some network filesystems seem to
> use "generic_delete_inode()" as their "drop_inode" thing. Which may mean
> that you get spurious delete messages when the reference is dropped. I
> don't see how to avoid that, though - we fundamentally don't _know_ when
> the inode actually gets deleted.
>


I don't think moving it to generic_delete_inode is the right place.
Anyways, generic_delete_inode is called when the final reference on the
inode is released, but inotify keeps a reference on the inode, so I
don't think it would work.

fsnotify_inoderemove should be called after the dentry for the file is
removed, not when the inode actually goes away. The behaviour inotify
users expect is:

$ watch /tmp/foo (wd = 0)
$ rm /tmp/foo
event sent: DELETE_SELF (wd = 0)

Inotify doesn't care if the inode for /tmp/foo is sticking around for
whatever reason. As far as inotify is concerned, the file is deleted.

--
John McCutchan <[email protected]>

2005-09-20 02:20:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Mon, 19 Sep 2005, John McCutchan wrote:
>
> To quote you:

Yeah, sometimes I'm more right than other times. That wasn't one of them.

It's actually _almost_ right. The problem being that dentry_iput() is
called for non-delete events too.

However, your patch is _worse_. Your patch will cause it not to report the
delete at all, because what will happen is that if the delete() is done
while somebody else has a pointer to the dentry, then we won't call
"dentry_iput()" with a "delete" AT ALL. We will only call it later when
the _other_ person (who didn't do a delete) releases the dentry.

See? It's very very wrong to send a flag that depends on the call-chain,
because the call-chain is _not_ what determines whether the inode gets
deleted or not.

The only way to know whether it gets deleted or not is whan the actual
i_nlink goes down to 0, and the inode gets deleted. Ie exactly the
generic_delete_inode() case.

But if you keep a reference to the inode, that will never actually happen.
Hmm.

Who wants that inode delete event anyway? It's fundamentally harder than
removing a name, partly because of the delayed delete, partly because an
inode may be reachable multiple ways.

Maybe this patch instead? It's not going to be reliable on networked
filesystems, though. Nothing is.

Linus

---
diff --git a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -102,7 +102,8 @@ static inline void dentry_iput(struct de
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
- fsnotify_inoderemove(inode);
+ if (!inode->i_nlink)
+ fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
dentry->d_op->d_iput(dentry, inode);
else

2005-09-20 03:31:50

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, Sep 19, 2005 at 06:37:43PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 19 Sep 2005, John McCutchan wrote:
> >
> > Below is a patch that fixes the random DELETE_SELF events when the
> > system is under load. The problem is that the DELETE_SELF event is sent
> > from dentry_iput, which is called in two code paths,
> >
> > 1) When a dentry is being deleted
> > 2) When the dcache is being pruned.
>
> No no.
>
> The problem is that you put the "fsnotify_inoderemove(inode);" in the
> wrong place, and I and Al never noticed.
>
> iput() doesn't have anything to do with delete at all, and adding a flag
> to it would be wrong. The inode may stay around _after_ the unlink() for
> as long as it has users (or much longer, if you have hardlinks ;).
>
> You should probably move the "fsnotify_inoderemove(inode);" call into
> generic_delete_inode() instead, just after "security_inode_delete()". No
> new flags, just a new place.
>
> (Oh, I think you need to add it to "hugetlbfs_delete_inode()" too).
>
> There's still a potential problem there: some network filesystems seem to
> use "generic_delete_inode()" as their "drop_inode" thing. Which may mean
> that you get spurious delete messages when the reference is dropped. I
> don't see how to avoid that, though - we fundamentally don't _know_ when
> the inode actually gets deleted.
>
> Al, do you have any comments? Anything stupid I missed?

One fundamentally stupid thing is exposing to userland events that
are none of its fscking business. Link removal - sure. Last link
removal - perhaps, but that's not obvious; in any case it should be
tied to notification of link removal. But inode getting freed or
last dentry going away is none of the userland concern.

IOW, let the notification of link removal check ->i_nlink and add
"that was the last one" event if it had hit zero. Or, better yet,
pass i_mode and i_nlink in the event itself and let the recepient
do obvious checks.

2005-09-20 03:33:56

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, Sep 19, 2005 at 10:00:41PM -0400, John McCutchan wrote:
> To quote you:
>
> Instead of the broken fsnotify_unlink/fsnotify_rmdir functions, you can
> split this into two logically _different_ functions:
>
> - fsnotify_nameremove(dentry) - called when the dentry goes away
> - fsnotify_inoderemove(dentry) - called when the inode goes away
>
> ...
>
> The fsnotify_inoderemove() is called from dentry_iput(), and that's the
> one that specifies that an actual inode no longer exists.
>
>
> ;)

That was wrong. Just have *one* function (on link removal) and stuff
i_mode and i_nlink of victim into event.

2005-09-20 03:48:54

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 04:33 +0100, Al Viro wrote:
> On Mon, Sep 19, 2005 at 10:00:41PM -0400, John McCutchan wrote:
> > To quote you:
> >
> > Instead of the broken fsnotify_unlink/fsnotify_rmdir functions, you can
> > split this into two logically _different_ functions:
> >
> > - fsnotify_nameremove(dentry) - called when the dentry goes away
> > - fsnotify_inoderemove(dentry) - called when the inode goes away
> >
> > ...
> >
> > The fsnotify_inoderemove() is called from dentry_iput(), and that's the
> > one that specifies that an actual inode no longer exists.
> >
> >
> > ;)
>
> That was wrong. Just have *one* function (on link removal) and stuff
> i_mode and i_nlink of victim into event.

We have already locked down the ABI of the event structure. Hopefully
Linus's patch will fix this in a clean way.

--
John McCutchan <[email protected]>

2005-09-20 03:45:34

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, 2005-09-19 at 19:20 -0700, Linus Torvalds wrote:
>
> On Mon, 19 Sep 2005, John McCutchan wrote:
> >
> > To quote you:
>
> Yeah, sometimes I'm more right than other times. That wasn't one of them.
>
> It's actually _almost_ right. The problem being that dentry_iput() is
> called for non-delete events too.
>
> However, your patch is _worse_. Your patch will cause it not to report the
> delete at all, because what will happen is that if the delete() is done
> while somebody else has a pointer to the dentry, then we won't call
> "dentry_iput()" with a "delete" AT ALL. We will only call it later when
> the _other_ person (who didn't do a delete) releases the dentry.
>
> See? It's very very wrong to send a flag that depends on the call-chain,
> because the call-chain is _not_ what determines whether the inode gets
> deleted or not.
>

Yeah, I see this now.

> The only way to know whether it gets deleted or not is whan the actual
> i_nlink goes down to 0, and the inode gets deleted. Ie exactly the
> generic_delete_inode() case.
>
> But if you keep a reference to the inode, that will never actually happen.
> Hmm.
>
> Who wants that inode delete event anyway? It's fundamentally harder than
> removing a name, partly because of the delayed delete, partly because an
> inode may be reachable multiple ways.
>

I think the name fsnotify_inoderemove is causing some confusion. We only
care that some name that is pointing to this inode has been deleted.
Remember, it was suggested as a replacement for fsnotify_unlink. We
don't care if the inode is actually going away or not.

> Maybe this patch instead? It's not going to be reliable on networked
> filesystems, though. Nothing is.
>

Thanks, I will have it tested.

--
John McCutchan <[email protected]>

2005-09-20 03:50:11

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 04:31 +0100, Al Viro wrote:
> On Mon, Sep 19, 2005 at 06:37:43PM -0700, Linus Torvalds wrote:
> >
> >
> > On Mon, 19 Sep 2005, John McCutchan wrote:
> > >
> > > Below is a patch that fixes the random DELETE_SELF events when the
> > > system is under load. The problem is that the DELETE_SELF event is sent
> > > from dentry_iput, which is called in two code paths,
> > >
> > > 1) When a dentry is being deleted
> > > 2) When the dcache is being pruned.
> >
> > No no.
> >
> > The problem is that you put the "fsnotify_inoderemove(inode);" in the
> > wrong place, and I and Al never noticed.
> >
> > iput() doesn't have anything to do with delete at all, and adding a flag
> > to it would be wrong. The inode may stay around _after_ the unlink() for
> > as long as it has users (or much longer, if you have hardlinks ;).
> >
> > You should probably move the "fsnotify_inoderemove(inode);" call into
> > generic_delete_inode() instead, just after "security_inode_delete()". No
> > new flags, just a new place.
> >
> > (Oh, I think you need to add it to "hugetlbfs_delete_inode()" too).
> >
> > There's still a potential problem there: some network filesystems seem to
> > use "generic_delete_inode()" as their "drop_inode" thing. Which may mean
> > that you get spurious delete messages when the reference is dropped. I
> > don't see how to avoid that, though - we fundamentally don't _know_ when
> > the inode actually gets deleted.
> >
> > Al, do you have any comments? Anything stupid I missed?
>
> One fundamentally stupid thing is exposing to userland events that
> are none of its fscking business. Link removal - sure. Last link
> removal - perhaps, but that's not obvious; in any case it should be
> tied to notification of link removal. But inode getting freed or
> last dentry going away is none of the userland concern.

I just thought I should clarify exactly when we want to send the
DELETE_SELF event to user space:

A path which points to inode X has been deleted.

--
John McCutchan <[email protected]>

2005-09-20 04:03:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Mon, 19 Sep 2005, John McCutchan wrote:
>
> I think the name fsnotify_inoderemove is causing some confusion. We only
> care that some name that is pointing to this inode has been deleted.
> Remember, it was suggested as a replacement for fsnotify_unlink. We
> don't care if the inode is actually going away or not.

Ahh.

Well, the problem is one of ordering. You could do it unconditionally at
the top of d_delete(). If that's ok, then good.

The problem with that is that the name will still be available for a while
afterwards - another process could look it up on another CPU.

And the _name_ won't be gone until after we've already dropped the inode.
Remember? You got oopses because you were trying to access an inode that
simply didn't exist any more..

That's where "dentry_iput()" comes in. It's after you've removed the name,
but before the inode is gone. However, then you do end up having the
problem that you can't tell a delete from a "drop the dcache entry" any
more.

One possibility is to mark the dentry deleted in d_flags. That would mean
something like this (against the just-pushed-put v2.6.14-rc2, which has
my previous hack).

Untested. Al?

Linus
---
diff --git a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -102,7 +102,7 @@ static inline void dentry_iput(struct de
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
- if (!inode->i_nlink)
+ if (dentry->d_flags & DCACHE_DELETED)
fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
dentry->d_op->d_iput(dentry, inode);
@@ -1166,6 +1166,7 @@ void d_delete(struct dentry * dentry)
*/
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_DELETED;
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (atomic_read(&dentry->d_count) == 1) {
dentry_iput(dentry);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -155,6 +155,7 @@ d_iput: no no no yes

#define DCACHE_REFERENCED 0x0008 /* Recently used, don't discard. */
#define DCACHE_UNHASHED 0x0010
+#define DCACHE_DELETED 0x0020

extern spinlock_t dcache_lock;

2005-09-20 04:24:58

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, Sep 19, 2005 at 09:03:36PM -0700, Linus Torvalds wrote:
> One possibility is to mark the dentry deleted in d_flags. That would mean
> something like this (against the just-pushed-put v2.6.14-rc2, which has
> my previous hack).
>
> Untested. Al?

Uhh... I still don't understand which behaviour do you want.

* removal of this link, postponed to indefinite future (until we
do not have any users of that dentry) [ new behaviour ]
* moment when the last link is gone _and_ nobody uses any dentries
pointing to object, with information taken from the last one still in use
[ old behaviour ]
* actual destruction of file [ but we won't have any names to report ]
* removal of this link, at the moment when it stops being accessible
[ none of the above, better done from vfs_...() ]

2005-09-20 04:26:13

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, 2005-09-19 at 21:03 -0700, Linus Torvalds wrote:
>
> On Mon, 19 Sep 2005, John McCutchan wrote:
> >
> > I think the name fsnotify_inoderemove is causing some confusion. We only
> > care that some name that is pointing to this inode has been deleted.
> > Remember, it was suggested as a replacement for fsnotify_unlink. We
> > don't care if the inode is actually going away or not.
>
> Ahh.
>
> Well, the problem is one of ordering. You could do it unconditionally at
> the top of d_delete(). If that's ok, then good.
>
> The problem with that is that the name will still be available for a while
> afterwards - another process could look it up on another CPU.
>
> And the _name_ won't be gone until after we've already dropped the inode.
> Remember? You got oopses because you were trying to access an inode that
> simply didn't exist any more..
>
> That's where "dentry_iput()" comes in. It's after you've removed the name,
> but before the inode is gone. However, then you do end up having the
> problem that you can't tell a delete from a "drop the dcache entry" any
> more.

Yep, that sums it up. The new patch looks good, I will have it tested.

--
John McCutchan <[email protected]>

2005-09-20 04:30:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, Al Viro wrote:
>
> Uhh... I still don't understand which behaviour do you want.
>
> * removal of this link, postponed to indefinite future (until we
> do not have any users of that dentry) [ new behaviour ]
> * moment when the last link is gone _and_ nobody uses any dentries
> pointing to object, with information taken from the last one still in use
> [ old behaviour ]

Old behaviour (well, "old" is relative) is apparently the expected one.

But the old behaviour had a bug: we _also_ call dentry_iput() for
non-deletes.

Linus

2005-09-20 04:34:29

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 05:24 +0100, Al Viro wrote:
> On Mon, Sep 19, 2005 at 09:03:36PM -0700, Linus Torvalds wrote:
> > One possibility is to mark the dentry deleted in d_flags. That would mean
> > something like this (against the just-pushed-put v2.6.14-rc2, which has
> > my previous hack).
> >
> > Untested. Al?
>
> Uhh... I still don't understand which behaviour do you want.


> * removal of this link, at the moment when it stops being accessible
> [ none of the above, better done from vfs_...() ]

That is the behaviour we want, how does Linus's second patch not
accomplish this?

--
John McCutchan <[email protected]>

2005-09-20 04:46:24

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 12:36:11AM -0400, John McCutchan wrote:
> On Tue, 2005-09-20 at 05:24 +0100, Al Viro wrote:
> > On Mon, Sep 19, 2005 at 09:03:36PM -0700, Linus Torvalds wrote:
> > > One possibility is to mark the dentry deleted in d_flags. That would mean
> > > something like this (against the just-pushed-put v2.6.14-rc2, which has
> > > my previous hack).
> > >
> > > Untested. Al?
> >
> > Uhh... I still don't understand which behaviour do you want.
>
>
> > * removal of this link, at the moment when it stops being accessible
> > [ none of the above, better done from vfs_...() ]
>
> That is the behaviour we want, how does Linus's second patch not
> accomplish this?

fd = open("foo", 0);
unlink("foo");
sleep for ten days
close(fd);

Linus' patch will send event on close(). Ten days since the moment
when any lookups on foo would bring you -ENOENT.

Could you please describe the semantics of your events?

2005-09-20 04:51:26

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 05:46 +0100, Al Viro wrote:
> On Tue, Sep 20, 2005 at 12:36:11AM -0400, John McCutchan wrote:
> > On Tue, 2005-09-20 at 05:24 +0100, Al Viro wrote:
> > > On Mon, Sep 19, 2005 at 09:03:36PM -0700, Linus Torvalds wrote:
> > > > One possibility is to mark the dentry deleted in d_flags. That would mean
> > > > something like this (against the just-pushed-put v2.6.14-rc2, which has
> > > > my previous hack).
> > > >
> > > > Untested. Al?
> > >
> > > Uhh... I still don't understand which behaviour do you want.
> >
> >
> > > * removal of this link, at the moment when it stops being accessible
> > > [ none of the above, better done from vfs_...() ]
> >
> > That is the behaviour we want, how does Linus's second patch not
> > accomplish this?
>
> fd = open("foo", 0);
> unlink("foo");
> sleep for ten days
> close(fd);
>
> Linus' patch will send event on close(). Ten days since the moment
> when any lookups on foo would bring you -ENOENT.
>


Ahh, got it.

> Could you please describe the semantics of your events?

DELETE_SELF WD=X

The path you requested a watch on (inotify_add_watch(path,mask) returned
X) has been deleted.

--
John McCutchan <[email protected]>

2005-09-20 04:52:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, John McCutchan wrote:
>
> > * removal of this link, at the moment when it stops being accessible
> > [ none of the above, better done from vfs_...() ]
>
> That is the behaviour we want, how does Linus's second patch not
> accomplish this?

My latest patch will still wait for any other process that has that _path_
open to release it.

The reason? It looks "easy" to do it from d_delete(), but the thing is, at
the point where we've released the d_lock spinlock, the "struct inode" is
gone, gone, gone. And we don't want to do the notification _while_ we hold
the spinlocks either.

So we can either do it the easy way - _before_ we get any spinlocks (but
that means that processes will be notified before the name is actually
gone), or we have to wait until _after_ we've unhashed the dentry and
released the spinlocks.

But waiting until after that automatically means that the inode isn't
stable any more: it might be gone.

The "fsnotify_nameremove()" thing doesn't have this problem, because it
simply doesn't even care about the inode - it only cares about the dentry,
which is stable.

This is why movign the release to "dentry_iput()" helps us - it's the
point where we release the dentry spinlocks, and it's also where the inode
actually goes away. In other words, it's the _one_ point where we can
insert the notification outside of the dcache locks but before the inode
is gone.

It's a sligtly inconvenient place, though. And it does mean that if the
dentry was in use by something else when the delete happened, the "inode
is gone" notification will be delayed until the dentry is really free'd.
However, at least at that point it's really a per-path thing, and it won't
have any other global issues (ie hardlinks etc do not come into play with
my last patch).

If you want immediate notification when the name disappears, you'd better
listen to the "nameremove" thing.

I don't think we can reasonably do better than the last patch, but maybe
Al sees something I've missed.

Linus

2005-09-20 04:56:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, Al Viro wrote:
>
> Could you please describe the semantics of your events?

I think the good news is that we can make up the exact semantics. This
thing is used by things like file managers etc, and we've had very fuzzy
semantics on inotify, while the new dnotify is totally new.

So it's a give-and-take of "what are the semantics people want" and "what
are reasonable semantics from the kernel VFS standpoint". I suspect we're
slowly converging on something that works for both parties ;)

Linus

2005-09-20 04:58:39

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 12:53:12AM -0400, John McCutchan wrote:
> DELETE_SELF WD=X
>
> The path you requested a watch on (inotify_add_watch(path,mask) returned
> X) has been deleted.

Then why the devil do we have IN_DELETE and IN_DELETE_SELF generated
in different places? The only difference is in who receives the
event - you send IN_DELETE to watchers on parent and IN_DELETE_SELF
on watchers on victim. Event itself is the same, judging by your
description...

2005-09-20 05:04:37

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 05:58 +0100, Al Viro wrote:
> On Tue, Sep 20, 2005 at 12:53:12AM -0400, John McCutchan wrote:
> > DELETE_SELF WD=X
> >
> > The path you requested a watch on (inotify_add_watch(path,mask) returned
> > X) has been deleted.
>
> Then why the devil do we have IN_DELETE and IN_DELETE_SELF generated
> in different places? The only difference is in who receives the
> event - you send IN_DELETE to watchers on parent and IN_DELETE_SELF
> on watchers on victim. Event itself is the same, judging by your
> description...

No, because in the case of IN_DELETE, the path represented by the WD
hasn't been deleted, it is "PATH(WD)/event->name" that has been. Also,
IN_DELETE_SELF marks the death of the WD, no further events will be sent
with the same WD [Except for the IN_IGNORE].

--
John McCutchan <[email protected]>

2005-09-20 05:17:32

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 01:06:23AM -0400, John McCutchan wrote:
> On Tue, 2005-09-20 at 05:58 +0100, Al Viro wrote:
> > On Tue, Sep 20, 2005 at 12:53:12AM -0400, John McCutchan wrote:
> > > DELETE_SELF WD=X
> > >
> > > The path you requested a watch on (inotify_add_watch(path,mask) returned
> > > X) has been deleted.
> >
> > Then why the devil do we have IN_DELETE and IN_DELETE_SELF generated
> > in different places? The only difference is in who receives the
> > event - you send IN_DELETE to watchers on parent and IN_DELETE_SELF
> > on watchers on victim. Event itself is the same, judging by your
> > description...
>
> No, because in the case of IN_DELETE, the path represented by the WD
> hasn't been deleted, it is "PATH(WD)/event->name" that has been.

That's OK - same thing described for different recepients, thus two
events with different contents and type being sent.

> Also,
> IN_DELETE_SELF marks the death of the WD, no further events will be sent
> with the same WD [Except for the IN_IGNORE].

Uh-oh... Now, _that_ is rather interesting - you are giving self-contradictory
descriptions of the semantics.

fd = open("foo", 0);
unlink("foo");
sleep for a day
fchmod(fd, 0400);
sleep for a day
close(fd);

Which events do we have here? Removal of path happens at unlink(); change
of attributes - a day later.

2005-09-20 08:33:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Mon, Sep 19, 2005 at 06:37:43PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 19 Sep 2005, John McCutchan wrote:
> >
> > Below is a patch that fixes the random DELETE_SELF events when the
> > system is under load. The problem is that the DELETE_SELF event is sent
> > from dentry_iput, which is called in two code paths,
> >
> > 1) When a dentry is being deleted
> > 2) When the dcache is being pruned.
>
> No no.
>
> The problem is that you put the "fsnotify_inoderemove(inode);" in the
> wrong place, and I and Al never noticed.
>
> iput() doesn't have anything to do with delete at all, and adding a flag
> to it would be wrong. The inode may stay around _after_ the unlink() for
> as long as it has users (or much longer, if you have hardlinks ;).
>
> You should probably move the "fsnotify_inoderemove(inode);" call into
> generic_delete_inode() instead, just after "security_inode_delete()". No
> new flags, just a new place.
>
> (Oh, I think you need to add it to "hugetlbfs_delete_inode()" too).

I have a patch pending to kill hugetlbfs_delete_inode().

2005-09-20 12:34:34

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load


On 20-Sep-05, at 1:17 AM, Al Viro wrote:

> On Tue, Sep 20, 2005 at 01:06:23AM -0400, John McCutchan wrote:
>
>> On Tue, 2005-09-20 at 05:58 +0100, Al Viro wrote:
>>
>>> On Tue, Sep 20, 2005 at 12:53:12AM -0400, John McCutchan wrote:
>>>
>>>> DELETE_SELF WD=X
>>>>
>>>> The path you requested a watch on (inotify_add_watch(path,mask)
>>>> returned
>>>> X) has been deleted.
>>>>
>>>
>>> Then why the devil do we have IN_DELETE and IN_DELETE_SELF generated
>>> in different places? The only difference is in who receives the
>>> event - you send IN_DELETE to watchers on parent and IN_DELETE_SELF
>>> on watchers on victim. Event itself is the same, judging by your
>>> description...
>>>
>>
>> No, because in the case of IN_DELETE, the path represented by the WD
>> hasn't been deleted, it is "PATH(WD)/event->name" that has been.
>>
>
> That's OK - same thing described for different recepients, thus two
> events with different contents and type being sent.
>
>
>> Also,
>> IN_DELETE_SELF marks the death of the WD, no further events will
>> be sent
>> with the same WD [Except for the IN_IGNORE].
>>
>
> Uh-oh... Now, _that_ is rather interesting - you are giving self-
> contradictory
> descriptions of the semantics.
>

Where is the contradiction?

> fd = open("foo", 0);
> unlink("foo");
> sleep for a day
> fchmod(fd, 0400);
> sleep for a day
> close(fd);
>
> Which events do we have here? Removal of path happens at unlink();
> change
> of attributes - a day later.
>

[I'm assuming that fchmod continues to work even if the path has been
deleted.]

With Linus's latest patch:

IN_ATTRIB
IN_DELETE_SELF
IN_IGNORE

If we were able to get inoderemove called when the path removal happens,

IN_DELETE_SELF
IN_IGNORE

At this point, inotify would stop monitoring the inode, and we would
never see the fchmod.

John McCutchan
[email protected]



2005-09-20 16:39:03

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 08:34:20AM -0400, John McCutchan wrote:
> Where is the contradiction?
>
> >fd = open("foo", 0);
> >unlink("foo");
> >sleep for a day
> >fchmod(fd, 0400);
> >sleep for a day
> >close(fd);
> >
> >Which events do we have here? Removal of path happens at unlink();
> >change
> >of attributes - a day later.
> >
>
> [I'm assuming that fchmod continues to work even if the path has been
> deleted.]

Of course it does.

> With Linus's latest patch:
>
> IN_ATTRIB
> IN_DELETE_SELF
> IN_IGNORE

And IN_DELETE_SELF comes days after the path removal; moreover, it
comes after event that happened after the path removal. To make it
even funnier, have your code do fchmod() only if stat("foo", ...)
returns -ENOENT.

> If we were able to get inoderemove called when the path removal happens,
>
> IN_DELETE_SELF
> IN_IGNORE
>
> At this point, inotify would stop monitoring the inode, and we would
> never see the fchmod.

So if you have
ln a b
start watching b
rm a
chmod 400 b
you'll get no event on that chmod? Both with the current version and
one with your "if we were able" above...

I don't get it. Could you please describe what your code is _supposed_
to do? I'm not even talking about implementation - it's on the level
of "what do we want the watchers to see after the following operations".
Especially since you have fixed ABI - if you have userland clients
relying on the representation of individual events, surely they also
have to assume something about the sequence of events generated when
we do this and that to files and directories?

2005-09-20 17:44:25

by Ray Lee

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 17:38 +0100, Al Viro wrote:
> I don't get it. Could you please describe what your code is _supposed_
> to do? I'm not even talking about implementation - it's on the level
> of "what do we want the watchers to see after the following operations".

I can't even talk to that level, but perhaps it'd help to know that some
(I think) are pinning their hopes on inotify as the foundation of a
userspace negative dentry cache (i.e., samba trying to prove a set of
filenames (case-insensitively) doesn't exist).

By that point of view:

> ln a b
> start watching b
> rm a
> chmod 400 b

...it's quite clearly important to continue to get b's events.

Continuing:

> >fd = open("foo", 0);
> >unlink("foo");
> >sleep for a day
> >fchmod(fd, 0400);
> >sleep for a day
> >close(fd);

...I'd say that providing the fchmod to userspace would be a good thing.
That fd may be available to multiple processes, and so those processes
could still be validly interested in events upon it. However, this is an
obscure case that could be handled by polling with fstat, so if it's
unreasonable for other reasons, toss the idea.

Ray

2005-09-20 18:12:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, Ray Lee wrote:
>
> I can't even talk to that level, but perhaps it'd help to know that some
> (I think) are pinning their hopes on inotify as the foundation of a
> userspace negative dentry cache (i.e., samba trying to prove a set of
> filenames (case-insensitively) doesn't exist).

Note that than you should use the _name_ caching part, ie the
fsnotify_nameremove() part of the equation. That part is unambiguous.

It's literally only the "inode" things (IN_DELETE_SELF) that are
questionable. And that's fundamentally because the "self" can live on for
_longer_ than the name that points to it.

I really think that the patch I sent out yesterday is as good as it gets.
If you want immediate notification, you should ask for notification about
name changes in a particular directory. IN_DELETE_SELF notification on a
file simple is _not_ going to be immediate.

Linus

2005-09-20 18:22:54

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 11:12:02AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 20 Sep 2005, Ray Lee wrote:
> >
> > I can't even talk to that level, but perhaps it'd help to know that some
> > (I think) are pinning their hopes on inotify as the foundation of a
> > userspace negative dentry cache (i.e., samba trying to prove a set of
> > filenames (case-insensitively) doesn't exist).
>
> Note that than you should use the _name_ caching part, ie the
> fsnotify_nameremove() part of the equation. That part is unambiguous.
>
> It's literally only the "inode" things (IN_DELETE_SELF) that are
> questionable. And that's fundamentally because the "self" can live on for
> _longer_ than the name that points to it.
>
> I really think that the patch I sent out yesterday is as good as it gets.
> If you want immediate notification, you should ask for notification about
> name changes in a particular directory. IN_DELETE_SELF notification on a
> file simple is _not_ going to be immediate.

But then it's too early. Note that with your patch we still get removal
of _any_ link to our inode (even though it's alive and well and we'd never
heard about the sodding link in the first place) terminating all events
on it. See example upthread - we have two links to the same inode;
the_only_name_we_know and ~luser/foo/bar/baz. We watch the_only_name_we_know.
Luser goes spring-cleaning and does rm ~luser/foo/bar/baz. Now we suddenly
get IN_DELETE_SELF on our watch *and* stop getting anything coming on that
sucker.

This is obviously broken - even if we reinstate the watch (after figuring out
that there had been no events on parent), we are already too late - we've
lost an unknown number of events _and_ had to do non-trivial cleanup in
the client (including redoing stat() and friends if we are going to compensate
for the lost events).

2005-09-20 18:26:59

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load


On 20-Sep-05, at 12:38 PM, Al Viro wrote:
>
> I don't get it. Could you please describe what your code is
> _supposed_
> to do? I'm not even talking about implementation - it's on the level
> of "what do we want the watchers to see after the following
> operations".
> Especially since you have fixed ABI - if you have userland clients
> relying on the representation of individual events, surely they also
> have to assume something about the sequence of events generated when
> we do this and that to files and directories?
>

Okay here are some cases to help you get a better idea,

p1: watch /tmp/foo
p2: rm /tmp/foo

p1 gets IN_DELETE_SELF

p1: watch /tmp/foo
p2: echo > /tmp/foo

p1 gets IN_OPEN
p1 gets IN_MODIFY
p1 gets IN_CLOSE_WRITE

p1: watch /tmp
p2: rm /tmp/foo

p1 gets IN_DELETE + "foo"

p1: watch /tmp/foo
p2: unmount /tmp/

p1 gets IN_UNMOUNT
p1 gets IN_IGNORE

p1: watch /tmp/foo
p2 cat /tmp/foo

p1 gets IN_OPEN
p1 gets IN_ACCESS
p1 gets IN_CLOSE_NOWRITE

p1: watch /tmp/foo
p2: mv /tmp/foo /tmp/bar

p1 gets IN_MOVED_FROM + "foo" & IN_MOVED_TO + "bar"

p1: watch /tmp/
p2: echo > /tmp/bar

p1 gets IN_CREATE + "bar"
p1 gets IN_OPEN + "bar"
p1 gets IN_MODIFY + "bar"
p1 gets IN_CLOSE_WRITE + "bar"


John McCutchan
[email protected]



2005-09-20 19:37:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, Al Viro wrote:
> >
> > I really think that the patch I sent out yesterday is as good as it gets.
> > If you want immediate notification, you should ask for notification about
> > name changes in a particular directory. IN_DELETE_SELF notification on a
> > file simple is _not_ going to be immediate.
>
> But then it's too early. Note that with your patch we still get removal
> of _any_ link to our inode (even though it's alive and well and we'd never
> heard about the sodding link in the first place) terminating all events
> on it.

Yes. What is in the current 2.6.14-rc2 tree doesn't do that. It considers
inodes "global". But it won't work reliably on networked filesystems, I
think.

Anyway, I do believe that IN_DELETE_SELF is stupid, but that you migth
re-arm it if you get it.

Linus

2005-09-20 19:40:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, John McCutchan wrote:
>
> Okay here are some cases to help you get a better idea,

The case that is _interesting_ is this one:

ln /tmp/foo /tmp/bar

p1: watch /tmp/foo

and then one or both of (different orders - four cases):

p2: rm /tmp/bar
p2: rm /tmp/foo

(along with "fd = open(/tmp/foo) + rm /tmp/foo + sleep + close(fd)" of
course, which Al already pointed out).

Linus

2005-09-20 22:53:07

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 12:37 -0700, Linus Torvalds wrote:
>
> On Tue, 20 Sep 2005, Al Viro wrote:
> > >
> > > I really think that the patch I sent out yesterday is as good as it gets.
> > > If you want immediate notification, you should ask for notification about
> > > name changes in a particular directory. IN_DELETE_SELF notification on a
> > > file simple is _not_ going to be immediate.
> >
> > But then it's too early. Note that with your patch we still get removal
> > of _any_ link to our inode (even though it's alive and well and we'd never
> > heard about the sodding link in the first place) terminating all events
> > on it.
>
> Yes. What is in the current 2.6.14-rc2 tree doesn't do that. It considers
> inodes "global". But it won't work reliably on networked filesystems, I
> think.
>
> Anyway, I do believe that IN_DELETE_SELF is stupid, but that you migth
> re-arm it if you get it.

Is there some reason we can't just do this from vfs_unlink

inode = dentry->inode;
iget (inode);
d_delete (dentry);
fsnotify_inoderemove (inode);
iput (inode);

This would allow us to have immediate event notification, and avoid a
race with the inode going away, right?

I think the path below will make link handling as good as it can get, by
sending IN_DELETE_SELF every time inode->i_nlink goes down, and when
inode->i_nlink == 0, send the IN_IGNORE event. Also, it stuffs
inode->i_nlink into the cookie giving user space a clue about the status
of the inode.

Index: linux/include/linux/fsnotify.h
===================================================================
--- linux.orig/include/linux/fsnotify.h 2005-08-28 19:41:01.000000000 -0400
+++ linux/include/linux/fsnotify.h 2005-09-20 18:46:15.000000000 -0400
@@ -63,8 +63,9 @@
*/
static inline void fsnotify_inoderemove(struct inode *inode)
{
- inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL);
- inotify_inode_is_dead(inode);
+ inotify_inode_queue_event(inode, IN_DELETE_SELF, inode->i_nlink, NULL);
+ if (inode->i_nlink == 0)
+ inotify_inode_is_dead(inode);
}

/*

2005-09-21 00:33:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load



On Tue, 20 Sep 2005, John McCutchan wrote:
>
> Is there some reason we can't just do this from vfs_unlink
>
> inode = dentry->inode;
> iget (inode);
> d_delete (dentry);
> fsnotify_inoderemove (inode);
> iput (inode);

Mainly that it slows things down, and that it's wrong.

The thing is, I don't consider fsnotify_inoderemove() that important.

It is a fundamentally broken interface. We should document it as such. It
is _senseless_.

If you want immediate notification of a filename going away, then check
the directory. That is something with a _meaning_.

But the whole IN_DELETE_SELF is a STUPID INTERFACE.

I don't want to have stupid interfaces doing stupid things.

I'm perfectly willing to give an approximate answer if one is easy to
give. But there IS no "exact" answer, as shown by the fact that you didn't
even know what the semantics should be in the presense of links and
keeping a file open.

The file still _exists_ when it's open. You can read it, write it, extend
it, truncate it.. It's only the name that is gone. So I think delaying
the "IN_DELETE_SELF" until you can't do that any more is the RIGHT THING,
dammit.

All of the problems with the interface have come from expecting semantics
that simply aren't _valid_.

Live with the fact that files live on after the name is gone. Embrace it.
IT'S HOW THE UNIX WORLD WORKS. Arguing against it is like arguing against
gravity.

Linus

2005-09-21 00:52:33

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, 2005-09-20 at 17:33 -0700, Linus Torvalds wrote:
>
> On Tue, 20 Sep 2005, John McCutchan wrote:
> >
> > Is there some reason we can't just do this from vfs_unlink
> >
> > inode = dentry->inode;
> > iget (inode);
> > d_delete (dentry);
> > fsnotify_inoderemove (inode);
> > iput (inode);
>
> Mainly that it slows things down, and that it's wrong.
>
> The thing is, I don't consider fsnotify_inoderemove() that important.
>
> It is a fundamentally broken interface. We should document it as such. It
> is _senseless_.
>
> If you want immediate notification of a filename going away, then check
> the directory. That is something with a _meaning_.
>
> But the whole IN_DELETE_SELF is a STUPID INTERFACE.
>
> I don't want to have stupid interfaces doing stupid things.
>
> I'm perfectly willing to give an approximate answer if one is easy to
> give. But there IS no "exact" answer, as shown by the fact that you didn't
> even know what the semantics should be in the presense of links and
> keeping a file open.
>
> The file still _exists_ when it's open. You can read it, write it, extend
> it, truncate it.. It's only the name that is gone. So I think delaying
> the "IN_DELETE_SELF" until you can't do that any more is the RIGHT THING,
> dammit.
>
> All of the problems with the interface have come from expecting semantics
> that simply aren't _valid_.
>
> Live with the fact that files live on after the name is gone. Embrace it.
> IT'S HOW THE UNIX WORLD WORKS. Arguing against it is like arguing against
> gravity.
>

Alright, at this point I was just throwing out ideas ;). Anyways, you've
convinced me! Now, what about my last patch and your last patch getting
included?

--
John McCutchan <[email protected]>

2005-09-21 01:02:00

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 06:53:34PM -0400, John McCutchan wrote:
> Is there some reason we can't just do this from vfs_unlink
>
> inode = dentry->inode;
> iget (inode);
> d_delete (dentry);
> fsnotify_inoderemove (inode);
> iput (inode);
>
> This would allow us to have immediate event notification, and avoid a
> race with the inode going away, right?

Playing with references to struct inode means playing dirty tricks
behind the filesystem's back. Doing that in a way that really changes
inode lifetime means asking for trouble. Combined with a dirty trick
*already* pulled by sys_unlink() to postpone the final iput until after
we unlock the parent, it means breakage (and aforementioned dirty trick
took some rather interesting logics to compensate for in the first place).

Moreover, your suggestion would do that to _everyone_, whether they use
inotify or not. NAK.

> static inline void fsnotify_inoderemove(struct inode *inode)
> {
> - inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL);
> - inotify_inode_is_dead(inode);
> + inotify_inode_queue_event(inode, IN_DELETE_SELF, inode->i_nlink, NULL);
> + if (inode->i_nlink == 0)
> + inotify_inode_is_dead(inode);
> }

Assumes that filesystem treats ->i_nlink on final iput() in usual way.
It doesn't have to.

BTW, what happens if one uses inotify on procfs? Or sysfs, for that matter?
Fundamental problem with that sucker is that you are playing games with
lifetime rules of inodes in a way that might be OK for some filesystems,
but violates a lot of assumptions made by other...

BTW^2, what guarantees that inotify_unmount_inodes() will not happen while we
are in inotify_release()? That would happily keep watch refcount bumped,
so it would outlive inotify_unmount_inodes(). Sure, it would be dropped.
And call iput() on a pinned inode that had outlived the umount(). Oops...

2005-09-21 01:40:42

by John McCutchan

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, 2005-09-21 at 02:01 +0100, Al Viro wrote:
> On Tue, Sep 20, 2005 at 06:53:34PM -0400, John McCutchan wrote:
> > Is there some reason we can't just do this from vfs_unlink
> >
> > inode = dentry->inode;
> > iget (inode);
> > d_delete (dentry);
> > fsnotify_inoderemove (inode);
> > iput (inode);
> >
> > This would allow us to have immediate event notification, and avoid a
> > race with the inode going away, right?
>
> Playing with references to struct inode means playing dirty tricks
> behind the filesystem's back. Doing that in a way that really changes
> inode lifetime means asking for trouble. Combined with a dirty trick
> *already* pulled by sys_unlink() to postpone the final iput until after
> we unlock the parent, it means breakage (and aforementioned dirty trick
> took some rather interesting logics to compensate for in the first place).
>
> Moreover, your suggestion would do that to _everyone_, whether they use
> inotify or not. NAK.

Got it.

>
> > static inline void fsnotify_inoderemove(struct inode *inode)
> > {
> > - inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL);
> > - inotify_inode_is_dead(inode);
> > + inotify_inode_queue_event(inode, IN_DELETE_SELF, inode->i_nlink, NULL);
> > + if (inode->i_nlink == 0)
> > + inotify_inode_is_dead(inode);
> > }
>
> Assumes that filesystem treats ->i_nlink on final iput() in usual way.
> It doesn't have to.
>

I grepped all the filesystems, and they all seem to use
generic_drop_inode, except for hugetlbfs, which seems to have the same
logic of (!inode->i_nlink).

> BTW, what happens if one uses inotify on procfs? Or sysfs, for that matter?
> Fundamental problem with that sucker is that you are playing games with
> lifetime rules of inodes in a way that might be OK for some filesystems,
> but violates a lot of assumptions made by other...
>

Honestly, I don't know. And I don't think I know enough to say with any
certainty how either of them would work. Would a black list of
filesystems that don't want inotify on them be acceptable?

> BTW^2, what guarantees that inotify_unmount_inodes() will not happen while we
> are in inotify_release()? That would happily keep watch refcount bumped,
> so it would outlive inotify_unmount_inodes(). Sure, it would be dropped.
> And call iput() on a pinned inode that had outlived the umount(). Oops...

Good catch,

Index: linux/fs/inotify.c
===================================================================
--- linux.orig/fs/inotify.c 2005-08-31 15:41:11.000000000 -0400
+++ linux/fs/inotify.c 2005-09-20 21:18:35.000000000 -0400
@@ -756,6 +756,7 @@
* do not know the inode until we iterate to the watch. But we need to
* hold inode->inotify_sem before dev->sem. The following works.
*/
+ down(&iprune_sem);
while (1) {
struct inotify_watch *watch;
struct list_head *watches;
@@ -779,6 +780,7 @@
up(&inode->inotify_sem);
put_inotify_watch(watch);
}
+ up(&iprune_sem);

/* destroy all of the events on this device */
down(&dev->sem);


2005-09-21 02:36:04

by Al Viro

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Tue, Sep 20, 2005 at 09:41:47PM -0400, John McCutchan wrote:
> I grepped all the filesystems

... in the tree

> , and they all seem to use
> generic_drop_inode, except for hugetlbfs, which seems to have the same
> logic of (!inode->i_nlink).

I have no problems with killing ->drop_inode(), but that should be
a) done for in-tree filesystems
b) announced on fsdevel, so that out-of-tree folks could deal
with that
c) given at least one release to avoid screwing them.

Christoph, could you send the patch you've mentioned? I'd rather avoid
duplicating what you've done...

> > BTW, what happens if one uses inotify on procfs? Or sysfs, for that matter?
> > Fundamental problem with that sucker is that you are playing games with
> > lifetime rules of inodes in a way that might be OK for some filesystems,
> > but violates a lot of assumptions made by other...
> >
>
> Honestly, I don't know. And I don't think I know enough to say with any
> certainty how either of them would work. Would a black list of
> filesystems that don't want inotify on them be acceptable?

Maybe... Alternatively, it might be interesting to see if differences
between filesystem requirements can be handled by secondary method
table that would be set at ->get_sb() time (or merged into super_operations
if we end up with few of them).

I'd certainly love to deal with sys_unlink() kludge that way, BTW - two
new methods, with vfs_unlink() ending with

up(&dentry->d_inode->i_sem);

/* NOTE: fast path should be dealt with in real version, this
is just a demo */

if (error)
return ERR_PTR(error);
else if (!dir->i_op->post_unlink)
return generic_post_unlink(dentry);
else
return dir->i_op->post_unlink(dentry);
}

with callers doing

p = vfs_unlink(dir, dentry);
...
up(&dir->i_sem);
err = vfs_finish_unlink(dir, p);


vfs_finish_unlink(dir, p) (again, that would need to be optimized for fast path)

if (IS_ERR(p))
return PTR_ERR(p);
if (!dir->i_op->finish_unlink)
return generic_finish_unlink(dir, p);
else
return dir->i_op->finish_unlink(dir, p);

generic_post_unlink(dentry): grab a reference to dentry->d_inode, call
d_delete(), return inode

generic_finish_unlink(dir, p): iput(p); return 0;

NFS: don't mess with inode refcount at all *and* make sure we don't
d_delete() sillyrenamed ones.

sys_unlink(): no more messing with inodes behind fs back.

Price: change of public API (vfs_unlink() changes the type of return value
*and* calling conventions - we need to pass its return value to
vfs_finish_unlink() after unlocking the parent).

> > BTW^2, what guarantees that inotify_unmount_inodes() will not happen while we
> > are in inotify_release()? That would happily keep watch refcount bumped,
> > so it would outlive inotify_unmount_inodes(). Sure, it would be dropped.
> > And call iput() on a pinned inode that had outlived the umount(). Oops...
>
> Good catch,

[snip]

Why don't you simply put watches on a list anchored at superblock and
go through that list instead of messing with inode lists?

Speaking of disappearing ->d_inode problem... How about collecting
watches on a temporary list while holding dcache_lock and then hitting
them all with "it's going away" if ->i_nlink has hit zero? You don't
need inode to be around for the last part - you only use it to find watches
in question. And that can be done without any allocations...

2005-09-21 08:35:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, Sep 21, 2005 at 03:36:01AM +0100, Al Viro wrote:
> On Tue, Sep 20, 2005 at 09:41:47PM -0400, John McCutchan wrote:
> > I grepped all the filesystems
>
> ... in the tree
>
> > , and they all seem to use
> > generic_drop_inode, except for hugetlbfs, which seems to have the same
> > logic of (!inode->i_nlink).
>
> I have no problems with killing ->drop_inode(), but that should be
> a) done for in-tree filesystems
> b) announced on fsdevel, so that out-of-tree folks could deal
> with that
> c) given at least one release to avoid screwing them.
>
> Christoph, could you send the patch you've mentioned? I'd rather avoid
> duplicating what you've done...

sure. Note that clusterfs folks (ocfs2 in particular) really want
->drop_inode because they need additional checks instead of just the
nlink one in there. While hugetlbfs should just go away ->drop_inode
makes some sense for them.

2005-09-21 09:16:21

by Joel Becker

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, Sep 21, 2005 at 09:35:25AM +0100, Christoph Hellwig wrote:
> On Wed, Sep 21, 2005 at 03:36:01AM +0100, Al Viro wrote:
> > I have no problems with killing ->drop_inode(), but that should be
> > a) done for in-tree filesystems
> > b) announced on fsdevel, so that out-of-tree folks could deal
> > with that
> > c) given at least one release to avoid screwing them.
>
> sure. Note that clusterfs folks (ocfs2 in particular) really want
> ->drop_inode because they need additional checks instead of just the
> nlink one in there. While hugetlbfs should just go away ->drop_inode
> makes some sense for them.

My apologies for not having read the inotify thread, I'll go
look in the morning.
In ->drop_inode(), OCFS2 takes care of noticing that nlink has
been changed by a remote node. This is necessary for
generic_drop...delete operation to proceed.
If OCFS2 had to go back to the 2.4 method of checking i_count==1
in ->put_inode(), I'm not sure we're allowed to modify i_nlink there
unlocked, are we?
I also think we had some sort of race with inode_lock that
->drop_inode() avoids, but I'm not sure. Mark?

Joel

--

"For every complex problem there exists a solution that is brief,
concise, and totally wrong."
-Unknown

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2005-09-21 09:17:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, Sep 21, 2005 at 02:15:24AM -0700, Joel Becker wrote:
> My apologies for not having read the inotify thread, I'll go
> look in the morning.
> In ->drop_inode(), OCFS2 takes care of noticing that nlink has
> been changed by a remote node. This is necessary for
> generic_drop...delete operation to proceed.
> If OCFS2 had to go back to the 2.4 method of checking i_count==1
> in ->put_inode(), I'm not sure we're allowed to modify i_nlink there
> unlocked, are we?
> I also think we had some sort of race with inode_lock that
> ->drop_inode() avoids, but I'm not sure. Mark?

The real fix would be to put an equivalent of OCFS2_INODE_MAYBE_ORPHANED
into struct inode. That way it could be shared by other clustered
filesystems aswell, and OCFS2 had no need to implement ->drop_inode.

I'm pretty sure I dicussed that with either Mark or Zab.

2005-09-21 14:47:06

by Joel Becker

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, Sep 21, 2005 at 10:17:38AM +0100, Christoph Hellwig wrote:
> The real fix would be to put an equivalent of OCFS2_INODE_MAYBE_ORPHANED
> into struct inode. That way it could be shared by other clustered
> filesystems aswell, and OCFS2 had no need to implement ->drop_inode.

Or change the VFS patterns to allow lookup and validation of the
inode before choosing the generic_drop/generic_delete path.

Joel

--

Life's Little Instruction Book #197

"Don't forget, a person's greatest emotional need is to
feel appreciated."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2005-09-21 18:10:05

by Mark Fasheh

[permalink] [raw]
Subject: Re: [patch] stop inotify from sending random DELETE_SELF event under load

On Wed, Sep 21, 2005 at 07:45:38AM -0700, Joel Becker wrote:
> On Wed, Sep 21, 2005 at 10:17:38AM +0100, Christoph Hellwig wrote:
> > The real fix would be to put an equivalent of OCFS2_INODE_MAYBE_ORPHANED
> > into struct inode. That way it could be shared by other clustered
> > filesystems aswell, and OCFS2 had no need to implement ->drop_inode.
> Or change the VFS patterns to allow lookup and validation of the
> inode before choosing the generic_drop/generic_delete path.
Right. *Only* setting something akin to OCFS2_INODE_MAYBE_ORPHANED on struct
inode would give us essentially what we have today assuming that
generic_drop_inode() punted to delete_inode() in that case (as well as nlink
== 0). Which is fine as long as we're ok with inodes who might *not*
actually orphaned going through delete_inode().

A more involved solution might be to give OCFS2 the chance to do it's
cluster querying at or around drop_inode() time on those which have been
marked as potentially orphaned.

This would give us the benefit of allowing those inodes which wound up not
being orphaned (due to remote node error or death) to go through
generic_forget_inode(), which if my reading is correct would at least allow
it to live longer in the system. I'm not so sure how important this is
however, considering the number of inodes which wind up not having been
orphaned is typically very small and the only downside for those few as far
as i can tell is a small loss of performance.

Btw, the other reason a local node might not completely wipe an inode in
OCFS2 land is if another node is still using it (if it has open file
descriptors), in which case what we do today is sync the inode, truncate
it's pages and then exit ocfs2_delete_inode(), thus allowing our local VFS
to destroy it secure in the knowledge that the actual wiping of the inode
will happen elsewhere in the cluster.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]