2006-01-16 22:34:33

by Olaf Hering

[permalink] [raw]
Subject: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

I get 'Busy inodes after umount' very often, even with recent kernels.
Usually it remains unnoticed for a while. A bit more info about what
superblock had problems would be helpful.

bdevname() doesnt seem to be the best way for pretty-printing NFS mounts
for example. Should it just print the major:minor pair?
Are there scripts or something that parse such kernel messages, should
the extra info go somewhere else?

fs/super.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c
+++ linux-2.6/fs/super.c
@@ -227,6 +227,7 @@ void generic_shutdown_super(struct super
{
struct dentry *root = sb->s_root;
struct super_operations *sop = sb->s_op;
+ char b[BDEVNAME_SIZE];

if (root) {
sb->s_root = NULL;
@@ -247,8 +248,9 @@ void generic_shutdown_super(struct super

/* Forget any remaining inodes */
if (invalidate_inodes(sb)) {
- printk("VFS: Busy inodes after unmount. "
- "Self-destruct in 5 seconds. Have a nice day...\n");
+ printk("VFS: (%s) Busy inodes after unmount. "
+ "Self-destruct in 5 seconds. Have a nice day...\n",
+ bdevname(sb->s_bdev, b));
}

unlock_kernel();
--
short story of a lazy sysadmin:
alias appserv=wotan


2006-01-16 23:24:06

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

Olaf, can you please check if my patch for busy inodes from -mm tree
helps you?
Patch name is fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15/2.6.15-mm4/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch

I can also provide you a more sophisticated debug patch for this if
required.

Kirill

> I get 'Busy inodes after umount' very often, even with recent kernels.
> Usually it remains unnoticed for a while. A bit more info about what
> superblock had problems would be helpful.
>
> bdevname() doesnt seem to be the best way for pretty-printing NFS mounts
> for example. Should it just print the major:minor pair?
> Are there scripts or something that parse such kernel messages, should
> the extra info go somewhere else?
>
> fs/super.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -227,6 +227,7 @@ void generic_shutdown_super(struct super
> {
> struct dentry *root = sb->s_root;
> struct super_operations *sop = sb->s_op;
> + char b[BDEVNAME_SIZE];
>
> if (root) {
> sb->s_root = NULL;
> @@ -247,8 +248,9 @@ void generic_shutdown_super(struct super
>
> /* Forget any remaining inodes */
> if (invalidate_inodes(sb)) {
> - printk("VFS: Busy inodes after unmount. "
> - "Self-destruct in 5 seconds. Have a nice day...\n");
> + printk("VFS: (%s) Busy inodes after unmount. "
> + "Self-destruct in 5 seconds. Have a nice day...\n",
> + bdevname(sb->s_bdev, b));
> }
>
> unlock_kernel();

2006-01-16 23:30:00

by Olaf Hering

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Tue, Jan 17, Kirill Korotaev wrote:

> Olaf, can you please check if my patch for busy inodes from -mm tree
> helps you?

I cant reprpoduce it at will, thats the thing. It likely happens with NFS
mounts. [email protected] did some work recently. But I remember even with
these changes (for a 2.6.13), the busy inodes did not disappear.

Merging your patch into our cvs will give it more testing, I will do
that tomorrow if noone disagrees.

--
short story of a lazy sysadmin:
alias appserv=wotan

2006-01-17 02:06:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

Olaf Hering <[email protected]> wrote:
>
> On Tue, Jan 17, Kirill Korotaev wrote:
>
> > Olaf, can you please check if my patch for busy inodes from -mm tree
> > helps you?
>
> I cant reprpoduce it at will, thats the thing. It likely happens with NFS
> mounts. [email protected] did some work recently. But I remember even with
> these changes (for a 2.6.13), the busy inodes did not disappear.
>
> Merging your patch into our cvs will give it more testing, I will do
> that tomorrow if noone disagrees.
>

The patch is certainly safe and stable. But it's so huge and complex and
ugly that I was hoping that a better fix would turn up. The bug itself
takes quite some ingenuity to hit.

2006-01-17 07:03:52

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> Olaf Hering <[email protected]> wrote:
>> On Tue, Jan 17, Kirill Korotaev wrote:
>>
>>> Olaf, can you please check if my patch for busy inodes from -mm tree
>>> helps you?
>> I cant reprpoduce it at will, thats the thing. It likely happens with NFS
>> mounts. [email protected] did some work recently. But I remember even with
>> these changes (for a 2.6.13), the busy inodes did not disappear.
>>
>> Merging your patch into our cvs will give it more testing, I will do
>> that tomorrow if noone disagrees.
>>
>
> The patch is certainly safe and stable. But it's so huge and complex and
> ugly that I was hoping that a better fix would turn up. The bug itself
> takes quite some ingenuity to hit.
We have another idea how to reimplement it via refcounters instead of
lists. But I'm not sure when this will happen, due to lack of time :(

Kirill

2006-01-18 22:49:55

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Tue, Jan 17, Kirill Korotaev wrote:

> Olaf, can you please check if my patch for busy inodes from -mm tree
> helps you?
> Patch name is fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15/2.6.15-mm4/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch

This patch is just wrong. It is hiding bugs in file systems. The problem is
that somewhere the reference counting on the vfsmount objects is wrong. The
file system is unmounted before the last dentry is dereferenced. Either you
didn't hold a reference to the proper vfsmount objects at all or you
dereference it too early. See Al Viros patch series (search for "namei fixes")
on how to fix this issues.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-01-18 23:11:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

Jan Blunck <[email protected]> wrote:
>
> On Tue, Jan 17, Kirill Korotaev wrote:
>
> > Olaf, can you please check if my patch for busy inodes from -mm tree
> > helps you?
> > Patch name is fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
> > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15/2.6.15-mm4/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
>
> This patch is just wrong. It is hiding bugs in file systems. The problem is
> that somewhere the reference counting on the vfsmount objects is wrong. The
> file system is unmounted before the last dentry is dereferenced. Either you
> didn't hold a reference to the proper vfsmount objects at all or you
> dereference it too early. See Al Viros patch series (search for "namei fixes")
> on how to fix this issues.
>

The only reason I've been carrying that patch is as a reminder that there's
a bug that we need to fix. It'd be good news if that bug had been fixed by
other means.

Kirill, do you know whether the bug is still present in 2.6.16-rc1?

2006-01-19 09:52:05

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>Olaf, can you please check if my patch for busy inodes from -mm tree
>>helps you?
>>Patch name is fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
>>http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15/2.6.15-mm4/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
>
>
> This patch is just wrong. It is hiding bugs in file systems. The problem is
> that somewhere the reference counting on the vfsmount objects is wrong. The
> file system is unmounted before the last dentry is dereferenced. Either you
> didn't hold a reference to the proper vfsmount objects at all or you
> dereference it too early. See Al Viros patch series (search for "namei fixes")
> on how to fix this issues.

This patch has nothing to do with vfsmount references and doesn't hide
anything. It just adds syncronization barrier between do_umount() and
shrink_dcache() since the latter can work with dentries/inodes without
holding locks.

So if you think there is something wrong with it, please, be more specific.

Kirill

2006-01-19 10:04:46

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Thu, Jan 19, Kirill Korotaev wrote:

> This patch has nothing to do with vfsmount references and doesn't hide
> anything. It just adds syncronization barrier between do_umount() and
> shrink_dcache() since the latter can work with dentries/inodes without
> holding locks.
>
> So if you think there is something wrong with it, please, be more specific.
>

You can only unmount a file system if there are no references to the vfsmount
object anymore. Since shrink_dcache*() is called after checking the refcount of
vfsmount while unmounting the file system, it isn't possible to hold a
reference to a dentry (and therefore call dput()) after this point in
time. Therefore your reference counting on the vfsmount is wrong which is the
root case for your problem of busy inodes.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-01-19 10:07:23

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>On Tue, Jan 17, Kirill Korotaev wrote:
>>
>>
>>>Olaf, can you please check if my patch for busy inodes from -mm tree
>>>helps you?
>>>Patch name is fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
>>>http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15/2.6.15-mm4/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch
>>
>>This patch is just wrong. It is hiding bugs in file systems. The problem is
>>that somewhere the reference counting on the vfsmount objects is wrong. The
>>file system is unmounted before the last dentry is dereferenced. Either you
>>didn't hold a reference to the proper vfsmount objects at all or you
>>dereference it too early. See Al Viros patch series (search for "namei fixes")
>>on how to fix this issues.
>
>
> The only reason I've been carrying that patch is as a reminder that there's
> a bug that we need to fix. It'd be good news if that bug had been fixed by
> other means.
>
> Kirill, do you know whether the bug is still present in 2.6.16-rc1?

it exists in 2.6.15 and I see no changes in 2.6.16-rc1 except for
cosmetics :(
checked the git tree, dput() etc. the bug is definetely still here -
nothing changed in this area.

Sorry for bad news.

The patch can be probably remade via introducing "notlocked_refs"
counter on dentry. if shrinker()/umount() see such a dentry() with
non-zero refcnt it can sleep as it is done in current patch. It would be
a little bit cleaner/simpler. What do you think?
If someone suggest any brilliant/helpfull idea I would be happy to
improve it.

Kirill

2006-01-19 10:25:08

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>This patch has nothing to do with vfsmount references and doesn't hide
>>anything. It just adds syncronization barrier between do_umount() and
>>shrink_dcache() since the latter can work with dentries/inodes without
>>holding locks.
>>
>>So if you think there is something wrong with it, please, be more specific.
>>
>
>
> You can only unmount a file system if there are no references to the vfsmount
> object anymore. Since shrink_dcache*() is called after checking the refcount of
> vfsmount while unmounting the file system, it isn't possible to hold a
> reference to a dentry (and therefore call dput()) after this point in
> time. Therefore your reference counting on the vfsmount is wrong which is the
> root case for your problem of busy inodes.

You didn't take into account shrink_dcache*() on memory pressure. It
works when it works. And when it calls dput() it detaches dentry from
the whole tree and starts to work with inode. do_umount() can
successfully shrink the other part of the tree, since dentry in question
is detached, complain about busy inode (it is really being put on
another CPU, but still busy) and destroy super block.

another scenario from patch comment:

CPU 1 CPU 2
~~~~~ ~~~~~
umount /dev/sda1
generic_shutdown_super shrink_dcache_memory()
shrink_dcache_parent dput dentry
select_parent prune_one_dentry()
<<<< child is dead, locks are released,
but parent is still referenced!!! >>>>
skip dentry->parent,
since it's d_count > 0

message: BUSY inodes after umount...
<<< parent is left on dentry_unused list,
referencing freed super block >>>


Kirill


2006-01-20 19:07:33

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Thu, Jan 19, Kirill Korotaev wrote:

> CPU 1 CPU 2
> ~~~~~ ~~~~~
> umount /dev/sda1
> generic_shutdown_super shrink_dcache_memory()
> shrink_dcache_parent dput dentry
> select_parent prune_one_dentry()
> <<<< child is dead, locks are released,
> but parent is still referenced!!! >>>>
> skip dentry->parent,
> since it's d_count > 0
>
> message: BUSY inodes after umount...
> <<< parent is left on dentry_unused list,
> referencing freed super block >>>

I see. The problem is that dcache_lock is given up before dereferencing the
parent. But your patch seems to be wrong anyway IMHO. I'll post patches in a
seperate thread.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-01-23 08:13:35

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>CPU 1 CPU 2
>>~~~~~ ~~~~~
>>umount /dev/sda1
>>generic_shutdown_super shrink_dcache_memory()
>>shrink_dcache_parent dput dentry
>>select_parent prune_one_dentry()
>> <<<< child is dead, locks are released,
>> but parent is still referenced!!! >>>>
>>skip dentry->parent,
>>since it's d_count > 0
>>
>>message: BUSY inodes after umount...
>> <<< parent is left on dentry_unused list,
>> referencing freed super block >>>
>
>
> I see. The problem is that dcache_lock is given up before dereferencing the
> parent. But your patch seems to be wrong anyway IMHO. I'll post patches in a
> seperate thread.
Jan, I still have not heard a single comment about what's wrong with
it... I would really appreciate if you provide me one.

Kirill

2006-01-30 11:54:38

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Mon, Jan 23, Kirill Korotaev wrote:

> Jan, I still have not heard a single comment about what's wrong with
> it... I would really appreciate if you provide me one.
>

Sorry for the delay. I had to fix a totally bogus patch (mine ;).

The problem with your patch is that it hides too early mntput's. Think about
following situation:

mntput(path->mnt); // too early mntput()
dput(path->dentry);

Assuming that in-between this sequence someone unmounts the file system, your
patch will wait for this dput() to finish before it proceeds with unmounting
the file system. I think this isn't what we want.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-01-30 14:03:51

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

Hello Jan,

>>Jan, I still have not heard a single comment about what's wrong with
>>it... I would really appreciate if you provide me one.
>>
>
>
> Sorry for the delay. I had to fix a totally bogus patch (mine ;).
>
> The problem with your patch is that it hides too early mntput's. Think about
> following situation:
>
> mntput(path->mnt); // too early mntput()
> dput(path->dentry);
>
> Assuming that in-between this sequence someone unmounts the file system, your
> patch will wait for this dput() to finish before it proceeds with unmounting
> the file system. I think this isn't what we want.
No, it won't wait for anything, because if umount happened between
mntput/dput, dentry is not in s_dshrinkers list.
if umount happens in parallell with dput() (where shrinker operations
are), then it will behave ok - will wait for dput() and then umount. It
was intended behaviour!

Also, please, note that such early mntput()'s are bugs!!! because such
dentries can reference freed memory after last mntput(). And I remember
some patches in 2.4.x/2.6.x which fixed this sequence everywhere.

Kirill

2006-01-30 14:21:40

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Mon, Jan 30, Kirill Korotaev wrote:

> >
> > mntput(path->mnt); // too early mntput()
> > dput(path->dentry);
> >
> >Assuming that in-between this sequence someone unmounts the file system,
> >your
> >patch will wait for this dput() to finish before it proceeds with
> >unmounting
> >the file system. I think this isn't what we want.
> No, it won't wait for anything, because if umount happened between
> mntput/dput, dentry is not in s_dshrinkers list.
> if umount happens in parallell with dput() (where shrinker operations
> are), then it will behave ok - will wait for dput() and then umount. It
> was intended behaviour!

It should not wait.

>
> Also, please, note that such early mntput()'s are bugs!!! because such
> dentries can reference freed memory after last mntput(). And I remember
> some patches in 2.4.x/2.6.x which fixed this sequence everywhere.

Thats why I'm complaining ...

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-01-30 14:32:23

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

Hello Jan,

>>>mntput(path->mnt); // too early mntput()
>>>dput(path->dentry);
>>>
>>>Assuming that in-between this sequence someone unmounts the file system,
>>>your
>>>patch will wait for this dput() to finish before it proceeds with
>>>unmounting
>>>the file system. I think this isn't what we want.
>>
>>No, it won't wait for anything, because if umount happened between
>>mntput/dput, dentry is not in s_dshrinkers list.
>>if umount happens in parallell with dput() (where shrinker operations
>>are), then it will behave ok - will wait for dput() and then umount. It
>>was intended behaviour!
>
>
> It should not wait.
why?! it makes sure, that dentries/inodes are gone _before_ super block
destroyed.

>>Also, please, note that such early mntput()'s are bugs!!! because such
>>dentries can reference freed memory after last mntput(). And I remember
>>some patches in 2.4.x/2.6.x which fixed this sequence everywhere.
>
>
> Thats why I'm complaining ...
about what?
my patch doesn't hide this bug, nor helps it anyhow.

Kirill


2006-03-02 06:58:35

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super



Hi,
This mail relates to the thread with the same subject which can be
found at

http://lkml.org/lkml/2006/1/16/279

I would like to propose an alternate patch for the problem.

The core problem is that:
prune_one_dentry can hold a reference to a dentry without any
lock being held, and without any other reference to the
filesystem (if it is being called from shrink_dcache_memory).
It holds this reference while calling iput on an inode. This can
take an arbitrarily long time to complete, especially if NFS
needs to wait for some RPCs to complete or timeout.

shrink_dcache_parent skips over dentries which have a reference,
such as the one held by prune_one_dentry.

Thus umount can find that an inode is still in use (by it's dentry
which was skipped) and will complain. Worse, when the nfs request
on some inode finally completes, it might find the superblock
doesn't exist any more and... oops.

My proposed solution to the problem is never to expose the reference
held by prune_one_dentry. i.e. keep the spin_lock held.

This requires:
- Breaking dentry_iput into 2 pieces, one that happens while the
dcache locks are held, and one that happens unlocked.
- Also, dput needs a variant which can be called with the spinlocks
held.
- This also requires a suitable comment in the code.

It is possible that the dentry_iput call in dput might need to be
split into the locked/unlocked portions as well. That would
require collecting a list of inodes and dentries to be freed once
the lock is dropped, which would be ugly.
An alternative might be to skip the tail recursion when
dput_locked was called as I *think* it is just an optimisation.


The following patch addressed the first three points.

Comments? Please :-?

NeilBrown


Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/dcache.c | 105 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 81 insertions(+), 24 deletions(-)

diff ./fs/dcache.c~current~ ./fs/dcache.c
--- ./fs/dcache.c~current~ 2006-03-02 17:14:24.000000000 +1100
+++ ./fs/dcache.c 2006-03-02 17:55:08.000000000 +1100
@@ -94,24 +94,36 @@ static void d_free(struct dentry *dentry
* d_iput() operation if defined.
* Called with dcache_lock and per dentry lock held, drops both.
*/
-static void dentry_iput(struct dentry * dentry)
+static inline struct inode *dentry_iput_locked(struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
if (inode) {
dentry->d_inode = NULL;
list_del_init(&dentry->d_alias);
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
- if (!inode->i_nlink)
- fsnotify_inoderemove(inode);
- if (dentry->d_op && dentry->d_op->d_iput)
- dentry->d_op->d_iput(dentry, inode);
- else
- iput(inode);
- } else {
- spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_lock);
}
+ return inode;
+}
+
+static inline void dentry_iput_unlocked(struct dentry *dentry,
+ struct inode *inode)
+{
+ if (!inode)
+ return;
+ if (!inode->i_nlink)
+ fsnotify_inoderemove(inode);
+ if (dentry->d_op && dentry->d_op->d_iput)
+ dentry->d_op->d_iput(dentry, inode);
+ else
+ iput(inode);
+}
+
+static void dentry_iput(struct dentry * dentry)
+{
+ struct inode *inode = dentry_iput_locked(dentry);
+
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ dentry_iput_unlocked(dentry, inode);
}

/*
@@ -143,18 +155,10 @@ static void dentry_iput(struct dentry *
* no dcache lock, please.
*/

-void dput(struct dentry *dentry)
+static void __dput_locked(struct dentry *dentry)
{
- if (!dentry)
- return;

repeat:
- if (atomic_read(&dentry->d_count) == 1)
- might_sleep();
- if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
- return;
-
- spin_lock(&dentry->d_lock);
if (atomic_read(&dentry->d_count)) {
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
@@ -202,10 +206,43 @@ kill_it: {
if (dentry == parent)
return;
dentry = parent;
+
+ if (atomic_read(&dentry->d_count) == 1)
+ might_sleep();
+ if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
+ return;
+
+ spin_lock(&dentry->d_lock);
goto repeat;
}
}

+void dput(struct dentry *dentry)
+{
+ if (!dentry)
+ return;
+ if (atomic_read(&dentry->d_count) == 1)
+ might_sleep();
+ if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
+ return;
+
+ spin_lock(&dentry->d_lock);
+
+ __dput_locked(dentry);
+}
+
+void dput_locked(struct dentry *dentry)
+{
+ if (!dentry)
+ return;
+ if (!atomic_dec_and_test(&dentry->d_count)) {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ return;
+ }
+ __dput_locked(dentry);
+}
+
/**
* d_invalidate - invalidate a dentry
* @dentry: dentry to invalidate
@@ -361,19 +398,39 @@ restart:
* This requires that the LRU list has already been
* removed.
* Called with dcache_lock, drops it and then regains.
+ *
+ * There was a risk of this function, called from shrink_dache_memory,
+ * racing with select_dcache_parent called from generic_shutdown_super.
+ * This function was holding a reference to the parent after the child
+ * has been removed, and this wasn't protected by any spinlock.
+ * select_dcache_parent would think the dentry was in use, and so it would
+ * not get discarded. This would result in a very unclean unmount.
+ * So we need to keep the spin_lock while ever we hold a reference to
+ * a dentry. This (hopefully) explains the two-stage
+ * dentry_iput, and the need for dput_locked.
+ * Note: the race was easiest to hit if iput was very slow, as
+ * it could be when tearing down a large address space, or waiting
+ * for pending network requests to return/timeout.
*/
static inline void prune_one_dentry(struct dentry * dentry)
{
struct dentry * parent;
+ struct inode * ino;

__d_drop(dentry);
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
- dentry_iput(dentry);
+ ino = dentry_iput_locked(dentry);
parent = dentry->d_parent;
- d_free(dentry);
if (parent != dentry)
- dput(parent);
+ dput_locked(parent);
+ else {
+ spin_unlock(&dentry->d_lock);
+ spin_unlock(&dcache_lock);
+ }
+ dentry_iput_unlocked(dentry, ino);
+ d_free(dentry);
+
spin_lock(&dcache_lock);
}

2006-03-02 10:48:20

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Thu, Mar 02, Neil Brown wrote:

> This requires:
> - Breaking dentry_iput into 2 pieces, one that happens while the
> dcache locks are held, and one that happens unlocked.
> - Also, dput needs a variant which can be called with the spinlocks
> held.
> - This also requires a suitable comment in the code.
>
> It is possible that the dentry_iput call in dput might need to be
> split into the locked/unlocked portions as well. That would
> require collecting a list of inodes and dentries to be freed once
> the lock is dropped, which would be ugly.
> An alternative might be to skip the tail recursion when
> dput_locked was called as I *think* it is just an optimisation.
>
>
> The following patch addressed the first three points.
>
> Comments? Please :-?
>

This looks very much like a fixed version of my patch from
http://lkml.org/lkml/2006/1/20/303. Therfore, in general I'm fine with it ;)

Comments below !

> +void dput_locked(struct dentry *dentry)
> +{
> + if (!dentry)
> + return;
> + if (!atomic_dec_and_test(&dentry->d_count)) {
> + spin_unlock(&dentry->d_lock);
> + spin_unlock(&dcache_lock);
> + return;
> + }
> + __dput_locked(dentry);
> +}
> +

A comment like in dentry_iput() would be fine here:
* Called with dcache_lock and per dentry lock held, drops both.

> static inline void prune_one_dentry(struct dentry * dentry)
> {
> struct dentry * parent;
> + struct inode * ino;
>
> __d_drop(dentry);
> list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> - dentry_iput(dentry);
> + ino = dentry_iput_locked(dentry);
> parent = dentry->d_parent;
> - d_free(dentry);
> if (parent != dentry)
> - dput(parent);
> + dput_locked(parent);
> + else {
> + spin_unlock(&dentry->d_lock);
> + spin_unlock(&dcache_lock);
> + }
> + dentry_iput_unlocked(dentry, ino);
> + d_free(dentry);
> +
> spin_lock(&dcache_lock);
> }
>

You missed getting the parent dentry's lock before calling dput_locked().

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-03 11:42:35

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Thu, Mar 02, Neil Brown wrote:

> The core problem is that:
> prune_one_dentry can hold a reference to a dentry without any
> lock being held, and without any other reference to the
> filesystem (if it is being called from shrink_dcache_memory).
> It holds this reference while calling iput on an inode. This can
> take an arbitrarily long time to complete, especially if NFS
> needs to wait for some RPCs to complete or timeout.
>
> shrink_dcache_parent skips over dentries which have a reference,
> such as the one held by prune_one_dentry.
>
> Thus umount can find that an inode is still in use (by it's dentry
> which was skipped) and will complain. Worse, when the nfs request
> on some inode finally completes, it might find the superblock
> doesn't exist any more and... oops.
>
> My proposed solution to the problem is never to expose the reference
> held by prune_one_dentry. i.e. keep the spin_lock held.

This morning I wondered, why I was using a list to drop the dentry's inodes
after the dput() has visited all parents.

It is not enough to fix prune_one_dentry() to hold the lock until the parent
is dereferenced since prune_one_dentry() calls __dput_locked(). And in
__dput_locked() we have the same problem again:

from __dput_locked():

list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry*/
-> dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
if (dentry == parent)
return;
dentry = parent;
+
+ if (atomic_read(&dentry->d_count) == 1)
+ might_sleep();
+ if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
+ return;
+
+ spin_lock(&dentry->d_lock);
goto repeat;
}
}

Between -> and the atomic_dec_and_lock() the reference count on the parent is
wrong and no lock is held. I fixed that by using d_lru to keep track of all
dentry's which inodes still have to be dereferenced. This should happen after
all parents have been dereferenced.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-06 06:10:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Thursday March 2, [email protected] wrote:
>
>
> Hi,
> This mail relates to the thread with the same subject which can be
> found at
>
> http://lkml.org/lkml/2006/1/16/279
>
> I would like to propose an alternate patch for the problem.
....
>
> Comments? Please :-?

Somewhere in among the comments (thanks), I realised that I was only
closing half the race. I had tried to make sure there were no stray
references to any dentries, but there is still the inode which is
being iput which can cause problem.

The following patch takes a totally different approach, is based on an
idea from Jan Kara, and is much less intrusive.

We:
- keep track of "who" is calling prune_dcache, and when a filesystem
is being unmounted (s_root == NULL) we only allow the unmount thread
to prune dentries.
- keep track of how many dentries are in the process of having
dentry_iput called on them for pruning
- don't allow umount to proceed until that count hits zero
- bias the count this way and that to make sure we get a wake_up at
the right time
- reuse 's_wait_unfrozen' to wait on the iput to complete.

Again, I'm very keen on feedback. This race is very hard to trigger,
so code review is the only real way to evaluate that patch.

Thanks,
NeilBrown


Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/dcache.c | 17 +++++++++++++----
./fs/super.c | 11 +++++++++++
./include/linux/fs.h | 2 ++
3 files changed, 26 insertions(+), 4 deletions(-)

diff ./fs/dcache.c~current~ ./fs/dcache.c
--- ./fs/dcache.c~current~ 2006-03-06 16:54:59.000000000 +1100
+++ ./fs/dcache.c 2006-03-06 16:55:33.000000000 +1100
@@ -366,6 +366,7 @@ static inline void prune_one_dentry(stru
{
struct dentry * parent;

+ dentry->d_sb->s_pending_iputs ++;
__d_drop(dentry);
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
@@ -375,6 +376,9 @@ static inline void prune_one_dentry(stru
if (parent != dentry)
dput(parent);
spin_lock(&dcache_lock);
+ dentry->d_sb->s_pending_iputs --;
+ if (dentry->d_sb->s_pending_iputs < 0)
+ wake_up(&dentry->d_sb->s_wait_unfrozen);
}

/**
@@ -390,7 +394,7 @@ static inline void prune_one_dentry(stru
* all the dentries are in use.
*/

-static void prune_dcache(int count)
+static void prune_dcache(int count, struct dentry *parent)
{
spin_lock(&dcache_lock);
for (; count ; count--) {
@@ -407,6 +411,11 @@ static void prune_dcache(int count)
dentry_stat.nr_unused--;
dentry = list_entry(tmp, struct dentry, d_lru);

+ if (dentry->d_sb->s_root == NULL &&
+ (parent == NULL ||
+ parent->d_sb != dentry->d_sb))
+ continue;
+
spin_lock(&dentry->d_lock);
/*
* We found an inuse dentry which was not removed from
@@ -635,7 +644,7 @@ void shrink_dcache_parent(struct dentry
int found;

while ((found = select_parent(parent)) != 0)
- prune_dcache(found);
+ prune_dcache(found, parent);
}

/**
@@ -673,7 +682,7 @@ void shrink_dcache_anon(struct hlist_hea
}
}
spin_unlock(&dcache_lock);
- prune_dcache(found);
+ prune_dcache(found, NULL);
} while(found);
}

@@ -694,7 +703,7 @@ static int shrink_dcache_memory(int nr,
if (nr) {
if (!(gfp_mask & __GFP_FS))
return -1;
- prune_dcache(nr);
+ prune_dcache(nr, NULL);
}
return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
}

diff ./fs/super.c~current~ ./fs/super.c
--- ./fs/super.c~current~ 2006-03-06 16:54:59.000000000 +1100
+++ ./fs/super.c 2006-03-06 16:57:19.000000000 +1100
@@ -230,7 +230,18 @@ void generic_shutdown_super(struct super
struct super_operations *sop = sb->s_op;

if (root) {
+ spin_lock(&dcache_lock);
+ /* disable stray dputs */
sb->s_root = NULL;
+
+ /* trigger a wake_up */
+ sb->s_pending_iputs --;
+ spin_unlock(&dcache_lock);
+ wait_event(sb->s_wait_unfrozen,
+ sb->s_pending_iputs < 0);
+ /* avoid further wakeups */
+ sb->s_pending_iputs = 65000;
+
shrink_dcache_parent(root);
shrink_dcache_anon(&sb->s_anon);
dput(root);

diff ./include/linux/fs.h~current~ ./include/linux/fs.h
--- ./include/linux/fs.h~current~ 2006-03-06 16:54:59.000000000 +1100
+++ ./include/linux/fs.h 2006-03-06 12:49:55.000000000 +1100
@@ -833,6 +833,8 @@ struct super_block {
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;

+ int s_pending_iputs;
+
struct block_device *s_bdev;
struct list_head s_instances;
struct quota_info s_dquot; /* Diskquota specific options */

2006-03-06 07:32:51

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> Somewhere in among the comments (thanks), I realised that I was only
> closing half the race. I had tried to make sure there were no stray
> references to any dentries, but there is still the inode which is
> being iput which can cause problem.
>
> The following patch takes a totally different approach, is based on an
> idea from Jan Kara, and is much less intrusive.
>
> We:
> - keep track of "who" is calling prune_dcache, and when a filesystem
> is being unmounted (s_root == NULL) we only allow the unmount thread
> to prune dentries.
> - keep track of how many dentries are in the process of having
> dentry_iput called on them for pruning
> - don't allow umount to proceed until that count hits zero
> - bias the count this way and that to make sure we get a wake_up at
> the right time
> - reuse 's_wait_unfrozen' to wait on the iput to complete.
>
> Again, I'm very keen on feedback. This race is very hard to trigger,
> so code review is the only real way to evaluate that patch.
>
> Thanks,
> NeilBrown
>

The code changes look big, have you looked at
http://marc.theaimsgroup.com/?l=linux-kernel&m=113817279225962&w=2

Some top of the head feedback below. Will try and do a detailed review later.

>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/dcache.c | 17 +++++++++++++----
> ./fs/super.c | 11 +++++++++++
> ./include/linux/fs.h | 2 ++
> 3 files changed, 26 insertions(+), 4 deletions(-)
>
> diff ./fs/dcache.c~current~ ./fs/dcache.c
> --- ./fs/dcache.c~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./fs/dcache.c 2006-03-06 16:55:33.000000000 +1100
> @@ -366,6 +366,7 @@ static inline void prune_one_dentry(stru
> {
> struct dentry * parent;
>
> + dentry->d_sb->s_pending_iputs ++;
> __d_drop(dentry);
> list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> @@ -375,6 +376,9 @@ static inline void prune_one_dentry(stru
> if (parent != dentry)
> dput(parent);
> spin_lock(&dcache_lock);
> + dentry->d_sb->s_pending_iputs --;
> + if (dentry->d_sb->s_pending_iputs < 0)
> + wake_up(&dentry->d_sb->s_wait_unfrozen);
> }
>
> /**
> @@ -390,7 +394,7 @@ static inline void prune_one_dentry(stru
> * all the dentries are in use.
> */
>
> -static void prune_dcache(int count)
> +static void prune_dcache(int count, struct dentry *parent)
> {
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> @@ -407,6 +411,11 @@ static void prune_dcache(int count)
> dentry_stat.nr_unused--;
> dentry = list_entry(tmp, struct dentry, d_lru);
>
> + if (dentry->d_sb->s_root == NULL &&
> + (parent == NULL ||
> + parent->d_sb != dentry->d_sb))
> + continue;
> +
> spin_lock(&dentry->d_lock);
> /*
> * We found an inuse dentry which was not removed from
> @@ -635,7 +644,7 @@ void shrink_dcache_parent(struct dentry
> int found;
>
> while ((found = select_parent(parent)) != 0)
> - prune_dcache(found);
> + prune_dcache(found, parent);
> }
>
> /**
> @@ -673,7 +682,7 @@ void shrink_dcache_anon(struct hlist_hea
> }
> }
> spin_unlock(&dcache_lock);
> - prune_dcache(found);
> + prune_dcache(found, NULL);
> } while(found);
> }
>
> @@ -694,7 +703,7 @@ static int shrink_dcache_memory(int nr,
> if (nr) {
> if (!(gfp_mask & __GFP_FS))
> return -1;
> - prune_dcache(nr);
> + prune_dcache(nr, NULL);
> }
> return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
> }
>
> diff ./fs/super.c~current~ ./fs/super.c
> --- ./fs/super.c~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./fs/super.c 2006-03-06 16:57:19.000000000 +1100
> @@ -230,7 +230,18 @@ void generic_shutdown_super(struct super
> struct super_operations *sop = sb->s_op;
>
> if (root) {
> + spin_lock(&dcache_lock);
> + /* disable stray dputs */
> sb->s_root = NULL;
> +
> + /* trigger a wake_up */
> + sb->s_pending_iputs --;
> + spin_unlock(&dcache_lock);
> + wait_event(sb->s_wait_unfrozen,
> + sb->s_pending_iputs < 0);
> + /* avoid further wakeups */
> + sb->s_pending_iputs = 65000;

This looks a bit ugly, what is 65000?

> +
> shrink_dcache_parent(root);
> shrink_dcache_anon(&sb->s_anon);
> dput(root);
>
> diff ./include/linux/fs.h~current~ ./include/linux/fs.h
> --- ./include/linux/fs.h~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./include/linux/fs.h 2006-03-06 12:49:55.000000000 +1100
> @@ -833,6 +833,8 @@ struct super_block {
> struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
> struct list_head s_files;
>
> + int s_pending_iputs;
> +
> struct block_device *s_bdev;
> struct list_head s_instances;
> struct quota_info s_dquot; /* Diskquota specific options */

2006-03-06 11:52:36

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>Comments? Please :-?
ok :) see inline.

> Somewhere in among the comments (thanks), I realised that I was only
> closing half the race. I had tried to make sure there were no stray
> references to any dentries, but there is still the inode which is
> being iput which can cause problem.
>
> The following patch takes a totally different approach, is based on an
> idea from Jan Kara, and is much less intrusive.
>
> We:
> - keep track of "who" is calling prune_dcache, and when a filesystem
> is being unmounted (s_root == NULL) we only allow the unmount thread
> to prune dentries.
> - keep track of how many dentries are in the process of having
> dentry_iput called on them for pruning
> - don't allow umount to proceed until that count hits zero
> - bias the count this way and that to make sure we get a wake_up at
> the right time
> - reuse 's_wait_unfrozen' to wait on the iput to complete.
>
> Again, I'm very keen on feedback. This race is very hard to trigger,
> so code review is the only real way to evaluate that patch.
hmm... It's not that very hard. I suppose no one ever tried except us.
We have a test which does mounts/umounts and lots of other activity
which makes memory pressure. AFAIR, it takes us ~3 hours to trigger this
bug on SMP.

In general your patch is still does what mine do, so I will be happy if
any of this is commited mainstream. In future, please, keep the
reference to original authors, this will also make sure that I'm on CC
if something goes wrong.

Thanks,
Kiril

> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/dcache.c | 17 +++++++++++++----
> ./fs/super.c | 11 +++++++++++
> ./include/linux/fs.h | 2 ++
> 3 files changed, 26 insertions(+), 4 deletions(-)
>
> diff ./fs/dcache.c~current~ ./fs/dcache.c
> --- ./fs/dcache.c~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./fs/dcache.c 2006-03-06 16:55:33.000000000 +1100
> @@ -366,6 +366,7 @@ static inline void prune_one_dentry(stru
> {
> struct dentry * parent;
>
> + dentry->d_sb->s_pending_iputs ++;
> __d_drop(dentry);
> list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> @@ -375,6 +376,9 @@ static inline void prune_one_dentry(stru
> if (parent != dentry)
> dput(parent);
> spin_lock(&dcache_lock);
> + dentry->d_sb->s_pending_iputs --;
> + if (dentry->d_sb->s_pending_iputs < 0)
> + wake_up(&dentry->d_sb->s_wait_unfrozen);
> }
>
> /**
> @@ -390,7 +394,7 @@ static inline void prune_one_dentry(stru
> * all the dentries are in use.
> */
>
> -static void prune_dcache(int count)
> +static void prune_dcache(int count, struct dentry *parent)
> {
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> @@ -407,6 +411,11 @@ static void prune_dcache(int count)
> dentry_stat.nr_unused--;
> dentry = list_entry(tmp, struct dentry, d_lru);
>
> + if (dentry->d_sb->s_root == NULL &&
> + (parent == NULL ||
> + parent->d_sb != dentry->d_sb))
> + continue;
<<<<
- we select some dentries in select_parent and then try to prune N
dentries. But this can be other N dentries :/ It's not the problem with
your code, but... adding 'parent' arg to prune_dcache() makes me feel
the same as with 'found' arg: you don't use it actually right way. You
don't prune only its children, you see? It is better to pass 'sb' arg then.

> +
> spin_lock(&dentry->d_lock);
> /*
> * We found an inuse dentry which was not removed from
> @@ -635,7 +644,7 @@ void shrink_dcache_parent(struct dentry
> int found;
>
> while ((found = select_parent(parent)) != 0)
> - prune_dcache(found);
> + prune_dcache(found, parent);
> }
>
> /**
> @@ -673,7 +682,7 @@ void shrink_dcache_anon(struct hlist_hea
> }
> }
> spin_unlock(&dcache_lock);
> - prune_dcache(found);
> + prune_dcache(found, NULL);
> } while(found);
> }
>
> @@ -694,7 +703,7 @@ static int shrink_dcache_memory(int nr,
> if (nr) {
> if (!(gfp_mask & __GFP_FS))
> return -1;
> - prune_dcache(nr);
> + prune_dcache(nr, NULL);
> }
> return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
> }
>
> diff ./fs/super.c~current~ ./fs/super.c
> --- ./fs/super.c~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./fs/super.c 2006-03-06 16:57:19.000000000 +1100
> @@ -230,7 +230,18 @@ void generic_shutdown_super(struct super
> struct super_operations *sop = sb->s_op;
>
> if (root) {
> + spin_lock(&dcache_lock);
> + /* disable stray dputs */
> sb->s_root = NULL;
> +
> + /* trigger a wake_up */
> + sb->s_pending_iputs --;
> + spin_unlock(&dcache_lock);
> + wait_event(sb->s_wait_unfrozen,
> + sb->s_pending_iputs < 0);
> + /* avoid further wakeups */
> + sb->s_pending_iputs = 65000;
<<< its ulgy... :( why don't you do wait after shrink's and dput(root)
below?

> +
> shrink_dcache_parent(root);
> shrink_dcache_anon(&sb->s_anon);
> dput(root);
>
> diff ./include/linux/fs.h~current~ ./include/linux/fs.h
> --- ./include/linux/fs.h~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./include/linux/fs.h 2006-03-06 12:49:55.000000000 +1100
> @@ -833,6 +833,8 @@ struct super_block {
> struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
> struct list_head s_files;
>
> + int s_pending_iputs;
> +
> struct block_device *s_bdev;
> struct list_head s_instances;
> struct quota_info s_dquot; /* Diskquota specific options */
>


2006-03-06 11:56:06

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Mon, Mar 06, Neil Brown wrote:

> On Thursday March 2, [email protected] wrote:
> >
> >
> > Hi,
> > This mail relates to the thread with the same subject which can be
> > found at
> >
> > http://lkml.org/lkml/2006/1/16/279
> >
> > I would like to propose an alternate patch for the problem.
> ....
> >
> > Comments? Please :-?
>
> Somewhere in among the comments (thanks), I realised that I was only
> closing half the race. I had tried to make sure there were no stray
> references to any dentries, but there is still the inode which is
> being iput which can cause problem.
>
> The following patch takes a totally different approach, is based on an
> idea from Jan Kara, and is much less intrusive.
>
> We:
> - keep track of "who" is calling prune_dcache, and when a filesystem
> is being unmounted (s_root == NULL) we only allow the unmount thread
> to prune dentries.
> - keep track of how many dentries are in the process of having
> dentry_iput called on them for pruning
> - don't allow umount to proceed until that count hits zero
> - bias the count this way and that to make sure we get a wake_up at
> the right time
> - reuse 's_wait_unfrozen' to wait on the iput to complete.
>
> Again, I'm very keen on feedback. This race is very hard to trigger,
> so code review is the only real way to evaluate that patch.
>

Just ask Olaf. Afaik he was the one who triggered it frequently.

This are two different problems which you adress with this and your first
patch. This one is to prevent busy inodes on umouny, the first one was to get
the reference counting on dentries right.

Neil, did you actually read my patch for this one?!
http://marc.theaimsgroup.com/?l=linux-kernel&m=114123870406751&w=2

> diff ./fs/super.c~current~ ./fs/super.c
> --- ./fs/super.c~current~ 2006-03-06 16:54:59.000000000 +1100
> +++ ./fs/super.c 2006-03-06 16:57:19.000000000 +1100
> @@ -230,7 +230,18 @@ void generic_shutdown_super(struct super
> struct super_operations *sop = sb->s_op;
>
> if (root) {
> + spin_lock(&dcache_lock);
> + /* disable stray dputs */
> sb->s_root = NULL;
> +
> + /* trigger a wake_up */
> + sb->s_pending_iputs --;
> + spin_unlock(&dcache_lock);
> + wait_event(sb->s_wait_unfrozen,
> + sb->s_pending_iputs < 0);
> + /* avoid further wakeups */
> + sb->s_pending_iputs = 65000;
> +
> shrink_dcache_parent(root);
> shrink_dcache_anon(&sb->s_anon);
> dput(root);
>

What I don't like, is that you are serializing the work of shrink_dcache_*
although they could work in parallel on different processors.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-07 01:59:29

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Monday March 6, [email protected] wrote:
> > Somewhere in among the comments (thanks), I realised that I was only
> > closing half the race. I had tried to make sure there were no stray
> > references to any dentries, but there is still the inode which is
> > being iput which can cause problem.
> >
> > The following patch takes a totally different approach, is based on an
> > idea from Jan Kara, and is much less intrusive.
> >
> > We:
> > - keep track of "who" is calling prune_dcache, and when a filesystem
> > is being unmounted (s_root == NULL) we only allow the unmount thread
> > to prune dentries.
> > - keep track of how many dentries are in the process of having
> > dentry_iput called on them for pruning
> > - don't allow umount to proceed until that count hits zero
> > - bias the count this way and that to make sure we get a wake_up at
> > the right time
> > - reuse 's_wait_unfrozen' to wait on the iput to complete.
> >
> > Again, I'm very keen on feedback. This race is very hard to trigger,
> > so code review is the only real way to evaluate that patch.
> >
> > Thanks,
> > NeilBrown
> >
>
> The code changes look big, have you looked at
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113817279225962&w=2

No I haven't. I like it.
- Holding the semaphore shouldn't be a problem.
- calling down_read_trylock ought to be fast
- I *think* the unwanted calls to prune_dcache are always under
PF_MEMALLOC - they certainly seem to be.

And it is a nice small change.
Have you had any other feedback on this?


>
> Some top of the head feedback below. Will try and do a detailed review later.
>

> > + /* avoid further wakeups */
> > + sb->s_pending_iputs = 65000;
>
> This looks a bit ugly, what is 65000?

Just the first big number that came to by head... probably not needed.


NeilBrown

2006-03-07 02:02:15

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Monday March 6, [email protected] wrote:
>
> In general your patch is still does what mine do, so I will be happy if
> any of this is commited mainstream. In future, please, keep the
> reference to original authors, this will also make sure that I'm on CC
> if something goes wrong.

Sorry: which 'original author' did I miss ?

> >
> > + if (dentry->d_sb->s_root == NULL &&
> > + (parent == NULL ||
> > + parent->d_sb != dentry->d_sb))
> > + continue;
> <<<<
> - we select some dentries in select_parent and then try to prune N
> dentries. But this can be other N dentries :/ It's not the problem with
> your code, but... adding 'parent' arg to prune_dcache() makes me feel
> the same as with 'found' arg: you don't use it actually right way. You
> don't prune only its children, you see? It is better to pass 'sb' arg then.
>

That's a valid point. I should be passing 'sb'.

Thanks,
NeilBrown

2006-03-07 02:16:46

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Monday March 6, [email protected] wrote:
>
> This are two different problems which you adress with this and your first
> patch. This one is to prevent busy inodes on umouny, the first one was to get
> the reference counting on dentries right.

I think that solving the "busy inodes" problem is sufficient. The
reference count on dentries isn't *wrong* as someone is actually
holding a reference. It is just that generic_shutdown_super doesn't
expect anyone else to hold any references. Fixing the "busy inodes"
problem means that no-one else will be holding any references, so it
becomes a non-problem.
>
> Neil, did you actually read my patch for this one?!
> http://marc.theaimsgroup.com/?l=linux-kernel&m=114123870406751&w=2

No, I didn't :-( I obviously didn't do enough homework.

The significant differences seem to be:
- you test ->s_prunes inside the spinlock. I don't bother. Yours is
probably safer.
- You call wake_up every time through prune_one_dentry while I try to
limit the calls. As each call is a function call and a spinlock,
maybe at least guard it with
if(waitqueue_active()) ...


>
> What I don't like, is that you are serializing the work of shrink_dcache_*
> although they could work in parallel on different processors.

I don't see how I am parallelising anything. Multiple shrink_dcache_*
can still run. The only place that extra locking is done is in
generic_shutdown_super.

But what do you think of Balbir Singh's patch? I think it is less
intrusive and solves the problem just a well.

Thanks,
NeilBrown


2006-03-07 02:49:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> No I haven't. I like it.
> - Holding the semaphore shouldn't be a problem.
> - calling down_read_trylock ought to be fast
> - I *think* the unwanted calls to prune_dcache are always under
> PF_MEMALLOC - they certainly seem to be.
>
> And it is a nice small change.
> Have you had any other feedback on this?
>
>

Thanks, I do not have any feedback on it, but I am certainly hungry for it :-)

> >
> > Some top of the head feedback below. Will try and do a detailed review later.
> >
>
> > > + /* avoid further wakeups */
> > > + sb->s_pending_iputs = 65000;
> >
> > This looks a bit ugly, what is 65000?
>
> Just the first big number that came to by head... probably not needed.
>

ok, I would rather use a const or a #define and hide it under a
meaningful name, with comments. If it is not needed, then nothing like
avoiding magic numbers.

>
> NeilBrown
>

2006-03-07 06:18:04

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>>>+ /* avoid further wakeups */
>>>>+ sb->s_pending_iputs = 65000;
>>>
>>>This looks a bit ugly, what is 65000?
>>
>>Just the first big number that came to by head... probably not needed.
>>
>
>
> ok, I would rather use a const or a #define and hide it under a
> meaningful name, with comments. If it is not needed, then nothing like
> avoiding magic numbers.
It looks like this assignment is not needed at all if "wait_for_prunes"
is moved after dput(root), since no more dentries should exist after
that point and wakeup can not potentially happen.

Thanks,
Kirill

2006-03-07 06:11:55

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>The code changes look big, have you looked at
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=113817279225962&w=2
>
>
> No I haven't. I like it.
> - Holding the semaphore shouldn't be a problem.
> - calling down_read_trylock ought to be fast
> - I *think* the unwanted calls to prune_dcache are always under
> PF_MEMALLOC - they certainly seem to be.
No, it looks as it is not :(
Have you noticed my comment about "count" argument to prune_dcache()?
For example, prune_dcache() is called from shrink_dcache_parent() which
is called in many places and not all of them have PF_MEMALLOC or
s_umount semaphore for write. But prune_dcache() doesn't care for super
blocks etc. It simply shrinks N dentries which are found _first_.

So the condition:
+ if ((current->flags & PF_MEMALLOC) &&
+ !(ret = down_read_trylock(&s->s_umount))) {
is not always true when the race occurs, as PF_MEMALLOC is not always set.

> And it is a nice small change.
> Have you had any other feedback on this?
here it is :)

Thanks,
Kirill

2006-03-07 06:16:19

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>In general your patch is still does what mine do, so I will be happy if
>>any of this is commited mainstream. In future, please, keep the
>>reference to original authors, this will also make sure that I'm on CC
>>if something goes wrong.
>
>
> Sorry: which 'original author' did I miss ?
I mean, it is better to mention original author
(http://marc.theaimsgroup.com/?l=linux-kernel&m=114123870406751&w=2) in
patch description, as it makes sure that he will be on CC if this patch
will be discussed later again. My patch fixing this race was in -mm tree
for half a year already.

Thanks,
Kirill


2006-03-07 07:03:47

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Tue, Mar 07, 2006 at 09:16:22AM +0300, Kirill Korotaev wrote:
> >>The code changes look big, have you looked at
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=113817279225962&w=2
> >
> >
> >No I haven't. I like it.
> > - Holding the semaphore shouldn't be a problem.
> > - calling down_read_trylock ought to be fast
> > - I *think* the unwanted calls to prune_dcache are always under
> > PF_MEMALLOC - they certainly seem to be.
> No, it looks as it is not :(
> Have you noticed my comment about "count" argument to prune_dcache()?
> For example, prune_dcache() is called from shrink_dcache_parent() which
> is called in many places and not all of them have PF_MEMALLOC or
> s_umount semaphore for write. But prune_dcache() doesn't care for super
> blocks etc. It simply shrinks N dentries which are found _first_.
>
> So the condition:
> + if ((current->flags & PF_MEMALLOC) &&
> + !(ret = down_read_trylock(&s->s_umount))) {
> is not always true when the race occurs, as PF_MEMALLOC is not always set.

I understand your comment about shrink_dcache_parent() being called
from several places. prune_one_dentry() would eventually dput the parent,
but unmount would go ahead and unmount the filesystem before the
dput of the parent could happen.

Given that background, I thought our main concern was with respect to
unmount. The race was between shrink_dcache_parent() (called from unmount)
and shrink_dcache_memory() (called from the allocator), hence the fix
for the race condition.

I just noticied that 2.6.16-rc* now seems to have drop_slab() where
PF_MEMALLOC is not set. So, we can still race with my fix if there
if /proc/sys/vm/drop_caches is written to and unmount is done in parallel.

A simple hack would be to set PF_MEMALLOC in drop_slab(), but I do not
think it is a good idea.

>
> >And it is a nice small change.
> >Have you had any other feedback on this?
> here it is :)
>

Thanks for your detailed feedback

> Thanks,
> Kirill
>

Regards,
Balbir

2006-03-07 07:16:58

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

>>No, it looks as it is not :(
>>Have you noticed my comment about "count" argument to prune_dcache()?
>>For example, prune_dcache() is called from shrink_dcache_parent() which
>>is called in many places and not all of them have PF_MEMALLOC or
>>s_umount semaphore for write. But prune_dcache() doesn't care for super
>>blocks etc. It simply shrinks N dentries which are found _first_.
>>
>>So the condition:
>>+ if ((current->flags & PF_MEMALLOC) &&
>>+ !(ret = down_read_trylock(&s->s_umount))) {
>>is not always true when the race occurs, as PF_MEMALLOC is not always set.
>
>
> I understand your comment about shrink_dcache_parent() being called
> from several places. prune_one_dentry() would eventually dput the parent,
> but unmount would go ahead and unmount the filesystem before the
> dput of the parent could happen.
exactly.

> Given that background, I thought our main concern was with respect to
> unmount. The race was between shrink_dcache_parent() (called from unmount)
> and shrink_dcache_memory() (called from the allocator), hence the fix
> for the race condition.
Partial fix doesn't make much sense from my point of view.

> I just noticied that 2.6.16-rc* now seems to have drop_slab() where
> PF_MEMALLOC is not set. So, we can still race with my fix if there
> if /proc/sys/vm/drop_caches is written to and unmount is done in parallel.
>
> A simple hack would be to set PF_MEMALLOC in drop_slab(), but I do not
> think it is a good idea.
Yeah, playing with PF_MEMALLOC can be not so good idea :/
And as it doesn't help in other cases it looks unpromising...

>>>Have you had any other feedback on this?
>>here it is :)
> Thanks for your detailed feedback
Sorry, that I did it too late :/

Thanks,
Kirill


2006-03-07 11:06:10

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> >Given that background, I thought our main concern was with respect to
> >unmount. The race was between shrink_dcache_parent() (called from unmount)
> >and shrink_dcache_memory() (called from the allocator), hence the fix
> >for the race condition.
> Partial fix doesn't make much sense from my point of view.
>

IMHO, It was not a partial fix. slab_drop() addition changed the assumptions
used by this fix

> >I just noticied that 2.6.16-rc* now seems to have drop_slab() where
> >PF_MEMALLOC is not set. So, we can still race with my fix if there
> >if /proc/sys/vm/drop_caches is written to and unmount is done in parallel.
> >
> >A simple hack would be to set PF_MEMALLOC in drop_slab(), but I do not
> >think it is a good idea.
> Yeah, playing with PF_MEMALLOC can be not so good idea :/
> And as it doesn't help in other cases it looks unpromising...

Yes, agreed.

>
> >>>Have you had any other feedback on this?
> >>here it is :)
> >Thanks for your detailed feedback
> Sorry, that I did it too late :/
>

No problem

> Thanks,
> Kirill
>

Balbir

2006-03-07 23:22:00

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Tuesday March 7, [email protected] wrote:
> >>In general your patch is still does what mine do, so I will be happy if
> >>any of this is commited mainstream. In future, please, keep the
> >>reference to original authors, this will also make sure that I'm on CC
> >>if something goes wrong.
> >
> >
> > Sorry: which 'original author' did I miss ?
> I mean, it is better to mention original author
> (http://marc.theaimsgroup.com/?l=linux-kernel&m=114123870406751&w=2) in
> patch description, as it makes sure that he will be on CC if this patch
> will be discussed later again. My patch fixing this race was in -mm tree
> for half a year already.

Which patch is that? The race still seems to be present in -mm.

NeilBrown

2006-03-08 00:30:13

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Tuesday March 7, [email protected] wrote:
> >>The code changes look big, have you looked at
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=113817279225962&w=2
> >
> >
> > No I haven't. I like it.
> > - Holding the semaphore shouldn't be a problem.
> > - calling down_read_trylock ought to be fast
> > - I *think* the unwanted calls to prune_dcache are always under
> > PF_MEMALLOC - they certainly seem to be.
> No, it looks as it is not :(
> Have you noticed my comment about "count" argument to prune_dcache()?
> For example, prune_dcache() is called from shrink_dcache_parent() which
> is called in many places and not all of them have PF_MEMALLOC or
> s_umount semaphore for write. But prune_dcache() doesn't care for super
> blocks etc. It simply shrinks N dentries which are found _first_.
>
> So the condition:
> + if ((current->flags & PF_MEMALLOC) &&
> + !(ret = down_read_trylock(&s->s_umount))) {
> is not always true when the race occurs, as PF_MEMALLOC is not always set.
>
> > And it is a nice small change.
> > Have you had any other feedback on this?
> here it is :)

Thanks....

So: we seem to have two different approaches to solving this problem.

One is to stop any other thread from calling dentry_iput while the
umount is running generic_shutdown_super. This cannot be done with
the PF_MEMALLOC trick and so would require calls to prune_dcache to
state their intentions (e.g. pass a 'struct super_block *').
With this approach, generic_shutdown_super needs to wait for stray
pruning to finish after marking the superblock as being unmounted, and
before shrinking the dcache.

The other is to allow other threads to call dentry_iput at any time,
but to keep track of them, and to wait for all to finish after
pruning all the filesystem's dentries. This requires the extra
locking in prune_one_dentry to make sure we dput the parent before
releasing dcache_lock.

Of these two, I prefer the former, because fiddling with the locking
in prune_one_dentry is rather intrusive.

Also, the recent
nfs-permit-filesystem-to-override-root-dentry-on-mount.patch
patch in -mm means that generic_shutdown_super does not call
prune_dcache but has it's own pruning code. This means there is no
longer a need to pass intentions to prune_dcache. Prune_dcache can
always ignore dentries with s->root==NULL, and shrink_dcache_sb only
works with it's own dentries.

So: blending the better bits of various patches together, I've come up
with the following. It still needs a changelog entry, but does anyone
want to ACK it ???

Thanks,
NeilBrown

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/dcache.c | 32 ++++++++++++++++++++++++++++++--
./fs/super.c | 3 +++
./include/linux/dcache.h | 1 +
./include/linux/fs.h | 3 +++
4 files changed, 37 insertions(+), 2 deletions(-)

diff ./fs/dcache.c~current~ ./fs/dcache.c
--- ./fs/dcache.c~current~ 2006-03-08 10:42:56.000000000 +1100
+++ ./fs/dcache.c 2006-03-08 11:24:11.000000000 +1100
@@ -364,17 +364,43 @@ restart:
*/
static inline void prune_one_dentry(struct dentry * dentry)
{
+ struct super_block *sb = dentry->d_sb;
struct dentry * parent;

__d_drop(dentry);
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
+ sb->s_prunes++;
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
if (parent != dentry)
dput(parent);
spin_lock(&dcache_lock);
+ sb->s_prunes--;
+ if (waitqueue_active(&sb->s_wait_prunes))
+ wake_up(&sb->s_wait_prunes);
+}
+
+/* As prune_one_dentry can hold an input with calling
+ * dentry_iput, generic_shutdown_super needs to wait for any
+ * pending pruning to stop before doing it's own dentry
+ * pruning.
+ */
+void wait_on_prunes(struct super_block *sb)
+{
+ DEFINE_WAIT(w);
+ spin_lock(&dcache_lock);
+ for (;;) {
+ prepare_to_wait(&sb->s_wait_prunes, &w, TASK_UNINTERRUPTIBLE);
+ if (sb->s_prunes == 0)
+ break;
+ spin_unlock(&dcache_lock);
+ schedule();
+ spin_lock(&dcache_lock);
+ }
+ spin_unlock(&dcache_lock);
+ finish_wait(&sb->s_wait_prunes, &w);
}

/**
@@ -417,8 +443,10 @@ static void prune_dcache(int count)
spin_unlock(&dentry->d_lock);
continue;
}
- /* If the dentry was recently referenced, don't free it. */
- if (dentry->d_flags & DCACHE_REFERENCED) {
+ /* If the dentry was recently referenced, or if the filesystem
+ * is being unmounted, don't free it. */
+ if ((dentry->d_flags & DCACHE_REFERENCED) ||
+ dentry->d_sb->s_root == NULL) {
dentry->d_flags &= ~DCACHE_REFERENCED;
list_add(&dentry->d_lru, &dentry_unused);
dentry_stat.nr_unused++;

diff ./fs/super.c~current~ ./fs/super.c
--- ./fs/super.c~current~ 2006-03-08 10:50:37.000000000 +1100
+++ ./fs/super.c 2006-03-08 11:02:12.000000000 +1100
@@ -81,6 +81,8 @@ static struct super_block *alloc_super(v
mutex_init(&s->s_dquot.dqio_mutex);
mutex_init(&s->s_dquot.dqonoff_mutex);
init_rwsem(&s->s_dquot.dqptr_sem);
+ s->s_prunes = 0;
+ init_waitqueue_head(&s->s_wait_prunes);
init_waitqueue_head(&s->s_wait_unfrozen);
s->s_maxbytes = MAX_NON_LFS;
s->dq_op = sb_dquot_ops;
@@ -231,6 +233,7 @@ void generic_shutdown_super(struct super

if (root) {
sb->s_root = NULL;
+ wait_on_prunes(sb);
shrink_dcache_sb(sb);
dput(root);
fsync_super(sb);

diff ./include/linux/dcache.h~current~ ./include/linux/dcache.h
--- ./include/linux/dcache.h~current~ 2006-03-08 11:07:50.000000000 +1100
+++ ./include/linux/dcache.h 2006-03-08 11:25:14.000000000 +1100
@@ -220,6 +220,7 @@ extern void shrink_dcache_sb(struct supe
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_anon(struct hlist_head *);
extern int d_invalidate(struct dentry *);
+extern void wait_on_prunes(struct super_block *);

/* only used at mount-time */
extern struct dentry * d_alloc_root(struct inode *);

diff ./include/linux/fs.h~current~ ./include/linux/fs.h
--- ./include/linux/fs.h~current~ 2006-03-08 11:02:23.000000000 +1100
+++ ./include/linux/fs.h 2006-03-08 11:03:02.000000000 +1100
@@ -838,6 +838,9 @@ struct super_block {
struct list_head s_instances;
struct quota_info s_dquot; /* Diskquota specific options */

+ unsigned int s_prunes; /* protected by dcache_lock */
+ wait_queue_head_t s_wait_prunes;
+
int s_frozen;
wait_queue_head_t s_wait_unfrozen;



2006-03-08 02:17:43

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> So: we seem to have two different approaches to solving this problem.
>
> One is to stop any other thread from calling dentry_iput while the
> umount is running generic_shutdown_super. This cannot be done with
> the PF_MEMALLOC trick and so would require calls to prune_dcache to
> state their intentions (e.g. pass a 'struct super_block *').
> With this approach, generic_shutdown_super needs to wait for stray
> pruning to finish after marking the superblock as being unmounted, and
> before shrinking the dcache.
>
> The other is to allow other threads to call dentry_iput at any time,
> but to keep track of them, and to wait for all to finish after
> pruning all the filesystem's dentries. This requires the extra
> locking in prune_one_dentry to make sure we dput the parent before
> releasing dcache_lock.
>
> Of these two, I prefer the former, because fiddling with the locking
> in prune_one_dentry is rather intrusive.
>
> Also, the recent
> nfs-permit-filesystem-to-override-root-dentry-on-mount.patch
> patch in -mm means that generic_shutdown_super does not call
> prune_dcache but has it's own pruning code. This means there is no
> longer a need to pass intentions to prune_dcache. Prune_dcache can
> always ignore dentries with s->root==NULL, and shrink_dcache_sb only
> works with it's own dentries.
>
> So: blending the better bits of various patches together, I've come up
> with the following. It still needs a changelog entry, but does anyone
> want to ACK it ???

I think this patch is much more cleaner and refined.

>From yesterdays comments I am beginning to wonder if it is enough to solve
only the unmount race or should the fix be more generic to address the race
between the shrinkers call to shrink_dcache_memory() and shrink_dcache_parent().

>From what I understand of the race, the race occurs when dput of the
parent fails to happen and because the referecne count is not 0,
shrink_dcache_parent() skips over those dentries. The race occurs for
the dentry and its ancestors above

I was wondering if the following would work (to solve the generic race)

Add a prune_mutex to the super-block. Hold on to it in prune_one_dentry()
until we hit a parent dentry that is a mount point (d_mounted > 0) or
the parent has a reference count > 1 or at the end of prune_one_dentry().
This should ensure that for each super block dentry counts are consistent.
Also get select_parent() to hold the super block's prune_mutex, so that it
sees a consistent view of the super block.

Oh! now that I think about it, I think your solution looks like an
elegant way to do the same thing. The only draw back is that it solves
only the unmount race and there are some changes in generic_shutdown_super()
which I do not understand.

Please find some comments for your patch below

>
> Thanks,
> NeilBrown
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/dcache.c | 32 ++++++++++++++++++++++++++++++--
> ./fs/super.c | 3 +++
> ./include/linux/dcache.h | 1 +
> ./include/linux/fs.h | 3 +++
> 4 files changed, 37 insertions(+), 2 deletions(-)
>
> diff ./fs/dcache.c~current~ ./fs/dcache.c
> --- ./fs/dcache.c~current~ 2006-03-08 10:42:56.000000000 +1100
> +++ ./fs/dcache.c 2006-03-08 11:24:11.000000000 +1100
> @@ -364,17 +364,43 @@ restart:
> */
> static inline void prune_one_dentry(struct dentry * dentry)
> {
> + struct super_block *sb = dentry->d_sb;
> struct dentry * parent;
>
> __d_drop(dentry);
> list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> + sb->s_prunes++;
> dentry_iput(dentry);
> parent = dentry->d_parent;
> d_free(dentry);
> if (parent != dentry)
> dput(parent);
> spin_lock(&dcache_lock);
> + sb->s_prunes--;
> + if (waitqueue_active(&sb->s_wait_prunes))
> + wake_up(&sb->s_wait_prunes);
> +}
> +
> +/* As prune_one_dentry can hold an input with calling
> + * dentry_iput, generic_shutdown_super needs to wait for any
> + * pending pruning to stop before doing it's own dentry
> + * pruning.
> + */
> +void wait_on_prunes(struct super_block *sb)
> +{
> + DEFINE_WAIT(w);
> + spin_lock(&dcache_lock);
> + for (;;) {
> + prepare_to_wait(&sb->s_wait_prunes, &w, TASK_UNINTERRUPTIBLE);
> + if (sb->s_prunes == 0)
> + break;
> + spin_unlock(&dcache_lock);
> + schedule();
> + spin_lock(&dcache_lock);
> + }
> + spin_unlock(&dcache_lock);
> + finish_wait(&sb->s_wait_prunes, &w);
> }
>
> /**
> @@ -417,8 +443,10 @@ static void prune_dcache(int count)
> spin_unlock(&dentry->d_lock);
> continue;
> }
> - /* If the dentry was recently referenced, don't free it. */
> - if (dentry->d_flags & DCACHE_REFERENCED) {
> + /* If the dentry was recently referenced, or if the filesystem
> + * is being unmounted, don't free it. */
> + if ((dentry->d_flags & DCACHE_REFERENCED) ||
> + dentry->d_sb->s_root == NULL) {
> dentry->d_flags &= ~DCACHE_REFERENCED;
> list_add(&dentry->d_lru, &dentry_unused);
> dentry_stat.nr_unused++;
>
> diff ./fs/super.c~current~ ./fs/super.c
> --- ./fs/super.c~current~ 2006-03-08 10:50:37.000000000 +1100
> +++ ./fs/super.c 2006-03-08 11:02:12.000000000 +1100
> @@ -81,6 +81,8 @@ static struct super_block *alloc_super(v
> mutex_init(&s->s_dquot.dqio_mutex);
> mutex_init(&s->s_dquot.dqonoff_mutex);
> init_rwsem(&s->s_dquot.dqptr_sem);
> + s->s_prunes = 0;
> + init_waitqueue_head(&s->s_wait_prunes);
> init_waitqueue_head(&s->s_wait_unfrozen);
> s->s_maxbytes = MAX_NON_LFS;
> s->dq_op = sb_dquot_ops;
> @@ -231,6 +233,7 @@ void generic_shutdown_super(struct super
>
> if (root) {
> sb->s_root = NULL;
> + wait_on_prunes(sb);
> shrink_dcache_sb(sb);

Hmm... in 2.6.16-rc5, I see

shrink_dcache_parent(root);
shrink_dcache_anon(&sb->sb_anon);

without these calls, some dentries might not get moved to LRU list.

Am I missing something here?

[snip]

Balbir

2006-03-08 02:40:44

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Wednesday March 8, [email protected] wrote:
> > So: blending the better bits of various patches together, I've come up
> > with the following. It still needs a changelog entry, but does anyone
> > want to ACK it ???
>
> I think this patch is much more cleaner and refined.

Thanks.

>
> From yesterdays comments I am beginning to wonder if it is enough to solve
> only the unmount race or should the fix be more generic to address the race
> between the shrinkers call to shrink_dcache_memory() and shrink_dcache_parent().
>
> From what I understand of the race, the race occurs when dput of the
> parent fails to happen and because the referecne count is not 0,
> shrink_dcache_parent() skips over those dentries. The race occurs for
> the dentry and its ancestors above

I think that in most cases, the race doesn't matter if
shrink_dcache_memory misses a dentry because someone else is holding a
temporary reference, it really doesn't matter.
Similarly most callers of shrink_dcache_parent are happy with a
best-effort.

Unmount is a special case because it wants to 'shrink' *all* of the
dentries for the filesystem. In that case, someone holding a
transient reference s a bad thing. In other cases it is, at best, a
minor inconvenience.

>
> I was wondering if the following would work (to solve the generic race)
>
> Add a prune_mutex to the super-block. Hold on to it in prune_one_dentry()
> until we hit a parent dentry that is a mount point (d_mounted > 0) or
> the parent has a reference count > 1 or at the end of prune_one_dentry().
> This should ensure that for each super block dentry counts are consistent.
> Also get select_parent() to hold the super block's prune_mutex, so that it
> sees a consistent view of the super block.
>
> Oh! now that I think about it, I think your solution looks like an
> elegant way to do the same thing. The only draw back is that it solves
> only the unmount race and there are some changes in generic_shutdown_super()
> which I do not understand.
>

> > diff ./fs/super.c~current~ ./fs/super.c
> > --- ./fs/super.c~current~ 2006-03-08 10:50:37.000000000 +1100
> > +++ ./fs/super.c 2006-03-08 11:02:12.000000000 +1100
> > @@ -81,6 +81,8 @@ static struct super_block *alloc_super(v
> > mutex_init(&s->s_dquot.dqio_mutex);
> > mutex_init(&s->s_dquot.dqonoff_mutex);
> > init_rwsem(&s->s_dquot.dqptr_sem);
> > + s->s_prunes = 0;
> > + init_waitqueue_head(&s->s_wait_prunes);
> > init_waitqueue_head(&s->s_wait_unfrozen);
> > s->s_maxbytes = MAX_NON_LFS;
> > s->dq_op = sb_dquot_ops;
> > @@ -231,6 +233,7 @@ void generic_shutdown_super(struct super
> >
> > if (root) {
> > sb->s_root = NULL;
> > + wait_on_prunes(sb);
> > shrink_dcache_sb(sb);
>
> Hmm... in 2.6.16-rc5, I see
>
> shrink_dcache_parent(root);
> shrink_dcache_anon(&sb->sb_anon);
>
> without these calls, some dentries might not get moved to LRU list.
>
> Am I missing something here?

I should have been more explicit that the patch was against
2.6.16-rc5-mm2. This contains some dcache patches to allow nfs
filesystem to share superblocks, and one of the patches replaces the
calls to shrink_dcache_parent and shrink_dcache_anon with a single
call to a new function: shrink_dcache_sb.

Thanks for the feedback

NeilBrown

2006-03-08 03:05:14

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

> I think that in most cases, the race doesn't matter if
> shrink_dcache_memory misses a dentry because someone else is holding a
> temporary reference, it really doesn't matter.
> Similarly most callers of shrink_dcache_parent are happy with a
> best-effort.

I agree.

>
> I should have been more explicit that the patch was against
> 2.6.16-rc5-mm2. This contains some dcache patches to allow nfs
> filesystem to share superblocks, and one of the patches replaces the
> calls to shrink_dcache_parent and shrink_dcache_anon with a single
> call to a new function: shrink_dcache_sb.
>

shrink_dcache_parent() has been added back to generic_shutdown_super in
-mm3 (just checked). With that being the case, I have only one concern
with your patch

wait_on_prunes() breaks out if sb->prunes == 0. What if shrink_dcache_parent()
now calls select_parent(). select_parent() could still find entries
with d_count > 0 and skip them and shrink_dcache_memory() can still cause
the race condition to occur.

I think pushing wait_on_prunes() to after shrink_dcache_parent() will
most likely solve the race.


> Thanks for the feedback

Your welcome!

>
> NeilBrown

Balbir

--
I'm extremely grateful that hundreds of you have taken time to read these
patches, and to detect and report errors that you've found.
Your comments have helped me improve enormously. But I must confess that
I'm also disappointed to have had absolutely no feedback so far on several of
the patches on which I worked hardest when I was preparing these patches.
Could it be that (1) you've said nothing about them because I somehow managed
to get the details perfect? Or is it that (2) you shy away and are busy, hence
you are unable to spend more than a few minutes on any particular topic?
Although I do like to think that readers like to provide feedback, I fear that
hypothesis (1) is far less likely than hypothesis (2).

Adapted from Don Knuth's comments on feedback for his exercises

2006-03-08 11:01:39

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

On Wed, Mar 08, Balbir Singh wrote:

>
> wait_on_prunes() breaks out if sb->prunes == 0. What if shrink_dcache_parent()
> now calls select_parent(). select_parent() could still find entries
> with d_count > 0 and skip them and shrink_dcache_memory() can still cause
> the race condition to occur.
>
> I think pushing wait_on_prunes() to after shrink_dcache_parent() will
> most likely solve the race.
>

This is why I used to let shrink_dache_parent() only return after an
unsuccessfull select_parent() after a wait.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-09 12:00:17

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super

--- linux-2.6.15.orig/fs/dcache.c 2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/fs/dcache.c 2006-03-03 16:22:38.862706448 +0300
@@ -114,6 +114,75 @@ static inline void dentry_iput(struct de
}
}

+struct dcache_shrinker {
+ struct list_head list;
+ struct dentry *dentry;
+};
+
+DECLARE_WAIT_QUEUE_HEAD(dcache_shrinker_wq);
+
+/* called under dcache_lock */
+static void dcache_shrinker_add(struct dcache_shrinker *ds,
+ struct dentry *parent, struct dentry *dentry)
+{
+ struct super_block *sb;
+
+ sb = parent->d_sb;
+ ds->dentry = parent;
+ list_add(&ds->list, &sb->s_dshrinkers);
+}
+
+/* called under dcache_lock */
+static void dcache_shrinker_del(struct dcache_shrinker *ds)
+{
+ if (ds == NULL || list_empty(&ds->list))
+ return;
+
+ list_del_init(&ds->list);
+ wake_up_all(&dcache_shrinker_wq);
+}
+
+/* called under dcache_lock, drops inside */
+static void dcache_shrinker_wait(struct super_block *sb)
+{
+ DECLARE_WAITQUEUE(wq, current);
+
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ add_wait_queue(&dcache_shrinker_wq, &wq);
+ spin_unlock(&dcache_lock);
+
+ schedule();
+ remove_wait_queue(&dcache_shrinker_wq, &wq);
+ __set_current_state(TASK_RUNNING);
+}
+
+void dcache_shrinker_wait_sb(struct super_block *sb)
+{
+ /* the root dentry can be held in dput_recursive */
+ spin_lock(&dcache_lock);
+ while (!list_empty(&sb->s_dshrinkers)) {
+ dcache_shrinker_wait(sb);
+ spin_lock(&dcache_lock);
+ }
+ spin_unlock(&dcache_lock);
+}
+
+/* dcache_lock protects shrinker's list */
+static void shrink_dcache_racecheck(struct dentry *parent, int *racecheck)
+{
+ struct super_block *sb;
+ struct dcache_shrinker *ds;
+
+ sb = parent->d_sb;
+ list_for_each_entry(ds, &sb->s_dshrinkers, list) {
+ /* is one of dcache shrinkers working on the dentry? */
+ if (ds->dentry == parent) {
+ *racecheck = 1;
+ break;
+ }
+ }
+}
+
/*
* This is dput
*
@@ -132,8 +201,9 @@ static inline void dentry_iput(struct de
*/

/*
- * dput - release a dentry
- * @dentry: dentry to release
+ * dput_recursive - go upward through the dentry tree and release dentries
+ * @dentry: starting dentry
+ * @ds: shrinker to be added to active list (see shrink_dcache_parent)
*
* Release a dentry. This will drop the usage count and if appropriate
* call the dentry unlink method as well as removing it from the queues and
@@ -142,18 +212,15 @@ static inline void dentry_iput(struct de
*
* no dcache lock, please.
*/
-
-void dput(struct dentry *dentry)
+static void dput_recursive(struct dentry *dentry, struct dcache_shrinker *ds)
{
- if (!dentry)
- return;
-
-repeat:
if (atomic_read(&dentry->d_count) == 1)
might_sleep();
if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
return;
+ dcache_shrinker_del(ds);

+repeat:
spin_lock(&dentry->d_lock);
if (atomic_read(&dentry->d_count)) {
spin_unlock(&dentry->d_lock);
@@ -185,6 +252,7 @@ unhash_it:

kill_it: {
struct dentry *parent;
+ struct dcache_shrinker lds;

/* If dentry was on d_lru list
* delete it from there
@@ -194,18 +262,47 @@ kill_it: {
dentry_stat.nr_unused--;
}
list_del(&dentry->d_u.d_child);
+ parent = dentry->d_parent;
+ dcache_shrinker_add(&lds, parent, dentry);
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
- parent = dentry->d_parent;
d_free(dentry);
- if (dentry == parent)
+ if (unlikely(dentry == parent)) {
+ spin_lock(&dcache_lock);
+ dcache_shrinker_del(&lds);
+ spin_unlock(&dcache_lock);
return;
+ }
dentry = parent;
- goto repeat;
+ spin_lock(&dcache_lock);
+ dcache_shrinker_del(&lds);
+ if (atomic_dec_and_test(&dentry->d_count))
+ goto repeat;
+ spin_unlock(&dcache_lock);
}
}

+/*
+ * dput - release a dentry
+ * @dentry: dentry to release
+ *
+ * Release a dentry. This will drop the usage count and if appropriate
+ * call the dentry unlink method as well as removing it from the queues and
+ * releasing its resources. If the parent dentries were scheduled for release
+ * they too may now get deleted.
+ *
+ * no dcache lock, please.
+ */
+
+void dput(struct dentry *dentry)
+{
+ if (!dentry)
+ return;
+
+ dput_recursive(dentry, NULL);
+}
+
/**
* d_invalidate - invalidate a dentry
* @dentry: dentry to invalidate
@@ -362,19 +459,23 @@ restart:
* removed.
* Called with dcache_lock, drops it and then regains.
*/
-static inline void prune_one_dentry(struct dentry * dentry)
+static void prune_one_dentry(struct dentry * dentry)
{
struct dentry * parent;
+ struct dcache_shrinker ds;

__d_drop(dentry);
list_del(&dentry->d_u.d_child);
+ parent = dentry->d_parent;
+ dcache_shrinker_add(&ds, parent, dentry);
dentry_stat.nr_dentry--; /* For d_free, below */
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
if (parent != dentry)
- dput(parent);
+ dput_recursive(parent, &ds);
spin_lock(&dcache_lock);
+ dcache_shrinker_del(&ds);
}

/**
@@ -557,13 +658,12 @@ positive:
* drop the lock and return early due to latency
* constraints.
*/
-static int select_parent(struct dentry * parent)
+static int select_parent(struct dentry * parent, int * racecheck)
{
struct dentry *this_parent = parent;
struct list_head *next;
int found = 0;

- spin_lock(&dcache_lock);
repeat:
next = this_parent->d_subdirs.next;
resume:
@@ -605,6 +705,9 @@ dentry->d_parent->d_name.name, dentry->d
#endif
goto repeat;
}
+
+ if (!found && racecheck != NULL)
+ shrink_dcache_racecheck(dentry, racecheck);
}
/*
* All done at this level ... ascend and resume the search.
@@ -619,7 +722,6 @@ this_parent->d_parent->d_name.name, this
goto resume;
}
out:
- spin_unlock(&dcache_lock);
return found;
}

@@ -632,10 +734,66 @@ out:

void shrink_dcache_parent(struct dentry * parent)
{
- int found;
+ int found, r;

- while ((found = select_parent(parent)) != 0)
+ while (1) {
+ spin_lock(&dcache_lock);
+ found = select_parent(parent, NULL);
+ if (found)
+ goto found;
+
+ /*
+ * try again with a dput_recursive() race check.
+ * it returns quickly if everything was really shrinked
+ */
+ r = 0;
+ found = select_parent(parent, &r);
+ if (found)
+ goto found;
+ if (!r)
+ break;
+
+ /* drops the lock inside */
+ dcache_shrinker_wait(parent->d_sb);
+ continue;
+
+found:
+ spin_unlock(&dcache_lock);
prune_dcache(found);
+ }
+ spin_unlock(&dcache_lock);
+}
+
+/*
+ * Move any unused anon dentries to the end of the unused list.
+ * called under dcache_lock
+ */
+static int select_anon(struct hlist_head *head, int *racecheck)
+{
+ struct hlist_node *lp;
+ int found = 0;
+
+ hlist_for_each(lp, head) {
+ struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
+ if (!list_empty(&this->d_lru)) {
+ dentry_stat.nr_unused--;
+ list_del_init(&this->d_lru);
+ }
+
+ /*
+ * move only zero ref count dentries to the end
+ * of the unused list for prune_dcache
+ */
+ if (!atomic_read(&this->d_count)) {
+ list_add_tail(&this->d_lru, &dentry_unused);
+ dentry_stat.nr_unused++;
+ found++;
+ }
+
+ if (!found && racecheck != NULL)
+ shrink_dcache_racecheck(this, racecheck);
+ }
+ return found;
}

/**
@@ -648,33 +806,36 @@ void shrink_dcache_parent(struct dentry
* done under dcache_lock.
*
*/
-void shrink_dcache_anon(struct hlist_head *head)
+void shrink_dcache_anon(struct super_block *sb)
{
- struct hlist_node *lp;
- int found;
- do {
- found = 0;
+ int found, r;
+
+ while (1) {
spin_lock(&dcache_lock);
- hlist_for_each(lp, head) {
- struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
- if (!list_empty(&this->d_lru)) {
- dentry_stat.nr_unused--;
- list_del_init(&this->d_lru);
- }
+ found = select_anon(&sb->s_anon, NULL);
+ if (found)
+ goto found;

- /*
- * move only zero ref count dentries to the end
- * of the unused list for prune_dcache
- */
- if (!atomic_read(&this->d_count)) {
- list_add_tail(&this->d_lru, &dentry_unused);
- dentry_stat.nr_unused++;
- found++;
- }
- }
+ /*
+ * try again with a dput_recursive() race check.
+ * it returns quickly if everything was really shrinked
+ */
+ r = 0;
+ found = select_anon(&sb->s_anon, &r);
+ if (found)
+ goto found;
+ if (!r)
+ break;
+
+ /* drops the lock inside */
+ dcache_shrinker_wait(sb);
+ continue;
+
+found:
spin_unlock(&dcache_lock);
prune_dcache(found);
- } while(found);
+ }
+ spin_unlock(&dcache_lock);
}

/*
--- linux-2.6.15.orig/fs/super.c 2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/fs/super.c 2006-03-03 16:22:38.841709640 +0300
@@ -69,6 +69,7 @@ static struct super_block *alloc_super(v
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
+ INIT_LIST_HEAD(&s->s_dshrinkers);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
init_rwsem(&s->s_umount);
@@ -231,8 +232,9 @@ void generic_shutdown_super(struct super
if (root) {
sb->s_root = NULL;
shrink_dcache_parent(root);
- shrink_dcache_anon(&sb->s_anon);
+ shrink_dcache_anon(sb);
dput(root);
+ dcache_shrinker_wait_sb(sb);
fsync_super(sb);
lock_super(sb);
sb->s_flags &= ~MS_ACTIVE;
--- linux-2.6.15.orig/include/linux/dcache.h 2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/include/linux/dcache.h 2006-03-03 16:22:38.843709336 +0300
@@ -209,7 +209,8 @@ extern struct dentry * d_alloc_anon(stru
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
-extern void shrink_dcache_anon(struct hlist_head *);
+extern void shrink_dcache_anon(struct super_block *);
+extern void dcache_shrinker_wait_sb(struct super_block *sb);
extern int d_invalidate(struct dentry *);

/* only used at mount-time */
--- linux-2.6.15.orig/include/linux/fs.h 2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/include/linux/fs.h 2006-03-03 16:22:38.821712680 +0300
@@ -803,6 +803,7 @@ struct super_block {
struct list_head s_io; /* parked for writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;
+ struct list_head s_dshrinkers; /* active dcache shrinkers */

struct block_device *s_bdev;
struct list_head s_instances;


Attachments:
diff-ms-dcache-race-20060303 (10.20 kB)