2006-03-09 16:58:36

by Jan Blunck

[permalink] [raw]
Subject: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

This is an updated version of the patch which adresses some issues that came
up during discussion.

The wait_on_prunes() code has moved to generic_shutdown_super() to avoid
hiding the mess in shrink_dcache_parent(). Recently updated the comments
to be more verbose about the goals of the modification.

Original patch description:

Kirill Korotaev <[email protected]> discovered a race between shrink_dcache_parent()
and shrink_dcache_memory() which leads to "Busy inodes after unmount".
When unmounting a file system shrink_dcache_parent() is racing against a
possible shrink_dcache_memory(). This might lead to the situation that
shrink_dcache_parent() is returning too early. In this situation the
super_block is destroyed before shrink_dcache_memory() could put the inode.

This patch fixes the problem through introducing a prunes counter which is
incremented when a dentry is pruned but the corresponding inoded isn't put
yet.When the prunes counter is not null, shrink_dcache_parent() is waiting and
restarting its work.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Kirill Korotaev <[email protected]>


Attachments:
(No filename) (1.08 kB)
umount-prune_one_dentry-fix.diff (4.53 kB)
Download all attachments

2006-03-09 17:03:13

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

Andrew,

Acked-By: Kirill Korotaev <[email protected]>

> This is an updated version of the patch which adresses some issues that came
> up during discussion.
>
> The wait_on_prunes() code has moved to generic_shutdown_super() to avoid
> hiding the mess in shrink_dcache_parent(). Recently updated the comments
> to be more verbose about the goals of the modification.
>
> Original patch description:
>
> Kirill Korotaev <[email protected]> discovered a race between shrink_dcache_parent()
> and shrink_dcache_memory() which leads to "Busy inodes after unmount".
> When unmounting a file system shrink_dcache_parent() is racing against a
> possible shrink_dcache_memory(). This might lead to the situation that
> shrink_dcache_parent() is returning too early. In this situation the
> super_block is destroyed before shrink_dcache_memory() could put the inode.
>
> This patch fixes the problem through introducing a prunes counter which is
> incremented when a dentry is pruned but the corresponding inoded isn't put
> yet.When the prunes counter is not null, shrink_dcache_parent() is waiting and
> restarting its work.
>
> Signed-off-by: Jan Blunck <[email protected]>
> Signed-off-by: Kirill Korotaev <[email protected]>
>
>
> ------------------------------------------------------------------------
>
> ---
>
> fs/dcache.c | 9 +++++++
> fs/super.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 3 ++
> 3 files changed, 72 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/fs/dcache.c
> ===================================================================
> --- linux-2.6.orig/fs/dcache.c
> +++ linux-2.6/fs/dcache.c
> @@ -361,20 +361,29 @@ restart:
> * This requires that the LRU list has already been
> * removed.
> * Called with dcache_lock, drops it and then regains.
> + *
> + * Wakes up any pending waiters (see wait_on_prunes()) if the
> + * dentry's filesystem is being umounted.
> */
> static inline void prune_one_dentry(struct dentry * dentry)
> {
> + struct super_block *sb = dentry->d_sb;
> struct dentry * parent;
>
> __d_drop(dentry);
> list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> + sb->s_prunes++;
> dentry_iput(dentry);
> parent = dentry->d_parent;
> d_free(dentry);
> if (parent != dentry)
> dput(parent);
> spin_lock(&dcache_lock);
> + sb->s_prunes--;
> + /* FIXME: this might be avoided through checking for sb->s_root */
> + if (likely(!sb->s_prunes))
> + wake_up(&sb->s_wait_prunes);
> }
>
> /**
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c
> +++ linux-2.6/fs/super.c
> @@ -80,6 +80,8 @@ static struct super_block *alloc_super(v
> sema_init(&s->s_dquot.dqio_sem, 1);
> sema_init(&s->s_dquot.dqonoff_sem, 1);
> init_rwsem(&s->s_dquot.dqptr_sem);
> + s->s_prunes = 0;
> + init_waitqueue_head(&s->s_wait_prunes);
> init_waitqueue_head(&s->s_wait_unfrozen);
> s->s_maxbytes = MAX_NON_LFS;
> s->dq_op = sb_dquot_ops;
> @@ -213,6 +215,51 @@ static int grab_super(struct super_block
> return 0;
> }
>
> +/*
> + * A special version of wait_event(!sb->s_prunes) which takes the dcache_lock
> + * when checking the condition and gives feedback if we have waited for
> + * remaining prunes.
> + *
> + * While we are waiting for pruned dentries to iput() their inode, the
> + * sb->s_prunes count is none zero. Since the s_prunes counter is modified
> + * by prune_one_dentry() under dcache_lock, either the reference count on the
> + * parent dentry is wrong (and therefore it isn't on the lru-list yet) and we
> + * are waiting because s_prunes != 0 or the reference count is correct (and the
> + * parent dentry might be found by select_parent()) and the s_prunes == 0. This
> + * is the reason for that we have to restart loop on select_parent() and
> + * prune_dcache() as long as we are waiting on pruned dentries here (see the
> + * usage of wait_on_prunes() in generic_shutdown_super()). This is not
> + * live-locking because we only call wait_on_prunes() if we are umounting the
> + * filesystem (sb->s_root == NULL), therefore noone is having a reference to
> + * the vfsmount structure anymore and hence shouldn't have a reference to a
> + * dentry.
> + */
> +static int wait_on_prunes(struct super_block *sb)
> +{
> + DEFINE_WAIT(wait);
> + int prunes_remaining = 0;
> +
> +#ifdef DCACHE_DEBUG
> + printk(KERN_DEBUG "%s: waiting for %d prunes\n", __FUNCTION__,
> + sb->s_prunes);
> +#endif
> +
> + spin_lock(&dcache_lock);
> + for (;;) {
> + prepare_to_wait(&sb->s_wait_prunes, &wait,
> + TASK_UNINTERRUPTIBLE);
> + if (!sb->s_prunes)
> + break;
> + spin_unlock(&dcache_lock);
> + schedule();
> + prunes_remaining = 1;
> + spin_lock(&dcache_lock);
> + }
> + spin_unlock(&dcache_lock);
> + finish_wait(&sb->s_wait_prunes, &wait);
> + return prunes_remaining;
> +}
> +
> /**
> * generic_shutdown_super - common helper for ->kill_sb()
> * @sb: superblock to kill
> @@ -230,8 +277,20 @@ void generic_shutdown_super(struct super
>
> if (root) {
> sb->s_root = NULL;
> - shrink_dcache_parent(root);
> shrink_dcache_anon(&sb->s_anon);
> +
> + /*
> + * If we have waited for a pruned dentry which parent wasn't
> + * unhashed yet, that parent will end up on the lru-list.
> + * Therefore, we need to run shrink_dcache_parent() again
> + * after we waited for pruned dentries in wait_on_prunes().
> + *
> + * NOTE: Also read the comment on wait_on_prunes() above.
> + */
> + do {
> + shrink_dcache_parent(root);
> + } while(wait_on_prunes(sb));
> +
> dput(root);
> fsync_super(sb);
> lock_super(sb);
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -836,6 +836,9 @@ struct super_block {
> struct list_head s_instances;
> struct quota_info s_dquot; /* Diskquota specific options */
>
> + unsigned int s_prunes; /* protected by dcache_lock */
> + wait_queue_head_t s_wait_prunes;
> +
> int s_frozen;
> wait_queue_head_t s_wait_unfrozen;
>


2006-03-10 05:11:09

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Thursday March 9, [email protected] wrote:
> Andrew,
>
> Acked-By: Kirill Korotaev <[email protected]>

I'm afraid that I'm not convinced.

> > +static int wait_on_prunes(struct super_block *sb)
> > +{
> > + DEFINE_WAIT(wait);
> > + int prunes_remaining = 0;
> > +
> > +#ifdef DCACHE_DEBUG
> > + printk(KERN_DEBUG "%s: waiting for %d prunes\n", __FUNCTION__,
> > + sb->s_prunes);
> > +#endif
> > +
> > + spin_lock(&dcache_lock);
> > + for (;;) {
> > + prepare_to_wait(&sb->s_wait_prunes, &wait,
> > + TASK_UNINTERRUPTIBLE);
> > + if (!sb->s_prunes)
> > + break;
> > + spin_unlock(&dcache_lock);
> > + schedule();
> > + prunes_remaining = 1;
> > + spin_lock(&dcache_lock);
> > + }
> > + spin_unlock(&dcache_lock);
> > + finish_wait(&sb->s_wait_prunes, &wait);
> > + return prunes_remaining;
> > +}

I don't think that a return value from wait_on_prunes is meaningful.
All it tells us is whether a prune_one_dentry finished before or after
wait_on_prunes takes the spin_lock. This isn't very useful
information as it has no significance to upper levels.

So:

> > + do {
> > + shrink_dcache_parent(root);
> > + } while(wait_on_prunes(sb));
> > +

Suppose shrink_dcache_parent misses on dentry because the inode was being
iput. This iput completes immediately that
shrink_dcache_parent completes. It decrements ->s_prunes and when
wait_on_prunes takes dcache_lock, ->s_prunes is zero so the loop
terminates, and the remaining dentries - the parents of the dentry
what was undergoing iput - don't get put.

I really think that we need to stop prune_one_dentry from being called
on dentries for a filesystem that is being unmounted. With that code
currently in -git, that means passing a 'struct super_block *' into
prune_dcache so that it ignores any filesystem with ->s_root==NULL
unless that filesystem is the filesystem that was passed.

NeilBrown

2006-03-10 08:24:08

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

Neil,

> On Thursday March 9, [email protected] wrote:
>
>>Andrew,
>>
>>Acked-By: Kirill Korotaev <[email protected]>
>
>
> I'm afraid that I'm not convinced.
>
>
>>>+static int wait_on_prunes(struct super_block *sb)
>>>+{
>>>+ DEFINE_WAIT(wait);
>>>+ int prunes_remaining = 0;
>>>+
>>>+#ifdef DCACHE_DEBUG
>>>+ printk(KERN_DEBUG "%s: waiting for %d prunes\n", __FUNCTION__,
>>>+ sb->s_prunes);
>>>+#endif
>>>+
>>>+ spin_lock(&dcache_lock);
>>>+ for (;;) {
>>>+ prepare_to_wait(&sb->s_wait_prunes, &wait,
>>>+ TASK_UNINTERRUPTIBLE);
>>>+ if (!sb->s_prunes)
>>>+ break;
>>>+ spin_unlock(&dcache_lock);
>>>+ schedule();
>>>+ prunes_remaining = 1;
>>>+ spin_lock(&dcache_lock);
>>>+ }
>>>+ spin_unlock(&dcache_lock);
>>>+ finish_wait(&sb->s_wait_prunes, &wait);
>>>+ return prunes_remaining;
>>>+}
>
>
> I don't think that a return value from wait_on_prunes is meaningful.
> All it tells us is whether a prune_one_dentry finished before or after
> wait_on_prunes takes the spin_lock. This isn't very useful
> information as it has no significance to upper levels.
>
> So:
>
>
>>>+ do {
>>>+ shrink_dcache_parent(root);
>>>+ } while(wait_on_prunes(sb));
>>>+
>
>
> Suppose shrink_dcache_parent misses on dentry because the inode was being
> iput. This iput completes immediately that
> shrink_dcache_parent completes. It decrements ->s_prunes and when
> wait_on_prunes takes dcache_lock, ->s_prunes is zero so the loop
> terminates, and the remaining dentries - the parents of the dentry
> what was undergoing iput - don't get put.
you are right... :/
And this is actually why we originally inserted check for race
in select_parent() under dcache_lock... I just lost my memory :(

> I really think that we need to stop prune_one_dentry from being called
> on dentries for a filesystem that is being unmounted. With that code
> currently in -git, that means passing a 'struct super_block *' into
> prune_dcache so that it ignores any filesystem with ->s_root==NULL
> unless that filesystem is the filesystem that was passed.
Can try...

Thanks,
Kirill

2006-03-10 10:59:54

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Fri, Mar 10, Kirill Korotaev wrote:

> >I really think that we need to stop prune_one_dentry from being called
> >on dentries for a filesystem that is being unmounted. With that code
> >currently in -git, that means passing a 'struct super_block *' into
> >prune_dcache so that it ignores any filesystem with ->s_root==NULL
> >unless that filesystem is the filesystem that was passed.
> Can try...
>

Can not ... because of down_read(s_umount) before checking s_root :(

So what do we do now?

1. always get the reference counting right outside of dcache_lock

2. hack around with different paths for prune_dcache() when called from
shrink_dcache_memory() and shrink_dcache_parent()

I think that we should go for the first.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-10 11:18:33

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

>>>I really think that we need to stop prune_one_dentry from being called
>>>on dentries for a filesystem that is being unmounted. With that code
>>>currently in -git, that means passing a 'struct super_block *' into
>>>prune_dcache so that it ignores any filesystem with ->s_root==NULL
>>>unless that filesystem is the filesystem that was passed.
>>
>>Can try...
>
> Can not ... because of down_read(s_umount) before checking s_root :(

> So what do we do now?
>
> 1. always get the reference counting right outside of dcache_lock
>
> 2. hack around with different paths for prune_dcache() when called from
> shrink_dcache_memory() and shrink_dcache_parent()
3. keep the existing patch from me :))))

> I think that we should go for the first.
just an idea which came to my mind:
can't we fix it the following way:
1. fix select_parent() when called from generic_shutdown_super() to loop
until _all_ dentries are shrinked (not only those, with d_count = 1);
this guarentees that no dentries are left.
2. no dentries are left, but iput() can be in progress.
So can't we simply make invalidate_inodes() to be in a loop with
schedule() until no busy inodes are left?!

unregister_netdevice() for example, loops until netdev counter drops to
zero. Why can't we do it the same way? Any objections?

Thanks,
Kirill

2006-03-10 12:01:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Friday March 10, [email protected] wrote:
> On Fri, Mar 10, Kirill Korotaev wrote:
>
> > >I really think that we need to stop prune_one_dentry from being called
> > >on dentries for a filesystem that is being unmounted. With that code
> > >currently in -git, that means passing a 'struct super_block *' into
> > >prune_dcache so that it ignores any filesystem with ->s_root==NULL
> > >unless that filesystem is the filesystem that was passed.
> > Can try...
> >
>
> Can not ... because of down_read(s_umount) before checking s_root :(

Sorry, I don't follow your logic. Could you elaborate a bit please?

>
> So what do we do now?
>
> 1. always get the reference counting right outside of dcache_lock

This is possible, but I think it is a very intrusive patch.

>
> 2. hack around with different paths for prune_dcache() when called from
> shrink_dcache_memory() and shrink_dcache_parent()

I don't think the paths are very different.

The following patch is against 2.6.16-rc5-git14, is based on yours,
and avoids calling prune_one_dentry at inconvenient times.

Differences:
- use waitqueue_active to decide whether to call wake_up
- pass a 'struct super_block *' to prune_dcache when appropriate,
and do not try to prune dentries for any other filesystem
which has ->s_root == NULL
- pass sb to shink_dcache_anon instead of &sb->s_anon.
- Don't return a value from wait_on_prunes
- Call wait_on_prunes *before* shrink_dcache_{anon,parent}

I am fairly sure that this will do the right thing.

NeilBrown



Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/dcache.c | 25 ++++++++++++++++++-------
./fs/super.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
./include/linux/dcache.h | 2 +-
./include/linux/fs.h | 3 +++
4 files changed, 68 insertions(+), 9 deletions(-)

diff ./fs/dcache.c~current~ ./fs/dcache.c
--- ./fs/dcache.c~current~ 2006-03-10 22:24:03.000000000 +1100
+++ ./fs/dcache.c 2006-03-10 22:39:07.000000000 +1100
@@ -361,20 +361,28 @@ restart:
* This requires that the LRU list has already been
* removed.
* Called with dcache_lock, drops it and then regains.
+ *
+ * Wakes up any pending waiters (see wait_on_prunes()) if the
+ * dentry's filesystem is being umounted.
*/
static inline void prune_one_dentry(struct dentry * dentry)
{
+ struct super_block *sb = dentry->d_sb;
struct dentry * parent;

__d_drop(dentry);
list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
+ sb->s_prunes++;
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
if (parent != dentry)
dput(parent);
spin_lock(&dcache_lock);
+ sb->s_prunes--;
+ if (waitqueue_active(&sb->s_wait_prunes))
+ wake_up(&sb->s_wait_prunes);
}

/**
@@ -390,7 +398,7 @@ static inline void prune_one_dentry(stru
* all the dentries are in use.
*/

-static void prune_dcache(int count)
+static void prune_dcache(int count, struct super_block *sb)
{
spin_lock(&dcache_lock);
for (; count ; count--) {
@@ -417,8 +425,10 @@ static void prune_dcache(int count)
spin_unlock(&dentry->d_lock);
continue;
}
- /* If the dentry was recently referenced, don't free it. */
- if (dentry->d_flags & DCACHE_REFERENCED) {
+ /* If the dentry was recently referenced, or is for
+ * a unmounting filesystem, don't free it. */
+ if ((dentry->d_flags & DCACHE_REFERENCED) ||
+ (dentry->d_sb != sb && dentry->d_sb->s_root == NULL)) {
dentry->d_flags &= ~DCACHE_REFERENCED;
list_add(&dentry->d_lru, &dentry_unused);
dentry_stat.nr_unused++;
@@ -635,7 +645,7 @@ void shrink_dcache_parent(struct dentry
int found;

while ((found = select_parent(parent)) != 0)
- prune_dcache(found);
+ prune_dcache(found, parent->d_sb);
}

/**
@@ -648,8 +658,9 @@ void shrink_dcache_parent(struct dentry
* done under dcache_lock.
*
*/
-void shrink_dcache_anon(struct hlist_head *head)
+void shrink_dcache_anon(struct super_block *sb)
{
+ struct hlist_head = &sb->s_anon;
struct hlist_node *lp;
int found;
do {
@@ -673,7 +684,7 @@ void shrink_dcache_anon(struct hlist_hea
}
}
spin_unlock(&dcache_lock);
- prune_dcache(found);
+ prune_dcache(found, sb);
} while(found);
}

@@ -694,7 +705,7 @@ static int shrink_dcache_memory(int nr,
if (nr) {
if (!(gfp_mask & __GFP_FS))
return -1;
- prune_dcache(nr);
+ prune_dcache(nr, NULL);
}
return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
}

diff ./fs/super.c~current~ ./fs/super.c
--- ./fs/super.c~current~ 2006-03-10 22:24:03.000000000 +1100
+++ ./fs/super.c 2006-03-10 22:37:14.000000000 +1100
@@ -80,6 +80,8 @@ static struct super_block *alloc_super(v
sema_init(&s->s_dquot.dqio_sem, 1);
sema_init(&s->s_dquot.dqonoff_sem, 1);
init_rwsem(&s->s_dquot.dqptr_sem);
+ s->s_prunes = 0;
+ init_waitqueue_head(&s->s_wait_prunes);
init_waitqueue_head(&s->s_wait_unfrozen);
s->s_maxbytes = MAX_NON_LFS;
s->dq_op = sb_dquot_ops;
@@ -213,6 +215,40 @@ static int grab_super(struct super_block
return 0;
}

+/*
+ * A special version of wait_event(!sb->s_prunes) which takes the dcache_lock
+ * when checking the condition.
+ *
+ * While we are waiting for pruned dentries to iput() their inode, the
+ * sb->s_prunes count is non-zero. Since the s_prunes counter is modified
+ * by prune_one_dentry() under dcache_lock, either the reference count on the
+ * parent dentry is wrong (and therefore it isn't on the lru-list yet) and we
+ * are waiting because s_prunes != 0 or the reference count is correct (and the
+ * parent dentry might be found by select_parent()) and the s_prunes == 0.
+ */
+static void wait_on_prunes(struct super_block *sb)
+{
+ DEFINE_WAIT(wait);
+
+#ifdef DCACHE_DEBUG
+ printk(KERN_DEBUG "%s: waiting for %d prunes\n", __FUNCTION__,
+ sb->s_prunes);
+#endif
+
+ spin_lock(&dcache_lock);
+ for (;;) {
+ prepare_to_wait(&sb->s_wait_prunes, &wait,
+ TASK_UNINTERRUPTIBLE);
+ if (!sb->s_prunes)
+ break;
+ spin_unlock(&dcache_lock);
+ schedule();
+ spin_lock(&dcache_lock);
+ }
+ spin_unlock(&dcache_lock);
+ finish_wait(&sb->s_wait_prunes, &wait);
+}
+
/**
* generic_shutdown_super - common helper for ->kill_sb()
* @sb: superblock to kill
@@ -230,8 +266,17 @@ void generic_shutdown_super(struct super

if (root) {
sb->s_root = NULL;
+
+ wait_on_prunes(sb);
+ /* At this point no dentries in this filesystem will be
+ * in the process of being pruned by shrink_dcache_memory
+ * (or anyone else), so no dentries should have external
+ * references, so shrink_dcache_anon and
+ * shrink_dcache_parent should find and free them all
+ */
+ shrink_dcache_anon(sb);
shrink_dcache_parent(root);
- shrink_dcache_anon(&sb->s_anon);
+
dput(root);
fsync_super(sb);
lock_super(sb);

diff ./include/linux/dcache.h~current~ ./include/linux/dcache.h
--- ./include/linux/dcache.h~current~ 2006-03-10 22:37:37.000000000 +1100
+++ ./include/linux/dcache.h 2006-03-10 22:37:55.000000000 +1100
@@ -215,7 +215,7 @@ extern struct dentry * d_alloc_anon(stru
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
-extern void shrink_dcache_anon(struct hlist_head *);
+extern void shrink_dcache_anon(struct super_block *);
extern int d_invalidate(struct dentry *);

/* only used at mount-time */

diff ./include/linux/fs.h~current~ ./include/linux/fs.h
--- ./include/linux/fs.h~current~ 2006-03-10 22:24:03.000000000 +1100
+++ ./include/linux/fs.h 2006-03-10 22:24:03.000000000 +1100
@@ -836,6 +836,9 @@ struct super_block {
struct list_head s_instances;
struct quota_info s_dquot; /* Diskquota specific options */

+ unsigned int s_prunes; /* protected by dcache_lock */
+ wait_queue_head_t s_wait_prunes;
+
int s_frozen;
wait_queue_head_t s_wait_unfrozen;

2006-03-10 12:02:01

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Friday March 10, [email protected] wrote:
> just an idea which came to my mind:
> can't we fix it the following way:
> 1. fix select_parent() when called from generic_shutdown_super() to loop
> until _all_ dentries are shrinked (not only those, with d_count = 1);
> this guarentees that no dentries are left.

You couldn't just busy-wait. You would need to have some wait_queue
to wait on, and someone to wake it up.


> 2. no dentries are left, but iput() can be in progress.
> So can't we simply make invalidate_inodes() to be in a loop with
> schedule() until no busy inodes are left?!

Again, you need to be more predictable than just calling schedule().

>
> unregister_netdevice() for example, loops until netdev counter drops to
> zero. Why can't we do it the same way? Any objections?

I'd really like to understand why the code was the way it is before
making that sort of change. It could just be that there was nothing
obvious to wait on, so maybe just adding a suitable wait_queue would
do it...
But then the current patch essentially adds a wait_queue and waits on
it where appropriate. So would you approach be that much cleaner?
Maybe...

NeilBrown

2006-03-10 12:02:35

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Fri, Mar 10, Kirill Korotaev wrote:

> >I think that we should go for the first.
> just an idea which came to my mind:
> can't we fix it the following way:
> 1. fix select_parent() when called from generic_shutdown_super() to loop
> until _all_ dentries are shrinked (not only those, with d_count = 1);
> this guarentees that no dentries are left.
> 2. no dentries are left, but iput() can be in progress.
> So can't we simply make invalidate_inodes() to be in a loop with
> schedule() until no busy inodes are left?!

But this hides the places where dput() is called after mntput() and locks up
when somebody forgets to call dput() at all. And I think we should printk() in
that situations instead of waiting for them.

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-10 12:15:52

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

>>>I think that we should go for the first.
>>
>>just an idea which came to my mind:
>>can't we fix it the following way:
>>1. fix select_parent() when called from generic_shutdown_super() to loop
>>until _all_ dentries are shrinked (not only those, with d_count = 1);
>>this guarentees that no dentries are left.
>>2. no dentries are left, but iput() can be in progress.
>>So can't we simply make invalidate_inodes() to be in a loop with
>>schedule() until no busy inodes are left?!
>
>
> But this hides the places where dput() is called after mntput() and locks up
> when somebody forgets to call dput() at all. And I think we should printk() in
> that situations instead of waiting for them.
I think both: printk and wait.
From my own personal experience with it: it is better to wait in a loop
and be able to collect information, then to crash.

Thanks,
Kirill

2006-03-10 12:31:56

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Fri, Mar 10, Neil Brown wrote:

> -static void prune_dcache(int count)
> +static void prune_dcache(int count, struct super_block *sb)
> {
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> @@ -417,8 +425,10 @@ static void prune_dcache(int count)
> spin_unlock(&dentry->d_lock);
> continue;
> }
> - /* If the dentry was recently referenced, don't free it. */
> - if (dentry->d_flags & DCACHE_REFERENCED) {
> + /* If the dentry was recently referenced, or is for
> + * a unmounting filesystem, don't free it. */
> + if ((dentry->d_flags & DCACHE_REFERENCED) ||
> + (dentry->d_sb != sb && dentry->d_sb->s_root == NULL)) {
> dentry->d_flags &= ~DCACHE_REFERENCED;
> list_add(&dentry->d_lru, &dentry_unused);
> dentry_stat.nr_unused++;

You have to down_read the rw-semaphore sb->s_umount since sb->s_root is
protected by it :(

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-10 17:17:54

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Fri, Mar 10, 2006 at 01:31:53PM +0100, Jan Blunck wrote:
> On Fri, Mar 10, Neil Brown wrote:
>
> > -static void prune_dcache(int count)
> > +static void prune_dcache(int count, struct super_block *sb)
> > {
> > spin_lock(&dcache_lock);
> > for (; count ; count--) {
> > @@ -417,8 +425,10 @@ static void prune_dcache(int count)
> > spin_unlock(&dentry->d_lock);
> > continue;
> > }
> > - /* If the dentry was recently referenced, don't free it. */
> > - if (dentry->d_flags & DCACHE_REFERENCED) {
> > + /* If the dentry was recently referenced, or is for
> > + * a unmounting filesystem, don't free it. */
> > + if ((dentry->d_flags & DCACHE_REFERENCED) ||
> > + (dentry->d_sb != sb && dentry->d_sb->s_root == NULL)) {
> > dentry->d_flags &= ~DCACHE_REFERENCED;
> > list_add(&dentry->d_lru, &dentry_unused);
> > dentry_stat.nr_unused++;
>
> You have to down_read the rw-semaphore sb->s_umount since sb->s_root is
> protected by it :(

<snip>

Please do not beat me up for suggesting this. I was wondering if it
makes sense to add a PF_SHRINKER flag and set it in shrink_slab().
Use my solution, instead of PF_MEMALLOC check against PF_SHRINKER.

I think PF_SHRINKER might be a good idea, it might help us detect
races between the shrinker and other subsystems too - not only dcache.

It might be well worth adding in. If there is sufficient interest I can
send create and send out a patch.

Balbir

2006-03-12 21:58:36

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Friday March 10, [email protected] wrote:
> On Fri, Mar 10, Neil Brown wrote:
>
> > -static void prune_dcache(int count)
> > +static void prune_dcache(int count, struct super_block *sb)
> > {
> > spin_lock(&dcache_lock);
> > for (; count ; count--) {
> > @@ -417,8 +425,10 @@ static void prune_dcache(int count)
> > spin_unlock(&dentry->d_lock);
> > continue;
> > }
> > - /* If the dentry was recently referenced, don't free it. */
> > - if (dentry->d_flags & DCACHE_REFERENCED) {
> > + /* If the dentry was recently referenced, or is for
> > + * a unmounting filesystem, don't free it. */
> > + if ((dentry->d_flags & DCACHE_REFERENCED) ||
> > + (dentry->d_sb != sb && dentry->d_sb->s_root == NULL)) {
> > dentry->d_flags &= ~DCACHE_REFERENCED;
> > list_add(&dentry->d_lru, &dentry_unused);
> > dentry_stat.nr_unused++;
>
> You have to down_read the rw-semaphore sb->s_umount since sb->s_root is
> protected by it :(

No you don't.
sb->s_root is set precisely twice for any filesystem.
Once when the filesystem is mounted, typically in the fill_super
function, and once in generic_shutdown_super where it is set to NULL.

There is no need to lock against the first change as the sb is not
globally visible until after s_root is set. So for the present
purpose we only need to worry about the second change.

For this, the current usage of dcache_lock is enough to remove any
possible race. generic_shutdown_super sets s_root to NULL and then
takes dcache_lock (via wait_for_prunes) before it cares if anyone has
noticed the NULL or not. prune_dcache holds dcache_lock while testing
for NULL. So there is no room for a race.

s_umount is sometimes taken before testing s_root. This is always
because the sb has just been found on the list of all superblocks, and
so the thread doesn't hold a proper reference to it. In our case
we are holding a dentry which in turn holds a real reference to the
superblock.

So I stand by my patch - s_root can be safely tested here.

NeilBrown

2006-03-12 23:03:33

by Jan Blunck

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Mon, Mar 13, Neil Brown wrote:

> No you don't.

Err, you are right. But this brings a new idea to my mind: why don't we use
s_umount to prevent umounting while we prune one dentry? Something like:

if (down_read_trylock()) {
if (s_root)
prune_one_dentry()
up_read()
}
// else just skip it

So maybe our prunes counter isn't the only way to go. Comments?

Regards,
Jan

--
Jan Blunck [email protected]
SuSE LINUX AG - A Novell company
Maxfeldstr. 5 +49-911-74053-608
D-90409 N?rnberg http://www.suse.de

2006-03-13 05:46:47

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

On Monday March 13, [email protected] wrote:
> On Mon, Mar 13, Neil Brown wrote:
>
> > No you don't.
>
> Err, you are right. But this brings a new idea to my mind: why don't we use
> s_umount to prevent umounting while we prune one dentry? Something like:
>
> if (down_read_trylock()) {
> if (s_root)
> prune_one_dentry()
> up_read()
> }
> // else just skip it
>
> So maybe our prunes counter isn't the only way to go. Comments?

Hmmm.... yes, I think that could work. Patch below against
2.6.16-rc6-mm1.

I cannot see down_read_trylock being slower than dcache_lock, so I
suspect this shouldn't impact performance.

Does this meet with everyone's approval?

NeilBrown




Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/dcache.c | 40 ++++++++++++++++++++++++++++------------
1 file changed, 28 insertions(+), 12 deletions(-)

diff ./fs/dcache.c~current~ ./fs/dcache.c
--- ./fs/dcache.c~current~ 2006-03-13 16:19:29.000000000 +1100
+++ ./fs/dcache.c 2006-03-13 16:43:59.000000000 +1100
@@ -383,6 +383,8 @@ static inline void prune_one_dentry(stru
/**
* prune_dcache - shrink the dcache
* @count: number of entries to try and free
+ * @sb: if given, ignore dentries for other superblocks
+ * which are being unmounted.
*
* Shrink the dcache. This is done when we need
* more memory, or simply when we need to unmount
@@ -393,7 +395,7 @@ static inline void prune_one_dentry(stru
* all the dentries are in use.
*/

-static void prune_dcache(int count)
+static void prune_dcache(int count, struct super_block *sb)
{
spin_lock(&dcache_lock);
for (; count ; count--) {
@@ -420,15 +422,27 @@ static void prune_dcache(int count)
spin_unlock(&dentry->d_lock);
continue;
}
- /* If the dentry was recently referenced, don't free it. */
- if (dentry->d_flags & DCACHE_REFERENCED) {
- dentry->d_flags &= ~DCACHE_REFERENCED;
- list_add(&dentry->d_lru, &dentry_unused);
- dentry_stat.nr_unused++;
- spin_unlock(&dentry->d_lock);
- continue;
+ if (!(dentry->d_flags & DCACHE_REFERENCED) &&
+ (!sb || dentry->d_sb == sb)) {
+ if (sb) {
+ prune_one_dentry(dentry);
+ continue;
+ }
+ /* Need to avoid race with generic_shutdown_super */
+ if (down_read_trylock(&dentry->d_sb->s_umount) &&
+ dentry->d_sb->s_root != NULL) {
+ prune_one_dentry(dentry);
+ up_read(&dentry->d_sb->s_umount);
+ continue;
+ }
}
- prune_one_dentry(dentry);
+ /* The dentry was recently referenced, or the filesystem
+ * is being unmounted, so don't free it. */
+
+ dentry->d_flags &= ~DCACHE_REFERENCED;
+ list_add(&dentry->d_lru, &dentry_unused);
+ dentry_stat.nr_unused++;
+ spin_unlock(&dentry->d_lock);
}
spin_unlock(&dcache_lock);
}
@@ -639,9 +653,10 @@ void shrink_dcache_parent(struct dentry
int found;

while ((found = select_parent(parent)) != 0)
- prune_dcache(found);
+ prune_dcache(found, parent->d_sb);
}

+#if 0
/**
* shrink_dcache_anon - further prune the cache
* @head: head of d_hash list of dentries to prune
@@ -677,9 +692,10 @@ void shrink_dcache_anon(struct hlist_hea
}
}
spin_unlock(&dcache_lock);
- prune_dcache(found);
+ prune_dcache(found, NULL);
} while(found);
}
+#endif

/*
* Scan `nr' dentries and return the number which remain.
@@ -698,7 +714,7 @@ static int shrink_dcache_memory(int nr,
if (nr) {
if (!(gfp_mask & __GFP_FS))
return -1;
- prune_dcache(nr);
+ prune_dcache(nr, NULL);
}
return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
}

2006-03-13 15:52:26

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (3rd updated patch)]

> Hmmm.... yes, I think that could work. Patch below against
> 2.6.16-rc6-mm1.
>
> I cannot see down_read_trylock being slower than dcache_lock, so I
> suspect this shouldn't impact performance.
>
> Does this meet with everyone's approval?
>

I like the umount logic too, it is very similar to my solution of
down_read_trylock() on the umount semaphore, except that it does not
have the PF_XXXXXXX magic involved.

<snip>

> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/dcache.c | 40 ++++++++++++++++++++++++++++------------
> 1 file changed, 28 insertions(+), 12 deletions(-)
>
> diff ./fs/dcache.c~current~ ./fs/dcache.c
> --- ./fs/dcache.c~current~ 2006-03-13 16:19:29.000000000 +1100
> +++ ./fs/dcache.c 2006-03-13 16:43:59.000000000 +1100
> @@ -383,6 +383,8 @@ static inline void prune_one_dentry(stru
> /**
> * prune_dcache - shrink the dcache
> * @count: number of entries to try and free
> + * @sb: if given, ignore dentries for other superblocks
> + * which are being unmounted.
> *
> * Shrink the dcache. This is done when we need
> * more memory, or simply when we need to unmount
> @@ -393,7 +395,7 @@ static inline void prune_one_dentry(stru
> * all the dentries are in use.
> */
>
> -static void prune_dcache(int count)
> +static void prune_dcache(int count, struct super_block *sb)
> {
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> @@ -420,15 +422,27 @@ static void prune_dcache(int count)
> spin_unlock(&dentry->d_lock);
> continue;
> }
> - /* If the dentry was recently referenced, don't free it. */
> - if (dentry->d_flags & DCACHE_REFERENCED) {
> - dentry->d_flags &= ~DCACHE_REFERENCED;
> - list_add(&dentry->d_lru, &dentry_unused);
> - dentry_stat.nr_unused++;
> - spin_unlock(&dentry->d_lock);
> - continue;
> + if (!(dentry->d_flags & DCACHE_REFERENCED) &&
> + (!sb || dentry->d_sb == sb)) {

Comments for the if condition please! It is a bit complex and I had to
read it a few times to understand what the condition achieves

> + if (sb) {
> + prune_one_dentry(dentry);
> + continue;
> + }
> + /* Need to avoid race with generic_shutdown_super */
> + if (down_read_trylock(&dentry->d_sb->s_umount) &&
> + dentry->d_sb->s_root != NULL) {
> + prune_one_dentry(dentry);
> + up_read(&dentry->d_sb->s_umount);
> + continue;
> + }

There is a BUG here. What if down_ready_trylock() succeeds and
dentry->d_sb->s_root == NULL? We will be missing an up_read().

<snip>

Balbir