2008-03-25 18:12:37

by Jan Kara

[permalink] [raw]
Subject: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()

Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
before calling __invalidate_mapping_pages(). We just have to make sure
inode won't go away from under us by keeping reference to it and putting
the reference only after we have safely resumed the scan of the inode
list. A bit tricky but not too bad...

Signed-off-by: Jan Kara <[email protected]>
CC: Fengguang Wu <[email protected]>
CC: David Chinner <[email protected]>

---
fs/drop_caches.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..f5aae26 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -14,15 +14,21 @@ int sysctl_drop_caches;

static void drop_pagecache_sb(struct super_block *sb)
{
- struct inode *inode;
+ struct inode *inode, *toput_inode = NULL;

spin_lock(&inode_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_WILL_FREE))
continue;
+ __iget(inode);
+ spin_unlock(&inode_lock);
__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
+ iput(toput_inode);
+ toput_inode = inode;
+ spin_lock(&inode_lock);
}
spin_unlock(&inode_lock);
+ iput(toput_inode);
}

void drop_pagecache(void)
--
1.5.2.4


2008-03-25 19:54:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()

On Tue, 25 Mar 2008 19:12:27 +0100
Jan Kara <[email protected]> wrote:

> Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
> before calling __invalidate_mapping_pages(). We just have to make sure
> inode won't go away from under us by keeping reference to it and putting
> the reference only after we have safely resumed the scan of the inode
> list. A bit tricky but not too bad...
>
> Signed-off-by: Jan Kara <[email protected]>
> CC: Fengguang Wu <[email protected]>
> CC: David Chinner <[email protected]>
>
> ---
> fs/drop_caches.c | 8 +++++++-
> 1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index 59375ef..f5aae26 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -14,15 +14,21 @@ int sysctl_drop_caches;
>
> static void drop_pagecache_sb(struct super_block *sb)
> {
> - struct inode *inode;
> + struct inode *inode, *toput_inode = NULL;
>
> spin_lock(&inode_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> if (inode->i_state & (I_FREEING|I_WILL_FREE))
> continue;

OT: it might be worth having an `if (mapping->nrpages==0) continue' here.

> + __iget(inode);
> + spin_unlock(&inode_lock);
> __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> + iput(toput_inode);
> + toput_inode = inode;
> + spin_lock(&inode_lock);
> }
> spin_unlock(&inode_lock);
> + iput(toput_inode);
> }
>
> void drop_pagecache(void)

hrm. So we have a random ref on an inode without holding inode_lock. If
we race with invalidate_list() we end up with an inode stuck on s_inodes
and "Self-destruct in 5 seconds. Have a nice day...", don't we?

2008-03-25 22:02:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()


On Tue, 2008-03-25 at 12:53 -0700, Andrew Morton wrote:
> On Tue, 25 Mar 2008 19:12:27 +0100
> Jan Kara <[email protected]> wrote:
>
> > Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
> > before calling __invalidate_mapping_pages(). We just have to make sure
> > inode won't go away from under us by keeping reference to it and putting
> > the reference only after we have safely resumed the scan of the inode
> > list. A bit tricky but not too bad...
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > CC: Fengguang Wu <[email protected]>
> > CC: David Chinner <[email protected]>
> >
> > ---
> > fs/drop_caches.c | 8 +++++++-
> > 1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > index 59375ef..f5aae26 100644
> > --- a/fs/drop_caches.c
> > +++ b/fs/drop_caches.c
> > @@ -14,15 +14,21 @@ int sysctl_drop_caches;
> >
> > static void drop_pagecache_sb(struct super_block *sb)
> > {
> > - struct inode *inode;
> > + struct inode *inode, *toput_inode = NULL;
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > continue;
>
> OT: it might be worth having an `if (mapping->nrpages==0) continue' here.
>
> > + __iget(inode);
> > + spin_unlock(&inode_lock);
> > __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> > + iput(toput_inode);
> > + toput_inode = inode;
> > + spin_lock(&inode_lock);
> > }
> > spin_unlock(&inode_lock);
> > + iput(toput_inode);
> > }
> >
> > void drop_pagecache(void)
>
> hrm. So we have a random ref on an inode without holding inode_lock. If
> we race with invalidate_list() we end up with an inode stuck on s_inodes
> and "Self-destruct in 5 seconds. Have a nice day...", don't we?

Calling drop_pagecache_sb() without having a reference to 'sb'? Surely
not...

Trond

2008-03-26 00:44:49

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()

On Tue, Mar 25, 2008 at 07:12:27PM +0100, Jan Kara wrote:
> Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
> before calling __invalidate_mapping_pages(). We just have to make sure
> inode won't go away from under us by keeping reference to it and putting
> the reference only after we have safely resumed the scan of the inode
> list. A bit tricky but not too bad...

Reviewed-by: Fengguang Wu <[email protected]>

It's a handy trick to iterate through the list_head :-)
I have practiced this in my filecache code, and it works nice.

Fengguang

> Signed-off-by: Jan Kara <[email protected]>
> CC: Fengguang Wu <[email protected]>
> CC: David Chinner <[email protected]>
>
> ---
> fs/drop_caches.c | 8 +++++++-
> 1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index 59375ef..f5aae26 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -14,15 +14,21 @@ int sysctl_drop_caches;
>
> static void drop_pagecache_sb(struct super_block *sb)
> {
> - struct inode *inode;
> + struct inode *inode, *toput_inode = NULL;
>
> spin_lock(&inode_lock);
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> if (inode->i_state & (I_FREEING|I_WILL_FREE))
> continue;
> + __iget(inode);
> + spin_unlock(&inode_lock);
> __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> + iput(toput_inode);
> + toput_inode = inode;
> + spin_lock(&inode_lock);
> }
> spin_unlock(&inode_lock);
> + iput(toput_inode);
> }
>
> void drop_pagecache(void)
> --
> 1.5.2.4
>

2008-03-26 01:28:34

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()

On Tue, Mar 25, 2008 at 12:53:54PM -0700, Andrew Morton wrote:
> On Tue, 25 Mar 2008 19:12:27 +0100
> Jan Kara <[email protected]> wrote:
>
> > Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
> > before calling __invalidate_mapping_pages(). We just have to make sure
> > inode won't go away from under us by keeping reference to it and putting
> > the reference only after we have safely resumed the scan of the inode
> > list. A bit tricky but not too bad...
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > CC: Fengguang Wu <[email protected]>
> > CC: David Chinner <[email protected]>
> >
> > ---
> > fs/drop_caches.c | 8 +++++++-
> > 1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > index 59375ef..f5aae26 100644
> > --- a/fs/drop_caches.c
> > +++ b/fs/drop_caches.c
> > @@ -14,15 +14,21 @@ int sysctl_drop_caches;
> >
> > static void drop_pagecache_sb(struct super_block *sb)
> > {
> > - struct inode *inode;
> > + struct inode *inode, *toput_inode = NULL;
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > continue;
>
> OT: it might be worth having an `if (mapping->nrpages==0) continue' here.

Good catch!

There are 25k opened inodes in my desktop, merely 10% of them has cached pages:

% cat /proc/sys/fs/inode-state
25395 129 0 0 0 0 0
# wc -l /proc/filecache
2542 /proc/filecache

+ if (!inode->i_mapping || !inode->i_mapping->nrpages)
+ continue;

2008-03-26 09:31:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix lock inversion in drop_pagecache_sb()

On Tue 25-03-08 12:53:54, Andrew Morton wrote:
> On Tue, 25 Mar 2008 19:12:27 +0100
> Jan Kara <[email protected]> wrote:
>
> > Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
> > before calling __invalidate_mapping_pages(). We just have to make sure
> > inode won't go away from under us by keeping reference to it and putting
> > the reference only after we have safely resumed the scan of the inode
> > list. A bit tricky but not too bad...
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > CC: Fengguang Wu <[email protected]>
> > CC: David Chinner <[email protected]>
> >
> > ---
> > fs/drop_caches.c | 8 +++++++-
> > 1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > index 59375ef..f5aae26 100644
> > --- a/fs/drop_caches.c
> > +++ b/fs/drop_caches.c
> > @@ -14,15 +14,21 @@ int sysctl_drop_caches;
> >
> > static void drop_pagecache_sb(struct super_block *sb)
> > {
> > - struct inode *inode;
> > + struct inode *inode, *toput_inode = NULL;
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > continue;
>
> OT: it might be worth having an `if (mapping->nrpages==0) continue' here.
Good idea. I'll send a patch in a minute.

> > + __iget(inode);
> > + spin_unlock(&inode_lock);
> > __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> > + iput(toput_inode);
> > + toput_inode = inode;
> > + spin_lock(&inode_lock);
> > }
> > spin_unlock(&inode_lock);
> > + iput(toput_inode);
> > }
> >
> > void drop_pagecache(void)
>
> hrm. So we have a random ref on an inode without holding inode_lock. If
> we race with invalidate_list() we end up with an inode stuck on s_inodes
> and "Self-destruct in 5 seconds. Have a nice day...", don't we?
We hold s_umount for reading so we should be safe against someone trying
to do umount. We could possibly race with invalidate_list() called from
check_disk_change() but removing media without unmounting is a bad behavior
anyway. So I think we are fine.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-26 09:33:31

by Jan Kara

[permalink] [raw]
Subject: [PATCH] vfs: Skip inodes without pages to free in drop_pagecache_sb()


Signed-off-by: Jan Kara <[email protected]>
CC: Fengguang Wu <[email protected]>

---
fs/drop_caches.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f5aae26..7327a42 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -20,6 +20,8 @@ static void drop_pagecache_sb(struct super_block *sb)
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_WILL_FREE))
continue;
+ if (inode->i_mapping->nrpages == 0)
+ continue;
__iget(inode);
spin_unlock(&inode_lock);
__invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
--
1.5.2.4