With -next 20110113 I'm experiencing lockups shortly after userspace
starts when booting with my root filesystem on NFS, log below. I can
boot into /bin/sh but running real userspace triggers this very easily.
This was introduced sometime this week.
I've not bisected or otherwise investigated much yet, but I do notice
code added recently by Nick in "fs: rcu-walk for path lookup" showing up
in the backtrace so including him in the CCs.
Tail of the console log:
INIT: version 2.86 booting
[ 15.250000] BUG: spinlock recursion on CPU#0, rc/1151
[ 15.250000] lock: c74ad678, .magic: dead4ead, .owner: rc/1151, .owner_cpu: 0
[ 15.260000] [<c0035118>] (unwind_backtrace+0x0/0xe4) from [<c0190454>] (do_raw_spin_lock+0x48/0x148)
[ 15.260000] [<c0190454>] (do_raw_spin_lock+0x48/0x148) from [<c00c89f0>] (nameidata_dentry_drop_rcu+0x84/0x180)
[ 15.270000] [<c00c89f0>] (nameidata_dentry_drop_rcu+0x84/0x180) from [<c00c8b1c>] (d_revalidate+0x30/0x58)
[ 15.280000] [<c00c8b1c>] (d_revalidate+0x30/0x58) from [<c00cb82c>] (link_path_walk+0xad8/0xb00)
[ 15.290000] [<c00cb82c>] (link_path_walk+0xad8/0xb00) from [<c00cba4c>] (do_path_lookup+0x44/0xcc)
[ 15.300000] [<c00cba4c>] (do_path_lookup+0x44/0xcc) from [<c00cc3f4>] (user_path_at+0x58/0x90)
[ 15.310000] [<c00cc3f4>] (user_path_at+0x58/0x90) from [<c00c4210>] (vfs_fstatat+0x28/0x54)
[ 15.320000] [<c00c4210>] (vfs_fstatat+0x28/0x54) from [<c00c431c>] (sys_stat64+0x14/0x34)
[ 15.330000] [<c00c431c>] (sys_stat64+0x14/0x34) from [<c002f4e0>] (ret_fast_syscall+0x0/0x30)
[ 25.830000] BUG: spinlock lockup on CPU#0, rc/1151, c74ad678
[ 25.830000] [<c0035118>] (unwind_backtrace+0x0/0xe4) from [<c0190518>] (do_raw_spin_lock+0x10c/0x148)
[ 25.840000] [<c0190518>] (do_raw_spin_lock+0x10c/0x148) from [<c00c89f0>] (nameidata_dentry_drop_rcu+0x84/0x180)
[ 25.850000] [<c00c89f0>] (nameidata_dentry_drop_rcu+0x84/0x180) from [<c00c8b1c>] (d_revalidate+0x30/0x58)
[ 25.860000] [<c00c8b1c>] (d_revalidate+0x30/0x58) from [<c00cb82c>] (link_path_walk+0xad8/0xb00)
[ 25.870000] [<c00cb82c>] (link_path_walk+0xad8/0xb00) from [<c00cba4c>] (do_path_lookup+0x44/0xcc)
[ 25.870000] [<c00cba4c>] (do_path_lookup+0x44/0xcc) from [<c00cc3f4>] (user_path_at+0x58/0x90)
[ 25.880000] [<c00cc3f4>] (user_path_at+0x58/0x90) from [<c00c4210>] (vfs_fstatat+0x28/0x54)
[ 25.890000] [<c00c4210>] (vfs_fstatat+0x28/0x54) from [<c00c431c>] (sys_stat64+0x14/0x34)
[ 25.900000] [<c00c431c>] (sys_stat64+0x14/0x34) from [<c002f4e0>] (ret_fast_syscall+0x0/0x30)
On Wed, Jan 19, 2011 at 5:43 PM, J. R. Okajima <[email protected]> wrote:
>
> Hi,
>
> Nick Piggin:
>> Thanks for your help, can you see how I've fixed it in my vfs-scale
>> tree? What do you think?
>
> Your fix is great. I have no objection at all.
> Other than the fix, here are more generic questions about vfs-scale work.
> I am happy if you reply when you have time.
Thanks for reviewing.
> - getcwd(2) needs d_lock?
> ?It acquires rename_lock and then tests whether the pwd is removed by
> ?d_unhashed(). If a race condition between vfs_rename_dir() which may
> ?unhash/rehash the dentry happens, then getcwd() may return the wrong
> ?result due to unprotected d_unhashed() call, I am afraid. rename_lock
> ?doesn't help this case.
We have the lock in write mode there, so it should exclude that
particular race. But I need to take another look at this code I
think, I'm not sure it's completely right, so I would appreciate reviews.
A while back I had some extra checks in there and would restart
the entire reverse walk in case of races... but need to think about
it.
> - what is the right order of dget() and mntget()?
> ?If I remember correctly, someone said "mntget() first and then
> ?dget(). when putting, do in reverse" in the discussion when
> ?path_{get,put}() were born. So it is called "the right order" in the
> ?commit log.
> ?It was many years ago. Is it still true? And should rcu-walk follow it
> ?too? The current implementation doesn't seem to care about this order.
Well dget and mntget is not a problem, because we can only do
mntget while already guaranteeing a reference on the mount, and
only dget when already guaranteeing a ref on the dentry (and mount).
But dput must happen before mntput so you don't have dentry ref
without mnt ref. Can you point out where rcu-walk does this wrongly?
> - d_move() and rename_lock
> ?This may be out of rcu-walk work, but rename_lock in d_move() looks
> ?outstanding since it surely kills concurrency. It is a pity that two
> ?unrelated but concurrent d_move-s are serialized when we run rename(2)
> ?on two different filesystems. Even if all of dentries, parents and
> ?hash buckets are different from each other, d_move() never run
> ?concurrently.
Yes I have a patch for that. I made a small hash table of rename locks.
This makes independent same-dir renames scalable. However that was
not the main motivation of the patch. On a really big POWER7 system,
the lookup path goes into a strange bimodal behaviour in the presence
of a relatively small amount of rename activity and sometimes starves
and throughput crashes. Breaking up rename_lock solves that too.
I'll wait until things settle down a bit more and perhaps have a chance
to get more numbers before submitting it (although I can show you when
I get back).
Thanks,
Nick
> -----Original Message-----
> From: J. R. Okajima [mailto:[email protected]]
> Sent: Thursday, January 13, 2011 6:52 PM
> To: Mark Brown
> Cc: Trond Myklebust; Santosh Shilimkar; Nick Piggin; linux-
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: NFS root lockups with -next 20110113
>
>
> Mark Brown:
> > With -next 20110113 I'm experiencing lockups shortly after
> userspace
> > starts when booting with my root filesystem on NFS, log below. I
> can
> > boot into /bin/sh but running real userspace triggers this very
> easily.
> > This was introduced sometime this week.
> >
> > I've not bisected or otherwise investigated much yet, but I do
> notice
> > code added recently by Nick in "fs: rcu-walk for path lookup"
> showing up
> > in the backtrace so including him in the CCs.
>
> This and a report from Santosh Shilimkar look like the same problem
> which I reported, and I am testing this patch. If you can, please
> try it
> too.
> Note: Of course this is not offcial fix.
>
It works. My board booted with NFS with below patch
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 5bb7588..51d052f 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -480,6 +480,7 @@ static int nameidata_dentry_drop_rcu(struct
> nameidata *nd, struct dentry *dentry
> {
> struct fs_struct *fs = current->fs;
> struct dentry *parent = nd->path.dentry;
> + int isroot;
>
> BUG_ON(!(nd->flags & LOOKUP_RCU));
> if (nd->root.mnt) {
> @@ -489,18 +490,22 @@ static int nameidata_dentry_drop_rcu(struct
> nameidata *nd, struct dentry *dentry
> goto err_root;
> }
> spin_lock(&parent->d_lock);
> - spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> - if (!__d_rcu_to_refcount(dentry, nd->seq))
> - goto err;
> + isroot = IS_ROOT(dentry);
> + if (!isroot) {
> + spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> + if (!__d_rcu_to_refcount(dentry, nd->seq))
> + goto err;
> + }
> /*
> * If the sequence check on the child dentry passed, then the
> child has
> * not been removed from its parent. This means the parent
> dentry must
> * be valid and able to take a reference at this point.
> */
> - BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
> + BUG_ON(!isroot && dentry->d_parent != parent);
> BUG_ON(!parent->d_count);
> parent->d_count++;
> - spin_unlock(&dentry->d_lock);
> + if (!isroot)
> + spin_unlock(&dentry->d_lock);
> spin_unlock(&parent->d_lock);
> if (nd->root.mnt) {
> path_get(&nd->root);
> @@ -513,7 +518,8 @@ static int nameidata_dentry_drop_rcu(struct
> nameidata *nd, struct dentry *dentry
> nd->flags &= ~LOOKUP_RCU;
> return 0;
> err:
> - spin_unlock(&dentry->d_lock);
> + if (!isroot)
> + spin_unlock(&dentry->d_lock);
> spin_unlock(&parent->d_lock);
> err_root:
> if (nd->root.mnt)
On Fri, Jan 14, 2011 at 12:45 AM, J. R. Okajima <[email protected]> wrote:
>
> Santosh Shilimkar:
>> It works. My board booted with NFS with below patch
>
> ----------------------------------------
>
> Mark Brown;
>> Seems to do the trick:
>>
>> Tested-by: Mark Brown <[email protected]>
>
> Thank you so quick tests.
> When I post the patch to be mainlined, I will add Tested-by for both of
> you. But Nick Piggin may take another approach.
Thanks for your help, can you see how I've fixed it in my vfs-scale
tree? What do you think?
Thanks,
Nick
Hi,
Nick Piggin:
> Thanks for your help, can you see how I've fixed it in my vfs-scale
> tree? What do you think?
Your fix is great. I have no objection at all.
Other than the fix, here are more generic questions about vfs-scale work.
I am happy if you reply when you have time.
- getcwd(2) needs d_lock?
It acquires rename_lock and then tests whether the pwd is removed by
d_unhashed(). If a race condition between vfs_rename_dir() which may
unhash/rehash the dentry happens, then getcwd() may return the wrong
result due to unprotected d_unhashed() call, I am afraid. rename_lock
doesn't help this case.
- what is the right order of dget() and mntget()?
If I remember correctly, someone said "mntget() first and then
dget(). when putting, do in reverse" in the discussion when
path_{get,put}() were born. So it is called "the right order" in the
commit log.
It was many years ago. Is it still true? And should rcu-walk follow it
too? The current implementation doesn't seem to care about this order.
- d_move() and rename_lock
This may be out of rcu-walk work, but rename_lock in d_move() looks
outstanding since it surely kills concurrency. It is a pity that two
unrelated but concurrent d_move-s are serialized when we run rename(2)
on two different filesystems. Even if all of dentries, parents and
hash buckets are different from each other, d_move() never run
concurrently.
J. R. Okajima
Mark Brown:
> With -next 20110113 I'm experiencing lockups shortly after userspace
> starts when booting with my root filesystem on NFS, log below. I can
> boot into /bin/sh but running real userspace triggers this very easily.
> This was introduced sometime this week.
>
> I've not bisected or otherwise investigated much yet, but I do notice
> code added recently by Nick in "fs: rcu-walk for path lookup" showing up
> in the backtrace so including him in the CCs.
This and a report from Santosh Shilimkar look like the same problem
which I reported, and I am testing this patch. If you can, please try it
too.
Note: Of course this is not offcial fix.
J. R. Okajima
diff --git a/fs/namei.c b/fs/namei.c
index 5bb7588..51d052f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -480,6 +480,7 @@ static int nameidata_dentry_drop_rcu(struct nameidata *nd, struct dentry *dentry
{
struct fs_struct *fs = current->fs;
struct dentry *parent = nd->path.dentry;
+ int isroot;
BUG_ON(!(nd->flags & LOOKUP_RCU));
if (nd->root.mnt) {
@@ -489,18 +490,22 @@ static int nameidata_dentry_drop_rcu(struct nameidata *nd, struct dentry *dentry
goto err_root;
}
spin_lock(&parent->d_lock);
- spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
- if (!__d_rcu_to_refcount(dentry, nd->seq))
- goto err;
+ isroot = IS_ROOT(dentry);
+ if (!isroot) {
+ spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
+ if (!__d_rcu_to_refcount(dentry, nd->seq))
+ goto err;
+ }
/*
* If the sequence check on the child dentry passed, then the child has
* not been removed from its parent. This means the parent dentry must
* be valid and able to take a reference at this point.
*/
- BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
+ BUG_ON(!isroot && dentry->d_parent != parent);
BUG_ON(!parent->d_count);
parent->d_count++;
- spin_unlock(&dentry->d_lock);
+ if (!isroot)
+ spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
if (nd->root.mnt) {
path_get(&nd->root);
@@ -513,7 +518,8 @@ static int nameidata_dentry_drop_rcu(struct nameidata *nd, struct dentry *dentry
nd->flags &= ~LOOKUP_RCU;
return 0;
err:
- spin_unlock(&dentry->d_lock);
+ if (!isroot)
+ spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
err_root:
if (nd->root.mnt)
Nick Piggin:
> Thanks for your help, can you see how I've fixed it in my vfs-scale
> tree? What do you think?
I will.
It may take some time.
J. R. Okajima
Santosh Shilimkar:
> It works. My board booted with NFS with below patch
----------------------------------------
Mark Brown;
> Seems to do the trick:
>
> Tested-by: Mark Brown <[email protected]>
Thank you so quick tests.
When I post the patch to be mainlined, I will add Tested-by for both of
you. But Nick Piggin may take another approach.
J. R. Okajima
On Thu, Jan 13, 2011 at 10:22:07PM +0900, J. R. Okajima wrote:
> This and a report from Santosh Shilimkar look like the same problem
> which I reported, and I am testing this patch. If you can, please try it
> too.
> Note: Of course this is not offcial fix.
Seems to do the trick:
Tested-by: Mark Brown <broonie-yzvPICuk2AABXRvFM2YniueM+bqZidxxQQ4Iyu8u01E@public.gmane.org>
(+ 'linux-arm' since same problem was getting discussed
in another thread)
> -----Original Message-----
> From: Santosh Shilimkar [mailto:[email protected]]
> Sent: Thursday, January 13, 2011 6:59 PM
> To: 'J. R. Okajima'; 'Mark Brown'
> Cc: 'Trond Myklebust'; 'Nick Piggin'; '[email protected]';
> '[email protected]'; '[email protected]'
> Subject: RE: NFS root lockups with -next 20110113
>
> > -----Original Message-----
> > From: J. R. Okajima [mailto:hooanon05-/[email protected]]
> > Sent: Thursday, January 13, 2011 6:52 PM
> > To: Mark Brown
> > Cc: Trond Myklebust; Santosh Shilimkar; Nick Piggin; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]
> > Subject: Re: NFS root lockups with -next 20110113
> >
> >
> > Mark Brown:
> > > With -next 20110113 I'm experiencing lockups shortly after
> > userspace
> > > starts when booting with my root filesystem on NFS, log below.
> I
> > can
> > > boot into /bin/sh but running real userspace triggers this very
> > easily.
> > > This was introduced sometime this week.
> > >
> > > I've not bisected or otherwise investigated much yet, but I do
> > notice
> > > code added recently by Nick in "fs: rcu-walk for path lookup"
> > showing up
> > > in the backtrace so including him in the CCs.
> >
> > This and a report from Santosh Shilimkar look like the same
> problem
> > which I reported, and I am testing this patch. If you can, please
> > try it
> > too.
> > Note: Of course this is not offcial fix.
> >
>
> It works. My board booted with NFS with below patch
>
>
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 5bb7588..51d052f 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -480,6 +480,7 @@ static int nameidata_dentry_drop_rcu(struct
> > nameidata *nd, struct dentry *dentry
> > {
> > struct fs_struct *fs = current->fs;
> > struct dentry *parent = nd->path.dentry;
> > + int isroot;
> >
> > BUG_ON(!(nd->flags & LOOKUP_RCU));
> > if (nd->root.mnt) {
> > @@ -489,18 +490,22 @@ static int nameidata_dentry_drop_rcu(struct
> > nameidata *nd, struct dentry *dentry
> > goto err_root;
> > }
> > spin_lock(&parent->d_lock);
> > - spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> > - if (!__d_rcu_to_refcount(dentry, nd->seq))
> > - goto err;
> > + isroot = IS_ROOT(dentry);
> > + if (!isroot) {
> > + spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> > + if (!__d_rcu_to_refcount(dentry, nd->seq))
> > + goto err;
> > + }
> > /*
> > * If the sequence check on the child dentry passed, then the
> > child has
> > * not been removed from its parent. This means the parent
> > dentry must
> > * be valid and able to take a reference at this point.
> > */
> > - BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
> > + BUG_ON(!isroot && dentry->d_parent != parent);
> > BUG_ON(!parent->d_count);
> > parent->d_count++;
> > - spin_unlock(&dentry->d_lock);
> > + if (!isroot)
> > + spin_unlock(&dentry->d_lock);
> > spin_unlock(&parent->d_lock);
> > if (nd->root.mnt) {
> > path_get(&nd->root);
> > @@ -513,7 +518,8 @@ static int nameidata_dentry_drop_rcu(struct
> > nameidata *nd, struct dentry *dentry
> > nd->flags &= ~LOOKUP_RCU;
> > return 0;
> > err:
> > - spin_unlock(&dentry->d_lock);
> > + if (!isroot)
> > + spin_unlock(&dentry->d_lock);
> > spin_unlock(&parent->d_lock);
> > err_root:
> > if (nd->root.mnt)
On Wed, 2011-01-19 at 15:43 +0900, J. R. Okajima wrote:
> Hi,
>
> Nick Piggin:
> > Thanks for your help, can you see how I've fixed it in my vfs-scale
> > tree? What do you think?
>
> Your fix is great. I have no objection at all.
> Other than the fix, here are more generic questions about vfs-scale work.
> I am happy if you reply when you have time.
>
> - getcwd(2) needs d_lock?
> It acquires rename_lock and then tests whether the pwd is removed by
> d_unhashed(). If a race condition between vfs_rename_dir() which may
> unhash/rehash the dentry happens, then getcwd() may return the wrong
> result due to unprotected d_unhashed() call, I am afraid. rename_lock
> doesn't help this case.
>
> - what is the right order of dget() and mntget()?
> If I remember correctly, someone said "mntget() first and then
> dget(). when putting, do in reverse" in the discussion when
> path_{get,put}() were born. So it is called "the right order" in the
> commit log.
> It was many years ago. Is it still true? And should rcu-walk follow it
> too? The current implementation doesn't seem to care about this order.
I didn't spot that, where did you see this?
I'm not sure about the get but I fairly sure the dput() has to be before
the mntput() because the shrink_dcache_*() cleanup routines object to
dentrys that have a reference count of more than one.
Ian
Ian Kent:
> > - what is the right order of dget() and mntget()?
> > If I remember correctly, someone said "mntget() first and then
> > dget(). when putting, do in reverse" in the discussion when
> > path_{get,put}() were born. So it is called "the right order" in the
> > commit log.
> > It was many years ago. Is it still true? And should rcu-walk follow it
> > too? The current implementation doesn't seem to care about this order.
>
> I didn't spot that, where did you see this?
>
> I'm not sure about the get but I fairly sure the dput() has to be before
> the mntput() because the shrink_dcache_*() cleanup routines object to
> dentrys that have a reference count of more than one.
For dget - mntget, there are several such code. For example,
nameidata_dentry_drop_rcu()
{
struct dentry *parent = nd->path.dentry;
:::
parent->d_count++;
spin_unlock(&dentry->d_lock);
spin_unlock(&parent->d_lock);
:::
mntget(nd->path.mnt);
:::
But I am not sure the "get" order is a problem.
Nick Piggin also replied and said dget and mntget is not a problem, and
I replied if I found such "put" order, I would write again.
J. R. Okajima