Hi,
I've been trying to track down the cause of the following messages which
appear in my logs each time I start up dlm_controld:
Dec 1 12:53:17 men-an-tol kernel:
=============================================
Dec 1 12:53:17 men-an-tol kernel: [ INFO: possible recursive locking
detected ]
Dec 1 12:53:17 men-an-tol kernel: 2.6.28-rc5 #179
Dec 1 12:53:17 men-an-tol kernel:
---------------------------------------------
Dec 1 12:53:17 men-an-tol kernel: dlm_controld/2455 is trying to
acquire lock:
Dec 1 12:53:17 men-an-tol kernel:
(&sb->s_type->i_mutex_key#11/2){--..}, at: [<
ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
Dec 1 12:53:17 men-an-tol kernel:
Dec 1 12:53:17 men-an-tol kernel: but task is already holding lock:
Dec 1 12:53:17 men-an-tol kernel:
(&sb->s_type->i_mutex_key#11/2){--..}, at: [<
ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
Dec 1 12:53:17 men-an-tol kernel:
Dec 1 12:53:17 men-an-tol kernel: other info that might help us debug
this:
Dec 1 12:53:17 men-an-tol kernel: 2 locks held by dlm_controld/2455:
Dec 1 12:53:17 men-an-tol kernel: #0:
(&sb->s_type->i_mutex_key#10/1){--..}, a
t: [<ffffffff802b5e11>] lookup_create+0x26/0x94
Dec 1 12:53:17 men-an-tol kernel: #1:
(&sb->s_type->i_mutex_key#11/2){--..}, a
t: [<ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
Dec 1 12:53:17 men-an-tol kernel:
which seems to be caused by the mkdir which dlm_controld does in
configfs. Looking at the stack trace, this didn't make much sense until
I stuck noinline in front of several functions in configfs, whereupon I
get:
[<ffffffff8025808e>] __lock_acquire+0xdce/0x14f5
[<ffffffff80254894>] ? get_lock_stats+0x34/0x5c
[<ffffffff802548ca>] ? put_lock_stats+0xe/0x27
[<ffffffff802549c3>] ? lock_release_holdtime+0xe0/0xe5
[<ffffffff8025880a>] lock_acquire+0x55/0x71
[<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
[<ffffffff8050563c>] mutex_lock_nested+0xf9/0x2c5
[<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
[<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
[<ffffffffa024ac7c>] ? configfs_attach_item+0xed/0x201 [configfs]
[<ffffffffa024add0>] configfs_attach_group+0x40/0x89 [configfs] <- second call
[<ffffffffa024aec5>] create_default_group+0xac/0xe3 [configfs]
[<ffffffffa024af24>] populate_groups+0x28/0x52 [configfs]
[<ffffffffa024add8>] configfs_attach_group+0x48/0x89 [configfs] <- first call
[<ffffffffa024b222>] configfs_mkdir+0x2d4/0x3bf [configfs]
[<ffffffff802b66ef>] vfs_mkdir+0xb0/0x121
[<ffffffff802b8afa>] sys_mkdirat+0xa2/0xf5
[<ffffffff8020b4fc>] ? sysret_check+0x27/0x62
[<ffffffff802569ec>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff8026cf14>] ? audit_syscall_entry+0x126/0x15a
[<ffffffff802b8b60>] sys_mkdir+0x13/0x15
[<ffffffff8020b4cb>] system_call_fastpath+0x16/0x1b
so it looks like configfs_attach_group is being called recursively in
this case, and I think thats the cause of the warning messages. Also I
spotted a couple of other things... from configfs_attach_item() the
inode mutex which is being locked just uses a plain old mutex_lock()
call whereas in configfs_attach_group() which calls
configfs_attach_item() there is an annotated I_MUTEX_CHILD call. I would
have expected them both to be the same since I presume that the parent
is common (locked by the VFS if I've understood whats going on here).
The second thing is that configfs_attach_item() calls populate_attrs()
which calls through to configfs_add_file(), so in order words it seems
to also be called from the context of the mkdir call. In that case the
inode mutex is locked with I_MUTEX_NORMAL annotation.
So I'm a bit confused as to why lockdep doesn't flag up those issues too
since they appear to occur before the one which produced the above
message, or maybe I've misunderstood how the annotation works.
Any ideas what is going wrong here? I think it must just be an
annotation issue since it appears that configfs works perfectly ok
otherwise, but it would be nice to get to the bottom of it,
Steve.
Hi,
On 11/12/08 14:20 +0000, Steven Whitehouse wrote:
> Hi,
>
> I've been trying to track down the cause of the following messages which
> appear in my logs each time I start up dlm_controld:
>
> Dec 1 12:53:17 men-an-tol kernel:
> =============================================
> Dec 1 12:53:17 men-an-tol kernel: [ INFO: possible recursive locking
> detected ]
> Dec 1 12:53:17 men-an-tol kernel: 2.6.28-rc5 #179
> Dec 1 12:53:17 men-an-tol kernel:
> ---------------------------------------------
> Dec 1 12:53:17 men-an-tol kernel: dlm_controld/2455 is trying to
> acquire lock:
> Dec 1 12:53:17 men-an-tol kernel:
> (&sb->s_type->i_mutex_key#11/2){--..}, at: [<
> ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
> Dec 1 12:53:17 men-an-tol kernel:
> Dec 1 12:53:17 men-an-tol kernel: but task is already holding lock:
> Dec 1 12:53:17 men-an-tol kernel:
> (&sb->s_type->i_mutex_key#11/2){--..}, at: [<
> ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
> Dec 1 12:53:17 men-an-tol kernel:
> Dec 1 12:53:17 men-an-tol kernel: other info that might help us debug
> this:
> Dec 1 12:53:17 men-an-tol kernel: 2 locks held by dlm_controld/2455:
> Dec 1 12:53:17 men-an-tol kernel: #0:
> (&sb->s_type->i_mutex_key#10/1){--..}, a
> t: [<ffffffff802b5e11>] lookup_create+0x26/0x94
> Dec 1 12:53:17 men-an-tol kernel: #1:
> (&sb->s_type->i_mutex_key#11/2){--..}, a
> t: [<ffffffffa0294d76>] configfs_attach_group+0x4a/0x183 [configfs]
> Dec 1 12:53:17 men-an-tol kernel:
>
> which seems to be caused by the mkdir which dlm_controld does in
> configfs. Looking at the stack trace, this didn't make much sense until
> I stuck noinline in front of several functions in configfs, whereupon I
> get:
> [<ffffffff8025808e>] __lock_acquire+0xdce/0x14f5
> [<ffffffff80254894>] ? get_lock_stats+0x34/0x5c
> [<ffffffff802548ca>] ? put_lock_stats+0xe/0x27
> [<ffffffff802549c3>] ? lock_release_holdtime+0xe0/0xe5
> [<ffffffff8025880a>] lock_acquire+0x55/0x71
> [<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
> [<ffffffff8050563c>] mutex_lock_nested+0xf9/0x2c5
> [<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
> [<ffffffffa024add0>] ? configfs_attach_group+0x40/0x89 [configfs]
> [<ffffffffa024ac7c>] ? configfs_attach_item+0xed/0x201 [configfs]
> [<ffffffffa024add0>] configfs_attach_group+0x40/0x89 [configfs] <- second call
> [<ffffffffa024aec5>] create_default_group+0xac/0xe3 [configfs]
> [<ffffffffa024af24>] populate_groups+0x28/0x52 [configfs]
> [<ffffffffa024add8>] configfs_attach_group+0x48/0x89 [configfs] <- first call
> [<ffffffffa024b222>] configfs_mkdir+0x2d4/0x3bf [configfs]
> [<ffffffff802b66ef>] vfs_mkdir+0xb0/0x121
> [<ffffffff802b8afa>] sys_mkdirat+0xa2/0xf5
> [<ffffffff8020b4fc>] ? sysret_check+0x27/0x62
> [<ffffffff802569ec>] ? trace_hardirqs_on_caller+0xf0/0x114
> [<ffffffff8026cf14>] ? audit_syscall_entry+0x126/0x15a
> [<ffffffff802b8b60>] sys_mkdir+0x13/0x15
> [<ffffffff8020b4cb>] system_call_fastpath+0x16/0x1b
These warnings are known issues. This results from a lack of lockdep annotations
in configfs. I must admit that I started to send patches for that a few months
ago, and then could not find time to finish this work.
The problem is a bit harder than just playing with I_MUTEX_CHILD, I_MUTEX_PARENT
and I_MUTEX_NORMAL, since configfs recursively locks variable numbers
(this can go to as many as the depth of the whole configfs tree) of
config_group inodes during operations like mkdir(), rmdir(), and depend_item().
I was working on two kinds of solutions:
1) insert lockdep_off()/lockdep_on() at the places of recursion,
2) separate default groups inode mutex classes according to their depth under
the created group they belong to.
People tend to reject any proposition like 1), but IIRC Joel was tending to
accept it.
Solution 2) does not work for depend_item(). This needs to rework the locking
scheme of depend_item() by removing the variable lock recursion depth, and I
think that it's doable thanks to the configfs_dirent_lock.
Joel, what do you think about this?
Anyway, I still hope to find time for this :)
Louis
>
> so it looks like configfs_attach_group is being called recursively in
> this case, and I think thats the cause of the warning messages. Also I
> spotted a couple of other things... from configfs_attach_item() the
> inode mutex which is being locked just uses a plain old mutex_lock()
> call whereas in configfs_attach_group() which calls
> configfs_attach_item() there is an annotated I_MUTEX_CHILD call. I would
> have expected them both to be the same since I presume that the parent
> is common (locked by the VFS if I've understood whats going on here).
>
> The second thing is that configfs_attach_item() calls populate_attrs()
> which calls through to configfs_add_file(), so in order words it seems
> to also be called from the context of the mkdir call. In that case the
> inode mutex is locked with I_MUTEX_NORMAL annotation.
>
> So I'm a bit confused as to why lockdep doesn't flag up those issues too
> since they appear to occur before the one which produced the above
> message, or maybe I've misunderstood how the annotation works.
>
> Any ideas what is going wrong here? I think it must just be an
> annotation issue since it appears that configfs works perfectly ok
> otherwise, but it would be nice to get to the bottom of it,
>
> Steve.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Thu, Dec 11, 2008 at 03:44:41PM +0100, Louis Rilling wrote:
> These warnings are known issues. This results from a lack of lockdep annotations
> in configfs. I must admit that I started to send patches for that a few months
> ago, and then could not find time to finish this work.
>
> The problem is a bit harder than just playing with I_MUTEX_CHILD, I_MUTEX_PARENT
> and I_MUTEX_NORMAL, since configfs recursively locks variable numbers
> (this can go to as many as the depth of the whole configfs tree) of
> config_group inodes during operations like mkdir(), rmdir(), and depend_item().
>
> I was working on two kinds of solutions:
> 1) insert lockdep_off()/lockdep_on() at the places of recursion,
> 2) separate default groups inode mutex classes according to their depth under
> the created group they belong to.
>
> People tend to reject any proposition like 1), but IIRC Joel was tending to
> accept it.
>
> Solution 2) does not work for depend_item(). This needs to rework the locking
> scheme of depend_item() by removing the variable lock recursion depth, and I
> think that it's doable thanks to the configfs_dirent_lock.
> Joel, what do you think about this?
I've been waiting for your patch for (1). I am wary of the (2)
approach. Not because it wouldn't work for mkdir(2) - I think it would.
But rmdir(2) has the same recursive locking, with far more importance
(live objects), and would print the same error.
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On 11/12/08 9:34 -0800, Joel Becker wrote:
> On Thu, Dec 11, 2008 at 03:44:41PM +0100, Louis Rilling wrote:
> > These warnings are known issues. This results from a lack of lockdep annotations
> > in configfs. I must admit that I started to send patches for that a few months
> > ago, and then could not find time to finish this work.
> >
> > The problem is a bit harder than just playing with I_MUTEX_CHILD, I_MUTEX_PARENT
> > and I_MUTEX_NORMAL, since configfs recursively locks variable numbers
> > (this can go to as many as the depth of the whole configfs tree) of
> > config_group inodes during operations like mkdir(), rmdir(), and depend_item().
> >
> > I was working on two kinds of solutions:
> > 1) insert lockdep_off()/lockdep_on() at the places of recursion,
> > 2) separate default groups inode mutex classes according to their depth under
> > the created group they belong to.
> >
> > People tend to reject any proposition like 1), but IIRC Joel was tending to
> > accept it.
> >
> > Solution 2) does not work for depend_item(). This needs to rework the locking
> > scheme of depend_item() by removing the variable lock recursion depth, and I
> > think that it's doable thanks to the configfs_dirent_lock.
> > Joel, what do you think about this?
>
> I've been waiting for your patch for (1). I am wary of the (2)
> approach. Not because it wouldn't work for mkdir(2) - I think it would.
> But rmdir(2) has the same recursive locking, with far more importance
> (live objects), and would print the same error.
??
(2) is fine for both mkdir() and rmdir(), since they both lock inodes'
mutexes along paths, and only paths. The problem is depend_item().
Anyway, I'll post patches based on (1) and later I'll propose patches based
on (2).
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:
- the VFS allows one to lock first an inode and second one of its
children (The lock subclasses for this pattern are respectively
I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
descending order as long as it stays under a single mountpoint and
does not follow symlinks.
Unfortunately lockdep does not know (yet?) how to handle such
recursion.
I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.
>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.
I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
i_mutexes recursively locked in different classes based on their
depth from the top-level config_group created. This
induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
nesting of configfs default groups whenever lockdep is activated
but this limit looks reasonably high. Unfortunately, this alos
isolates VFS operations on configfs default groups from the others
and thus lowers the chances to detect locking issues.
This patch implements solution 1).
Solution 2) looks better from lockdep's point of view, but fails with
configfs_depend_item(). This needs to rework the locking
scheme of configfs_depend_item() by removing the variable lock recursion
depth, and I think that it's doable thanks to the configfs_dirent_lock.
For now, let's stick to solution 1).
Signed-off-by: Louis Rilling <[email protected]>
---
fs/configfs/dir.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 59 insertions(+), 0 deletions(-)
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index 8e93341..9c23583 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -553,12 +553,24 @@ static void detach_groups(struct config_group *group)
child = sd->s_dentry;
+ /*
+ * Note: we hide this from lockdep since we have no way
+ * to teach lockdep about recursive
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
+ * in an inode tree, which are valid as soon as
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
+ * parent inode to one of its children.
+ */
+ lockdep_off();
mutex_lock(&child->d_inode->i_mutex);
+ lockdep_on();
configfs_detach_group(sd->s_element);
child->d_inode->i_flags |= S_DEAD;
+ lockdep_off();
mutex_unlock(&child->d_inode->i_mutex);
+ lockdep_on();
d_delete(child);
dput(child);
@@ -748,11 +760,22 @@ static int configfs_attach_item(struct config_item *parent_item,
* We are going to remove an inode and its dentry but
* the VFS may already have hit and used them. Thus,
* we must lock them as rmdir() would.
+ *
+ * Note: we hide this from lockdep since we have no way
+ * to teach lockdep about recursive
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
+ * in an inode tree, which are valid as soon as
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
+ * parent inode to one of its children.
*/
+ lockdep_off();
mutex_lock(&dentry->d_inode->i_mutex);
+ lockdep_on();
configfs_remove_dir(item);
dentry->d_inode->i_flags |= S_DEAD;
+ lockdep_off();
mutex_unlock(&dentry->d_inode->i_mutex);
+ lockdep_on();
d_delete(dentry);
}
}
@@ -787,14 +810,25 @@ static int configfs_attach_group(struct config_item *parent_item,
*
* We must also lock the inode to remove it safely in case of
* error, as rmdir() would.
+ *
+ * Note: we hide this from lockdep since we have no way
+ * to teach lockdep about recursive
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
+ * in an inode tree, which are valid as soon as
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
+ * parent inode to one of its children.
*/
+ lockdep_off();
mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_CHILD);
+ lockdep_on();
ret = populate_groups(to_config_group(item));
if (ret) {
configfs_detach_item(item);
dentry->d_inode->i_flags |= S_DEAD;
}
+ lockdep_off();
mutex_unlock(&dentry->d_inode->i_mutex);
+ lockdep_on();
if (ret)
d_delete(dentry);
}
@@ -956,7 +990,17 @@ static int configfs_depend_prep(struct dentry *origin,
BUG_ON(!origin || !sd);
/* Lock this guy on the way down */
+ /*
+ * Note: we hide this from lockdep since we have no way
+ * to teach lockdep about recursive
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
+ * in an inode tree, which are valid as soon as
+ * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
+ * parent inode to one of its children.
+ */
+ lockdep_off();
mutex_lock(&sd->s_dentry->d_inode->i_mutex);
+ lockdep_on();
if (sd->s_element == target) /* Boo-yah */
goto out;
@@ -970,7 +1014,9 @@ static int configfs_depend_prep(struct dentry *origin,
}
/* We looped all our children and didn't find target */
+ lockdep_off();
mutex_unlock(&sd->s_dentry->d_inode->i_mutex);
+ lockdep_on();
ret = -ENOENT;
out:
@@ -990,11 +1036,16 @@ static void configfs_depend_rollback(struct dentry *origin,
struct dentry *dentry = item->ci_dentry;
while (dentry != origin) {
+ /* See comments in configfs_depend_prep() */
+ lockdep_off();
mutex_unlock(&dentry->d_inode->i_mutex);
+ lockdep_on();
dentry = dentry->d_parent;
}
+ lockdep_off();
mutex_unlock(&origin->d_inode->i_mutex);
+ lockdep_on();
}
int configfs_depend_item(struct configfs_subsystem *subsys,
@@ -1329,8 +1380,16 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
}
/* Wait until the racing operation terminates */
+ /*
+ * Note: we hide this from lockdep since we are locked
+ * with subclass I_MUTEX_NORMAL from vfs_rmdir() (why
+ * not I_MUTEX_CHILD?), and I_MUTEX_XATTR or
+ * I_MUTEX_QUOTA are not relevant for the locked inode.
+ */
+ lockdep_off();
mutex_lock(wait_mutex);
mutex_unlock(wait_mutex);
+ lockdep_on();
}
} while (ret == -EAGAIN);
--
1.5.6.5
On Thu, Dec 18, 2008 at 01:28:28PM +0100, Peter Zijlstra wrote:
> In fact, both (configfs) mkdir and rmdir seem to synchronize on
> su_mutex..
>
> mkdir B/C/bar
>
> C.i_mutex
> su_mutex
>
> vs
>
> rmdir foo
>
> parent(foo).i_mutex
> foo.i_mutex
> su_mutex
>
>
> once holding the rmdir su_mutex you can check foo's user-content, since
> any mkdir will be blocked. All you have to do is then re-validate in
> mkdir's su_mutex that !IS_DEADDIR(C).
We explicitly do not take any i_mutex locks after taking
su_mutex. That's an ABBA risk. su_mutex protects the hierarchy of
config_items. i_mutex protects the vfs view thereof.
If you look in mkdir, we take su_mutex, get a new item from the
client subsystem, then drop su_mutex. After that, we go about building
our filesystem structure, using i_mutex where appropriate. More
importantly is rmdir(2), where we use i_mutex in
configfs_detach_group(), but are not holding su_sem. Only when
configfs_detach_group() has successfully returned and we have torn down
the filesystem structure do we take su_mutex and tear down the
config_item structure.
In fact, we're part of the way there. Check out that
USET_DROPPING flag we set in detach_prep() while scanning for user
objects. That flags us racing mkdir(2). When we are done with
detach_prep(), we know that mkdir(2) calls racing behind us will do
nothing until we safely lock them out with the locking in
detach_group(). All mkdir(2) calls will have exited by the time we get
the mutex, and no new mkdir(2) call can start because we have the mutex.
Now look in detach_groups(). We drop the groups children before
marking them DEAD. Louis' plan, I think, is to perhaps mark a group
DEAD, disconnect it from the vfs, and then operate on its children. In
this fashion, perhaps we can unlock the trailing lock like a normal VFS
operation.
This will require some serious auditing, however, because now
vfs functions can get into the vfs objects behind us. And more vfs
changes affect us. Whereas the current locking relies on the vfs's
parent->child lock ordering only, something that isn't likely to be
changed.
Joel
--
"You cannot bring about prosperity by discouraging thrift. You cannot
strengthen the weak by weakening the strong. You cannot help the wage
earner by pulling down the wage payer. You cannot further the
brotherhood of man by encouraging class hatred. You cannot help the
poor by destroying the rich. You cannot build character and courage by
taking away a man's initiative and independence. You cannot help men
permanently by doing for them what they could and should do for
themselves."
- Abraham Lincoln
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On 18/12/08 14:58 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 01:28:28PM +0100, Peter Zijlstra wrote:
> > In fact, both (configfs) mkdir and rmdir seem to synchronize on
> > su_mutex..
> >
> > mkdir B/C/bar
> >
> > C.i_mutex
> > su_mutex
> >
> > vs
> >
> > rmdir foo
> >
> > parent(foo).i_mutex
> > foo.i_mutex
> > su_mutex
> >
> >
> > once holding the rmdir su_mutex you can check foo's user-content, since
> > any mkdir will be blocked. All you have to do is then re-validate in
> > mkdir's su_mutex that !IS_DEADDIR(C).
>
> We explicitly do not take any i_mutex locks after taking
> su_mutex. That's an ABBA risk. su_mutex protects the hierarchy of
> config_items. i_mutex protects the vfs view thereof.
> If you look in mkdir, we take su_mutex, get a new item from the
> client subsystem, then drop su_mutex. After that, we go about building
> our filesystem structure, using i_mutex where appropriate. More
> importantly is rmdir(2), where we use i_mutex in
> configfs_detach_group(), but are not holding su_sem. Only when
> configfs_detach_group() has successfully returned and we have torn down
> the filesystem structure do we take su_mutex and tear down the
> config_item structure.
> In fact, we're part of the way there. Check out that
> USET_DROPPING flag we set in detach_prep() while scanning for user
> objects. That flags us racing mkdir(2). When we are done with
> detach_prep(), we know that mkdir(2) calls racing behind us will do
> nothing until we safely lock them out with the locking in
> detach_group(). All mkdir(2) calls will have exited by the time we get
> the mutex, and no new mkdir(2) call can start because we have the mutex.
> Now look in detach_groups(). We drop the groups children before
> marking them DEAD. Louis' plan, I think, is to perhaps mark a group
> DEAD, disconnect it from the vfs, and then operate on its children. In
> this fashion, perhaps we can unlock the trailing lock like a normal VFS
> operation.
> This will require some serious auditing, however, because now
> vfs functions can get into the vfs objects behind us. And more vfs
> changes affect us. Whereas the current locking relies on the vfs's
> parent->child lock ordering only, something that isn't likely to be
> changed.
I've thought about such plan, but I'm not comfortable enough with the VFS to
tell how it could be done precisely, and whether it is safe to remove a whole
tree from the dcache by just unlinking its root. In particular, how could we
deal with racing operations under default groups? Should we setup a link from
any default group to its youngest non-default group ancestor? As Steven
suggested, looking at unmount might be interesting, but not today as far as I
am concerned.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Fri, 12 Dec 2008 16:29:11 +0100
Louis Rilling <[email protected]> wrote:
> When attaching default groups (subdirs) of a new group (in mkdir() or
> in configfs_register()), configfs recursively takes inode's mutexes
> along the path from the parent of the new group to the default
> subdirs. This is needed to ensure that the VFS will not race with
> operations on these sub-dirs. This is safe for the following reasons:
>
> - the VFS allows one to lock first an inode and second one of its
> children (The lock subclasses for this pattern are respectively
> I_MUTEX_PARENT and I_MUTEX_CHILD);
> - from this rule any inode path can be recursively locked in
> descending order as long as it stays under a single mountpoint and
> does not follow symlinks.
>
> Unfortunately lockdep does not know (yet?) how to handle such
> recursion.
>
> I've tried to use Peter Zijlstra's lock_set_subclass() helper to
> upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
> that we might recursively lock some of their descendant, but this
> usage does not seem to fit the purpose of lock_set_subclass() because
> it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
> the same task.
>
> >From inside configfs it is not possible to serialize those recursive
> locking with a top-level one, because mkdir() and rmdir() are already
> called with inodes locked by the VFS. So using some
> mutex_lock_nest_lock() is not an option.
>
> I am proposing two solutions:
> 1) one that wraps recursive mutex_lock()s with
> lockdep_off()/lockdep_on().
> 2) (as suggested earlier by Peter Zijlstra) one that puts the
> i_mutexes recursively locked in different classes based on their
> depth from the top-level config_group created. This
> induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
> nesting of configfs default groups whenever lockdep is activated
> but this limit looks reasonably high. Unfortunately, this alos
> isolates VFS operations on configfs default groups from the others
> and thus lowers the chances to detect locking issues.
>
> This patch implements solution 1).
>
> Solution 2) looks better from lockdep's point of view, but fails with
> configfs_depend_item(). This needs to rework the locking
> scheme of configfs_depend_item() by removing the variable lock recursion
> depth, and I think that it's doable thanks to the configfs_dirent_lock.
> For now, let's stick to solution 1).
>
> Signed-off-by: Louis Rilling <[email protected]>
> ---
> fs/configfs/dir.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 59 insertions(+), 0 deletions(-)
>
> diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> index 8e93341..9c23583 100644
> --- a/fs/configfs/dir.c
> +++ b/fs/configfs/dir.c
> @@ -553,12 +553,24 @@ static void detach_groups(struct config_group *group)
>
> child = sd->s_dentry;
>
> + /*
> + * Note: we hide this from lockdep since we have no way
> + * to teach lockdep about recursive
> + * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
> + * in an inode tree, which are valid as soon as
> + * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
> + * parent inode to one of its children.
> + */
> + lockdep_off();
> mutex_lock(&child->d_inode->i_mutex);
> + lockdep_on();
>
> [etc]
>
Oh dear, what an unpleasant patch.
Peter, can this be saved?
On Wed, Dec 17, 2008 at 01:40:20PM -0800, Andrew Morton wrote:
> On Fri, 12 Dec 2008 16:29:11 +0100
> Louis Rilling <[email protected]> wrote:
>
> > When attaching default groups (subdirs) of a new group (in mkdir() or
> > in configfs_register()), configfs recursively takes inode's mutexes
> > along the path from the parent of the new group to the default
> > subdirs. This is needed to ensure that the VFS will not race with
> > operations on these sub-dirs. This is safe for the following reasons:
> >
> > - the VFS allows one to lock first an inode and second one of its
> > children (The lock subclasses for this pattern are respectively
> > I_MUTEX_PARENT and I_MUTEX_CHILD);
> > - from this rule any inode path can be recursively locked in
> > descending order as long as it stays under a single mountpoint and
> > does not follow symlinks.
> >
> > Unfortunately lockdep does not know (yet?) how to handle such
> > recursion.
> >
> > I've tried to use Peter Zijlstra's lock_set_subclass() helper to
> > upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
> > that we might recursively lock some of their descendant, but this
> > usage does not seem to fit the purpose of lock_set_subclass() because
> > it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
> > the same task.
> >
> > >From inside configfs it is not possible to serialize those recursive
> > locking with a top-level one, because mkdir() and rmdir() are already
> > called with inodes locked by the VFS. So using some
> > mutex_lock_nest_lock() is not an option.
> >
> > I am proposing two solutions:
> > 1) one that wraps recursive mutex_lock()s with
> > lockdep_off()/lockdep_on().
> > 2) (as suggested earlier by Peter Zijlstra) one that puts the
> > i_mutexes recursively locked in different classes based on their
> > depth from the top-level config_group created. This
> > induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
> > nesting of configfs default groups whenever lockdep is activated
> > but this limit looks reasonably high. Unfortunately, this alos
> > isolates VFS operations on configfs default groups from the others
> > and thus lowers the chances to detect locking issues.
> >
> > This patch implements solution 1).
> >
> > Solution 2) looks better from lockdep's point of view, but fails with
> > configfs_depend_item(). This needs to rework the locking
> > scheme of configfs_depend_item() by removing the variable lock recursion
> > depth, and I think that it's doable thanks to the configfs_dirent_lock.
> > For now, let's stick to solution 1).
> >
> > Signed-off-by: Louis Rilling <[email protected]>
> > ---
> > fs/configfs/dir.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 59 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> > index 8e93341..9c23583 100644
> > --- a/fs/configfs/dir.c
> > +++ b/fs/configfs/dir.c
> > @@ -553,12 +553,24 @@ static void detach_groups(struct config_group *group)
> >
> > child = sd->s_dentry;
> >
> > + /*
> > + * Note: we hide this from lockdep since we have no way
> > + * to teach lockdep about recursive
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
> > + * in an inode tree, which are valid as soon as
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
> > + * parent inode to one of its children.
> > + */
> > + lockdep_off();
> > mutex_lock(&child->d_inode->i_mutex);
> > + lockdep_on();
> >
> > [etc]
> >
>
> Oh dear, what an unpleasant patch.
>
> Peter, can this be saved?
I'd love to see it work within lockdep, but it seems rather
hard, so that's why I recommended Louis cook up this version. I see you
picked it up in -mm. Do you want me to push it through ocfs2.git?
Joel
--
Life's Little Instruction Book #157
"Take time to smell the roses."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Wed, 17 Dec 2008 14:03:10 -0800
Joel Becker <[email protected]> wrote:
> > >
> > > [etc]
> > >
> >
> > Oh dear, what an unpleasant patch.
> >
> > Peter, can this be saved?
>
> I'd love to see it work within lockdep, but it seems rather
> hard, so that's why I recommended Louis cook up this version. I see you
> picked it up in -mm. Do you want me to push it through ocfs2.git?
Please do. If Peter refuses to save us.
On Wed, 2008-12-17 at 13:40 -0800, Andrew Morton wrote:
> On Fri, 12 Dec 2008 16:29:11 +0100
> Louis Rilling <[email protected]> wrote:
>
> > When attaching default groups (subdirs) of a new group (in mkdir() or
> > in configfs_register()), configfs recursively takes inode's mutexes
> > along the path from the parent of the new group to the default
> > subdirs. This is needed to ensure that the VFS will not race with
> > operations on these sub-dirs. This is safe for the following reasons:
> >
> > - the VFS allows one to lock first an inode and second one of its
> > children (The lock subclasses for this pattern are respectively
> > I_MUTEX_PARENT and I_MUTEX_CHILD);
> > - from this rule any inode path can be recursively locked in
> > descending order as long as it stays under a single mountpoint and
> > does not follow symlinks.
> >
> > Unfortunately lockdep does not know (yet?) how to handle such
> > recursion.
> >
> > I've tried to use Peter Zijlstra's lock_set_subclass() helper to
> > upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
> > that we might recursively lock some of their descendant, but this
> > usage does not seem to fit the purpose of lock_set_subclass() because
> > it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
> > the same task.
> >
> > >From inside configfs it is not possible to serialize those recursive
> > locking with a top-level one, because mkdir() and rmdir() are already
> > called with inodes locked by the VFS. So using some
> > mutex_lock_nest_lock() is not an option.
> >
> > I am proposing two solutions:
> > 1) one that wraps recursive mutex_lock()s with
> > lockdep_off()/lockdep_on().
> > 2) (as suggested earlier by Peter Zijlstra) one that puts the
> > i_mutexes recursively locked in different classes based on their
> > depth from the top-level config_group created. This
> > induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
> > nesting of configfs default groups whenever lockdep is activated
> > but this limit looks reasonably high. Unfortunately, this alos
> > isolates VFS operations on configfs default groups from the others
> > and thus lowers the chances to detect locking issues.
> >
> > This patch implements solution 1).
> >
> > Solution 2) looks better from lockdep's point of view, but fails with
> > configfs_depend_item(). This needs to rework the locking
> > scheme of configfs_depend_item() by removing the variable lock recursion
> > depth, and I think that it's doable thanks to the configfs_dirent_lock.
> > For now, let's stick to solution 1).
> >
> > Signed-off-by: Louis Rilling <[email protected]>
> > ---
> > fs/configfs/dir.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 59 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> > index 8e93341..9c23583 100644
> > --- a/fs/configfs/dir.c
> > +++ b/fs/configfs/dir.c
> > @@ -553,12 +553,24 @@ static void detach_groups(struct config_group *group)
> >
> > child = sd->s_dentry;
> >
> > + /*
> > + * Note: we hide this from lockdep since we have no way
> > + * to teach lockdep about recursive
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
> > + * in an inode tree, which are valid as soon as
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
> > + * parent inode to one of its children.
> > + */
> > + lockdep_off();
> > mutex_lock(&child->d_inode->i_mutex);
> > + lockdep_on();
> >
> > [etc]
> >
>
> Oh dear, what an unpleasant patch.
>
> Peter, can this be saved?
I'm still not sure why configfs is the only virtual filesystem that
suffers this - something in its design is weird (and not so wonderfull).
Also, while I usually applaud fine grained locking, is configfs really
in need of that?, I mean, its not like its meant to create and remove
directories all day every day at breakneck speed, right? AFAIK you just
stomp some data in to configure some kernel stuff and then let it sit
for the duration of whatever that kernel thing does while it does it.
See I'm not even clear on what configfs is.. and why its better than
sysfs for example.. - /me goes read configfs.txt
Right, so basically we avoid syscalls by making vfs ops do stuff..
lovely - still not seeing it though - regular filesystems seems to cope
just fine and they get to create arbitrary tree structures too.
Yes lockdep has 2 major weaknesses - 1) it doesn't support arbitrary
lock depths (and I've tried, twice, to fix that, but its a _hard_
problem) and 2) it can't deal with arbitrary recursion.
The thing is, in practise it turns out that reworking code to not run
into these issues often makes the code better - if only for the fact
that taking locks is expensive and doing less is better, and holding
locks stifles concurrency, so holding less it better (yes, I said
_often_, there likely are counter cases but I don't believe configfs is
one of them).
Anyway - I'm against just turning lockdep off, that will make folks
complacent and let the stuff rot to bits inside - and I for one will
refuse to run anything using it (but since that only seems to be
dlm/ocfs and I'm of the believe that centralized cluster stuff sucks
rocks anyway that won't be a problem).
On Thu, Dec 18, 2008 at 08:26:48AM +0100, Peter Zijlstra wrote:
> On Wed, 2008-12-17 at 13:40 -0800, Andrew Morton wrote:
> > On Fri, 12 Dec 2008 16:29:11 +0100
> > Louis Rilling <[email protected]> wrote:
> > > >From inside configfs it is not possible to serialize those recursive
> > > locking with a top-level one, because mkdir() and rmdir() are already
> > > called with inodes locked by the VFS. So using some
> > > mutex_lock_nest_lock() is not an option.
<snip>
> >
> > Oh dear, what an unpleasant patch.
> >
> > Peter, can this be saved?
>
> I'm still not sure why configfs is the only virtual filesystem that
> suffers this - something in its design is weird (and not so wonderfull).
>
> Also, while I usually applaud fine grained locking, is configfs really
> in need of that?, I mean, its not like its meant to create and remove
> directories all day every day at breakneck speed, right? AFAIK you just
> stomp some data in to configure some kernel stuff and then let it sit
> for the duration of whatever that kernel thing does while it does it.
It's not about breakneck speed, it's about living in the VFS.
But I think you get that later.
> See I'm not even clear on what configfs is.. and why its better than
> sysfs for example.. - /me goes read configfs.txt
>
> Right, so basically we avoid syscalls by making vfs ops do stuff..
> lovely - still not seeing it though - regular filesystems seems to cope
> just fine and they get to create arbitrary tree structures too.
It's about the default_groups and how they build up and tear
down small bits of tree.
A simple creation of a config_item, a mkdir(2), is a normal VFS
lock set and doesn't make lockdep unhappy. But if the new config_item
has a default_group or two, they need locking too. Not so much on
mkdir(2), but on rmdir(2).
> Yes lockdep has 2 major weaknesses - 1) it doesn't support arbitrary
> lock depths (and I've tried, twice, to fix that, but its a _hard_
> problem) and 2) it can't deal with arbitrary recursion.
I know it's hard, or I'd have sent you patches :-) In fact,
Louis tried to use the subclass bits to make this work to a depth of N
(where N was probably deep enough in practice). However, this creates
subclasses that don't get seen by the regular VFS locking - and the big
deal here is making sure configfs's use of i_mutex meshes with the VFS.
That is, his code made the warnings go away, but removed much of
lockdep's ability to see when we got the locking wrong.
> The thing is, in practise it turns out that reworking code to not run
> into these issues often makes the code better - if only for the fact
> that taking locks is expensive and doing less is better, and holding
> locks stifles concurrency, so holding less it better (yes, I said
> _often_, there likely are counter cases but I don't believe configfs is
> one of them).
This isn't about concurrency or speed. This is about safety
while configfs is attaching or (especially) detaching config_items from
the filesystem view it presents. When the VFS travels down a path, it
unlocks the trailing directory. We can't do that when tearing down
default groups, because we need to lock that small hunk and tear it out
atomically.
> Anyway - I'm against just turning lockdep off, that will make folks
> complacent and let the stuff rot to bits inside - and I for one will
> refuse to run anything using it (but since that only seems to be
> dlm/ocfs and I'm of the believe that centralized cluster stuff sucks
> rocks anyway that won't be a problem).
Oh, be nice :-)
You are absolutely right that turning off lockdep leaves the
possibility of complacency and bitrot. That's precisely why I didn't
like Louis' subclass solution - again, bitrot might go unnoticed.
Now, I know that I will be paying attention to the locking and
going over it with a fine-toothed comb. But I'd much rather have an
actual working lockdep analysis. Whether that means we find a way for
lockdep to describe what's happening here, or we find another way to
keep folks out of the tree we're removing, I don't care.
Joel
--
Life's Little Instruction Book #109
"Know how to drive a stick shift."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On 18/12/08 1:27 -0800, Joel Becker wrote:
>
> I know it's hard, or I'd have sent you patches :-) In fact,
> Louis tried to use the subclass bits to make this work to a depth of N
> (where N was probably deep enough in practice). However, this creates
> subclasses that don't get seen by the regular VFS locking - and the big
> deal here is making sure configfs's use of i_mutex meshes with the VFS.
> That is, his code made the warnings go away, but removed much of
> lockdep's ability to see when we got the locking wrong.
>
> > The thing is, in practise it turns out that reworking code to not run
> > into these issues often makes the code better - if only for the fact
> > that taking locks is expensive and doing less is better, and holding
> > locks stifles concurrency, so holding less it better (yes, I said
> > _often_, there likely are counter cases but I don't believe configfs is
> > one of them).
>
> This isn't about concurrency or speed. This is about safety
> while configfs is attaching or (especially) detaching config_items from
> the filesystem view it presents. When the VFS travels down a path, it
> unlocks the trailing directory. We can't do that when tearing down
> default groups, because we need to lock that small hunk and tear it out
> atomically.
>
> > Anyway - I'm against just turning lockdep off, that will make folks
> > complacent and let the stuff rot to bits inside - and I for one will
> > refuse to run anything using it (but since that only seems to be
> > dlm/ocfs and I'm of the believe that centralized cluster stuff sucks
> > rocks anyway that won't be a problem).
>
> Oh, be nice :-)
> You are absolutely right that turning off lockdep leaves the
> possibility of complacency and bitrot. That's precisely why I didn't
> like Louis' subclass solution - again, bitrot might go unnoticed.
> Now, I know that I will be paying attention to the locking and
> going over it with a fine-toothed comb. But I'd much rather have an
> actual working lockdep analysis. Whether that means we find a way for
> lockdep to describe what's happening here, or we find another way to
> keep folks out of the tree we're removing, I don't care.
Perhaps I didn't explain myself well. Quoting my original post:
<quote>
I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
i_mutexes recursively locked in different classes based on their
depth from the top-level config_group created. This
induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
nesting of configfs default groups whenever lockdep is activated
but this limit looks reasonably high. Unfortunately, this alos
isolates VFS operations on configfs default groups from the others
and thus lowers the chances to detect locking issues.
This patch implements solution 1).
Solution 2) looks better from lockdep's point of view, but fails with
configfs_depend_item(). This needs to rework the locking
scheme of configfs_depend_item() by removing the variable lock recursion
depth, and I think that it's doable thanks to the configfs_dirent_lock.
For now, let's stick to solution 1).
</quote>
Solution 2) does not play with i_mutex sub-classes as I proposed earlier, but
instead put default_groups' i_mutex in separate classes (actually one class per
default group depth). This is not worse than putting each run queue lock in a
separate class, as it used to be.
For instance, if a created group A has default groups A/B, A/D, and A/B/C, A's
i_mutex class will be the regular i_mutex class used everywhere else in the VFS,
A/B and A/D will have default_group_class[0], and A/B/C will have
default_group_class[1].
Of course those default_group classes will not benefit from locking schemes seen
by lockdep outside configfs, but they still will interact nicely with the VFS.
Moreover, a default group depth limit of 46 (MAX_LOCK_DEPTH - 2) looks rather
reasonable, doesn't it?
To me the real drawback of this solution is that it needs to rework locking in
configfs_depend_item(). Peter says it is preferable, I know how it could be
done, but as any code rework this may bring new bugs, and I realize that I'm
spending time to explain this while 1) I don't have much time to just explain
what could be done, 2) I'd prefer having time to code what I am explaining.
Let's see if I can show you something today.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Hi,
I have to say that when I brought this subject up, I didn't realise
quite how tricky this was going to be, however...
On Thu, 2008-12-18 at 01:27 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 08:26:48AM +0100, Peter Zijlstra wrote:
> > On Wed, 2008-12-17 at 13:40 -0800, Andrew Morton wrote:
> > > On Fri, 12 Dec 2008 16:29:11 +0100
> > > Louis Rilling <[email protected]> wrote:
> > > > >From inside configfs it is not possible to serialize those recursive
> > > > locking with a top-level one, because mkdir() and rmdir() are already
> > > > called with inodes locked by the VFS. So using some
> > > > mutex_lock_nest_lock() is not an option.
>
> <snip>
>
<more snipping>
> > Right, so basically we avoid syscalls by making vfs ops do stuff..
> > lovely - still not seeing it though - regular filesystems seems to cope
> > just fine and they get to create arbitrary tree structures too.
>
> It's about the default_groups and how they build up and tear
> down small bits of tree.
> A simple creation of a config_item, a mkdir(2), is a normal VFS
> lock set and doesn't make lockdep unhappy. But if the new config_item
> has a default_group or two, they need locking too. Not so much on
> mkdir(2), but on rmdir(2).
>
So if I've understood this correctly, the dentries created upon mkdir
live until such time as they are removed at some later date, presumably
with rmdir?
When creating the tree, would it be possible to build it starting from
the bottom and working towards the point of attachment and then to not
actually attach it until the last moment? That way it would not be
visible from userland until it was linked into the existing dir and that
solves the locking issue for mkdir I think.
As you say, rmdir seems the harder problem, but again, is it possible to
separate the unlink operation from the destruction of the tree by
keeping the tree, after its been unlinked, until the last userland
reference has gone away? At least I assume that is why the locking is
there. I may be a bit off the mark, but it seems like this is quite
similar to how the VFS does umount, so maybe there are some hints in
that code which may help us solve this issue?
Steve.
On 18/12/08 11:26 +0000, Steven Whitehouse wrote:
> On Thu, 2008-12-18 at 01:27 -0800, Joel Becker wrote:
> >
> > It's about the default_groups and how they build up and tear
> > down small bits of tree.
> > A simple creation of a config_item, a mkdir(2), is a normal VFS
> > lock set and doesn't make lockdep unhappy. But if the new config_item
> > has a default_group or two, they need locking too. Not so much on
> > mkdir(2), but on rmdir(2).
> >
> So if I've understood this correctly, the dentries created upon mkdir
> live until such time as they are removed at some later date, presumably
> with rmdir?
>
> When creating the tree, would it be possible to build it starting from
> the bottom and working towards the point of attachment and then to not
> actually attach it until the last moment? That way it would not be
> visible from userland until it was linked into the existing dir and that
> solves the locking issue for mkdir I think.
>
> As you say, rmdir seems the harder problem, but again, is it possible to
> separate the unlink operation from the destruction of the tree by
> keeping the tree, after its been unlinked, until the last userland
> reference has gone away? At least I assume that is why the locking is
> there. I may be a bit off the mark, but it seems like this is quite
> similar to how the VFS does umount, so maybe there are some hints in
> that code which may help us solve this issue?
I second this kind of rework and even think that it is doable. This would avoid
exposing things to userspace before they are created for sure. I even had to
include dirty hacks in configfs to avoid exposing too much of such temporary
things, and I'm definitely not proud of this.
Unfortunately I don't have time to rework configfs this way at all.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Thu, 2008-12-18 at 01:27 -0800, Joel Becker wrote:
> It's about the default_groups and how they build up and tear
> down small bits of tree.
> A simple creation of a config_item, a mkdir(2), is a normal VFS
> lock set and doesn't make lockdep unhappy. But if the new config_item
> has a default_group or two, they need locking too. Not so much on
> mkdir(2), but on rmdir(2).
Hohumm,..
So the problem is that mkdir() doesn't just create a single entity but a
whole tree:
configfs:/my_subsystem/$ mkdir foo
might result in:
foo/
foo/A/
foo/B/
foo/B/C/
which on rmdir foo you'd have to tear down, but only if its that exact
tree and not when say A has any user created directories.
VFS mkdir A/blah only synchronizes on A.i_mutex and checks S_DEAD to
avoid races with rmdir A - which would lock first parent(A).i_mutex and
then A.i_mutex before detaching A and marking it S_DEAD.
So what you're now doing is locking the full foo/ subtree in order to
check there is no user content and block mkdir/creat from generating any
- which is where the trouble comes from, right?
Like said on IRC, the whole populated thing made me think of
mount/umount (steven whitehouse seems to have had a similar notion).
You basically want to synchronize any user mkdir/creat against foo
instead of just the new parent so that rmdir foo can tell if there is
any such content without having to lock the whole subtree.
That would mean them locking both foo and the new parent (when they're
not one and the same). Trouble seems to be that vfs_mkdir() and such
already have their new parent locked, which means you cannot go about
locking foo anymore. But that would have resulted in a 3 deep
lock-chain.
(and I don't see any filesystem hooks in user_path_parent() -- which is
probably a good thing)
Bugger..
On Thu, 2008-12-18 at 12:56 +0100, Peter Zijlstra wrote:
> On Thu, 2008-12-18 at 01:27 -0800, Joel Becker wrote:
>
> > It's about the default_groups and how they build up and tear
> > down small bits of tree.
> > A simple creation of a config_item, a mkdir(2), is a normal VFS
> > lock set and doesn't make lockdep unhappy. But if the new config_item
> > has a default_group or two, they need locking too. Not so much on
> > mkdir(2), but on rmdir(2).
>
> Hohumm,..
>
> So the problem is that mkdir() doesn't just create a single entity but a
> whole tree:
>
> configfs:/my_subsystem/$ mkdir foo
>
> might result in:
>
> foo/
> foo/A/
> foo/B/
> foo/B/C/
>
> which on rmdir foo you'd have to tear down, but only if its that exact
> tree and not when say A has any user created directories.
>
> VFS mkdir A/blah only synchronizes on A.i_mutex and checks S_DEAD to
> avoid races with rmdir A - which would lock first parent(A).i_mutex and
> then A.i_mutex before detaching A and marking it S_DEAD.
>
> So what you're now doing is locking the full foo/ subtree in order to
> check there is no user content and block mkdir/creat from generating any
> - which is where the trouble comes from, right?
>
> Like said on IRC, the whole populated thing made me think of
> mount/umount (steven whitehouse seems to have had a similar notion).
>
> You basically want to synchronize any user mkdir/creat against foo
> instead of just the new parent so that rmdir foo can tell if there is
> any such content without having to lock the whole subtree.
>
> That would mean them locking both foo and the new parent (when they're
> not one and the same). Trouble seems to be that vfs_mkdir() and such
> already have their new parent locked, which means you cannot go about
> locking foo anymore. But that would have resulted in a 3 deep
> lock-chain.
>
> (and I don't see any filesystem hooks in user_path_parent() -- which is
> probably a good thing)
>
>
> Bugger..
In fact, both (configfs) mkdir and rmdir seem to synchronize on
su_mutex..
mkdir B/C/bar
C.i_mutex
su_mutex
vs
rmdir foo
parent(foo).i_mutex
foo.i_mutex
su_mutex
once holding the rmdir su_mutex you can check foo's user-content, since
any mkdir will be blocked. All you have to do is then re-validate in
mkdir's su_mutex that !IS_DEADDIR(C).
Does that sound plausible, or am I missing something obvious.. ?
Hi,
Here is a patchset implementing a more lockdep-friendly solution to make lockdep
happy with configfs.
The first patch could probably be cleaner, but first let's see if the
approach is ok.
I don't have a good setup to test the second patch beyond compilation, but I
guess that Joel has one :)
So, does it look better?
Louis
Louis Rilling (2):
configfs: Silence lockdep on mkdir() and rmdir()
configfs: Rework configfs_depend_item() locking and make lockdep happy
fs/configfs/configfs_internal.h | 3 +
fs/configfs/dir.c | 131 +++++++++++++++++++++++----------------
fs/configfs/inode.c | 28 ++++++++
3 files changed, 108 insertions(+), 54 deletions(-)
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:
- the VFS allows one to lock first an inode and second one of its
children (The lock subclasses for this pattern are respectively
I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
descending order as long as it stays under a single mountpoint and
does not follow symlinks.
Unfortunately lockdep does not know (yet?) how to handle such
recursion.
I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.
>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.
I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
i_mutexes recursively locked in different classes based on their
depth from the top-level config_group created. This
induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
nesting of configfs default groups whenever lockdep is activated
but this limit looks reasonably high. Unfortunately, this also
isolates VFS operations on configfs default groups from the others
and thus lowers the chances to detect locking issues.
Nobody likes solution 1), what I can understand.
This patch implements solution 2). However lockdep is still not happy with
configfs_depend_item(). Next patch reworks the locking of
configfs_depend_item() and finally makes lockdep happy.
Signed-off-by: Louis Rilling <[email protected]>
---
fs/configfs/configfs_internal.h | 3 +++
fs/configfs/dir.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/configfs/inode.c | 28 ++++++++++++++++++++++++++++
3 files changed, 70 insertions(+), 0 deletions(-)
diff --git a/fs/configfs/configfs_internal.h b/fs/configfs/configfs_internal.h
index 762d287..da6061a 100644
--- a/fs/configfs/configfs_internal.h
+++ b/fs/configfs/configfs_internal.h
@@ -39,6 +39,9 @@ struct configfs_dirent {
umode_t s_mode;
struct dentry * s_dentry;
struct iattr * s_iattr;
+#ifdef CONFIG_LOCKDEP
+ int s_depth;
+#endif
};
#define CONFIGFS_ROOT 0x0001
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index 8e93341..f21be74 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -94,6 +94,9 @@ static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * pare
INIT_LIST_HEAD(&sd->s_links);
INIT_LIST_HEAD(&sd->s_children);
sd->s_element = element;
+#ifdef CONFIG_LOCKDEP
+ sd->s_depth = -1;
+#endif
spin_lock(&configfs_dirent_lock);
if (parent_sd->s_type & CONFIGFS_USET_DROPPING) {
spin_unlock(&configfs_dirent_lock);
@@ -176,6 +179,17 @@ static int init_symlink(struct inode * inode)
return 0;
}
+#ifdef CONFIG_LOCKDEP
+static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
+ struct configfs_dirent *sd)
+{
+ int parent_depth = parent_sd->s_depth;
+
+ if (parent_depth >= 0)
+ sd->s_depth = parent_depth + 1;
+}
+#endif
+
static int create_dir(struct config_item * k, struct dentry * p,
struct dentry * d)
{
@@ -187,6 +201,9 @@ static int create_dir(struct config_item * k, struct dentry * p,
error = configfs_make_dirent(p->d_fsdata, d, k, mode,
CONFIGFS_DIR | CONFIGFS_USET_CREATING);
if (!error) {
+#ifdef CONFIG_LOCKDEP
+ configfs_set_dir_dirent_depth(p->d_fsdata, d->d_fsdata);
+#endif
error = configfs_create(d, mode, init_dir);
if (!error) {
inc_nlink(p->d_inode);
@@ -789,11 +806,33 @@ static int configfs_attach_group(struct config_item *parent_item,
* error, as rmdir() would.
*/
mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_CHILD);
+#ifdef CONFIG_LOCKDEP
+ /*
+ * item's i_mutex class is already setup, so s_depth is now only
+ * used to set new sub-directories s_depth, which is always done
+ * with item's i_mutex locked.
+ */
+ /*
+ * sd->s_depth == -1 iff we are a non default group.
+ * else (we are a default group) sd->s_depth > 0 (see
+ * create_dir()).
+ */
+ if (sd->s_depth == -1)
+ /*
+ * We are a non default group and we are going to create
+ * default groups.
+ */
+ sd->s_depth = 0;
+#endif
ret = populate_groups(to_config_group(item));
if (ret) {
configfs_detach_item(item);
dentry->d_inode->i_flags |= S_DEAD;
}
+#ifdef CONFIG_LOCKDEP
+ /* We will not create default groups anymore. */
+ sd->s_depth = -1;
+#endif
mutex_unlock(&dentry->d_inode->i_mutex);
if (ret)
d_delete(dentry);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 4803ccc..5e66b40 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -33,10 +33,15 @@
#include <linux/backing-dev.h>
#include <linux/capability.h>
#include <linux/sched.h>
+#include <linux/lockdep.h>
#include <linux/configfs.h>
#include "configfs_internal.h"
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key default_group_class[MAX_LOCK_DEPTH];
+#endif
+
extern struct super_block * configfs_sb;
static const struct address_space_operations configfs_aops = {
@@ -153,6 +158,26 @@ struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent * sd)
return inode;
}
+#ifdef CONFIG_LOCKDEP
+static void configfs_set_inode_lock_class(struct configfs_dirent *sd,
+ struct inode *inode)
+{
+ if (sd->s_depth > 0) {
+ if (sd->s_depth <= ARRAY_SIZE(default_group_class)) {
+ lockdep_set_class(&inode->i_mutex,
+ &default_group_class[sd->s_depth - 1]);
+ } else {
+ /*
+ * In practice the maximum level of locking depth is
+ * already reached. Just inform about possible reasons.
+ */
+ printk("configfs: Too many levels of inodes for the locking correctness validator.\n");
+ printk("Spurious warnings may appear.\n");
+ }
+ }
+}
+#endif
+
int configfs_create(struct dentry * dentry, int mode, int (*init)(struct inode *))
{
int error = 0;
@@ -165,6 +190,9 @@ int configfs_create(struct dentry * dentry, int mode, int (*init)(struct inode *
struct inode *p_inode = dentry->d_parent->d_inode;
p_inode->i_mtime = p_inode->i_ctime = CURRENT_TIME;
}
+#ifdef CONFIG_LOCKDEP
+ configfs_set_inode_lock_class(sd, inode);
+#endif
goto Proceed;
}
else
--
1.5.6.5
configfs_depend_item() recursively locks all inodes mutex from configfs root to
the target item, which makes lockdep unhappy. The purpose of this recursive
locking is to ensure that the item tree can be safely parsed and that the target
item, if found, is not about to leave.
This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
protects tagging of items to be removed, this lock can be used instead of the
inodes mutex lock chain.
This needs that the check for dependents be done atomically with
CONFIGFS_USET_DROPPING tagging.
Now lockdep looks happy with configfs.
Signed-off-by: Louis Rilling <[email protected]>
---
fs/configfs/dir.c | 92 ++++++++++++++++++++++-------------------------------
1 files changed, 38 insertions(+), 54 deletions(-)
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index f21be74..4dea906 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -955,11 +955,11 @@ static int configfs_dump(struct configfs_dirent *sd, int level)
* Note, btw, that this can be called at *any* time, even when a configfs
* subsystem isn't registered, or when configfs is loading or unloading.
* Just like configfs_register_subsystem(). So we take the same
- * precautions. We pin the filesystem. We lock each i_mutex _in_order_
- * on our way down the tree. If we can find the target item in the
+ * precautions. We pin the filesystem. We lock configfs_dirent_lock.
+ * If we can find the target item in the
* configfs tree, it must be part of the subsystem tree as well, so we
- * do not need the subsystem semaphore. Holding the i_mutex chain locks
- * out mkdir() and rmdir(), who might be racing us.
+ * do not need the subsystem semaphore. Holding configfs_dirent_lock helps
+ * locking out mkdir() and rmdir(), who might be racing us.
*/
/*
@@ -972,17 +972,17 @@ static int configfs_dump(struct configfs_dirent *sd, int level)
* do that so we can unlock it if we find nothing.
*
* Here we do a depth-first search of the dentry hierarchy looking for
- * our object. We take i_mutex on each step of the way down. IT IS
- * ESSENTIAL THAT i_mutex LOCKING IS ORDERED. If we come back up a branch,
- * we'll drop the i_mutex.
+ * our object.
+ * We deliberately ignore items tagged as dropping since they are virtually
+ * dead, as well as items in the middle of attachment since they virtually
+ * do not exist yet. This completes the locking out of racing mkdir() and
+ * rmdir().
*
- * If the target is not found, -ENOENT is bubbled up and we have released
- * all locks. If the target was found, the locks will be cleared by
- * configfs_depend_rollback().
+ * If the target is not found, -ENOENT is bubbled up.
*
* This adds a requirement that all config_items be unique!
*
- * This is recursive because the locking traversal is tricky. There isn't
+ * This is recursive. There isn't
* much on the stack, though, so folks that need this function - be careful
* about your stack! Patches will be accepted to make it iterative.
*/
@@ -994,13 +994,13 @@ static int configfs_depend_prep(struct dentry *origin,
BUG_ON(!origin || !sd);
- /* Lock this guy on the way down */
- mutex_lock(&sd->s_dentry->d_inode->i_mutex);
if (sd->s_element == target) /* Boo-yah */
goto out;
list_for_each_entry(child_sd, &sd->s_children, s_sibling) {
- if (child_sd->s_type & CONFIGFS_DIR) {
+ if (child_sd->s_type & CONFIGFS_DIR &&
+ !child_sd->s_type & CONFIGFS_USET_DROPPING &&
+ !child_sd->s_type & CONFIGFS_USET_CREATING) {
ret = configfs_depend_prep(child_sd->s_dentry,
target);
if (!ret)
@@ -1009,33 +1009,12 @@ static int configfs_depend_prep(struct dentry *origin,
}
/* We looped all our children and didn't find target */
- mutex_unlock(&sd->s_dentry->d_inode->i_mutex);
ret = -ENOENT;
out:
return ret;
}
-/*
- * This is ONLY called if configfs_depend_prep() did its job. So we can
- * trust the entire path from item back up to origin.
- *
- * We walk backwards from item, unlocking each i_mutex. We finish by
- * unlocking origin.
- */
-static void configfs_depend_rollback(struct dentry *origin,
- struct config_item *item)
-{
- struct dentry *dentry = item->ci_dentry;
-
- while (dentry != origin) {
- mutex_unlock(&dentry->d_inode->i_mutex);
- dentry = dentry->d_parent;
- }
-
- mutex_unlock(&origin->d_inode->i_mutex);
-}
-
int configfs_depend_item(struct configfs_subsystem *subsys,
struct config_item *target)
{
@@ -1076,17 +1055,21 @@ int configfs_depend_item(struct configfs_subsystem *subsys,
/* Ok, now we can trust subsys/s_item */
- /* Scan the tree, locking i_mutex recursively, return 0 if found */
+ spin_lock(&configfs_dirent_lock);
+ /* Scan the tree, protected by configfs_dirent_lock, return 0 if found */
ret = configfs_depend_prep(subsys_sd->s_dentry, target);
if (ret)
- goto out_unlock_fs;
+ goto out_unlock_dirent_lock;
- /* We hold all i_mutexes from the subsystem down to the target */
+ /*
+ * We are sure that the item is not about to be removed by rmdir(), and
+ * not in the middle of attachment by mkdir().
+ */
p = target->ci_dentry->d_fsdata;
p->s_dependent_count += 1;
- configfs_depend_rollback(subsys_sd->s_dentry, target);
-
+out_unlock_dirent_lock:
+ spin_unlock(&configfs_dirent_lock);
out_unlock_fs:
mutex_unlock(&configfs_sb->s_root->d_inode->i_mutex);
@@ -1111,10 +1094,10 @@ void configfs_undepend_item(struct configfs_subsystem *subsys,
struct configfs_dirent *sd;
/*
- * Since we can trust everything is pinned, we just need i_mutex
- * on the item.
+ * Since we can trust everything is pinned, we just need
+ * configfs_dirent_lock.
*/
- mutex_lock(&target->ci_dentry->d_inode->i_mutex);
+ spin_lock(&configfs_dirent_lock);
sd = target->ci_dentry->d_fsdata;
BUG_ON(sd->s_dependent_count < 1);
@@ -1125,7 +1108,7 @@ void configfs_undepend_item(struct configfs_subsystem *subsys,
* After this unlock, we cannot trust the item to stay alive!
* DO NOT REFERENCE item after this unlock.
*/
- mutex_unlock(&target->ci_dentry->d_inode->i_mutex);
+ spin_unlock(&configfs_dirent_lock);
}
EXPORT_SYMBOL(configfs_undepend_item);
@@ -1325,13 +1308,6 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
if (sd->s_type & CONFIGFS_USET_DEFAULT)
return -EPERM;
- /*
- * Here's where we check for dependents. We're protected by
- * i_mutex.
- */
- if (sd->s_dependent_count)
- return -EBUSY;
-
/* Get a working ref until we have the child */
parent_item = configfs_get_config_item(dentry->d_parent);
subsys = to_config_group(parent_item)->cg_subsys;
@@ -1355,9 +1331,17 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
mutex_lock(&configfs_symlink_mutex);
spin_lock(&configfs_dirent_lock);
- ret = configfs_detach_prep(dentry, &wait_mutex);
- if (ret)
- configfs_detach_rollback(dentry);
+ /*
+ * Here's where we check for dependents. We're protected by
+ * configfs_dirent_lock.
+ * If no dependent, atomically tag the item as dropping.
+ */
+ ret = sd->s_dependent_count ? -EBUSY : 0;
+ if (!ret) {
+ ret = configfs_detach_prep(dentry, &wait_mutex);
+ if (ret)
+ configfs_detach_rollback(dentry);
+ }
spin_unlock(&configfs_dirent_lock);
mutex_unlock(&configfs_symlink_mutex);
--
1.5.6.5
On 18/12/08 19:00 +0100, Louis Rilling wrote:
>
> Hi,
>
> Here is a patchset implementing a more lockdep-friendly solution to make lockdep
> happy with configfs.
>
> The first patch could probably be cleaner, but first let's see if the
> approach is ok.
> I don't have a good setup to test the second patch beyond compilation, but I
> guess that Joel has one :)
>
> So, does it look better?
Joel,
Are you going to take one of these approaches to make lockdep and configfs
cohabit?
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Thu, 2008-12-18 at 14:58 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 01:28:28PM +0100, Peter Zijlstra wrote:
> > In fact, both (configfs) mkdir and rmdir seem to synchronize on
> > su_mutex..
> >
> > mkdir B/C/bar
> >
> > C.i_mutex
> > su_mutex
> >
> > vs
> >
> > rmdir foo
> >
> > parent(foo).i_mutex
> > foo.i_mutex
> > su_mutex
> >
> >
> > once holding the rmdir su_mutex you can check foo's user-content, since
> > any mkdir will be blocked. All you have to do is then re-validate in
> > mkdir's su_mutex that !IS_DEADDIR(C).
>
> We explicitly do not take any i_mutex locks after taking
> su_mutex. That's an ABBA risk. su_mutex protects the hierarchy of
> config_items. i_mutex protects the vfs view thereof.
I don't think I was suggesting that. All you need is to serialize any
mkdir/creat against the rmdir of the youngest non-default group, and you
can do that by holding su_mutex.
In rmdir, you already own all the i_mutex instances you need to uncouple
the whole tree, all you need to do is validate that its indeed empty --
you don't need i_mutex's for that, because you're holding su_mutex, and
any concurrent mkdir/creat will be blocking on that.
If you find it empty, just mark everybody DEAD, drop su_mutex and
decouple. All concurrent mkdir/creat thingies that were blocking will
now bail because their parent is found DEAD.
> If you look in mkdir, we take su_mutex, get a new item from the
> client subsystem, then drop su_mutex.
All you need to do before dropping su_mutex again is checking
IS_DEADDIR(), if so, you just fail the whole mkdir() no extra i_mutex's
needed.
> After that, we go about building
> our filesystem structure, using i_mutex where appropriate.
Sure, but its ok to grow the default groups non-atomically, right? mkdir
will only need to check that everything is empty in as far as it has
been linked, and ensure the not yet linked entries won't be.
> More
> importantly is rmdir(2), where we use i_mutex in
> configfs_detach_group(), but are not holding su_sem. Only when
> configfs_detach_group() has successfully returned and we have torn down
> the filesystem structure do we take su_mutex and tear down the
> config_item structure.
The only thing that matters is that you can hold su_mutex inside
i_mutex.
configfs_rmdir( "foo" )
{
/* we hold i_mutex for foo and its parent */
mutex_lock(&subsys->su_mutex);
if (default_tree_empty())
mark_default_tree_dead();
else
ret = -EBUSY;
mutex_unlock(&subsys->su_mutex);
if (ret)
return ret;
/* do actual unlink foo */
}
configfs_mkdir( "B/A/bar" )
{
/* we hold i_mutex for A */
mutex_lock(&subsys->su_mutex);
if (IS_DEADDIR(A))
ret = -EINVAL; /* or whatever */
/* increase A's use count, so default_tree_empty() will fail. *
inc_A_or_subsys_use_count();
mutex_unlock(&subsys->su_mutex);
if (ret)
return ret;
/* do actual mkdir */
}
Surely something along these lines ought to work?
Hi Peter,
Thank you for continuing this discussion :)
On 26/01/09 13:30 +0100, Peter Zijlstra wrote:
> On Thu, 2008-12-18 at 14:58 -0800, Joel Becker wrote:
> > On Thu, Dec 18, 2008 at 01:28:28PM +0100, Peter Zijlstra wrote:
> > > In fact, both (configfs) mkdir and rmdir seem to synchronize on
> > > su_mutex..
> > >
> > > mkdir B/C/bar
> > >
> > > C.i_mutex
> > > su_mutex
> > >
> > > vs
> > >
> > > rmdir foo
> > >
> > > parent(foo).i_mutex
> > > foo.i_mutex
> > > su_mutex
> > >
> > >
> > > once holding the rmdir su_mutex you can check foo's user-content, since
> > > any mkdir will be blocked. All you have to do is then re-validate in
> > > mkdir's su_mutex that !IS_DEADDIR(C).
> >
> > We explicitly do not take any i_mutex locks after taking
> > su_mutex. That's an ABBA risk. su_mutex protects the hierarchy of
> > config_items. i_mutex protects the vfs view thereof.
>
> I don't think I was suggesting that. All you need is to serialize any
> mkdir/creat against the rmdir of the youngest non-default group, and you
> can do that by holding su_mutex.
>
> In rmdir, you already own all the i_mutex instances you need to uncouple
> the whole tree, all you need to do is validate that its indeed empty --
> you don't need i_mutex's for that, because you're holding su_mutex, and
> any concurrent mkdir/creat will be blocking on that.
>
> If you find it empty, just mark everybody DEAD, drop su_mutex and
> decouple. All concurrent mkdir/creat thingies that were blocking will
> now bail because their parent is found DEAD.
>
> > If you look in mkdir, we take su_mutex, get a new item from the
> > client subsystem, then drop su_mutex.
>
> All you need to do before dropping su_mutex again is checking
> IS_DEADDIR(), if so, you just fail the whole mkdir() no extra i_mutex's
> needed.
>
> > After that, we go about building
> > our filesystem structure, using i_mutex where appropriate.
>
> Sure, but its ok to grow the default groups non-atomically, right? mkdir
> will only need to check that everything is empty in as far as it has
> been linked, and ensure the not yet linked entries won't be.
>
> > More
> > importantly is rmdir(2), where we use i_mutex in
> > configfs_detach_group(), but are not holding su_sem. Only when
> > configfs_detach_group() has successfully returned and we have torn down
> > the filesystem structure do we take su_mutex and tear down the
> > config_item structure.
>
> The only thing that matters is that you can hold su_mutex inside
> i_mutex.
>
>
> configfs_rmdir( "foo" )
> {
> /* we hold i_mutex for foo and its parent */
>
> mutex_lock(&subsys->su_mutex);
> if (default_tree_empty())
> mark_default_tree_dead();
> else
> ret = -EBUSY;
> mutex_unlock(&subsys->su_mutex);
>
> if (ret)
> return ret;
>
> /* do actual unlink foo */
> }
>
>
> configfs_mkdir( "B/A/bar" )
> {
> /* we hold i_mutex for A */
>
> mutex_lock(&subsys->su_mutex);
> if (IS_DEADDIR(A))
> ret = -EINVAL; /* or whatever */
>
> /* increase A's use count, so default_tree_empty() will fail. *
> inc_A_or_subsys_use_count();
> mutex_unlock(&subsys->su_mutex);
> if (ret)
> return ret;
>
> /* do actual mkdir */
> }
>
>
> Surely something along these lines ought to work?
I may have missed something important here, but how does your suggestion remove
the need for recursively locking inodes?
In configfs_rmdir(), we do not need to lock default groups inodes to prevent
racing mkdir() under them. This race is already dealt with earlier in
configfs_detach_prep(), which tries to set the CONFIGS_USET_DROPPING flag on the
group to remove and on all its hierarchy of default groups, protected by
configfs_dirent_lock. The matching logic in configfs_mkdir() may look a bit
burried but is simple: configfs_attach_item()/configfs_attach_group() eventually
calls configfs_new_dirent(), which fails whenever the parent is tagged with
CONFIGFS_USET_DROPPING. Again, no inode locking is required for this logic.
However configfs_rmdir() and configfs_mkdir() (recursively) lock inodes because
this is how the VFS works when removing/adding entries under a directory which
has already lived in the dcache.
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Mon, 2009-01-26 at 14:24 +0100, Louis Rilling wrote:
> However configfs_rmdir() and configfs_mkdir() (recursively) lock inodes because
> this is how the VFS works when removing/adding entries under a directory which
> has already lived in the dcache.
Ok, so then I'm not understanding things correctly.
Its not a locking correctness thing, but simply not being able to do it
from the vfs calls because those assume locks held?
Can't you simply punt the work to a worklet once you've created/removed
the non-default group, which can be done from within the vfs callback ?
On 26/01/09 14:41 +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-26 at 14:24 +0100, Louis Rilling wrote:
>
> > However configfs_rmdir() and configfs_mkdir() (recursively) lock inodes because
> > this is how the VFS works when removing/adding entries under a directory which
> > has already lived in the dcache.
>
> Ok, so then I'm not understanding things correctly.
You may understand the VFS better than I do actually.
>
> Its not a locking correctness thing, but simply not being able to do it
> from the vfs calls because those assume locks held?
>
> Can't you simply punt the work to a worklet once you've created/removed
> the non-default group, which can be done from within the vfs callback ?
I'm not sure to understand your suggestion. Is this:
1) for mkdir(), create the non-default group, but without its default groups,
and defer their creation to a worker which won't have constraints on locks held
by any caller;
2) for rmdir(), unlink the non-default group, but without unlinking its default
groups, and defer the recursive work to a lock-free context?
For mkdir(), this may work. Maybe a bit confusing for userspace, since mkdir(A)
returns as soon as A is created, but A may be populated later and userspace may
rely on A being populated as soon as it is created (current behavior). As a
configfs user, this makes my life harder...
For rmdir(), is this safe to unlink a non-empty directory, and to empty it
afterwards? This looks like going back to the unmount problem.
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Mon, 2009-01-26 at 15:00 +0100, Louis Rilling wrote:
> > Its not a locking correctness thing, but simply not being able to do it
> > from the vfs calls because those assume locks held?
> >
> > Can't you simply punt the work to a worklet once you've created/removed
> > the non-default group, which can be done from within the vfs callback ?
>
> I'm not sure to understand your suggestion. Is this:
> 1) for mkdir(), create the non-default group, but without its default groups,
> and defer their creation to a worker which won't have constraints on locks held
> by any caller;
> 2) for rmdir(), unlink the non-default group, but without unlinking its default
> groups, and defer the recursive work to a lock-free context?
>
> For mkdir(), this may work. Maybe a bit confusing for userspace, since mkdir(A)
> returns as soon as A is created, but A may be populated later and userspace may
> rely on A being populated as soon as it is created (current behavior). As a
> configfs user, this makes my life harder...
Right, so that is the whole crux of the matter?
Initially I understood the whole recursive locking issue to be about
having to serialize mkdir vs rmdir so that you would know the default
groups to be empty etc.
You could create the subtree before you link it in. i_op->mkdir() only
has the parent i_mutex held, so you should be able to create your inode,
and all default groups (some of who will have the non-default group as
parent, but that's ok, as we don't have that locked yet).
Once you've constructed this, you could connect the non-default group to
its parent.
Also, you don't _need_ to have any i_mutex's locked here, because non of
these inodes are reachable.
> For rmdir(), is this safe to unlink a non-empty directory, and to empty it
> afterwards? This looks like going back to the unmount problem.
Dunno :-), I think it should be safe. The only guarantee you need is
that there are no refs to inodes in the decoupled sub-tree (other than
your own of course.)
So you'd only need to punt the rmdir cleanup to eventd or something.
On 26/01/09 15:19 +0100, Peter Zijlstra wrote:
> On Mon, 2009-01-26 at 15:00 +0100, Louis Rilling wrote:
>
> > > Its not a locking correctness thing, but simply not being able to do it
> > > from the vfs calls because those assume locks held?
> > >
> > > Can't you simply punt the work to a worklet once you've created/removed
> > > the non-default group, which can be done from within the vfs callback ?
> >
> > I'm not sure to understand your suggestion. Is this:
> > 1) for mkdir(), create the non-default group, but without its default groups,
> > and defer their creation to a worker which won't have constraints on locks held
> > by any caller;
> > 2) for rmdir(), unlink the non-default group, but without unlinking its default
> > groups, and defer the recursive work to a lock-free context?
> >
> > For mkdir(), this may work. Maybe a bit confusing for userspace, since mkdir(A)
> > returns as soon as A is created, but A may be populated later and userspace may
> > rely on A being populated as soon as it is created (current behavior). As a
> > configfs user, this makes my life harder...
>
> Right, so that is the whole crux of the matter?
Probably not. I'm not the maintainer of configfs, but I guess that Joel is a bit
reluctant to deeply rework parts of something that actually works (conflicts
with lockdep excepted).
>
> Initially I understood the whole recursive locking issue to be about
> having to serialize mkdir vs rmdir so that you would know the default
> groups to be empty etc.
>
> You could create the subtree before you link it in. i_op->mkdir() only
> has the parent i_mutex held, so you should be able to create your inode,
> and all default groups (some of who will have the non-default group as
> parent, but that's ok, as we don't have that locked yet).
>
> Once you've constructed this, you could connect the non-default group to
> its parent.
>
> Also, you don't _need_ to have any i_mutex's locked here, because non of
> these inodes are reachable.
True. I already suggested this to Joel (while fixing a race condition), but this
raises other issues (see http://marc.info/?l=linux-kernel&m=121438776626316&w=2
for a previous discussion on this).
>
> > For rmdir(), is this safe to unlink a non-empty directory, and to empty it
> > afterwards? This looks like going back to the unmount problem.
>
> Dunno :-), I think it should be safe. The only guarantee you need is
> that there are no refs to inodes in the decoupled sub-tree (other than
> your own of course.)
>
> So you'd only need to punt the rmdir cleanup to eventd or something.
May be. Anyway I can't investigate this right now, and that's why I'm asking
Joel if he is going to accept one of the temporary solutions that I provided
(Note that my second solution
http://marc.info/?l=linux-kernel&m=122962334723834&w=2 does not turn off
lockdep!). Of course it's better if someone can just do this rework :)
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Thank you both for keeping up on this. I owe Louis some review that I'm
going to try to get to.
On Mon, Jan 26, 2009 at 03:55:36PM +0100, Louis Rilling wrote:
> On 26/01/09 15:19 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-01-26 at 15:00 +0100, Louis Rilling wrote:
> >
> > > > Its not a locking correctness thing, but simply not being able to do it
> > > > from the vfs calls because those assume locks held?
> > > >
> > > > Can't you simply punt the work to a worklet once you've created/removed
> > > > the non-default group, which can be done from within the vfs callback ?
> > >
> > > I'm not sure to understand your suggestion. Is this:
> > > 1) for mkdir(), create the non-default group, but without its default groups,
> > > and defer their creation to a worker which won't have constraints on locks held
> > > by any caller;
> > > 2) for rmdir(), unlink the non-default group, but without unlinking its default
> > > groups, and defer the recursive work to a lock-free context?
> > >
> > > For mkdir(), this may work. Maybe a bit confusing for userspace, since mkdir(A)
> > > returns as soon as A is created, but A may be populated later and userspace may
> > > rely on A being populated as soon as it is created (current behavior). As a
> > > configfs user, this makes my life harder...
> >
> > Right, so that is the whole crux of the matter?
The "appearance" of the entire default group hierarchy should be
atomic. When mkdir(2) returns, it is all there. This does make our
lives a little difficult inside configfs, but it makes everyone else's
lives much easier.
> Probably not. I'm not the maintainer of configfs, but I guess that Joel is a bit
> reluctant to deeply rework parts of something that actually works (conflicts
> with lockdep excepted).
That is true - it works, and safely, and the lockdep conflict is
the only real known issue. I'm not wary of changing it, I'm only wary
of breaking it. That is, I want to understand what is being changed and
be sure that we're keeping the correctness we have so far. I don't want
to change it and "hope" we got it right, you know? This makes me
cautious, but don't think I'm against change. As I stated, having
lockdep work for us is a good thing.
More replies to the other mails coming.
Joel
--
You can use a screwdriver to screw in screws or to clean your ears,
however, the latter needs real skill, determination and a lack of fear
of injuring yourself. It is much the same with JavaScript.
- Chris Heilmann
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jan 26, 2009 at 01:30:09PM +0100, Peter Zijlstra wrote:
> I don't think I was suggesting that. All you need is to serialize any
> mkdir/creat against the rmdir of the youngest non-default group, and you
> can do that by holding su_mutex.
>
> In rmdir, you already own all the i_mutex instances you need to uncouple
> the whole tree, all you need to do is validate that its indeed empty --
> you don't need i_mutex's for that, because you're holding su_mutex, and
> any concurrent mkdir/creat will be blocking on that.
>
> If you find it empty, just mark everybody DEAD, drop su_mutex and
> decouple. All concurrent mkdir/creat thingies that were blocking will
> now bail because their parent is found DEAD.
Right, that's what I was talking about when I said:
> > Now look in detach_groups(). We drop the groups children before
> > marking them DEAD. Louis' plan, I think, is to perhaps mark a group
> > DEAD, disconnect it from the vfs, and then operate on its children. In
> > this fashion, perhaps we can unlock the trailing lock like a normal VFS
> > operation.
> > This will require some serious auditing, however, because now
> > vfs functions can get into the vfs objects behind us. And more vfs
> > changes affect us. Whereas the current locking relies on the vfs's
> > parent->child lock ordering only, something that isn't likely to be
> > changed.
That is, the vfs has already walked past this directory,
dropped i_mutex, and is in a child default group holding its i_mutex.
It wants to mkdir(2) down there. You're saying that, if mkdir(2) holds
su_mutex higher up, it can check S_DEAD and compare with us, and that's
exactly the scheme I mentioned in the first of the quoted paragraphs
(Louis proposed it a while back). Thus, though we don't hold i_mutex on that
child, it will eventually either a) have gotten su_mutex first, and will
cause rmdir to ENOTEMPTY or b) have gotten su_mutex second, will see
S_DEAD, and return -ENOENT. As far as preventing mkdir(2), I don't see
why it wouldn't work.
The issue is in that second quoted paragraph. We know that the
VFS can look up our children if we're not holding i_mutex. In fact,
cached_lookup() can find them without i_mutex. Now, we know that
mkdir(2) and rmdir(2) will block at su_mutex. But what about all other
file operations, both on the child directories *and* the attribute
files? For attribute files, we prevent access at creation time with a
flag. We can trust the flag because we hold i_mutex. This might hold
anyway because we're holding that toplevel i_mutex. At teardown time,
though, we know they can't be found because of i_mutex. Now we don't
have that protection for processes that are farther down the tree.
But the bigger issue is just the plain regular operations on our
directories. An example is ->readdir(). We currently lock out
->readdir() during rmdir(2) with our holding of i_mutex. However, if we
are not holding i_mutex, vfs_readdir() can call into our ->readdir()
right as we're tearing things down. We may not have gotten to S_DEAD
yet in the rmdir(2) path, and the two will race. We can't take su_mutex
in ->readdir(), because that sort of solution effectively says "we have
to take su_mutex for all operations", and we end up serializing all
operations on a subsustem.
Now, I can think of a way to make ->readdir() safe without
su_mutex. But what other operation is next? How do I know I have them
all? How do I notice when someone adds a new operation or code path to
the generic code that I have to protect against? With i_mutex, I *know*
that everyone has agreed that is the gatekeeper. Without it...
*That's* the big worry. That's what I'm worried about being
sure of. I'd love to hear a solution that we know will work, or at
least move towards one.
Joel
PS: And we haven't even talked about configfs_depend_item() yet.
--
Life's Little Instruction Book #252
"Take good care of those you love."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jan 26, 2009 at 12:51:52PM +0100, Louis Rilling wrote:
> Are you going to take one of these approaches to make lockdep and configfs
> cohabit?
I'm going to look at your patches now and do some review - sorry
it took me so long.
I do think our discussion with Peter elsewhere may have even
better fruit, so I'm going to hold of on including these just yet.
Joel
--
Life's Little Instruction Book #510
"Count your blessings."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Thu, Dec 18, 2008 at 07:00:17PM +0100, Louis Rilling wrote:
> >From inside configfs it is not possible to serialize those recursive
> locking with a top-level one, because mkdir() and rmdir() are already
> called with inodes locked by the VFS. So using some
> mutex_lock_nest_lock() is not an option.
>
> I am proposing two solutions:
> 1) one that wraps recursive mutex_lock()s with
> lockdep_off()/lockdep_on().
> 2) (as suggested earlier by Peter Zijlstra) one that puts the
> i_mutexes recursively locked in different classes based on their
> depth from the top-level config_group created. This
> induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
> nesting of configfs default groups whenever lockdep is activated
> but this limit looks reasonably high. Unfortunately, this also
> isolates VFS operations on configfs default groups from the others
> and thus lowers the chances to detect locking issues.
>
> Nobody likes solution 1), what I can understand.
Me too :-P
> This patch implements solution 2). However lockdep is still not happy with
> configfs_depend_item(). Next patch reworks the locking of
> configfs_depend_item() and finally makes lockdep happy.
<snip>
> #define CONFIGFS_ROOT 0x0001
> diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> index 8e93341..f21be74 100644
> --- a/fs/configfs/dir.c
> +++ b/fs/configfs/dir.c
> @@ -94,6 +94,9 @@ static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * pare
> INIT_LIST_HEAD(&sd->s_links);
> INIT_LIST_HEAD(&sd->s_children);
> sd->s_element = element;
> +#ifdef CONFIG_LOCKDEP
> + sd->s_depth = -1;
> +#endif
> spin_lock(&configfs_dirent_lock);
> if (parent_sd->s_type & CONFIGFS_USET_DROPPING) {
> spin_unlock(&configfs_dirent_lock);
> @@ -176,6 +179,17 @@ static int init_symlink(struct inode * inode)
> return 0;
> }
>
> +#ifdef CONFIG_LOCKDEP
> +static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
> + struct configfs_dirent *sd)
> +{
> + int parent_depth = parent_sd->s_depth;
> +
> + if (parent_depth >= 0)
> + sd->s_depth = parent_depth + 1;
> +}
> +#endif
> +
> static int create_dir(struct config_item * k, struct dentry * p,
> struct dentry * d)
> {
> @@ -187,6 +201,9 @@ static int create_dir(struct config_item * k, struct dentry * p,
> error = configfs_make_dirent(p->d_fsdata, d, k, mode,
> CONFIGFS_DIR | CONFIGFS_USET_CREATING);
> if (!error) {
> +#ifdef CONFIG_LOCKDEP
> + configfs_set_dir_dirent_depth(p->d_fsdata, d->d_fsdata);
> +#endif
> error = configfs_create(d, mode, init_dir);
> if (!error) {
> inc_nlink(p->d_inode);
Can you change this to provide non-lockdep versions of
functions? We don't want "#ifdef CONFIG_LOCKDEP" everywhere. What we
want is the code to call functions unconditionally, and the functions to
do nothing if lockdep is not enabled. Like:
#ifdef CONFIG_LOCKDEP
static inline void configfs_init_dir_dirent_depth(dirent)
{
dirent->s_depth = -1;
}
static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
struct configfs_dirent *sd)
{
int parent_depth = parent_sd->s_depth;
if (parent_depth >= 0)
sd->s_depth = parent_depth + 1;
}
#else
static inline void configfs_init_dir_dirent_depth(dirent)
{
}
static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
struct configfs_dirent *sd)
{
}
#endif
This makes the callsites much nicer to read:
@@ -94,6 +94,9 @@ static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * pare
INIT_LIST_HEAD(&sd->s_links);
INIT_LIST_HEAD(&sd->s_children);
sd->s_element = element;
+ configfs_init_dir_dirent_depth(sd);
spin_lock(&configfs_dirent_lock);
if (parent_sd->s_type & CONFIGFS_USET_DROPPING) {
spin_unlock(&configfs_dirent_lock);
@@ -187,6 +201,7 @@ static int create_dir(struct config_item * k, struct dentry * p,
error = configfs_make_dirent(p->d_fsdata, d, k, mode,
CONFIGFS_DIR | CONFIGFS_USET_CREATING);
if (!error) {
+ configfs_set_dir_dirent_depth(p->d_fsdata, d->d_fsdata);
error = configfs_create(d, mode, init_dir);
if (!error) {
inc_nlink(p->d_inode);
Otherwise, this patch seems pretty straightforward.
Joel
--
Life's Little Instruction Book #30
"Never buy a house without a fireplace."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Thu, Dec 18, 2008 at 07:00:18PM +0100, Louis Rilling wrote:
> configfs_depend_item() recursively locks all inodes mutex from configfs root to
> the target item, which makes lockdep unhappy. The purpose of this recursive
> locking is to ensure that the item tree can be safely parsed and that the target
> item, if found, is not about to leave.
>
> This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
> Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
> protects tagging of items to be removed, this lock can be used instead of the
> inodes mutex lock chain.
> This needs that the check for dependents be done atomically with
> CONFIGFS_USET_DROPPING tagging.
>
> Now lockdep looks happy with configfs.
This looks almost, but not quite right.
In the create path, we do configfs_new_dirent() before we set
sd->s_type. But configfs_new_dirent() attaches sd->s_sibling. So, in
aonther thread, configfs_depend_prep() can traverse this s_sibling
without CONFIGFS_USET_CREATING being set. This turns out to be safe
because CONFIGFS_DIR is also not set - but boy I'd like a comment about
that.
What if we're in mkdir(2) in one thread and another thread is
trying to pin the parent directory? That is, we are inside
configfs_mkdir(parent, new_dentry, mode). The other thread is doing
configfs_depend_item(subsys, parent). With this patch, the other thread
will not take parent->i_mutex. It will happily determine that
parent is part of the tree and bump its s_dependent with no locking. Is
this OK?
If it is - isn't this patch good without any other reason? That
is, aside from the issues of lockdep, isn't it better for
configfs_depend_item() to never have to worry about the VFS locks other
than the configfs root?
Joel
--
The zen have a saying:
"When you learn how to listen, ANYONE can be your teacher."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On 27/01/09 20:13 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 07:00:18PM +0100, Louis Rilling wrote:
> > configfs_depend_item() recursively locks all inodes mutex from configfs root to
> > the target item, which makes lockdep unhappy. The purpose of this recursive
> > locking is to ensure that the item tree can be safely parsed and that the target
> > item, if found, is not about to leave.
> >
> > This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
> > Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
> > protects tagging of items to be removed, this lock can be used instead of the
> > inodes mutex lock chain.
> > This needs that the check for dependents be done atomically with
> > CONFIGFS_USET_DROPPING tagging.
> >
> > Now lockdep looks happy with configfs.
>
> This looks almost, but not quite right.
> In the create path, we do configfs_new_dirent() before we set
> sd->s_type. But configfs_new_dirent() attaches sd->s_sibling. So, in
> aonther thread, configfs_depend_prep() can traverse this s_sibling
> without CONFIGFS_USET_CREATING being set. This turns out to be safe
> because CONFIGFS_DIR is also not set - but boy I'd like a comment about
> that.
Definitely agreed. I should have written this comment instead of letting you
notice this.
> What if we're in mkdir(2) in one thread and another thread is
> trying to pin the parent directory? That is, we are inside
> configfs_mkdir(parent, new_dentry, mode). The other thread is doing
> configfs_depend_item(subsys, parent). With this patch, the other thread
> will not take parent->i_mutex. It will happily determine that
> parent is part of the tree and bump its s_dependent with no locking. Is
> this OK?
Yes this is the expected impact. It is OK because
1) under a same critical section of configfs_dirent_lock, depend_item()
checks that CONFIGFS_USET_DROPPING is not set and bumps s_dependent;
2) under a same critical section of configfs_dirent_lock, configfs_rmdir()
checks the s_dependent count and tries to set CONFIGFS_USET_DROPPING.
> If it is - isn't this patch good without any other reason? That
> is, aside from the issues of lockdep, isn't it better for
> configfs_depend_item() to never have to worry about the VFS locks other
> than the configfs root?
Yes, this patch may look like an improvement, independently from lockdep. I
think that locking is simpler this way, and this also removes the need for
configfs_depend_rollback(). Moreover this moves towards the management of
configfs_dirents protected by configfs_dirent_lock only. In the end, it's up to
you to judge if this is a good direction ;)
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On 27/01/09 19:55 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 07:00:17PM +0100, Louis Rilling wrote:
[...]
>
> > #define CONFIGFS_ROOT 0x0001
> > diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> > index 8e93341..f21be74 100644
> > --- a/fs/configfs/dir.c
> > +++ b/fs/configfs/dir.c
> > @@ -94,6 +94,9 @@ static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * pare
> > INIT_LIST_HEAD(&sd->s_links);
> > INIT_LIST_HEAD(&sd->s_children);
> > sd->s_element = element;
> > +#ifdef CONFIG_LOCKDEP
> > + sd->s_depth = -1;
> > +#endif
> > spin_lock(&configfs_dirent_lock);
> > if (parent_sd->s_type & CONFIGFS_USET_DROPPING) {
> > spin_unlock(&configfs_dirent_lock);
> > @@ -176,6 +179,17 @@ static int init_symlink(struct inode * inode)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_LOCKDEP
> > +static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
> > + struct configfs_dirent *sd)
> > +{
> > + int parent_depth = parent_sd->s_depth;
> > +
> > + if (parent_depth >= 0)
> > + sd->s_depth = parent_depth + 1;
> > +}
> > +#endif
> > +
> > static int create_dir(struct config_item * k, struct dentry * p,
> > struct dentry * d)
> > {
> > @@ -187,6 +201,9 @@ static int create_dir(struct config_item * k, struct dentry * p,
> > error = configfs_make_dirent(p->d_fsdata, d, k, mode,
> > CONFIGFS_DIR | CONFIGFS_USET_CREATING);
> > if (!error) {
> > +#ifdef CONFIG_LOCKDEP
> > + configfs_set_dir_dirent_depth(p->d_fsdata, d->d_fsdata);
> > +#endif
> > error = configfs_create(d, mode, init_dir);
> > if (!error) {
> > inc_nlink(p->d_inode);
>
> Can you change this to provide non-lockdep versions of
> functions? We don't want "#ifdef CONFIG_LOCKDEP" everywhere. What we
> want is the code to call functions unconditionally, and the functions to
> do nothing if lockdep is not enabled. Like:
>
> #ifdef CONFIG_LOCKDEP
> static inline void configfs_init_dir_dirent_depth(dirent)
> {
> dirent->s_depth = -1;
> }
>
> static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
> struct configfs_dirent *sd)
> {
> int parent_depth = parent_sd->s_depth;
>
> if (parent_depth >= 0)
> sd->s_depth = parent_depth + 1;
> }
> #else
> static inline void configfs_init_dir_dirent_depth(dirent)
> {
> }
>
> static void configfs_set_dir_dirent_depth(struct configfs_dirent *parent_sd,
> struct configfs_dirent *sd)
> {
> }
> #endif
>
> This makes the callsites much nicer to read:
>
> @@ -94,6 +94,9 @@ static struct configfs_dirent *configfs_new_dirent(struct configfs_dirent * pare
> INIT_LIST_HEAD(&sd->s_links);
> INIT_LIST_HEAD(&sd->s_children);
> sd->s_element = element;
> + configfs_init_dir_dirent_depth(sd);
> spin_lock(&configfs_dirent_lock);
> if (parent_sd->s_type & CONFIGFS_USET_DROPPING) {
> spin_unlock(&configfs_dirent_lock);
> @@ -187,6 +201,7 @@ static int create_dir(struct config_item * k, struct dentry * p,
> error = configfs_make_dirent(p->d_fsdata, d, k, mode,
> CONFIGFS_DIR | CONFIGFS_USET_CREATING);
> if (!error) {
> + configfs_set_dir_dirent_depth(p->d_fsdata, d->d_fsdata);
> error = configfs_create(d, mode, init_dir);
> if (!error) {
> inc_nlink(p->d_inode);
Sure, that's why I said that this could be cleaner ;)
>
> Otherwise, this patch seems pretty straightforward.
Cleaner patch coming.
Thanks,
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes