A couple comments.
First, put a BUG_ON() where you have BAD BAD BAD - we shouldn't
be creating a depth we can't delete.
> @@ -392,6 +437,10 @@ static int configfs_detach_prep(struct d
> * deep nesting of default_groups
> */
> ret = configfs_detach_prep(sd->s_dentry);
> + /* Update parent's lock_level so that remaining
> + * sibling children keep on globally increasing
> + * lock_level */
> + copy_dirent_lock_level(sd, parent_sd);
> if (!ret)
> continue;
> } else
I'm not sure I get this hunk. If our parent was 1 and we are 2,
we are copying 2 to our parent so the parent can only have other
children at 3?
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 02, 2008 at 04:07:21PM -0700, Joel Becker wrote:
> A couple comments.
> First, put a BUG_ON() where you have BAD BAD BAD - we shouldn't
> be creating a depth we can't delete.
I think that the best way to avoid this is to use the same numbering scheme
while attaching default groups.
This would change the body of populate_groups() like this:
- if (group->default_groups) {
+ /* lock_level starts at zero for the non default group.
+ * Set it even if we do not take the lock, so that we can use the same
+ * numbering scheme as for destruction time, and can prevent overload at
+ * destruction time. */
+ lock_level = set_dirent_lock_level(parent_sd, sd);
+ if (lock_level < 0) {
+ /* Too many default groups */
+ ret = lock_level;
+ } else if (group->default_groups) {
/*
* FYI, we're faking mkdir here
* I'm not sure we need this semaphore, as we're called
* from our parent's mkdir. That holds our parent's
* i_mutex, so afaik lookup cannot continue through our
* parent to find us, let alone mess with our tree.
* That said, taking our i_mutex is closer to mkdir
* emulation, and shouldn't hurt.
*/
- /* lock_level starts at zero for the non default group */
- lock_level = set_dirent_lock_level(parent_sd, sd);
- if (lock_level < 0) {
- /* Too deeply nested default groups */
- ret = lock_level;
- } else {
mutex_lock_nested(&dentry->d_inode->i_mutex,
I_MUTEX_CHILD + lock_level);
for (i = 0; group->default_groups[i]; i++) {
new_group = group->default_groups[i];
ret = create_default_group(group, new_group);
if (ret)
break;
}
mutex_unlock(&dentry->d_inode->i_mutex);
- /* Reset for future sub-group creations */
- reset_dirent_lock_level(sd);
- }
}
+ if (lock_level > 0)
+ /* Update parent lock_level to keep it increasing, but not
+ * if group is the one actually created/registered by the
+ * user/subsystem */
+ copy_dirent_lock_level(sd, parent_sd);
+ /* Reset for future sub-group creations */
+ reset_dirent_lock_level(sd);
>
> > @@ -392,6 +437,10 @@ static int configfs_detach_prep(struct d
> > * deep nesting of default_groups
> > */
> > ret = configfs_detach_prep(sd->s_dentry);
> > + /* Update parent's lock_level so that remaining
> > + * sibling children keep on globally increasing
> > + * lock_level */
> > + copy_dirent_lock_level(sd, parent_sd);
> > if (!ret)
> > continue;
> > } else
>
> I'm not sure I get this hunk. If our parent was 1 and we are 2,
> we are copying 2 to our parent so the parent can only have other
> children at 3?
Exactly.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Tue, Jun 03, 2008 at 06:00:34PM +0200, Louis Rilling wrote:
> On Mon, Jun 02, 2008 at 04:07:21PM -0700, Joel Becker wrote:
> > A couple comments.
> > First, put a BUG_ON() where you have BAD BAD BAD - we shouldn't
> > be creating a depth we can't delete.
>
> I think that the best way to avoid this is to use the same numbering scheme
> while attaching default groups.
If I'm reading this right, when we come back up from one child
chain, we update the parent to be the same as the child - this is, i
assume, to allow all the locks to be held at once. IOW, you are trying
to have all locks in the default groups have unique lock levels,
regardless of their depth.
This is obviously limiting on the number of default groups for
one item - it's a total cap, not a depth cap. But I have another
concern. We lock a particular default_group with level N, then its
child default_group with level N+1. But how does that integrate with
VFS locking of the same mutexes?
Say we have an group G. It has one default group D1. D1 has a
default group itself, D2. So, when we populate the groups, we lock G at
MUTEX_CHILD, D1 at MUTEX_CHILD+1, and D2 at MUTEX_CHILD+2. However,
when the VFS navigates the tree (eg, lookup() or someone attempting an
rmdir() of D2's non-default child), it will lock with _PARENT and
_CHILD, not with our subclasses.
Am I right about this? We won't be using the same classes as
the VFS, and thus won't be able to see about interactions between the
VFS locking and our locking? I'd love to be wrong :-)
Joel
--
"The real reason GNU ls is 8-bit-clean is so that they can
start using ISO-8859-1 option characters."
- Christopher Davis ([email protected])
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Fri, Jun 06, 2008 at 04:01:54PM -0700, Joel Becker wrote:
> On Tue, Jun 03, 2008 at 06:00:34PM +0200, Louis Rilling wrote:
> > On Mon, Jun 02, 2008 at 04:07:21PM -0700, Joel Becker wrote:
> > > A couple comments.
> > > First, put a BUG_ON() where you have BAD BAD BAD - we shouldn't
> > > be creating a depth we can't delete.
> >
> > I think that the best way to avoid this is to use the same numbering scheme
> > while attaching default groups.
>
> If I'm reading this right, when we come back up from one child
> chain, we update the parent to be the same as the child - this is, i
> assume, to allow all the locks to be held at once. IOW, you are trying
> to have all locks in the default groups have unique lock levels,
> regardless of their depth.
Exactly, otherwise lockdep will issue a warning as soon as one tries to remove
a config group having default groups at the same depth, because it will see two
mutexes locked with same sub-class.
> This is obviously limiting on the number of default groups for
> one item - it's a total cap, not a depth cap. But I have another
> concern. We lock a particular default_group with level N, then its
> child default_group with level N+1. But how does that integrate with
> VFS locking of the same mutexes?
> Say we have an group G. It has one default group D1. D1 has a
> default group itself, D2. So, when we populate the groups, we lock G at
> MUTEX_CHILD, D1 at MUTEX_CHILD+1, and D2 at MUTEX_CHILD+2. However,
> when the VFS navigates the tree (eg, lookup() or someone attempting an
> rmdir() of D2's non-default child), it will lock with _PARENT and
> _CHILD, not with our subclasses.
> Am I right about this? We won't be using the same classes as
> the VFS, and thus won't be able to see about interactions between the
> VFS locking and our locking? I'd love to be wrong :-)
You are perfectly right, unfortunately. This is the reason why I proposed
another way that temporarily disables lockdep, and let us prove the correctness
manually (actually, this manual solution still lets lockdep verify that the
assumption about I_MUTEX_PARENT -> I_MUTEX_CHILD nesting is correct).
A real solution without disabling lockdep and that would integrate with the VFS
should make lockdep aware of lock trees (like i_mutex locks inside a same
filesystem), or more generally lock graphs, and let lockdep verify that locks of
a tree are always taken while respecting a same order. IOW, if we are able to
consistently tag the nodes of a tree with unique numbers (consistently means
that the resulting order on the nodes is never changed when adding or removing
nodes), lockdep should check that locks of the tree are always taken in
ascending tag order.
This seems unfortunately hard (impossible?) to achieve with reasonable
constraints: lockdep should not need to add links between the locks (this would
make addition and removal of nodes error prone), and lockdep should not need to
renumber all the nodes of a tree when adding a new node.
As a conclusion, I still suggest to temporarily disable lockdep, which will have
the advantage of letting people use lockdep (for other areas) while using
configfs, because lockdep simply cannot help us with configfs hierarchical
locking right now.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Hi,
Following an intuition, I just found a deadlock resulting from the whole default
groups tree locking in configfs_detach_prep().
I can reproduce the bug with the attached patch (which just enlarges an existing
window in VFS lock_rename()) and the following procedure, assuming that configfs
is mounted under /config, and ocfs2 is loaded with cluster support:
# mkdir /config/cluster/foo
# cd /config/cluster/foo
# ln -s /bin/mv ~/test_deadlock
# ~/test_deadlock heartbeat/dead_threshold node/bar
and in another shell, right after having launched test_deadlock:
# rmdir /config/cluster/foo
First, lockdep warns as usual (see below), and after two minutes (standard task
deadlock parameters), we get the dead lock alerts:
<log>
=============================================
[ INFO: possible recursive locking detected ]
2.6.26-rc5 #13
---------------------------------------------
rmdir/3997 is trying to acquire lock:
(&sb->s_type->i_mutex_key#11){--..}, at: [<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
but task is already holding lock:
(&sb->s_type->i_mutex_key#11){--..}, at: [<ffffffff80296070>] vfs_rmdir+0x49/0xac
other info that might help us debug this:
2 locks held by rmdir/3997:
#0: (&sb->s_type->i_mutex_key#3/1){--..}, at: [<ffffffff80297c77>] do_rmdir+0x82/0x108
#1: (&sb->s_type->i_mutex_key#11){--..}, at: [<ffffffff80296070>] vfs_rmdir+0x49/0xac
stack backtrace:
Pid: 3997, comm: rmdir Not tainted 2.6.26-rc5 #13
Call Trace:
[<ffffffff8024aa65>] __lock_acquire+0x8d2/0xc78
[<ffffffff802495ec>] find_usage_backwards+0x9d/0xbe
[<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
[<ffffffff8024b1de>] lock_acquire+0x51/0x6c
[<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
[<ffffffff80247dad>] debug_mutex_lock_common+0x16/0x23
[<ffffffff805d63a4>] mutex_lock_nested+0xcd/0x23b
[<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
[<ffffffff802d327b>] configfs_rmdir+0xb8/0x1c3
[<ffffffff80296092>] vfs_rmdir+0x6b/0xac
[<ffffffff80297cac>] do_rmdir+0xb7/0x108
[<ffffffff80249d1e>] trace_hardirqs_on+0xef/0x113
[<ffffffff805d74c4>] trace_hardirqs_on_thunk+0x35/0x3a
[<ffffffff8020b0cb>] system_call_after_swapgs+0x7b/0x80
INFO: task test_deadlock:3996 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
test_deadlock D 0000000000000001 0 3996 3980
ffff81007cc93d78 0000000000000046 ffff81007cc93d40 ffffffff808ed280
ffffffff808ed280 ffff81007cc93d28 ffffffff808ed280 ffffffff808ed280
ffffffff808ed280 ffffffff808ea120 ffffffff808ed280 ffff81007cdcaa10
Call Trace:
[<ffffffff802955e3>] lock_rename+0x11e/0x126
[<ffffffff805d641e>] mutex_lock_nested+0x147/0x23b
[<ffffffff802955e3>] lock_rename+0x11e/0x126
[<ffffffff80297838>] sys_renameat+0xd7/0x21c
[<ffffffff805d74c4>] trace_hardirqs_on_thunk+0x35/0x3a
[<ffffffff80249d1e>] trace_hardirqs_on+0xef/0x113
[<ffffffff805d74c4>] trace_hardirqs_on_thunk+0x35/0x3a
[<ffffffff8020b0cb>] system_call_after_swapgs+0x7b/0x80
INFO: lockdep is turned off.
INFO: task rmdir:3997 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rmdir D 0000000000000000 0 3997 3986
ffff81007cdb9dd8 0000000000000046 0000000000000000 ffffffff808ed280
ffffffff808ed280 ffff81007cdb9d88 ffffffff808ed280 ffffffff808ed280
ffffffff808ed280 ffffffff808ea120 ffffffff808ed280 ffff81007cde0a50
Call Trace:
[<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
[<ffffffff805d641e>] mutex_lock_nested+0x147/0x23b
[<ffffffff802d2131>] configfs_detach_prep+0x58/0xaa
[<ffffffff802d327b>] configfs_rmdir+0xb8/0x1c3
[<ffffffff80296092>] vfs_rmdir+0x6b/0xac
[<ffffffff80297cac>] do_rmdir+0xb7/0x108
[<ffffffff80249d1e>] trace_hardirqs_on+0xef/0x113
[<ffffffff805d74c4>] trace_hardirqs_on_thunk+0x35/0x3a
[<ffffffff8020b0cb>] system_call_after_swapgs+0x7b/0x80
INFO: lockdep is turned off.
</log>
The issue here is that the VFS locks the i_mutex of the source and target
directories of the rename in source -> target order (because none is ascendent
of the other one), while configfs_detach_prep() takes them in default group
order (or reverse order, I'm not sure), following the order specified by the
groups' creator.
The VFS protects itself against deadlocks of two concurrent renames with
interverted source and target directories with i_sb->s_vfs_rename_mutex. Perhaps
configfs should use the same lock before calling configfs_detach_prep()?
Or maybe configfs would better find an alternative to locking the whole
default groups tree? I strongly advocate for the latter, since this could also
solve our issues with lockdep ;)
Louis
On Mon, Jun 09, 2008 at 01:03:53PM +0200, Louis Rilling wrote:
> On Fri, Jun 06, 2008 at 04:01:54PM -0700, Joel Becker wrote:
> > On Tue, Jun 03, 2008 at 06:00:34PM +0200, Louis Rilling wrote:
> > > On Mon, Jun 02, 2008 at 04:07:21PM -0700, Joel Becker wrote:
> > > > A couple comments.
> > > > First, put a BUG_ON() where you have BAD BAD BAD - we shouldn't
> > > > be creating a depth we can't delete.
> > >
> > > I think that the best way to avoid this is to use the same numbering scheme
> > > while attaching default groups.
> >
> > If I'm reading this right, when we come back up from one child
> > chain, we update the parent to be the same as the child - this is, i
> > assume, to allow all the locks to be held at once. IOW, you are trying
> > to have all locks in the default groups have unique lock levels,
> > regardless of their depth.
>
> Exactly, otherwise lockdep will issue a warning as soon as one tries to remove
> a config group having default groups at the same depth, because it will see two
> mutexes locked with same sub-class.
>
> > This is obviously limiting on the number of default groups for
> > one item - it's a total cap, not a depth cap. But I have another
> > concern. We lock a particular default_group with level N, then its
> > child default_group with level N+1. But how does that integrate with
> > VFS locking of the same mutexes?
> > Say we have an group G. It has one default group D1. D1 has a
> > default group itself, D2. So, when we populate the groups, we lock G at
> > MUTEX_CHILD, D1 at MUTEX_CHILD+1, and D2 at MUTEX_CHILD+2. However,
> > when the VFS navigates the tree (eg, lookup() or someone attempting an
> > rmdir() of D2's non-default child), it will lock with _PARENT and
> > _CHILD, not with our subclasses.
> > Am I right about this? We won't be using the same classes as
> > the VFS, and thus won't be able to see about interactions between the
> > VFS locking and our locking? I'd love to be wrong :-)
>
> You are perfectly right, unfortunately. This is the reason why I proposed
> another way that temporarily disables lockdep, and let us prove the correctness
> manually (actually, this manual solution still lets lockdep verify that the
> assumption about I_MUTEX_PARENT -> I_MUTEX_CHILD nesting is correct).
>
> A real solution without disabling lockdep and that would integrate with the VFS
> should make lockdep aware of lock trees (like i_mutex locks inside a same
> filesystem), or more generally lock graphs, and let lockdep verify that locks of
> a tree are always taken while respecting a same order. IOW, if we are able to
> consistently tag the nodes of a tree with unique numbers (consistently means
> that the resulting order on the nodes is never changed when adding or removing
> nodes), lockdep should check that locks of the tree are always taken in
> ascending tag order.
> This seems unfortunately hard (impossible?) to achieve with reasonable
> constraints: lockdep should not need to add links between the locks (this would
> make addition and removal of nodes error prone), and lockdep should not need to
> renumber all the nodes of a tree when adding a new node.
>
> As a conclusion, I still suggest to temporarily disable lockdep, which will have
> the advantage of letting people use lockdep (for other areas) while using
> configfs, because lockdep simply cannot help us with configfs hierarchical
> locking right now.
>
> Louis
>
> --
> Dr Louis Rilling Kerlabs
> Skype: louis.rilling Batiment Germanium
> Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
> http://www.kerlabs.com/ 35700 Rennes
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Mon, Jun 09, 2008 at 02:54:43PM +0200, Louis Rilling wrote:
> Following an intuition, I just found a deadlock resulting from the whole default
> groups tree locking in configfs_detach_prep().
Ugh, thanks for catching this :-(
> The issue here is that the VFS locks the i_mutex of the source and target
> directories of the rename in source -> target order (because none is ascendent
> of the other one), while configfs_detach_prep() takes them in default group
> order (or reverse order, I'm not sure), following the order specified by the
> groups' creator.
What actual targets are you renaming? Sibling default groups?
> The VFS protects itself against deadlocks of two concurrent renames with
> interverted source and target directories with i_sb->s_vfs_rename_mutex. Perhaps
> configfs should use the same lock before calling configfs_detach_prep()?
> Or maybe configfs would better find an alternative to locking the whole
> default groups tree? I strongly advocate for the latter, since this could also
> solve our issues with lockdep ;)
I think the former actually works nicely. We are playing with
the subtree, and want to keep all operations out of it. Except, of
course, that we come into rmdir() with our parent i_mutex taken, so that
violates the ordering of the rename locks, right?
I'm not against the latter AT ALL. I just haven't come up with
it yet - we can't remove parts of the tree, it must be all or none.
Hence, we lock them all speculatively.
Joel
--
Life's Little Instruction Book #15
"Own a great stereo system."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Jun 09, 2008 at 06:58:00PM -0700, Joel Becker wrote:
> On Mon, Jun 09, 2008 at 02:54:43PM +0200, Louis Rilling wrote:
> > Following an intuition, I just found a deadlock resulting from the whole default
> > groups tree locking in configfs_detach_prep().
>
> Ugh, thanks for catching this :-(
>
> > The issue here is that the VFS locks the i_mutex of the source and target
> > directories of the rename in source -> target order (because none is ascendent
> > of the other one), while configfs_detach_prep() takes them in default group
> > order (or reverse order, I'm not sure), following the order specified by the
> > groups' creator.
>
> What actual targets are you renaming? Sibling default groups?
Actually the operation tries to rename a file in the source default group
"foo/heartbeat/" to a new entry "bar" in a sibling default group "foo/node/" of the
source default group. The operation itself is silly regarding configfs
semantics, but VFS cannot know before locking the source and target
directories...
>
> > The VFS protects itself against deadlocks of two concurrent renames with
> > interverted source and target directories with i_sb->s_vfs_rename_mutex. Perhaps
> > configfs should use the same lock before calling configfs_detach_prep()?
> > Or maybe configfs would better find an alternative to locking the whole
> > default groups tree? I strongly advocate for the latter, since this could also
> > solve our issues with lockdep ;)
>
> I think the former actually works nicely. We are playing with
> the subtree, and want to keep all operations out of it. Except, of
> course, that we come into rmdir() with our parent i_mutex taken, so that
> violates the ordering of the rename locks, right?
Right.
I suggested to use i_sb->s_vfs_rename_mutex, but we cannot do this from
inside configfs_rmdir(), because locking an i_mutex (as the VFS does
before calling configfs_rmdir()) before s_vfs_rename_mutex will also
deadlock with lock_rename().
> I'm not against the latter AT ALL. I just haven't come up with
> it yet - we can't remove parts of the tree, it must be all or none.
> Hence, we lock them all speculatively.
I'm slowly thinking about a solution, but I don't know the VFS enough
yet, especially regarding dentry invalidation and locking. Would it be possible
to start an rmdir() by moving the group to some unreachable place? We could
probably use the mutex of the configfs subsystem and enlarge its scope to
protect against concurrent mkdir(), check for user-created items under the
group (and descendent default groups) to remove, invalidate all dentries and
actually remove the group. Do you think that something is feasible in this way?
Louis
--
Dr Louis Rilling Kerlabs - IRISA
Skype: louis.rilling Campus Universitaire de Beaulieu
Phone: (+33|0) 2 99 84 71 52 Avenue du General Leclerc
Fax: (+33|0) 2 99 84 71 71 35042 Rennes CEDEX - France
http://www.kerlabs.com/
On Tue, Jun 10, 2008 at 12:14:27PM +0200, Louis Rilling wrote:
> On Mon, Jun 09, 2008 at 06:58:00PM -0700, Joel Becker wrote:
> > I'm not against the latter AT ALL. I just haven't come up with
> > it yet - we can't remove parts of the tree, it must be all or none.
> > Hence, we lock them all speculatively.
>
> I'm slowly thinking about a solution, but I don't know the VFS enough
> yet, especially regarding dentry invalidation and locking. Would it be possible
> to start an rmdir() by moving the group to some unreachable place? We could
> probably use the mutex of the configfs subsystem and enlarge its scope to
> protect against concurrent mkdir(), check for user-created items under the
> group (and descendent default groups) to remove, invalidate all dentries and
> actually remove the group. Do you think that something is feasible in this way?
Nope, because you may have live objects below you - the rmdir
should fail, and nothing should change. Sure, you could put it back,
but in the middle there is a period where another process tries to look
at the tree and gets ENOENT. That's not right.
But blocking lookup another way might work. If we keep that
rename process out of looking up its targets (blocked on a lock we hold)
it might work.
Note, btw, that the create side (populate_groups) is safe,
because we hold the creating parent's i_mutex throughout the entire
process.
Hey, can we use d_revalidate? Here's the issue. rename, when
going to lookup the objects it wants to lock, is getting them out of
cached_lookup - there dcache locking is all that protects it. I was
first thinking we could take the dentry locks to block this out. But
rather, why not fail d_revalidate and force a locked lookup? So, when
we go to lock one of these groups for detaching, we also set a flag on
the configfs_dirent. We add a configfs_d_revalidate function that
returns based on that flag - if set, revalidation is needed. Thus, when
another process comes in to look at the object we've already locked, it
blocks waiting to find it.
See, in do_rename, it does do_path_lookup() before actually
calling lock_rename(). It would block there waiting for our speculative
removal. We'd either fail rmdir, and those lookups would succeed, or
we'd succeed rmdir, and the lookup fails.
The only concern is, can the reverse happen? We get past the
lookups in do_rename(), and then another process comes into rmdir() -
will they deadlock there?
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Tue, Jun 10, 2008 at 10:36:54AM -0700, Joel Becker wrote:
> On Tue, Jun 10, 2008 at 12:14:27PM +0200, Louis Rilling wrote:
> > On Mon, Jun 09, 2008 at 06:58:00PM -0700, Joel Becker wrote:
> > > I'm not against the latter AT ALL. I just haven't come up with
> > > it yet - we can't remove parts of the tree, it must be all or none.
> > > Hence, we lock them all speculatively.
> >
> > I'm slowly thinking about a solution, but I don't know the VFS enough
> > yet, especially regarding dentry invalidation and locking. Would it be possible
> > to start an rmdir() by moving the group to some unreachable place? We could
> > probably use the mutex of the configfs subsystem and enlarge its scope to
> > protect against concurrent mkdir(), check for user-created items under the
> > group (and descendent default groups) to remove, invalidate all dentries and
> > actually remove the group. Do you think that something is feasible in this way?
>
> Nope, because you may have live objects below you - the rmdir
> should fail, and nothing should change. Sure, you could put it back,
> but in the middle there is a period where another process tries to look
> at the tree and gets ENOENT. That's not right.
Sure. I was more thinking about moving the to-be-removed group only once we ware
sure that nobody would use it anymore (thanks to the subsys mutex enlarging its
scope for instance). Anyway, see below.
> But blocking lookup another way might work. If we keep that
> rename process out of looking up its targets (blocked on a lock we hold)
> it might work.
> Note, btw, that the create side (populate_groups) is safe,
> because we hold the creating parent's i_mutex throughout the entire
> process.
Yes, mkdir() is completely ok. In fact, I am able to formally prove its locking
correctness (from lockdep's point of view) provided that I_MUTEX_PARENT ->
I_MUTEX_CHILD is correct. I'll come back to lockdep annotations once the
rmdir() bug is fixed.
> Hey, can we use d_revalidate? Here's the issue. rename, when
> going to lookup the objects it wants to lock, is getting them out of
> cached_lookup - there dcache locking is all that protects it. I was
> first thinking we could take the dentry locks to block this out. But
> rather, why not fail d_revalidate and force a locked lookup? So, when
> we go to lock one of these groups for detaching, we also set a flag on
> the configfs_dirent. We add a configfs_d_revalidate function that
> returns based on that flag - if set, revalidation is needed. Thus, when
> another process comes in to look at the object we've already locked, it
> blocks waiting to find it.
> See, in do_rename, it does do_path_lookup() before actually
> calling lock_rename(). It would block there waiting for our speculative
> removal. We'd either fail rmdir, and those lookups would succeed, or
> we'd succeed rmdir, and the lookup fails.
> The only concern is, can the reverse happen? We get past the
> lookups in do_rename(), and then another process comes into rmdir() -
> will they deadlock there?
No we cannot make d_revalidate() temporarily fail, because it will either make
do_lookup() fail (return value < 0), or will tell do_revalidate() to call
d_invalidate() (return value == 0), which will either definitely invalidate the
dentry stored in item->ci_dentry (unlikely because its use count should be > 1,
right?), or return success to do_lookup().
The good news is that we can make d_revalidate() block on whatever we use to
protect rmdir() against racing operations. I'll try to send a patch levaraging
the subsystem's mutexes later in the day.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Wed, Jun 11, 2008 at 04:10:10PM +0200, Louis Rilling wrote:
> On Tue, Jun 10, 2008 at 10:36:54AM -0700, Joel Becker wrote:
> > Hey, can we use d_revalidate? Here's the issue. rename, when
> > going to lookup the objects it wants to lock, is getting them out of
> > cached_lookup - there dcache locking is all that protects it. I was
> > first thinking we could take the dentry locks to block this out. But
> > rather, why not fail d_revalidate and force a locked lookup? So, when
> > we go to lock one of these groups for detaching, we also set a flag on
> > the configfs_dirent. We add a configfs_d_revalidate function that
> > returns based on that flag - if set, revalidation is needed. Thus, when
> > another process comes in to look at the object we've already locked, it
> > blocks waiting to find it.
> > See, in do_rename, it does do_path_lookup() before actually
> > calling lock_rename(). It would block there waiting for our speculative
> > removal. We'd either fail rmdir, and those lookups would succeed, or
> > we'd succeed rmdir, and the lookup fails.
> > The only concern is, can the reverse happen? We get past the
> > lookups in do_rename(), and then another process comes into rmdir() -
> > will they deadlock there?
>
> No we cannot make d_revalidate() temporarily fail, because it will either make
> do_lookup() fail (return value < 0), or will tell do_revalidate() to call
> d_invalidate() (return value == 0), which will either definitely invalidate the
> dentry stored in item->ci_dentry (unlikely because its use count should be > 1,
> right?), or return success to do_lookup().
>
> The good news is that we can make d_revalidate() block on whatever we use to
> protect rmdir() against racing operations. I'll try to send a patch levaraging
> the subsystem's mutexes later in the day.
I've spoken too fast. Doing things in d_revalidate() (or even in
configfs_lookup()) is just too early: after they are called and before
lock_rename(), nothing can prevent rmdir() from being called.
I'm afraid that we need a way to get rid of the whole tree locking in
configfs_detach_prep(). Imagine that we protect all calls to configfs_rmdir()
and configfs_mkdir() with a global mutex (say configfs_dir_mutex). Is it still
needed to take default group's i_mutex in configfs_detach_prep() to check for
the validity of the rmdir()?
After this check, we could set a REMOVING flag in configfs_dirent of all
default groups (could actually be done by configfs_detach_prep(), making
configfs_detach_rollback() resetting the flag), release configfs_dir_mutex, and
then call configfs_detach_group(), with detach_groups() taking default group's
i_mutex right before calling configfs_detach_group() recursively (this would
lead to the same path locking as in populate_groups() or
configfs_depend_prep()).
On mkdir() side, once configfs_dir_mutex is acquired, we first check the
REMOVING flag, and proceed if it is not set.
Just tell me if you think that it's feasible, and I'll send you a patch if it's
ok.
Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
On Wed, Jun 11, 2008 at 07:30:47PM +0200, Louis Rilling wrote:
> I've spoken too fast. Doing things in d_revalidate() (or even in
> configfs_lookup()) is just too early: after they are called and before
> lock_rename(), nothing can prevent rmdir() from being called.
Yeah, that's what I was saying.
> I'm afraid that we need a way to get rid of the whole tree locking in
> configfs_detach_prep(). Imagine that we protect all calls to configfs_rmdir()
> and configfs_mkdir() with a global mutex (say configfs_dir_mutex). Is it still
> needed to take default group's i_mutex in configfs_detach_prep() to check for
> the validity of the rmdir()?
That's not a bad idea. I was basically sure we need the tree
locking to prevent other VFS operations from happening, but the more I
think about it, the only reason we do the prep locking is to check and
protect the ENOTEMPTY case. I've bounced it off Mark too, and he
agrees.
> After this check, we could set a REMOVING flag in configfs_dirent of all
> default groups (could actually be done by configfs_detach_prep(), making
> configfs_detach_rollback() resetting the flag), release configfs_dir_mutex, and
> then call configfs_detach_group(), with detach_groups() taking default group's
> i_mutex right before calling configfs_detach_group() recursively (this would
> lead to the same path locking as in populate_groups() or
> configfs_depend_prep()).
We already set the flag - CONFIGFS_USET_DROPPING. The only
change is that you don't take i_mutex anymore. For the static mutex,
the name 'configfs_dir_mutex' is fine.
> On mkdir() side, once configfs_dir_mutex is acquired, we first check the
> REMOVING flag, and proceed if it is not set.
And if it is set, we return -ENOENT.
> Just tell me if you think that it's feasible, and I'll send you a patch if it's
Have at! Thanks.
Joel
--
"Vote early and vote often."
- Al Capone
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127