There's a simple and effective way to prevent unlink(2) and rename(2)
from operating on any file or directory by simply mounting something
on it. In any mount instance in any namespace.
Was this considered in the unprivileged mount design?
The solution is also theoretically simple: mounts in unpriv namespaces
are marked "volatile" and are dissolved on an unlink type operation.
Such volatile mounts would be useful in general too.
Thanks,
Miklos
On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?
>
> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.
I'd actually prefer the reverse: unprivileged mounts don't prevent
unlink and rename. If the dentry goes away, then the mount could still
exist, sans underlying file. (This is already supported on network
filesystems.)
--Andy
Miklos Szeredi <[email protected]> writes:
> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?
The focus was on not fooling privileged applications and some of the
secondary effects that really should have been thought about in more
depth were like this one were overlooked.
> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.
>
> Such volatile mounts would be useful in general too.
Agreed.
This is a problem that is a general pain with mount namespaces in
general.
I think the real technical hurdle is finding the mounts t in some random
mount namespace. Once we can do that relatively efficiently the rest
becomes simple.
Eric
Andy Lutomirski <[email protected]> writes:
> On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
>> There's a simple and effective way to prevent unlink(2) and rename(2)
>> from operating on any file or directory by simply mounting something
>> on it. In any mount instance in any namespace.
>>
>> Was this considered in the unprivileged mount design?
>>
>> The solution is also theoretically simple: mounts in unpriv namespaces
>> are marked "volatile" and are dissolved on an unlink type operation.
>
> I'd actually prefer the reverse: unprivileged mounts don't prevent
> unlink and rename. If the dentry goes away, then the mount could still
> exist, sans underlying file. (This is already supported on network
> filesystems.)
Of course we do this in network filesystems by pretending the
rename/unlink did not actually happen. The vfs insists that it be lied
to instead of mirroring what actually happened.
Again all of this is a question about efficient data structures and not
really one of semantics. Can either semantic be implemented in such a
way that it does not slow down the vfs?
Eric
On Wed, Aug 14, 2013 at 12:53 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
>>> There's a simple and effective way to prevent unlink(2) and rename(2)
>>> from operating on any file or directory by simply mounting something
>>> on it. In any mount instance in any namespace.
>>>
>>> Was this considered in the unprivileged mount design?
>>>
>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>> are marked "volatile" and are dissolved on an unlink type operation.
>>
>> I'd actually prefer the reverse: unprivileged mounts don't prevent
>> unlink and rename. If the dentry goes away, then the mount could still
>> exist, sans underlying file. (This is already supported on network
>> filesystems.)
>
> Of course we do this in network filesystems by pretending the
> rename/unlink did not actually happen. The vfs insists that it be lied
> to instead of mirroring what actually happened.
>
> Again all of this is a question about efficient data structures and not
> really one of semantics. Can either semantic be implemented in such a
> way that it does not slow down the vfs?
Given that vfs_unlink has:
if (d_mountpoint(dentry))
error = -EBUSY;
I think it's just a matter of changing / deleting that code.
--Andy
Andy Lutomirski <[email protected]> writes:
> On Wed, Aug 14, 2013 at 12:53 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
>>>> There's a simple and effective way to prevent unlink(2) and rename(2)
>>>> from operating on any file or directory by simply mounting something
>>>> on it. In any mount instance in any namespace.
>>>>
>>>> Was this considered in the unprivileged mount design?
>>>>
>>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>>> are marked "volatile" and are dissolved on an unlink type operation.
>>>
>>> I'd actually prefer the reverse: unprivileged mounts don't prevent
>>> unlink and rename. If the dentry goes away, then the mount could still
>>> exist, sans underlying file. (This is already supported on network
>>> filesystems.)
>>
>> Of course we do this in network filesystems by pretending the
>> rename/unlink did not actually happen. The vfs insists that it be lied
>> to instead of mirroring what actually happened.
>>
>> Again all of this is a question about efficient data structures and not
>> really one of semantics. Can either semantic be implemented in such a
>> way that it does not slow down the vfs?
>
> Given that vfs_unlink has:
>
> if (d_mountpoint(dentry))
> error = -EBUSY;
>
> I think it's just a matter of changing / deleting that code.
Deleting the code is completely unacceptable as it generates mounts that
can never be unmounted.
Changing this code is what we were discussing. My point is that an
efficient replacement is not immediately obvious, and a solution that
degrades the performance of the fast path of the vfs to make this case
work better is not likely to be acceptable.
Eric
On Wed, Aug 14, 2013 at 9:32 PM, Eric W. Biederman
<[email protected]> wrote:
>> The solution is also theoretically simple: mounts in unpriv namespaces
>> are marked "volatile" and are dissolved on an unlink type operation.
>>
>> Such volatile mounts would be useful in general too.
>
> Agreed.
>
> This is a problem that is a general pain with mount namespaces in
> general.
>
> I think the real technical hurdle is finding the mounts t in some random
> mount namespace. Once we can do that relatively efficiently the rest
> becomes simple.
We already have a "struct mountpoint" hashed on the dentry. Chaining
mounts on that mountpoint would be trivial. And we need a
MNT_VOLATILE flag and that's it. If we fear that traversing the list
of mounts on the dentry to check for non-volatile ones then we could
also add a separate volatile counter to struct mountpoint and a
matching flag to the dentry. But I don't think that's really
necessary.
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
> On Wed, Aug 14, 2013 at 9:32 PM, Eric W. Biederman
> <[email protected]> wrote:
>
>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>> are marked "volatile" and are dissolved on an unlink type operation.
>>>
>>> Such volatile mounts would be useful in general too.
>>
>> Agreed.
>>
>> This is a problem that is a general pain with mount namespaces in
>> general.
>>
>> I think the real technical hurdle is finding the mounts t in some random
>> mount namespace. Once we can do that relatively efficiently the rest
>> becomes simple.
>
> We already have a "struct mountpoint" hashed on the dentry. Chaining
> mounts on that mountpoint would be trivial. And we need a
> MNT_VOLATILE flag and that's it. If we fear that traversing the list
> of mounts on the dentry to check for non-volatile ones then we could
> also add a separate volatile counter to struct mountpoint and a
> matching flag to the dentry. But I don't think that's really
> necessary.
*Blink* I had overlooked "struct mountpoint". That indeed makes things
easier.
I agree we can chain "struct mount" on "struct mountpoint" and then we
would have an efficient implementation, that does not impact the vfs
fast path.
After that it becomes a question of permissions and semantics.
I am in the process of adopting the rule that something that is not
visible at the time we copy a set of mounts should not become visible
in the child mount namespace. Grr. This has been a busy month and
despite having been reviewed I haven't gotten around to pushing that
patch to linux-next.
But MNT_VOLATILE by definition can not reveal anything becasue the
underlying mount point is removed, so that all of that weirdness is in
propogating mounts between mount namespaces is not relevant here.
This is however the propogation of an unmount between mount namespaces.
In general we don't even need the MNT_VOLATILE flag we just need the
appropriate permission checks. However we do need something like
MNT_VOLATILE to prevent surprises. MNT_VOLATILE would be used to
prevent things like:
# mount --bind / /mnt
# rmdir /mnt/usr
I think the root user would be rather annoyed if that worked, so it does
appear we need something like MNT_VOLATILE.
Part of me does prefer the semantics Andy has suggested where instead of
unmounting things we have something like a skeleton of the mount tree
unioned with dcaches of the filesystems themselves. With "struct
mountpoint" we are amazing close to that already.
A mount skeleton would allow us to always remove and rename directories
and files without really caring, about what mounts were present.
Probably with just a quick lookup to see if we need to set
DCACHE_MOUNTED.
The big practical problem I can see with MNT_VOLATILE is mount points in
shared directories like /tmp but without the sticky set. At which point
it would be possible to delete another users mount points. Perhaps we
need restrictions on where a user can mount.
Are there any practical limitations we can add to mount that will
ensure MNT_VOLATILE won't impact other users? inode_capable looks like
a good canidate. Which effectively boils down to /proc/<pid>/uid_map.
I will have to see if that semantic is workable, but at first glance it
looks promising.
So in summary I am thinking.
- Restrict on which files/directories an unprivileged user can mount.
That immediately prevents most weirdness.
- Allow unlink and rmdir and rename to cause recursive lazy unmounts if
the caller has the appropriate permissions.
Perhaps the permissions needed include MNT_VOLATILE.
Any other suggestions?
Eric
On Wed, Aug 14, 2013 at 11:45 PM, Eric W. Biederman
<[email protected]> wrote:
> Miklos Szeredi <[email protected]> writes:
>
>> On Wed, Aug 14, 2013 at 9:32 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>
>>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>>> are marked "volatile" and are dissolved on an unlink type operation.
>>>>
>>>> Such volatile mounts would be useful in general too.
>>>
>>> Agreed.
>>>
>>> This is a problem that is a general pain with mount namespaces in
>>> general.
>>>
>>> I think the real technical hurdle is finding the mounts t in some random
>>> mount namespace. Once we can do that relatively efficiently the rest
>>> becomes simple.
>>
>> We already have a "struct mountpoint" hashed on the dentry. Chaining
>> mounts on that mountpoint would be trivial. And we need a
>> MNT_VOLATILE flag and that's it. If we fear that traversing the list
>> of mounts on the dentry to check for non-volatile ones then we could
>> also add a separate volatile counter to struct mountpoint and a
>> matching flag to the dentry. But I don't think that's really
>> necessary.
>
> *Blink* I had overlooked "struct mountpoint". That indeed makes things
> easier.
>
> I agree we can chain "struct mount" on "struct mountpoint" and then we
> would have an efficient implementation, that does not impact the vfs
> fast path.
>
> After that it becomes a question of permissions and semantics.
>
> I am in the process of adopting the rule that something that is not
> visible at the time we copy a set of mounts should not become visible
> in the child mount namespace. Grr. This has been a busy month and
> despite having been reviewed I haven't gotten around to pushing that
> patch to linux-next.
>
> But MNT_VOLATILE by definition can not reveal anything becasue the
> underlying mount point is removed, so that all of that weirdness is in
> propogating mounts between mount namespaces is not relevant here.
>
> This is however the propogation of an unmount between mount namespaces.
>
> In general we don't even need the MNT_VOLATILE flag we just need the
> appropriate permission checks. However we do need something like
> MNT_VOLATILE to prevent surprises. MNT_VOLATILE would be used to
> prevent things like:
>
> # mount --bind / /mnt
> # rmdir /mnt/usr
>
> I think the root user would be rather annoyed if that worked, so it does
> appear we need something like MNT_VOLATILE.
>
> Part of me does prefer the semantics Andy has suggested where instead of
> unmounting things we have something like a skeleton of the mount tree
> unioned with dcaches of the filesystems themselves. With "struct
> mountpoint" we are amazing close to that already.
Two possible nasty cases:
1. mount whatever /tmp/foo/bar; rmdir /tmp/foo/bar; rmdir /tmp/foo
Presumably ls /tmp shouldn't show foo. Should cd /tmp/foo/bar work?
What about umount /tmp/foo/bar? What about cd /tmp/foo?
2. mount whatever /tmp/foo; rmdir /tmp/foo; mkdir /tmp/foo
Ugh.
--Andy
Andy Lutomirski <[email protected]> writes:
> On Wed, Aug 14, 2013 at 11:45 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Miklos Szeredi <[email protected]> writes:
>>
>> Part of me does prefer the semantics Andy has suggested where instead of
>> unmounting things we have something like a skeleton of the mount tree
>> unioned with dcaches of the filesystems themselves. With "struct
>> mountpoint" we are amazing close to that already.
>
> Two possible nasty cases:
>
> 1. mount whatever /tmp/foo/bar; rmdir /tmp/foo/bar; rmdir /tmp/foo
>
> Presumably ls /tmp shouldn't show foo. Should cd /tmp/foo/bar work?
> What about umount /tmp/foo/bar? What about cd /tmp/foo?
You have to have two mount namespaces or at least two different paths to
to the same filesystem to make this work. rdir /tmp/foo/bar where
/tmp/foo/bar is a mountpoint in your mount namespace will not work
because you are trying to remove a root directory.
So the semantics I would expect to see if it was implementable is
/tmp/foo and /tmp/foo/bar would continue to exist on the paths where
/tmp/foo/bar was a mount point and would disappear as soon as it was
unmounted.
> 2. mount whatever /tmp/foo; rmdir /tmp/foo; mkdir /tmp/foo
>
> Ugh.
Likewise. I would expect to see the new /tmp/foo slide under the old
/tmp/foo mountpoint.
Essentially my expectation would be that the mount points would float
over the filesystems. Semantically I like it, and have played with the
idea before. Implementation wise shrug I didn't realize any of this was
close to being practically implementatable until today.
Eric
On Thu, Aug 15, 2013 at 8:45 AM, Eric W. Biederman
<[email protected]> wrote:
> Part of me does prefer the semantics Andy has suggested where instead of
> unmounting things we have something like a skeleton of the mount tree
> unioned with dcaches of the filesystems themselves. With "struct
> mountpoint" we are amazing close to that already.
>
> A mount skeleton would allow us to always remove and rename directories
> and files without really caring, about what mounts were present.
> Probably with just a quick lookup to see if we need to set
> DCACHE_MOUNTED.
Yes, we could have a separate dentry tree just for anchoring mounts
and we could make a union with the real dentry tree. But implementing
that in a low overhead manner is not trivial. Anchoring mounts on
real dentries is *rather* convenient. And yes, we only actually need
the skeleton dentries when the real ones disappear, but that doesn't
make the implementation any simpler.
> The big practical problem I can see with MNT_VOLATILE is mount points in
> shared directories like /tmp but without the sticky set. At which point
> it would be possible to delete another users mount points. Perhaps we
> need restrictions on where a user can mount.
I think if user X mounts something on /tmp/foo owned by user Y and
user Y removes /tmp/foo then that shouldn't really surprise user X. I
don't see this an issue.
Thanks,
Miklos
On 08/14/2013 12:42:19 PM, Miklos Szeredi wrote:
> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?
>
> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.
>
> Such volatile mounts would be useful in general too.
Would that "anonymous inode" thing that wandered by recently help,
letting umount move the mount to one side so you could keep the mount
point as the root of your per-process hierarchy but not have it glued
to other people's namespaces?
Rob-
This patchset is an attempt to address two problems:
1) Not all modifications to the filesystems happen through the vfs and
since the vfs can not cope with a mount point being unlinked or
renamed filesystems whose modifications that do not come through the
vfs are required to lie.
2) Through an oversight it is now possible for one unprivileged user to
mount something on another unprivileged users dentry and make it
impossible for the other user to unlink or rename that dentry.
It is now technically possible to easily lift the restriction on
unlinking and renaming files with mount points on them, with a
corresponding reduction in complexity of the vfs semantics and a small
code side reduction.
After thinking about it removing the restrictions on mount points
appears safe, because it is just plain dumb to have a mount point
in a directory that is not restricted to root only modifications.
This is a change in user visible semantics, so I want to be very careful
about this. Are there any reasons to not make this change?
All of this happens under a full pile of vfs locks so this shouldn't
affect the vfs scalabilitiy work that is on-going.
Eric W. Biederman (3):
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/mount.h | 3 ++
fs/namei.c | 31 ++++++------------
fs/namespace.c | 29 +++++++++++++++++
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
10 files changed, 83 insertions(+), 87 deletions(-)
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 2 ++
fs/namespace.c | 5 +++++
2 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 64a858143ff9..e4342b8dfab1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
@@ -47,6 +48,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index da5c49483430..d092964fe7f9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
list_del_init(&mnt->mnt_hash);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
+ list_del_init(&mnt->mnt_mp_list);
}
/*
@@ -705,6 +708,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}
/*
@@ -1193,6 +1197,7 @@ void umount_tree(struct mount *mnt, int propagate)
p->mnt_parent->mnt_ghosts++;
put_mountpoint(p->mnt_mp);
p->mnt_mp = NULL;
+ list_del_init(&mnt->mnt_mp_list);
}
change_mnt_propagation(p, MS_PRIVATE);
}
--
1.7.5.4
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 1 +
fs/namespace.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index e4342b8dfab1..7a6a2bb3f290 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
}
extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
+extern void detach_mounts(struct dentry *dentry);
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
diff --git a/fs/namespace.c b/fs/namespace.c
index d092964fe7f9..8eaee0c14fdb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
+void detach_mounts(struct dentry *dentry)
+{
+ struct mount *mnt, *next;
+ struct mountpoint *mp;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry)) {
+ namespace_unlock();
+ return;
+ }
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp)) {
+ return;
+ }
+ br_write_lock(&vfsmount_lock);
+ list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
+ if (!list_empty(&mnt->mnt_list))
+ umount_tree(mnt, 1);
+ }
+ br_write_unlock(&vfsmount_lock);
+ put_mountpoint(mp);
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4
With the introduction of mount namespaces and bind mounts it because
possible to access files and directories that in other locations in
were used as mount points. Especially with mount namespaces has
become very confusing why rm -rf somedir return -EBUSY because some
directory is mounted somewhere else. With the addition of user
namespaces allowing unprivileged mounts this condition has gone from
annoying to allowing a DOS attack on more privileged users.
The simplest approach appears to be to remove the -EBUSY message,
allow unlink and rename, and lazily unmount the mount point.
In most cases this is less surprising as this is an implementation
of the normal unix behavior of allowing unlinking of files.
The change implemented in this patch allows the following to succeed:
The vfs does not currently follow paths up to the final component for
the rename and unlink system calls making the boldest version of this
idea the simplest to implement. Which should it simple to spot problems
with this idea.
While different from our historical behavior this change does not look
like it will break anything, or introduce any security
vulnerabilities. In a quick survey of all of the common mount points
on linux systems I found mount points in directories owned and
modifiable by root, and fuse fuse mounts in directories owned by the
``mounter'' of the fuse filesystem. In both of these cases relying on
the permissions of the directory does not practically change the user
who is allowed to unmount the filesystem.
Attempting to anticipate cases I have not witnessed I observe that
every directory in a trusted path to a file must limit modification
such that no one else may modify that directory. For files trusted by
suid root executables root most own and be the only user capable of
modifying the directory and all parent directories for the files to be
safe. Therefore for mount points part of a trusted path only root
should be able to unlink any directory or file on that path. Which
means after this change for a secured path only root can unmount
directories.
For mount points part of a path we can not trust we should not care if
the just disappear, as that is just another kind of arbitrary
manipulation.
So I conclude that the existing conditions will ensure that the permissions
on directories will be sufficiently limited that the new unmount on unlink
behavior will not cause problems.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/namei.c | 31 ++++++------------
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
8 files changed, 51 insertions(+), 87 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 646337dc5201..7fb69d45f1b9 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -686,8 +686,7 @@ not_found:
out_bad:
/* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
+ shrink_submounts_and_drop(dentry);
_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
diff --git a/fs/dcache.c b/fs/dcache.c
index 41000305d716..1e9bf96b0132 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. shrink_submounts_and_drop() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
}
EXPORT_SYMBOL(shrink_dcache_parent);
-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct dentry *found;
+};
+static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
{
- struct select_data *data = _data;
-
- if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
- return D_WALK_QUIT;
- }
-
- return select_collect(_data, dentry);
-}
+ struct detach_data *data = ptr;
-static void check_and_drop(void *_data)
-{
- struct select_data *data = _data;
+ if (d_mountpoint(dentry))
+ data->found = dentry;
- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
}
/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * detach_submounts - check for submounts and detach them.
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * @dentry: dentry to find mount points under.
*
- * @dentry: dentry to prune and drop
+ * If dentry or any of it's children is a mount point detach those mounts.
*/
-int check_submounts_and_drop(struct dentry *dentry)
+void detach_submounts(struct dentry *dentry)
{
- int ret = 0;
-
- /* Negative dentries can be dropped without further checks */
- if (!dentry->d_inode) {
- d_drop(dentry);
- goto out;
- }
-
+ struct detach_data data;
for (;;) {
- struct select_data data;
-
- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.found = NULL;
+ d_walk(dentry, &data, do_detach_submounts, NULL);
- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
-
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
-
- if (ret <= 0)
+ if (!data.found)
break;
+ detach_mounts(data.found);
cond_resched();
}
+ detach_mounts(dentry);
+}
-out:
- return ret;
+/**
+ * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
+ *
+ * All done as a single atomic operation reletaive to d_set_mounted().
+ *
+ * @dentry: dentry to detach, prune and drop
+ */
+void shrink_submounts_and_drop(struct dentry *dentry)
+{
+ d_drop(dentry);
+ detach_submounts(dentry);
+ shrink_dcache_parent(dentry);
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(shrink_submounts_and_drop);
/**
* __d_alloc - allocate a dcache entry
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b577bfc..b1cd7b79a325 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -259,8 +259,7 @@ out:
invalid:
ret = 0;
- if (check_submounts_and_drop(entry) != 0)
- ret = 1;
+ shrink_submounts_and_drop(entry);
goto out;
}
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..2ecc2b873829 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,7 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
return 0;
diff --git a/fs/namei.c b/fs/namei.c
index 645268f23eb6..b18b017c946b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3560,10 +3560,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dget(dentry);
mutex_lock(&dentry->d_inode->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(dentry))
- goto out;
-
error = security_inode_rmdir(dir, dentry);
if (error)
goto out;
@@ -3575,6 +3571,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);
out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3657,14 +3654,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
return -EPERM;
mutex_lock(&dentry->d_inode->i_mutex);
- if (d_mountpoint(dentry))
- error = -EBUSY;
- else {
- error = security_inode_unlink(dir, dentry);
+ error = security_inode_unlink(dir, dentry);
+ if (!error) {
+ error = dir->i_op->unlink(dir, dentry);
if (!error) {
- error = dir->i_op->unlink(dir, dentry);
- if (!error)
- dont_mount(dentry);
+ dont_mount(dentry);
+ detach_mounts(dentry);
}
}
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
-
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
new_dir->i_nlink >= max_links)
@@ -4006,6 +3997,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4031,16 +4023,15 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
-
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (error)
goto out;
- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
+ detach_mounts(old_dentry);
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 854a8f05a610..e8e35acd8850 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1142,10 +1142,7 @@ out_zap_parent:
if (dentry->d_flags & DCACHE_DISCONNECTED)
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
__func__, dentry->d_parent->d_name.name,
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 4d83cedb9fcb..477c66d4e2a8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
}
mutex_unlock(&sysfs_mutex);
-out_valid:
return 1;
out_bad:
/* Remove the dentry from the dcache hashes.
@@ -341,13 +340,7 @@ out_bad:
* to the dcache hashes.
*/
mutex_unlock(&sysfs_mutex);
-
- /* If we have submounts we must allow the vfs caches
- * to lie about the state of the filesystem to prevent
- * leaks and other nasty things.
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
+ shrink_submounts_and_drop(dentry);
return 0;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 59066e0b4ff1..17948b49f3d5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);
/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);
+extern void detach_submounts(struct dentry *dentry);
+extern void shrink_submounts_and_drop(struct dentry *);
/*
* This adds the entry to the hash queues.
--
1.7.5.4
On Fri, Oct 4, 2013 at 3:41 PM, Eric W. Biederman <[email protected]> wrote:
>
> After thinking about it removing the restrictions on mount points
> appears safe, because it is just plain dumb to have a mount point
> in a directory that is not restricted to root only modifications.
>
> This is a change in user visible semantics, so I want to be very careful
> about this. Are there any reasons to not make this change?
At least one worry: people are very used to 'rmdir()' not removing
empty directories, and I've written code myself that just does an
'rmdir()' without worrying about it. I think git has code like "remove
file, then try to remove directory file is in, and recurse until it
fails or you hit the top of tree". And it all depends on knowing that
rmdir() is harmless, and returns the appropriate error when the
directory isn't empty.
And you're now changing that. If it's a mount-point, the rmdir just
succeeds, afaik.
Does anybody care? I dunno. But it looks like a _big_ semantic change.
That said, I like the _concept_ of being able to remove a mount-point
and the mount just goes away. But I do think that for sanity sake, it
should have something like "if one of the mounts is in the current
namespace, return -EBUSY". IOW, the patch-series would make the VFS
layer _able_ to remove mount-points, but a normal rmdir() when
something is mounted in that namespace would fail in order to give
legacy behavior.
Hmm?
Linus
Linus Torvalds <[email protected]> writes:
> On Fri, Oct 4, 2013 at 3:41 PM, Eric W. Biederman <[email protected]> wrote:
>>
>> After thinking about it removing the restrictions on mount points
>> appears safe, because it is just plain dumb to have a mount point
>> in a directory that is not restricted to root only modifications.
>>
>> This is a change in user visible semantics, so I want to be very careful
>> about this. Are there any reasons to not make this change?
>
> At least one worry: people are very used to 'rmdir()' not removing
> empty directories, and I've written code myself that just does an
> 'rmdir()' without worrying about it. I think git has code like "remove
> file, then try to remove directory file is in, and recurse until it
> fails or you hit the top of tree". And it all depends on knowing that
> rmdir() is harmless, and returns the appropriate error when the
> directory isn't empty.
>
> And you're now changing that. If it's a mount-point, the rmdir just
> succeeds, afaik.
>
> Does anybody care? I dunno. But it looks like a _big_ semantic change.
Which is definitley why I am asking and being careful.
> That said, I like the _concept_ of being able to remove a mount-point
> and the mount just goes away. But I do think that for sanity sake, it
> should have something like "if one of the mounts is in the current
> namespace, return -EBUSY". IOW, the patch-series would make the VFS
> layer _able_ to remove mount-points, but a normal rmdir() when
> something is mounted in that namespace would fail in order to give
> legacy behavior.
>
> Hmm?
In principle I have no problems tweaking rmdir to check for that case.
At the same time the real reason that this is safe is that mount points
are almost always part of trusted paths to important files and you just
don't mess with those paths.
So tweaking rmdir to fail would be more about making stupid mistakes
like running "rm -rf /" fail than it would be about security or
correctness.
My intuition agrees with yours that it is less surprising if rmdir was
forbidden in mount namespaces where it has a mount. Root would still be
able to do crazy things like:
# unshare --mount /bin/bash
# umount -l /sys
# mv /proc /sys
# exit
#
# ls /proc
ls: cannot access /proc No such file or directory
#
# ls /sys
1 2 3 4 acpi/ asound/ buddyinfo bus/ cmdline config.gz consoles cpuinfo
crypto devices diskstats dma dir/ driver/ execdomains fb filesystems
fs/ interrupts iomem ioports irq/ kallsyms kcore key-users kmsg
kpagecount kpageflags loadavg locks mdstat meminfo misc modules mounts/
mtrr pagetypeinfo partitions schedstat scsi/ self/ slabinfo softirqs
stat swaps sys/ sysrq-trigger sysvipc/ timer_list tty/ uptime version
vmallocinfo vmstat zoneinfo
#
I just noticed that Al's latest vfs changes posted yesterday mean I need
to rebase and possibly respin these patches, as all of the locking and
interesting bits of the dcache have changed. I don't think the
conflicts would be fun to resolve during the merge window.
Eric
[email protected] (Eric W. Biederman) writes:
> I just noticed that Al's latest vfs changes posted yesterday mean I need
> to rebase and possibly respin these patches, as all of the locking and
> interesting bits of the dcache have changed. I don't think the
> conflicts would be fun to resolve during the merge window.
On second look the only conflict is the rename of the vfsmount_lock to
lock_mount_hash, and a few bits of context. So nothing interesting.
Eric
Programs have been known to test for empty directories by attempting
to remove them. To keep from violating the principle of least
surprise don't let directories the caller can see with someting
mounted on them be deleted.
With a little luck this may prevent commands stupid commands
like rm -rf from eating your system.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namei.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index b18b017c946b..b9cae480ac27 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}
+static bool covered(struct vfsmount *mnt, struct dentry *dentry)
+{
+ /* test to see if a dentry is covered with a mount in
+ * the current mount namespace.
+ */
+ bool is_covered;
+
+ rcu_read_lock();
+ is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
+ rcu_read_unlock();
+
+ return is_covered;
+}
+
int vfs_rmdir(struct inode *dir, struct dentry *dentry)
{
int error = may_delete(dir, dentry, 1);
@@ -3619,6 +3633,9 @@ retry:
error = -ENOENT;
goto exit3;
}
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit3;
error = security_path_rmdir(&nd.path, dentry);
if (error)
goto exit3;
@@ -4155,6 +4172,10 @@ retry:
error = -ENOTEMPTY;
if (new_dentry == trap)
goto exit5;
+ error = -EBUSY;
+ if (new_dentry->d_inode && S_ISDIR(new_dentry->d_inode->i_mode) &&
+ covered(newnd.path.mnt, new_dentry))
+ goto exit5;
error = security_path_rename(&oldnd.path, old_dentry,
&newnd.path, new_dentry);
--
1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> Programs have been known to test for empty directories by attempting
> to remove them. To keep from violating the principle of least
> surprise don't let directories the caller can see with someting
> mounted on them be deleted.
>
> With a little luck this may prevent commands stupid commands
> like rm -rf from eating your system.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Thanks. I've not had a chance to look at the implementation details
(hope to do so this weekend) but ack to the change itself being
needed. AFAIK this was the last uncomfortable piece about unprivileged
CLONE_NEWUSER.
> ---
> fs/namei.c | 21 +++++++++++++++++++++
> 1 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b18b017c946b..b9cae480ac27 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
> spin_unlock(&dentry->d_lock);
> }
>
> +static bool covered(struct vfsmount *mnt, struct dentry *dentry)
> +{
> + /* test to see if a dentry is covered with a mount in
> + * the current mount namespace.
> + */
> + bool is_covered;
> +
> + rcu_read_lock();
> + is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
> + rcu_read_unlock();
> +
> + return is_covered;
> +}
> +
> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> {
> int error = may_delete(dir, dentry, 1);
> @@ -3619,6 +3633,9 @@ retry:
> error = -ENOENT;
> goto exit3;
> }
> + error = -EBUSY;
> + if (covered(nd.path.mnt, dentry))
> + goto exit3;
> error = security_path_rmdir(&nd.path, dentry);
> if (error)
> goto exit3;
> @@ -4155,6 +4172,10 @@ retry:
> error = -ENOTEMPTY;
> if (new_dentry == trap)
> goto exit5;
> + error = -EBUSY;
> + if (new_dentry->d_inode && S_ISDIR(new_dentry->d_inode->i_mode) &&
> + covered(newnd.path.mnt, new_dentry))
> + goto exit5;
>
> error = security_path_rename(&oldnd.path, old_dentry,
> &newnd.path, new_dentry);
> --
> 1.7.5.4
On 10/04/2013 05:41:25 PM, Eric W. Biederman wrote:
>
> This patchset is an attempt to address two problems:
> 1) Not all modifications to the filesystems happen through the vfs and
> since the vfs can not cope with a mount point being unlinked or
> renamed filesystems whose modifications that do not come through
> the
> vfs are required to lie.
>
> 2) Through an oversight it is now possible for one unprivileged user
> to
> mount something on another unprivileged users dentry and make it
> impossible for the other user to unlink or rename that dentry.
>
> It is now technically possible to easily lift the restriction on
> unlinking and renaming files with mount points on them, with a
> corresponding reduction in complexity of the vfs semantics and a small
> code side reduction.
A todo item I've had _forever_ is fixing chroot() to not be broken so
that you can trivially break out of a chroot via:
chdir("/");
mkdir("sub");
chroot("sub");
chdir("./../../../../../../../..");
(Because chroot() affects where "/" points but NOT where "." points to,
and chdir does an == check with the dentry "/" points at to know when
to stop, so if you move "/" under "." you can back up to the actual
root of the tree.)
The above is why lxc uses pivot_root() instead of chroot().
These days, we have multiple mount trees so there's no reason chroot()
can't trim the process local mount tree (creating a new bind mount if
necessary). Except my todo list runneth over and I haven't had a chance
to dig in and see what would be involved. (Last time I brought this up
people were wondering why chroot() didn't just move "." to the new "/"
if it wasn't under it. I had no idea, still don't.)
Rob-
On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <[email protected]> wrote:
>
> A todo item I've had _forever_ is fixing chroot() to not be broken so that
> you can trivially break out of a chroot via:
What drugs are you on?
Your example is moronic, and against all _documented_ uses of chroot.
> chdir("/");
> mkdir("sub");
> chroot("sub");
> chdir("./../../../../../../../..");
After you do a chroot(), you need to chdir *into* the root. The reason
chroot() itself doesn't do that is simple: you may still be doing
various setup stuff.
But your example is just stupid. Yes, chroot'ed environments can
generally be escaped, but your example escape is simply because you
didn't use chroot() correctly.
So learn this pattern: every time you use chroot, add a simple
chdir("/");
immediately after the chroot call.
Then, if you decide that you want to do some setup in between the two
(like the interface allows), that's fine, but always start off with
that "chroot+chdir" pattern.
(Similarly, if it turns out that you want to chdir somewhere else,
like "/home/user" after the chroot, then you can obviously remove the
now superfluous chdir("/"), but you always conceptually start off with
that chroot/chdir pair)
Linus
On Sat, Oct 05, 2013 at 06:07:42PM -0500, Rob Landley wrote:
> A todo item I've had _forever_ is fixing chroot() to not be broken
> so that you can trivially break out of a chroot via:
>
> chdir("/");
> mkdir("sub");
> chroot("sub");
> chdir("./../../../../../../../..");
>
> (Because chroot() affects where "/" points but NOT where "." points
> to, and chdir does an == check with the dentry "/" points at to know
> when to stop, so if you move "/" under "." you can back up to the
> actual root of the tree.)
>
> The above is why lxc uses pivot_root() instead of chroot().
>
> These days, we have multiple mount trees so there's no reason
> chroot() can't trim the process local mount tree (creating a new
> bind mount if necessary). Except my todo list runneth over and I
> haven't had a chance to dig in and see what would be involved. (Last
> time I brought this up people were wondering why chroot() didn't
> just move "." to the new "/" if it wasn't under it. I had no idea,
> still don't.)
1) RTFUNIXFAQ. chroot() never has been root-proof.
2) your "fix" isn't - it will lead to mounts done by chrooted process
not affecting other processes in the same namespace.
On Sat, Oct 5, 2013 at 4:17 PM, Linus Torvalds
<[email protected]> wrote:
>
> So learn this pattern: every time you use chroot, add a simple
>
> chdir("/");
>
> immediately after the chroot call.
.. btw, also make sure that you close all non-essential file
descriptors. Having any open directory file descriptors pointing to
outside the chroot is also a classic escape.
Even then, escaping chroot is usually fairly easy. Making a
escape-proof chroot is really quite hard. Basically impossible if you
allow root.
Linus
On Sat, Oct 05, 2013 at 04:17:55PM -0700, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <[email protected]> wrote:
> >
> > A todo item I've had _forever_ is fixing chroot() to not be broken so that
> > you can trivially break out of a chroot via:
>
> What drugs are you on?
>
> Your example is moronic, and against all _documented_ uses of chroot.
... and even fixing that example (chroot *can* be escaped, if you can
call chroot(2) - mkdir /foo, chdir /, chroot /foo, chroot .. and you are out)
the whole thing is idiocy - chroot() is not and has never been root-proof
and anybody expecting it to be has failed to read any number of FAQs out
there.
On 10/04/2013 07:03:23 PM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Fri, Oct 4, 2013 at 3:41 PM, Eric W. Biederman
> <[email protected]> wrote:
> >>
> >> After thinking about it removing the restrictions on mount points
> >> appears safe, because it is just plain dumb to have a mount point
> >> in a directory that is not restricted to root only modifications.
> >>
> >> This is a change in user visible semantics, so I want to be very
> careful
> >> about this. Are there any reasons to not make this change?
> >
> > At least one worry: people are very used to 'rmdir()' not removing
> > empty directories, and I've written code myself that just does an
> > 'rmdir()' without worrying about it. I think git has code like
> "remove
> > file, then try to remove directory file is in, and recurse until it
> > fails or you hit the top of tree". And it all depends on knowing
> that
> > rmdir() is harmless, and returns the appropriate error when the
> > directory isn't empty.
> >
> > And you're now changing that. If it's a mount-point, the rmdir just
> > succeeds, afaik.
> >
> > Does anybody care? I dunno. But it looks like a _big_ semantic
> change.
>
> Which is definitley why I am asking and being careful.
>
> > That said, I like the _concept_ of being able to remove a
> mount-point
> > and the mount just goes away. But I do think that for sanity sake,
> it
> > should have something like "if one of the mounts is in the current
> > namespace, return -EBUSY". IOW, the patch-series would make the VFS
> > layer _able_ to remove mount-points, but a normal rmdir() when
> > something is mounted in that namespace would fail in order to give
> > legacy behavior.
> >
> > Hmm?
>
> In principle I have no problems tweaking rmdir to check for that case.
>
> At the same time the real reason that this is safe is that mount
> points
> are almost always part of trusted paths to important files and you
> just
> don't mess with those paths.
>
> So tweaking rmdir to fail would be more about making stupid mistakes
> like running "rm -rf /" fail than it would be about security or
> correctness.
If you do an rm -rf descending through a mount point, it's going to
delete the contents of the mount point before _trying_ to unlink the
mount point, which may be bad for the thing you mounted. Then there's
the fun corner case of "the directory wasn't empty before it was
mounted on, and umounting revealed the overmounted files, so the unlink
fails for that reason even after magic umount".
Doing rmdir on a non-empty directory won't delete it if it _isn't_ a
mount point, and presumably we require write access to directories to
mount on them, so in what ways is this different than another user
mucking about with my files asynchronously?
Oh, attached is a dumb "zapchroot" script I've been using for years to
unlink all mount points under a given directory, taking advantage of
the fact that mount points are appended to the end of the list so if
you unlink from the end to the front you should get the sub-mounts
before the parent mounts (modulo mount --move not reordering the list,
but that's uncommon).
Recently I noticed some kernels where chroot does _not_ trim the paths
so that the paths you see in /proc/mounts are relevant to the current
chroot but instead have all sorts of crap you can't access with no way
to know what it's talking about. That was sad, I need to go figure out
if that was distro breakage or vanilla breakage...
Rob
On 10/05/2013 06:17:55 PM, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <[email protected]> wrote:
> >
> > A todo item I've had _forever_ is fixing chroot() to not be broken
> so that
> > you can trivially break out of a chroot via:
>
> What drugs are you on?
Enough caffeine to count as a plural.
> Your example is moronic, and against all _documented_ uses of chroot.
Yes. That's why it's an exploit.
My example allows someone who can chroot again to break out of a
previous chroot, which is the reason that lxc doesn't use chroot.
> > chdir("/");
> > mkdir("sub");
> > chroot("sub");
> > chdir("./../../../../../../../..");
>
> After you do a chroot(), you need to chdir *into* the root. The reason
> chroot() itself doesn't do that is simple: you may still be doing
> various setup stuff.
Yes, I asked you about this years ago before I implemented chroot and
switch_root for busybox and toybox, I know how to use it _right_. What
I'm saying is it doesn't prevent something running in the chroot from
doing it _wrong_, which is why lxc can't use chroot to implement
containers and instead uses pivot_root, which is a HORRIBLE system call
I'd like to move away from.
I want to implement a lightweight container command for toybox, but
when I dug into lxc and tried to swap their pivot_root for a simple
chroot, I got corrected.
> But your example is just stupid. Yes, chroot'ed environments can
> generally be escaped, but your example escape is simply because you
> didn't use chroot() correctly.
The above is code you run _in_ the chroot. After the first user used
chroot incorrectly.
The problem is that there's only one "/" symlink. If you move it down
under your current "." then nothing stops you from following .. all the
way up the tree. The fact you can do this after the fact, from within
an existing chroot, is the problem.
> So learn this pattern: every time you use chroot, add a simple
>
> chdir("/");
>
> immediately after the chroot call.
I do. But I can't stop other people from NOT doing that afterwards.
Rob-
On 10/05/2013 06:19:15 PM, Al Viro wrote:
> On Sat, Oct 05, 2013 at 06:07:42PM -0500, Rob Landley wrote:
> > A todo item I've had _forever_ is fixing chroot() to not be broken
> > so that you can trivially break out of a chroot via:
> >
> > chdir("/");
> > mkdir("sub");
> > chroot("sub");
> > chdir("./../../../../../../../..");
> >
> > (Because chroot() affects where "/" points but NOT where "." points
> > to, and chdir does an == check with the dentry "/" points at to know
> > when to stop, so if you move "/" under "." you can back up to the
> > actual root of the tree.)
> >
> > The above is why lxc uses pivot_root() instead of chroot().
> >
> > These days, we have multiple mount trees so there's no reason
> > chroot() can't trim the process local mount tree (creating a new
> > bind mount if necessary). Except my todo list runneth over and I
> > haven't had a chance to dig in and see what would be involved. (Last
> > time I brought this up people were wondering why chroot() didn't
> > just move "." to the new "/" if it wasn't under it. I had no idea,
> > still don't.)
>
> 1) RTFUNIXFAQ. chroot() never has been root-proof.
>
> 2) your "fix" isn't - it will lead to mounts done by chrooted process
> not affecting other processes in the same namespace.
So if I write a lightweight container setup command, I need to use
pivot_root just like lxc does?
Rob-
On 10/05/2013 06:22:15 PM, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:17 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So learn this pattern: every time you use chroot, add a simple
> >
> > chdir("/");
> >
> > immediately after the chroot call.
>
> .. btw, also make sure that you close all non-essential file
> descriptors. Having any open directory file descriptors pointing to
> outside the chroot is also a classic escape.
>
> Even then, escaping chroot is usually fairly easy. Making a
> escape-proof chroot is really quite hard. Basically impossible if you
> allow root.
Which is why containers have all sorts of extra plumbing. But that
extra plumbing is currently built on pivot_root(), not on chroot(). And
I'd dismissed pivot_root() as residue from initramfs with all that
kernel thread reparenting, so it seemed like the wrong tool for the
job, but obviously I'll take your and Al's word it's not...
Rob-
On 10/05/2013 06:24:51 PM, Al Viro wrote:
> On Sat, Oct 05, 2013 at 04:17:55PM -0700, Linus Torvalds wrote:
> > On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <[email protected]> wrote:
> > >
> > > A todo item I've had _forever_ is fixing chroot() to not be
> broken so that
> > > you can trivially break out of a chroot via:
> >
> > What drugs are you on?
> >
> > Your example is moronic, and against all _documented_ uses of
> chroot.
>
> ... and even fixing that example (chroot *can* be escaped, if you can
> call chroot(2) - mkdir /foo, chdir /, chroot /foo, chroot .. and you
> are out)
Um, wasn't that my example?
> the whole thing is idiocy - chroot() is not and has never been
> root-proof
> and anybody expecting it to be has failed to read any number of FAQs
> out
> there.
Which is why I was proposing modifying it so lxc and containers didn't
have to use pivot_root() instead, with all the logic to rehome existing
processes and the weird "this path is relative the old root, this path
is relative to the new root" syntax from initial ramdisks.
Rob-
On Sat, Oct 5, 2013 at 5:18 PM, Rob Landley <[email protected]> wrote:
>
> Which is why containers have all sorts of extra plumbing. But that extra
> plumbing is currently built on pivot_root(), not on chroot(). And I'd
> dismissed pivot_root() as residue from initramfs with all that kernel thread
> reparenting, so it seemed like the wrong tool for the job, but obviously
> I'll take your and Al's word it's not...
Yeah, chroot() really doesn't cut it if you allow root access - and
thus internal chroot() calls - as you noticed (I didn't realize that
your example was meant to be run _inside_ a chroot).
There are generally other ways to escape chroot too if you're root.
It's just too hard to plug. That doesn't make chroot() useless - it
just means that the uses are elsewhere (it's useful for various
non-security issues like development environments, but it can also be
useful as one small _part_ of some bigger model, like a VM etc).
pivot_root() does end up being a "better chroot than chroot" if you're
looking for containment. It may not be a pretty system call, but it
does avoid at least the most obvious gotchas with chroot()..
Linus
Rob Landley <[email protected]> writes:
> On 10/04/2013 07:03:23 PM, Eric W. Biederman wrote:
>>
>> In principle I have no problems tweaking rmdir to check for that case.
>>
>> At the same time the real reason that this is safe is that mount
>> points
>> are almost always part of trusted paths to important files and you
>> just
>> don't mess with those paths.
>>
>> So tweaking rmdir to fail would be more about making stupid mistakes
>> like running "rm -rf /" fail than it would be about security or
>> correctness.
>
> If you do an rm -rf descending through a mount point, it's going to
> delete the contents of the mount point before _trying_ to unlink the
> mount point, which may be bad for the thing you mounted. Then there's
> the fun corner case of "the directory wasn't empty before it was
> mounted on, and umounting revealed the overmounted files, so the unlink
> fails for that reason even after magic umount".
Yes. Adding the restrictions I added in 4/3 are really just about
preserving peoples experience that rmdir on a directory ls shows files
in fails. Just to be a little less surprising.
> Doing rmdir on a non-empty directory won't delete it if it _isn't_ a
> mount point, and presumably we require write access to directories to
> mount on them, so in what ways is this different than another user
> mucking about with my files asynchronously?
We don't require write access to files or directories before we mount on
them.
The permissions we require to mount is that the directory or file is
visible aka execute permissions all of the way to /, and that the
mounter has CAP_SYS_ADMIN in the user namespace that created the mount
namespace. Basically that the mounter is locally root.
In practice this means that if someone mounts something on your
directory it happens in another mount namespace and you don't care.
You don't even know about it unless you call rmdir and today that
returns -EBUSY.
The permission checks could be extended to ensure that the uid of the
file or directory is mapped in to the mounter's user namespace. However
that would arguably require extending the vfs -EBUSY handling to chown
or chmod. And even without that we are still left with the mess that
is the current magic insertion of -EBUSY because we have to deal with
mount point somehow.
So what I am proposing is removing the -EBUSY handling making it up to
whoever creates a mount to put the mount on path they trust if they
don't want the path to go away.
So for the owner of the directory who doesn't see the mount it is
nothing like a user scribbling into the directory.
With these patches it becomes almost exactly like scribling in someone
elses directory for the creator of the mount.
Further even without user namespaces and with root creating all of the
mounts we remove a current challenge in the VFS that requires
filesystems to lie about their directory structure.
Unless there is someone crazy enough to put mount points in directories
that are world writable I think my current patchset is a very nice
cleanup.
I have meandered all over the field. Did I answer your question?
I wasn't exactly certain what you were asking.
FYI: As for chroot you can not create a user namespace if you are
chrooted, so I suggest you file chroot under that old thing that was
an early poor approximation of mount namespaces that we don't need
anymore.
Eric
Quoting Eric W. Biederman ([email protected]):
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/mount.h | 1 +
> fs/namespace.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index e4342b8dfab1..7a6a2bb3f290 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
> }
>
> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
> +extern void detach_mounts(struct dentry *dentry);
>
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d092964fe7f9..8eaee0c14fdb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mount *mnt, *next;
> + struct mountpoint *mp;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry)) {
> + namespace_unlock();
> + return;
> + }
> + mp = new_mountpoint(dentry);
> + if (IS_ERR(mp)) {
namespace_unlock();
> + return;
> + }
> + br_write_lock(&vfsmount_lock);
> + list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
> + if (!list_empty(&mnt->mnt_list))
> + umount_tree(mnt, 1);
> + }
> + br_write_unlock(&vfsmount_lock);
> + put_mountpoint(mp);
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */
> --
> 1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> Programs have been known to test for empty directories by attempting
> to remove them. To keep from violating the principle of least
> surprise don't let directories the caller can see with someting
> mounted on them be deleted.
Do you think we should do the same thing for over-mounted file at
vfs_unlink()?
> With a little luck this may prevent commands stupid commands
> like rm -rf from eating your system.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/namei.c | 21 +++++++++++++++++++++
> 1 files changed, 21 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b18b017c946b..b9cae480ac27 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
> spin_unlock(&dentry->d_lock);
> }
>
> +static bool covered(struct vfsmount *mnt, struct dentry *dentry)
> +{
> + /* test to see if a dentry is covered with a mount in
> + * the current mount namespace.
> + */
> + bool is_covered;
> +
> + rcu_read_lock();
> + is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
> + rcu_read_unlock();
> +
> + return is_covered;
> +}
> +
> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> {
> int error = may_delete(dir, dentry, 1);
> @@ -3619,6 +3633,9 @@ retry:
> error = -ENOENT;
> goto exit3;
> }
> + error = -EBUSY;
> + if (covered(nd.path.mnt, dentry))
> + goto exit3;
> error = security_path_rmdir(&nd.path, dentry);
> if (error)
> goto exit3;
> @@ -4155,6 +4172,10 @@ retry:
> error = -ENOTEMPTY;
> if (new_dentry == trap)
> goto exit5;
> + error = -EBUSY;
> + if (new_dentry->d_inode && S_ISDIR(new_dentry->d_inode->i_mode) &&
> + covered(newnd.path.mnt, new_dentry))
> + goto exit5;
>
> error = security_path_rename(&oldnd.path, old_dentry,
> &newnd.path, new_dentry);
> --
> 1.7.5.4
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Eric W. Biederman ([email protected]):
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> fs/mount.h | 1 +
>> fs/namespace.c | 24 ++++++++++++++++++++++++
>> 2 files changed, 25 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/mount.h b/fs/mount.h
>> index e4342b8dfab1..7a6a2bb3f290 100644
>> --- a/fs/mount.h
>> +++ b/fs/mount.h
>> @@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
>> }
>>
>> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
>> +extern void detach_mounts(struct dentry *dentry);
>>
>> static inline void get_mnt_ns(struct mnt_namespace *ns)
>> {
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index d092964fe7f9..8eaee0c14fdb 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
>> return retval;
>> }
>>
>> +void detach_mounts(struct dentry *dentry)
>> +{
>> + struct mount *mnt, *next;
>> + struct mountpoint *mp;
>> +
>> + namespace_lock();
>> + if (!d_mountpoint(dentry)) {
>> + namespace_unlock();
>> + return;
>> + }
>> + mp = new_mountpoint(dentry);
>> + if (IS_ERR(mp)) {
>
> namespace_unlock();
Good catch. Thank you.
Eric
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Eric W. Biederman ([email protected]):
>>
>> Programs have been known to test for empty directories by attempting
>> to remove them. To keep from violating the principle of least
>> surprise don't let directories the caller can see with someting
>> mounted on them be deleted.
>
> Do you think we should do the same thing for over-mounted file at
> vfs_unlink()?
We easily could.
The point of the patch is to just preserve the directory is empty don't
allow rmdir to succeed semantics, and as typically we can see something
in the directory because of the mount it doesn't make sense for rmdir to
succeed.
unlink doesn't have any occassions when the permissions are sufficient
to remove a directory where it will fail. So I don't see the point of
doing this for anything except directories.
Except for possibly the oddball rmdir semantics mentioned I don't think
this patch should be part of anyone's correctness analysis.
It is easiest to see that this series of changes is semantically safe if
we are safe to run unprivileged code in a mount namespace where root has
locally unmounted every mount point.
We do have the restriction that in a user namespace we can't unmount
anything root was mounted outside the user namespace. Which combined
with the above patch would be roughly equivalent to todays mount
restrictions for the common case. Unfortunately being only roughly
equivalent the analysis gets very complicated, and complicated reasoning
usually means invalid reasoning.
So if we can feel safe just depending on the parent directory
permissions (which are not hidden by a mount) protecting our mount
points, I feel much better about this patchset.
But if you can articulate some reasons why it would be better and less
surprising for unlink to fail I am willing to listen.
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/namei.c | 21 +++++++++++++++++++++
>> 1 files changed, 21 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index b18b017c946b..b9cae480ac27 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
>> spin_unlock(&dentry->d_lock);
>> }
>>
>> +static bool covered(struct vfsmount *mnt, struct dentry *dentry)
>> +{
>> + /* test to see if a dentry is covered with a mount in
>> + * the current mount namespace.
>> + */
>> + bool is_covered;
>> +
>> + rcu_read_lock();
>> + is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
>> + rcu_read_unlock();
>> +
>> + return is_covered;
>> +}
>> +
>> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>> {
>> int error = may_delete(dir, dentry, 1);
>> @@ -3619,6 +3633,9 @@ retry:
>> error = -ENOENT;
>> goto exit3;
>> }
>> + error = -EBUSY;
>> + if (covered(nd.path.mnt, dentry))
>> + goto exit3;
>> error = security_path_rmdir(&nd.path, dentry);
>> if (error)
>> goto exit3;
>> @@ -4155,6 +4172,10 @@ retry:
>> error = -ENOTEMPTY;
>> if (new_dentry == trap)
>> goto exit5;
>> + error = -EBUSY;
>> + if (new_dentry->d_inode && S_ISDIR(new_dentry->d_inode->i_mode) &&
>> + covered(newnd.path.mnt, new_dentry))
>> + goto exit5;
>>
>> error = security_path_rename(&oldnd.path, old_dentry,
>> &newnd.path, new_dentry);
>> --
>> 1.7.5.4
On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <[email protected]> wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
>> Quoting Eric W. Biederman ([email protected]):
>>>
>>> Programs have been known to test for empty directories by attempting
>>> to remove them. To keep from violating the principle of least
>>> surprise don't let directories the caller can see with someting
>>> mounted on them be deleted.
>>
>> Do you think we should do the same thing for over-mounted file at
>> vfs_unlink()?
>
> We easily could.
>
> The point of the patch is to just preserve the directory is empty don't
> allow rmdir to succeed semantics, and as typically we can see something
> in the directory because of the mount it doesn't make sense for rmdir to
> succeed.
>
> unlink doesn't have any occassions when the permissions are sufficient
> to remove a directory where it will fail. So I don't see the point of
> doing this for anything except directories.
>
> Except for possibly the oddball rmdir semantics mentioned I don't think
> this patch should be part of anyone's correctness analysis.
>
>
>
> It is easiest to see that this series of changes is semantically safe if
> we are safe to run unprivileged code in a mount namespace where root has
> locally unmounted every mount point.
>
> We do have the restriction that in a user namespace we can't unmount
> anything root was mounted outside the user namespace. Which combined
> with the above patch would be roughly equivalent to todays mount
> restrictions for the common case. Unfortunately being only roughly
> equivalent the analysis gets very complicated, and complicated reasoning
> usually means invalid reasoning.
>
>
> So if we can feel safe just depending on the parent directory
> permissions (which are not hidden by a mount) protecting our mount
> points, I feel much better about this patchset.
>
I feel safe about this (in the safe-against-attack sense, not
necessary the safe-against-confused-users sense) because the whole
point of preventing unmount of things that were mounted by root is to
prevent untrusted users from seeing what's under the mount. But rmdir
and unlink don't let you see what was under the mount because they
remove it.
(Hmm. I wonder if rename on a mountpoint could work. I've wanted
this on occasion.)
--Andy
Andy Lutomirski <[email protected]> writes:
> On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <[email protected]> wrote:
>> "Serge E. Hallyn" <[email protected]> writes:
>>
>>> Quoting Eric W. Biederman ([email protected]):
>>>>
>>>> Programs have been known to test for empty directories by attempting
>>>> to remove them. To keep from violating the principle of least
>>>> surprise don't let directories the caller can see with someting
>>>> mounted on them be deleted.
>>>
>>> Do you think we should do the same thing for over-mounted file at
>>> vfs_unlink()?
>>
>> We easily could.
>>
>> The point of the patch is to just preserve the directory is empty don't
>> allow rmdir to succeed semantics, and as typically we can see something
>> in the directory because of the mount it doesn't make sense for rmdir to
>> succeed.
>>
>> unlink doesn't have any occassions when the permissions are sufficient
>> to remove a directory where it will fail. So I don't see the point of
>> doing this for anything except directories.
>>
>> Except for possibly the oddball rmdir semantics mentioned I don't think
>> this patch should be part of anyone's correctness analysis.
>>
>>
>>
>> It is easiest to see that this series of changes is semantically safe if
>> we are safe to run unprivileged code in a mount namespace where root has
>> locally unmounted every mount point.
>>
>> We do have the restriction that in a user namespace we can't unmount
>> anything root was mounted outside the user namespace. Which combined
>> with the above patch would be roughly equivalent to todays mount
>> restrictions for the common case. Unfortunately being only roughly
>> equivalent the analysis gets very complicated, and complicated reasoning
>> usually means invalid reasoning.
>>
>>
>> So if we can feel safe just depending on the parent directory
>> permissions (which are not hidden by a mount) protecting our mount
>> points, I feel much better about this patchset.
>>
>
> I feel safe about this (in the safe-against-attack sense, not
> necessary the safe-against-confused-users sense) because the whole
> point of preventing unmount of things that were mounted by root is to
> prevent untrusted users from seeing what's under the mount. But rmdir
> and unlink don't let you see what was under the mount because they
> remove it.
My premise is that the directory permissions of the directory that
contains the mountpoint are already restrictive enough that it safe to
rely on them for permission to unlink and thus unmount.
I believe my premise is correct however I am human and make mistakes.
As far as I can tell a world writable / or /dev has a lot of other potential
issues and similarly with other directories where we put mount points,
so people just won't create those directories with bad permissions.
I am reitterating just so that what I depend upon is clear and if anyone
knows a counter example they can tell me.
I think by preventing rmdir (when the mount is in the same mount
namespace) and allowing rename and unlink (that do not unlink a
directory) we present a minimal surprise to users.
> (Hmm. I wonder if rename on a mountpoint could work. I've wanted
> this on occasion.)
With this set of patches rename on a mountpoint is supported. That will
break /etc/mtab but in the cases where it really matters renaming a
mount point will break other things. Which makes in most cases renaming
a mount point a case of don't do that then, while still being well
defined for the cases where it is not a problem.
Eric
On Sat, Oct 05, 2013 at 06:42:44PM -0500, Rob Landley wrote:
> Oh, attached is a dumb "zapchroot" script I've been using for years to
> unlink all mount points under a given directory, taking advantage of the
> fact that mount points are appended to the end of the list so if you unlink
> from the end to the front you should get the sub-mounts before the parent
> mounts (modulo mount --move not reordering the list, but that's uncommon).
util-linux umount supports --recursive, it uses /proc/self/mountinfo
to compose the hierarchy. The important is that the mountinfo file
contains Id and Parent_Id relations, so you don't rely on the order
only.
> Recently I noticed some kernels where chroot does _not_ trim the paths so
> that the paths you see in /proc/mounts are relevant to the current chroot
> but instead have all sorts of crap you can't access with no way to know what
> it's talking about. That was sad, I need to go figure out if that was distro
> breakage or vanilla breakage...
hmm..
Karel
--
Karel Zak <[email protected]>
http://karelzak.blogspot.com
On 06.10.2013 23:55, Eric W. Biederman wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
> So if we can feel safe just depending on the parent directory
> permissions (which are not hidden by a mount) protecting our mount
> points, I feel much better about this patchset.
As far as i can tell, the permissions of the host-directory of a
mount-point are hidden, at least for user-space.
(Ignoring (bind-)mounting the parent-mount somewhere else)
As root:
$ mkdir /tmp/test
$ ls -ld /tmp/test
drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test
$ mount tmpfs -t tmpfs /tmp/test
$ ls -ld /tmp/test
drwxrwxrwt 2 root root 40 Oct 8 12:33 /tmp/test
$ chown nobody.users /tmp/test
$ ls -ld /tmp/test
drwxrwxrwt 2 nobody users 40 Oct 8 12:33 /tmp/test
$ umount /tmp/test
$ ls -ld /tmp/test
drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test
So if the kernel would check the host-directory-permissions for allowing
umounting by rmdir it follows that a "plain user" doesn't have any
possibility to know beforehand if rmdir/umount would be possible.
--
Matthias
Matthias Schniedermeyer <[email protected]> writes:
> On 06.10.2013 23:55, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <[email protected]> writes:
>>
>> So if we can feel safe just depending on the parent directory
>> permissions (which are not hidden by a mount) protecting our mount
>> points, I feel much better about this patchset.
>
> As far as i can tell, the permissions of the host-directory of a
> mount-point are hidden, at least for user-space.
> (Ignoring (bind-)mounting the parent-mount somewhere else)
>
> As root:
> $ mkdir /tmp/test
> $ ls -ld /tmp/test
> drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test
>
> $ mount tmpfs -t tmpfs /tmp/test
> $ ls -ld /tmp/test
> drwxrwxrwt 2 root root 40 Oct 8 12:33 /tmp/test
>
> $ chown nobody.users /tmp/test
> $ ls -ld /tmp/test
> drwxrwxrwt 2 nobody users 40 Oct 8 12:33 /tmp/test
>
> $ umount /tmp/test
> $ ls -ld /tmp/test
> drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test
>
>
> So if the kernel would check the host-directory-permissions for allowing
> umounting by rmdir it follows that a "plain user" doesn't have any
> possibility to know beforehand if rmdir/umount would be possible.
Except the directory that is relevant to unlink/rmdir in your example is /tmp
not /tmp/test.
But thanks for the eyeball.
Eric
On Fri, Oct 04, 2013 at 03:42:37PM -0700, Eric W. Biederman wrote:
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 5 +++++
> 2 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 64a858143ff9..e4342b8dfab1 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -21,6 +21,7 @@ struct mnt_pcp {
> struct mountpoint {
> struct list_head m_hash;
> struct dentry *m_dentry;
> + struct list_head m_list;
> int m_count;
> };
>
> @@ -47,6 +48,7 @@ struct mount {
> struct mount *mnt_master; /* slave is on master->mnt_slave_list */
> struct mnt_namespace *mnt_ns; /* containing namespace */
> struct mountpoint *mnt_mp; /* where is it mounted */
> + struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
> #ifdef CONFIG_FSNOTIFY
> struct hlist_head mnt_fsnotify_marks;
> __u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index da5c49483430..d092964fe7f9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
> INIT_LIST_HEAD(&mnt->mnt_share);
> INIT_LIST_HEAD(&mnt->mnt_slave_list);
> INIT_LIST_HEAD(&mnt->mnt_slave);
> + INIT_LIST_HEAD(&mnt->mnt_mp_list);
> #ifdef CONFIG_FSNOTIFY
> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
> #endif
> @@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
> mp->m_dentry = dentry;
> mp->m_count = 1;
> list_add(&mp->m_hash, chain);
> + INIT_LIST_HEAD(&mp->m_list);
> return mp;
> }
>
> @@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
> list_del_init(&mnt->mnt_hash);
> put_mountpoint(mnt->mnt_mp);
> mnt->mnt_mp = NULL;
> + list_del_init(&mnt->mnt_mp_list);
Should be done *before* put_mountpoint(), for obvious reasons.
And a BUG_ON(!list_empty(&mp->m_list)) in put_mountpoint() for good measure (and
no, WARN_ON() is not better here, since use-after-free is definitely worse than
a BUG).
> }
>
> /*
> @@ -705,6 +708,7 @@ void mnt_set_mountpoint(struct mount *mnt,
> child_mnt->mnt_mountpoint = dget(mp->m_dentry);
> child_mnt->mnt_parent = mnt;
> child_mnt->mnt_mp = mp;
> + list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
> }
>
> /*
> @@ -1193,6 +1197,7 @@ void umount_tree(struct mount *mnt, int propagate)
> p->mnt_parent->mnt_ghosts++;
> put_mountpoint(p->mnt_mp);
> p->mnt_mp = NULL;
> + list_del_init(&mnt->mnt_mp_list);
Ditto.
> }
> change_mnt_propagation(p, MS_PRIVATE);
> }
> --
> 1.7.5.4
>
On Fri, Oct 04, 2013 at 03:43:18PM -0700, Eric W. Biederman wrote:
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/mount.h | 1 +
> fs/namespace.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index e4342b8dfab1..7a6a2bb3f290 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
> }
>
> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
> +extern void detach_mounts(struct dentry *dentry);
>
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d092964fe7f9..8eaee0c14fdb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mount *mnt, *next;
> + struct mountpoint *mp;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry)) {
> + namespace_unlock();
> + return;
> + }
> + mp = new_mountpoint(dentry);
> + if (IS_ERR(mp)) {
> + return;
> + }
> + br_write_lock(&vfsmount_lock);
> + list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
I don't think list_for_each_entry_safe is actually safe enough here. Because
propageted umounts will likely remove other mounts from the same mountpoint as
well. Doing it with "while (!list_empty(&mp->m_list))" should be better, and
AFAICS it will always make progress.
> + if (!list_empty(&mnt->mnt_list))
> + umount_tree(mnt, 1);
> + }
> + br_write_unlock(&vfsmount_lock);
> + put_mountpoint(mp);
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */
> --
> 1.7.5.4
>
On Fri, Oct 04, 2013 at 03:43:56PM -0700, Eric W. Biederman wrote:
>
> With the introduction of mount namespaces and bind mounts it because
> possible to access files and directories that in other locations in
> were used as mount points. Especially with mount namespaces has
> become very confusing why rm -rf somedir return -EBUSY because some
> directory is mounted somewhere else. With the addition of user
> namespaces allowing unprivileged mounts this condition has gone from
> annoying to allowing a DOS attack on more privileged users.
>
> The simplest approach appears to be to remove the -EBUSY message,
> allow unlink and rename, and lazily unmount the mount point.
>
> In most cases this is less surprising as this is an implementation
> of the normal unix behavior of allowing unlinking of files.
>
> The change implemented in this patch allows the following to succeed:
>
> The vfs does not currently follow paths up to the final component for
> the rename and unlink system calls making the boldest version of this
> idea the simplest to implement. Which should it simple to spot problems
> with this idea.
>
> While different from our historical behavior this change does not look
> like it will break anything, or introduce any security
> vulnerabilities. In a quick survey of all of the common mount points
> on linux systems I found mount points in directories owned and
> modifiable by root, and fuse fuse mounts in directories owned by the
> ``mounter'' of the fuse filesystem. In both of these cases relying on
> the permissions of the directory does not practically change the user
> who is allowed to unmount the filesystem.
>
> Attempting to anticipate cases I have not witnessed I observe that
> every directory in a trusted path to a file must limit modification
> such that no one else may modify that directory. For files trusted by
> suid root executables root most own and be the only user capable of
> modifying the directory and all parent directories for the files to be
> safe. Therefore for mount points part of a trusted path only root
> should be able to unlink any directory or file on that path. Which
> means after this change for a secured path only root can unmount
> directories.
>
> For mount points part of a path we can not trust we should not care if
> the just disappear, as that is just another kind of arbitrary
> manipulation.
>
> So I conclude that the existing conditions will ensure that the permissions
> on directories will be sufficiently limited that the new unmount on unlink
> behavior will not cause problems.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/afs/dir.c | 3 +-
> fs/dcache.c | 80 ++++++++++++++++++++----------------------------
> fs/fuse/dir.c | 3 +-
> fs/gfs2/dentry.c | 4 +--
> fs/namei.c | 31 ++++++------------
> fs/nfs/dir.c | 5 +--
> fs/sysfs/dir.c | 9 +-----
> include/linux/dcache.h | 3 +-
> 8 files changed, 51 insertions(+), 87 deletions(-)
>
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index 646337dc5201..7fb69d45f1b9 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -686,8 +686,7 @@ not_found:
>
> out_bad:
> /* don't unhash if we have submounts */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_skip;
> + shrink_submounts_and_drop(dentry);
>
> _debug("dropping dentry %s/%s",
> parent->d_name.name, dentry->d_name.name);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 41000305d716..1e9bf96b0132 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
> int ret = -ENOENT;
> write_seqlock(&rename_lock);
> for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> - /* Need exclusion wrt. check_submounts_and_drop() */
> + /* Need exclusion wrt. shrink_submounts_and_drop() */
> spin_lock(&p->d_lock);
> if (unlikely(d_unhashed(p))) {
> spin_unlock(&p->d_lock);
> @@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
> }
> EXPORT_SYMBOL(shrink_dcache_parent);
>
> -static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
> +struct detach_data {
> + struct dentry *found;
> +};
> +static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
> {
> - struct select_data *data = _data;
> -
> - if (d_mountpoint(dentry)) {
> - data->found = -EBUSY;
> - return D_WALK_QUIT;
> - }
> -
> - return select_collect(_data, dentry);
> -}
> + struct detach_data *data = ptr;
>
> -static void check_and_drop(void *_data)
> -{
> - struct select_data *data = _data;
> + if (d_mountpoint(dentry))
> + data->found = dentry;
>
> - if (d_mountpoint(data->start))
> - data->found = -EBUSY;
> - if (!data->found)
> - __d_drop(data->start);
> + return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
> }
>
> /**
> - * check_submounts_and_drop - prune dcache, check for submounts and drop
> + * detach_submounts - check for submounts and detach them.
> *
> - * All done as a single atomic operation relative to has_unlinked_ancestor().
> - * Returns 0 if successfully unhashed @parent. If there were submounts then
> - * return -EBUSY.
> + * @dentry: dentry to find mount points under.
> *
> - * @dentry: dentry to prune and drop
> + * If dentry or any of it's children is a mount point detach those mounts.
> */
> -int check_submounts_and_drop(struct dentry *dentry)
> +void detach_submounts(struct dentry *dentry)
> {
> - int ret = 0;
> -
> - /* Negative dentries can be dropped without further checks */
> - if (!dentry->d_inode) {
> - d_drop(dentry);
> - goto out;
> - }
> -
> + struct detach_data data;
> for (;;) {
> - struct select_data data;
> -
> - INIT_LIST_HEAD(&data.dispose);
> - data.start = dentry;
> - data.found = 0;
> + data.found = NULL;
> + d_walk(dentry, &data, do_detach_submounts, NULL);
>
> - d_walk(dentry, &data, check_and_collect, check_and_drop);
> - ret = data.found;
> -
> - if (!list_empty(&data.dispose))
> - shrink_dentry_list(&data.dispose);
> -
> - if (ret <= 0)
> + if (!data.found)
> break;
>
> + detach_mounts(data.found);
> cond_resched();
> }
> + detach_mounts(dentry);
> +}
>
> -out:
> - return ret;
> +/**
> + * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
> + *
> + * All done as a single atomic operation reletaive to d_set_mounted().
> + *
> + * @dentry: dentry to detach, prune and drop
> + */
> +void shrink_submounts_and_drop(struct dentry *dentry)
> +{
> + d_drop(dentry);
> + detach_submounts(dentry);
And here, between detach_submounts() and shrink_dcache_parent() a new mount can
be added.
It's not accidental that check_submounts_and_drop() did the check and the drop
together, protected by rename_lock and d_lock.
> + shrink_dcache_parent(dentry);
> }
> -EXPORT_SYMBOL(check_submounts_and_drop);
> +EXPORT_SYMBOL(shrink_submounts_and_drop);
>
> /**
> * __d_alloc - allocate a dcache entry
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 62b43b577bfc..b1cd7b79a325 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -259,8 +259,7 @@ out:
>
> invalid:
> ret = 0;
> - if (check_submounts_and_drop(entry) != 0)
> - ret = 1;
> + shrink_submounts_and_drop(entry);
> goto out;
> }
>
> diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
> index d3a5d4e29ba5..2ecc2b873829 100644
> --- a/fs/gfs2/dentry.c
> +++ b/fs/gfs2/dentry.c
> @@ -93,9 +93,7 @@ invalid_gunlock:
> if (!had_lock)
> gfs2_glock_dq_uninit(&d_gh);
> invalid:
> - if (check_submounts_and_drop(dentry) != 0)
> - goto valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> return 0;
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 645268f23eb6..b18b017c946b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3560,10 +3560,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> dget(dentry);
> mutex_lock(&dentry->d_inode->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(dentry))
> - goto out;
> -
> error = security_inode_rmdir(dir, dentry);
> if (error)
> goto out;
> @@ -3575,6 +3571,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>
> dentry->d_inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> + detach_mounts(dentry);
>
> out:
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -3657,14 +3654,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
> return -EPERM;
>
> mutex_lock(&dentry->d_inode->i_mutex);
> - if (d_mountpoint(dentry))
> - error = -EBUSY;
> - else {
> - error = security_inode_unlink(dir, dentry);
> + error = security_inode_unlink(dir, dentry);
> + if (!error) {
> + error = dir->i_op->unlink(dir, dentry);
> if (!error) {
> - error = dir->i_op->unlink(dir, dentry);
> - if (!error)
> - dont_mount(dentry);
> + dont_mount(dentry);
> + detach_mounts(dentry);
> }
> }
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
> - goto out;
> -
I know of at least one app that relied at some point on a mountpoint (directory
or non-directory) not being movable: fusermount uses this to ensure that
unprivileged userspace didn't try replacing a fuse mount with a symlink to trick
fusermount into umounting an arbitrary path. The code that relied on this was
replaced by UMOUNT_NOFOLLOW on kernels where it is supported. But in theory
there may exist a running binary without UMOUNT_NOFOLLOW and relying on EBUSY.
And there may be other such horrid hacks out there.
> error = -EMLINK;
> if (max_links && !target && new_dir != old_dir &&
> new_dir->i_nlink >= max_links)
> @@ -4006,6 +3997,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target) {
> target->i_flags |= S_DEAD;
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> }
> out:
> if (target)
> @@ -4031,16 +4023,15 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> - goto out;
> -
> error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
> if (error)
> goto out;
>
> - if (target)
> + if (target) {
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> + }
> + detach_mounts(old_dentry);
Why exactly? "Moved file changes contents" is not the least surprising result,
IMO. And why the difference between rename-dir and rename-other in this regard?
> if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> d_move(old_dentry, new_dentry);
> out:
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 854a8f05a610..e8e35acd8850 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1142,10 +1142,7 @@ out_zap_parent:
> if (dentry->d_flags & DCACHE_DISCONNECTED)
> goto out_valid;
> }
> - /* If we have submounts, don't unhash ! */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
> __func__, dentry->d_parent->d_name.name,
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 4d83cedb9fcb..477c66d4e2a8 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
> }
>
> mutex_unlock(&sysfs_mutex);
> -out_valid:
> return 1;
> out_bad:
> /* Remove the dentry from the dcache hashes.
> @@ -341,13 +340,7 @@ out_bad:
> * to the dcache hashes.
> */
> mutex_unlock(&sysfs_mutex);
> -
> - /* If we have submounts we must allow the vfs caches
> - * to lie about the state of the filesystem to prevent
> - * leaks and other nasty things.
> - */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> + shrink_submounts_and_drop(dentry);
>
> return 0;
> }
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 59066e0b4ff1..17948b49f3d5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);
>
> /* test whether we have any submounts in a subdir tree */
> extern int have_submounts(struct dentry *);
> -extern int check_submounts_and_drop(struct dentry *);
> +extern void detach_submounts(struct dentry *dentry);
> +extern void shrink_submounts_and_drop(struct dentry *);
>
> /*
> * This adds the entry to the hash queues.
> --
> 1.7.5.4
>
On Mon, Oct 07, 2013 at 03:25:17PM -0700, Eric W. Biederman wrote:
> Andy Lutomirski <[email protected]> writes:
>
> > On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <[email protected]> wrote:
> >> "Serge E. Hallyn" <[email protected]> writes:
> >>
> >>> Quoting Eric W. Biederman ([email protected]):
> >>>>
> >>>> Programs have been known to test for empty directories by attempting
> >>>> to remove them. To keep from violating the principle of least
> >>>> surprise don't let directories the caller can see with someting
> >>>> mounted on them be deleted.
> >>>
> >>> Do you think we should do the same thing for over-mounted file at
> >>> vfs_unlink()?
> >>
> >> We easily could.
And for vfs_rename()?
I think the risks of changing behavior outweigh the benefits. How many times
did you have to remove or rename a mounted file or directory? It's very rarely
needed.
Thanks,
Miklos
On Tue, Oct 8, 2013 at 9:06 AM, Miklos Szeredi <[email protected]> wrote:
> On Mon, Oct 07, 2013 at 03:25:17PM -0700, Eric W. Biederman wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>> > On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <[email protected]> wrote:
>> >> "Serge E. Hallyn" <[email protected]> writes:
>> >>
>> >>> Quoting Eric W. Biederman ([email protected]):
>> >>>>
>> >>>> Programs have been known to test for empty directories by attempting
>> >>>> to remove them. To keep from violating the principle of least
>> >>>> surprise don't let directories the caller can see with someting
>> >>>> mounted on them be deleted.
>> >>>
>> >>> Do you think we should do the same thing for over-mounted file at
>> >>> vfs_unlink()?
>> >>
>> >> We easily could.
>
> And for vfs_rename()?
>
> I think the risks of changing behavior outweigh the benefits. How many times
> did you have to remove or rename a mounted file or directory? It's very rarely
> needed.
I do this every time I reinstall a system while running that system.
Admittedly, mount --move works, but that's a really unpleasant
interface.
When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.
--Andy
>
> Thanks,
> Miklos
--
Andy Lutomirski
AMA Capital Management, LLC
On Tue, Oct 08, 2013 at 09:06:29AM -0700, Andy Lutomirski wrote:
> > I think the risks of changing behavior outweigh the benefits. How many
> > times did you have to remove or rename a mounted file or directory? It's
> > very rarely needed.
>
> I do this every time I reinstall a system while running that system.
> Admittedly, mount --move works, but that's a really unpleasant
> interface.
>
> When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.
Good point. Opting in would be the safest for both unlinkat() and
renameat2().
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
>> @@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
>> list_del_init(&mnt->mnt_hash);
>> put_mountpoint(mnt->mnt_mp);
>> mnt->mnt_mp = NULL;
>> + list_del_init(&mnt->mnt_mp_list);
>
> Should be done *before* put_mountpoint(), for obvious reasons.
>
> And a BUG_ON(!list_empty(&mp->m_list)) in put_mountpoint() for good measure (and
> no, WARN_ON() is not better here, since use-after-free is definitely worse than
> a BUG).
Good point. I will take a look at making this change shortly and repost
the patches.
>> }
>>
>> /*
>> @@ -705,6 +708,7 @@ void mnt_set_mountpoint(struct mount *mnt,
>> child_mnt->mnt_mountpoint = dget(mp->m_dentry);
>> child_mnt->mnt_parent = mnt;
>> child_mnt->mnt_mp = mp;
>> + list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
>> }
>>
>> /*
>> @@ -1193,6 +1197,7 @@ void umount_tree(struct mount *mnt, int propagate)
>> p->mnt_parent->mnt_ghosts++;
>> put_mountpoint(p->mnt_mp);
>> p->mnt_mp = NULL;
>> + list_del_init(&mnt->mnt_mp_list);
>
> Ditto.
>
>> }
>> change_mnt_propagation(p, MS_PRIVATE);
>> }
>> --
>> 1.7.5.4
>>
Miklos Szeredi <[email protected]> writes:
> On Fri, Oct 04, 2013 at 03:43:18PM -0700, Eric W. Biederman wrote:
>>
>> +void detach_mounts(struct dentry *dentry)
>> +{
>> + struct mount *mnt, *next;
>> + struct mountpoint *mp;
>> +
>> + namespace_lock();
>> + if (!d_mountpoint(dentry)) {
>> + namespace_unlock();
>> + return;
>> + }
>> + mp = new_mountpoint(dentry);
>> + if (IS_ERR(mp)) {
>> + return;
>> + }
>> + br_write_lock(&vfsmount_lock);
>> + list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
>
> I don't think list_for_each_entry_safe is actually safe enough here. Because
> propageted umounts will likely remove other mounts from the same mountpoint as
> well. Doing it with "while (!list_empty(&mp->m_list))" should be better, and
> AFAICS it will always make progress.
Thanks I will take a look. But on the surface that logic sounds
correct.
Eric
Miklos Szeredi <[email protected]> writes:
> On Tue, Oct 08, 2013 at 09:06:29AM -0700, Andy Lutomirski wrote:
>
>> > I think the risks of changing behavior outweigh the benefits. How many
>> > times did you have to remove or rename a mounted file or directory? It's
>> > very rarely needed.
>>
>> I do this every time I reinstall a system while running that system.
>> Admittedly, mount --move works, but that's a really unpleasant
>> interface.
>>
>> When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.
>
> Good point. Opting in would be the safest for both unlinkat() and
> renameat2().
Opting in to allowing mounts to be unlinked and renamed? Or opting in
to not renaming/unlinking a mount point?
I can see opting in to denying the rename/unlink if that is someting
someone wants for some reason. With an opt in we can even use the
previous d_mountpoint check and not stomp someone's else's weird set of
mounts in some other mount namespace if they are on the local machine
and that is desired.
The rmdir non-empty dir semantics justify blocking rmdir.
If we are going to fix the VFS deficiency we have to let these changes
happen in other mount namespaces. To make that safe it has to be
sufficient to rely on the directory permissions and the conditions that
ensure that the directory permissions are sufficient.
So I find it far safer to allow as much as possible even in the local
mount namespace so we can actually see if there are problems with
relying on the directory permissions.
Eric
Miklos Szeredi <[email protected]> writes:
> On Fri, Oct 04, 2013 at 03:43:56PM -0700, Eric W. Biederman wrote:
>> +/**
>> + * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
>> + *
>> + * All done as a single atomic operation reletaive to d_set_mounted().
>> + *
>> + * @dentry: dentry to detach, prune and drop
>> + */
>> +void shrink_submounts_and_drop(struct dentry *dentry)
>> +{
>> + d_drop(dentry);
>> + detach_submounts(dentry);
>
> And here, between detach_submounts() and shrink_dcache_parent() a new mount can
> be added.
Actually it is not possible to add a new mount betwee detach_submounts
and shrink_dcache_parent d_set_mountpoint will see the unhashed
dentry as an unhashed parent and refuse to add any new mount points.
> It's not accidental that check_submounts_and_drop() did the check and the drop
> together, protected by rename_lock and d_lock.
When I looked the locking of detach_mount prevented me from using the
same technique as check_submounts_and_drop. But happily
d_set_mountpoint now checks for unhashed parents and prevents carnage at
that point.
>> + shrink_dcache_parent(dentry);
>> }
>> -EXPORT_SYMBOL(check_submounts_and_drop);
>> +EXPORT_SYMBOL(shrink_submounts_and_drop);
>>
>> @@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
>> if (target)
>> mutex_lock(&target->i_mutex);
>>
>> - error = -EBUSY;
>> - if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
>> - goto out;
>> -
>
> I know of at least one app that relied at some point on a mountpoint (directory
> or non-directory) not being movable: fusermount uses this to ensure that
> unprivileged userspace didn't try replacing a fuse mount with a symlink to trick
> fusermount into umounting an arbitrary path. The code that relied on this was
> replaced by UMOUNT_NOFOLLOW on kernels where it is supported. But in theory
> there may exist a running binary without UMOUNT_NOFOLLOW and relying on EBUSY.
>
> And there may be other such horrid hacks out there.
That is certainly worth thinking about.
In practice if I rename a parent directory I can rename mount points
today. So I don't know that it was ever fully safe to rely on rename
prevention.
We may be able to keep this rule just for the local mount namespace if
that would help.
If we are going to remove lying to the VFS removing the d_mountpoint
restriction of rename is necessary.
It gets scary to rely on the checks only applying locally, because of
things like mount --bind /some/path /some/other/path, and because of
cases like sysfs and nfs where the mutation path is not through the
local vfs and simply bypasses all of these changes.
So I will think about this case some more but I don't think I can make
this change and remain hack compatile with old kernels.
I hope we don't find any horrible hacks that we absolutely must support
because at best that puts everything back to the drawing board.
But I will go through and read the old fusermount code before I get too
much farther just so I understand what I am potentially breaking.
>> error = -EMLINK;
>> if (max_links && !target && new_dir != old_dir &&
>> new_dir->i_nlink >= max_links)
>> @@ -4006,6 +3997,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
>> if (target) {
>> target->i_flags |= S_DEAD;
>> dont_mount(new_dentry);
>> + detach_mounts(new_dentry);
>> }
>> out:
>> if (target)
>> @@ -4031,16 +4023,15 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
>> if (target)
>> mutex_lock(&target->i_mutex);
>>
>> - error = -EBUSY;
>> - if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
>> - goto out;
>> -
>> error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
>> if (error)
>> goto out;
>>
>> - if (target)
>> + if (target) {
>> dont_mount(new_dentry);
>> + detach_mounts(new_dentry);
>> + }
>> + detach_mounts(old_dentry);
>
> Why exactly? "Moved file changes contents" is not the least surprising result,
> IMO. And why the difference between rename-dir and rename-other in
> this regard?
Because detach_mounts(old_dentry) is a bug.
detach_mounts(new_dentry) is needed so that it is safe to put new_dentry
a few lines
Eric
[email protected] (Eric W. Biederman) writes:
> But I will go through and read the old fusermount code before I get too
> much farther just so I understand what I am potentially breaking.
Grr.
So I have just read the fusermount umount code and the hack that it uses
before there was UMOUNT_NOFOLLOW support in the vm.
If I walk this path of lazy unmounts and detaching directories, anyone
with a new kernel and an old copy of fusermount and a nfs mounted home
directory will be able to exploit the fusermount umount symlink race.
Unless we can declare that old fusermount binaries are buggy beyond
supporting this patchset as it exists is dead.
Eric
On Wed, Oct 9, 2013 at 12:12 PM, Eric W. Biederman
<[email protected]> wrote:
> [email protected] (Eric W. Biederman) writes:
>
>> But I will go through and read the old fusermount code before I get too
>> much farther just so I understand what I am potentially breaking.
>
> Grr.
>
> So I have just read the fusermount umount code and the hack that it uses
> before there was UMOUNT_NOFOLLOW support in the vm.
>
> If I walk this path of lazy unmounts and detaching directories, anyone
> with a new kernel and an old copy of fusermount and a nfs mounted home
> directory will be able to exploit the fusermount umount symlink race.
>
> Unless we can declare that old fusermount binaries are buggy beyond
> supporting this patchset as it exists is dead.
What's the hack that it does?
--Andy
Andy Lutomirski <[email protected]> writes:
> On Wed, Oct 9, 2013 at 12:12 PM, Eric W. Biederman
> <[email protected]> wrote:
>> [email protected] (Eric W. Biederman) writes:
>>
>>> But I will go through and read the old fusermount code before I get too
>>> much farther just so I understand what I am potentially breaking.
>>
>> Grr.
>>
>> So I have just read the fusermount umount code and the hack that it uses
>> before there was UMOUNT_NOFOLLOW support in the vm.
>>
>> If I walk this path of lazy unmounts and detaching directories, anyone
>> with a new kernel and an old copy of fusermount and a nfs mounted home
>> directory will be able to exploit the fusermount umount symlink race.
>>
>> Unless we can declare that old fusermount binaries are buggy beyond
>> supporting this patchset as it exists is dead.
>
> What's the hack that it does?
The problem is that "umount -l /some/user/supplied/path" can unmount
anything.
Even after checking mtab etc there are races, and replacing "path" with
a symlink will still allow you to unmount anything.
So when not usering UMOUNT_NOFOLLOW fusermount will do:
cd "/some/user/supplied/"
clone(CLONE_NEWNS)
mount --make-rprivate /
COUNT=$(cat /proc/mounts | wc -l)
mount --rbind . /
Look at new /proc/mounts entries past $COUNT and see if there is an
entry for "path"
if $found
exit 0
else
exit 1
wait
if ($? == 0) umount -l "path"
As everything happens in the directory just below the mount point
we are only concerned about things that happen in that directory.
This does mean there is a window between checking that path is
a valid fuse mount point and when that mountpoint is unmounted.
So if someone can unlink/rename the mount point and replace it
with a symlink during that small window it is possible to unmount
anything.
Today the d_mountpoint tests and the requirement distributed filesystems
(aka nfs) lie to the vfs prevent those issues.
The UMOUNT_NOFOLLOW code which appears in fuse 2.9.0 is new enough that
distro's like CentOS6 don't ship it yet.
Eric
On 10/08/2013 03:03:03 AM, Karel Zak wrote:
> On Sat, Oct 05, 2013 at 06:42:44PM -0500, Rob Landley wrote:
> > Oh, attached is a dumb "zapchroot" script I've been using for years
> to
> > unlink all mount points under a given directory, taking advantage
> of the
> > fact that mount points are appended to the end of the list so if
> you unlink
> > from the end to the front you should get the sub-mounts before the
> parent
> > mounts (modulo mount --move not reordering the list, but that's
> uncommon).
>
> util-linux umount supports --recursive, it uses /proc/self/mountinfo
> to compose the hierarchy. The important is that the mountinfo file
> contains Id and Parent_Id relations, so you don't rely on the order
> only.
Ah, that's what happened.
For some reason /proc/self/mounts stopped adjusting itself for chroot a
while back, apparently because containers use switch_root instead so
chroot is generally deprecated or something? This made /proc/mounts
completely useless in a chroot because the paths it showed were not the
ones you actually had to use to umount anything.
Instead they added a new way to get this info, in a new format where
you have to parse the 5th field out of each line to get the mount point
(to make it friendly to scripts and tools).
Good to know.
Rob-
On Tue, Oct 8, 2013 at 10:50 PM, Eric W. Biederman
<[email protected]> wrote:
> If we are going to fix the VFS deficiency we have to let these changes
> happen in other mount namespaces. To make that safe it has to be
> sufficient to rely on the directory permissions and the conditions that
> ensure that the directory permissions are sufficient.
Yes.
> So I find it far safer to allow as much as possible even in the local
> mount namespace so we can actually see if there are problems with
> relying on the directory permissions.
But it's not safer in terms of breaking legacy environments, which are
single namespace, usually.
I see it this way: the current behavior (blocking unlink/rename if
mounted) may not be nice, but it's expected. But once we have
multiple namespaces it becomes problematic. Just like network
filesystems where we cannot prevent unlink on another host just
because it's mounted on this. So we have to have some way of dealing
with this.
Is it worth changing behavor in the "current namespace" case too? It
has advantages, but it's not absolutely necessary and there are
unknown dangers. But if we disable it for rmdir() then what's the
point in enabling it for unlink()? Directory mounts far outnumber
non-directory ones, so the additional testing is minimal, but the
holes left are probably just as real.
And there's rename(). We have a real security hole in fusermount by
allowing it in the local namespace. I don't think changing the
behavior of rename() in the local namespace is important enough to
risk such problems. And as Andy pointed out, we can just have an
option to turn it on and that's that.
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
> On Tue, Oct 8, 2013 at 10:50 PM, Eric W. Biederman
> <[email protected]> wrote:
>
>> If we are going to fix the VFS deficiency we have to let these changes
>> happen in other mount namespaces. To make that safe it has to be
>> sufficient to rely on the directory permissions and the conditions that
>> ensure that the directory permissions are sufficient.
>
> Yes.
>
>> So I find it far safer to allow as much as possible even in the local
>> mount namespace so we can actually see if there are problems with
>> relying on the directory permissions.
>
> But it's not safer in terms of breaking legacy environments, which are
> single namespace, usually.
>
> I see it this way: the current behavior (blocking unlink/rename if
> mounted) may not be nice, but it's expected. But once we have
> multiple namespaces it becomes problematic. Just like network
> filesystems where we cannot prevent unlink on another host just
> because it's mounted on this. So we have to have some way of dealing
> with this.
>
> Is it worth changing behavor in the "current namespace" case too? It
> has advantages, but it's not absolutely necessary and there are
> unknown dangers. But if we disable it for rmdir() then what's the
> point in enabling it for unlink()? Directory mounts far outnumber
> non-directory ones, so the additional testing is minimal, but the
> holes left are probably just as real.
>
> And there's rename(). We have a real security hole in fusermount by
> allowing it in the local namespace. I don't think changing the
> behavior of rename() in the local namespace is important enough to
> risk such problems. And as Andy pointed out, we can just have an
> option to turn it on and that's that.
I have been weighing the pros and the cons of this.
At this point the most practical path I can see is to block unlink,
rename, and rmdir if there is a mount in the local namespace.
At the very least that makes very limited additions to what applications
can depend on from the vfs, and even more importantly that can be ready
by 3.13.
Furthermore it removes the nasty need for filesystems to lie to the vfs,
and removes the nasty DOS that non-local mountpoints are.
Miklos if you as the fuse maintainer aren't worried about network
filesystems, and multiple namespaces I won't worry either. Especially
since modern versions of fuse aren't affected.
I have a implementable looking plan B if we really need it of creating
ghost dentries for mounts to ride on if their underlying dentry gets
renamed or unlinked. And I don't see this removing the possibility of
implementing that plan B. But ick. I think deciding we need plan B
would just enough more work that it would kill getting any change made
in a timely fashion. So I will take my little blue pill and ignore the
possible slings and arrows of outrageous fortune, and see what happens.
Eric
On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
<[email protected]> wrote:
> I have been weighing the pros and the cons of this.
>
> At this point the most practical path I can see is to block unlink,
> rename, and rmdir if there is a mount in the local namespace.
>
> At the very least that makes very limited additions to what applications
> can depend on from the vfs, and even more importantly that can be ready
> by 3.13.
>
> Furthermore it removes the nasty need for filesystems to lie to the vfs,
> and removes the nasty DOS that non-local mountpoints are.
Agreed 100%.
> Miklos if you as the fuse maintainer aren't worried about network
> filesystems, and multiple namespaces I won't worry either. Especially
> since modern versions of fuse aren't affected.
I think the above conditions (local mount blocks unlink/rename) are
enough to prevent most of the problems, of which there aren't many in
any case.
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
> On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
>> Miklos if you as the fuse maintainer aren't worried about network
>> filesystems, and multiple namespaces I won't worry either. Especially
>> since modern versions of fuse aren't affected.
>
> I think the above conditions (local mount blocks unlink/rename) are
> enough to prevent most of the problems, of which there aren't many in
> any case.
Dumb question.
What prevents someone setting up a race between the fusermount
permission checks and replacing the destination with a symlink, perhaps
to /etc/shadow?
Do we need a MS_NOFOLLOW?
Eric
[email protected] (Eric W. Biederman) writes:
> Miklos Szeredi <[email protected]> writes:
>
>> On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
>>> Miklos if you as the fuse maintainer aren't worried about network
>>> filesystems, and multiple namespaces I won't worry either. Especially
>>> since modern versions of fuse aren't affected.
>>
>> I think the above conditions (local mount blocks unlink/rename) are
>> enough to prevent most of the problems, of which there aren't many in
>> any case.
>
> Dumb question.
>
> What prevents someone setting up a race between the fusermount
> permission checks and replacing the destination with a symlink, perhaps
> to /etc/shadow?
>
> Do we need a MS_NOFOLLOW?
Doh! mount(".",...) works just fine..
My apologies for the silly question.
Eric
This patchset is an addresses two problems:
1) Not all modifications to the filesystems happen through the vfs and
since the vfs can not cope with a mount point being unlinked or
renamed filesystems whose modifications that do not come through the
vfs are required to lie.
2) Through an oversight it is now possible for one unprivileged user to
mount something on another unprivileged users dentry and make it
impossible for the other user to unlink or rename that dentry.
It is now technically possible to easily lift the restriction on
unlinking and renaming files with mount points on them, with a
corresponding reduction in complexity of the vfs semantics.
After review it seems that there are no objections to this approach as
long as we retain the -EBUSY semantics for rmdir, unlink, and rename of
mount points in the current mount namespace. The first patch in this
series now adds those local mount namespace restrictions.
All of the review comments should now be addressed and folded in, and
I have take a careful look and it appears what I have is now correct
and complete. So I am posting this for one last round of review.
Al if you want to take this through the vfs tree, point me at a branch
and I will give you versions of these patches that apply cleanly there.
Otherwise I will push these patches to my userns tree as soon as all of
these patches pass review.
Eric W. Biederman (4):
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry. v3
vfs: Lazily remove mounts on unlinked files and directories. v2
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/mount.h | 3 ++
fs/namei.c | 55 +++++++++++++++++++++------------
fs/namespace.c | 30 ++++++++++++++++++
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
10 files changed, 108 insertions(+), 87 deletions(-)
In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.
The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.
Miklos Szeredi <[email protected]> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.
A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.
Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namei.c | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 645268f23eb6..df7bd6e9c7b6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}
+static bool covered(struct vfsmount *mnt, struct dentry *dentry)
+{
+ /* test to see if a dentry is covered with a mount in
+ * the current mount namespace.
+ */
+ bool is_covered;
+
+ rcu_read_lock();
+ is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
+ rcu_read_unlock();
+
+ return is_covered;
+}
+
int vfs_rmdir(struct inode *dir, struct dentry *dentry)
{
int error = may_delete(dir, dentry, 1);
@@ -3622,6 +3636,9 @@ retry:
error = -ENOENT;
goto exit3;
}
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit3;
error = security_path_rmdir(&nd.path, dentry);
if (error)
goto exit3;
@@ -3716,6 +3733,9 @@ retry:
inode = dentry->d_inode;
if (!inode)
goto slashes;
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit2;
ihold(inode);
error = security_path_unlink(&nd.path, dentry);
if (error)
@@ -4164,6 +4184,11 @@ retry:
error = -ENOTEMPTY;
if (new_dentry == trap)
goto exit5;
+ error = -EBUSY;
+ if (covered(oldnd.path.mnt, old_dentry))
+ goto exit5;
+ if (covered(newnd.path.mnt, new_dentry))
+ goto exit5;
error = security_path_rename(&oldnd.path, old_dentry,
&newnd.path, new_dentry);
--
1.7.5.4
To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 2 ++
fs/namespace.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 64a858143ff9..e4342b8dfab1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
@@ -47,6 +48,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index da5c49483430..e4fe22c23e00 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@ -643,6 +645,7 @@ static void put_mountpoint(struct mountpoint *mp)
{
if (!--mp->m_count) {
struct dentry *dentry = mp->m_dentry;
+ BUG_ON(!list_empty(&mp->m_list));
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
@@ -689,6 +692,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@ -705,6 +709,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}
/*
@@ -1191,6 +1196,7 @@ void umount_tree(struct mount *mnt, int propagate)
list_del_init(&p->mnt_child);
if (mnt_has_parent(p)) {
p->mnt_parent->mnt_ghosts++;
+ list_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
p->mnt_mp = NULL;
}
--
1.7.5.4
v2: Always drop the lock when exiting early.
v3: Make detach_mounts robust about freeing several
mounts on the same mountpoint at one time, and remove
the unneeded mnt_list list test.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 1 +
fs/namespace.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index e4342b8dfab1..7a6a2bb3f290 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
}
extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
+extern void detach_mounts(struct dentry *dentry);
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
diff --git a/fs/namespace.c b/fs/namespace.c
index e4fe22c23e00..78f7c5c9e673 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1295,6 +1295,30 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
+void detach_mounts(struct dentry *dentry)
+{
+ struct mountpoint *mp;
+ struct mount *mnt;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry))
+ goto out_unlock;
+
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp))
+ goto out_unlock;
+
+ br_write_lock(&vfsmount_lock);
+ while (!list_empty(&mp->m_list)) {
+ mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
+ umount_tree(mnt, 1);
+ }
+ br_write_unlock(&vfsmount_lock);
+ put_mountpoint(mp);
+out_unlock:
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/namei.c | 30 ++++++------------
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
8 files changed, 50 insertions(+), 87 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 646337dc5201..7fb69d45f1b9 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -686,8 +686,7 @@ not_found:
out_bad:
/* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
+ shrink_submounts_and_drop(dentry);
_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
diff --git a/fs/dcache.c b/fs/dcache.c
index 41000305d716..1e9bf96b0132 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. shrink_submounts_and_drop() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
}
EXPORT_SYMBOL(shrink_dcache_parent);
-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct dentry *found;
+};
+static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
{
- struct select_data *data = _data;
-
- if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
- return D_WALK_QUIT;
- }
-
- return select_collect(_data, dentry);
-}
+ struct detach_data *data = ptr;
-static void check_and_drop(void *_data)
-{
- struct select_data *data = _data;
+ if (d_mountpoint(dentry))
+ data->found = dentry;
- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
}
/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * detach_submounts - check for submounts and detach them.
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * @dentry: dentry to find mount points under.
*
- * @dentry: dentry to prune and drop
+ * If dentry or any of it's children is a mount point detach those mounts.
*/
-int check_submounts_and_drop(struct dentry *dentry)
+void detach_submounts(struct dentry *dentry)
{
- int ret = 0;
-
- /* Negative dentries can be dropped without further checks */
- if (!dentry->d_inode) {
- d_drop(dentry);
- goto out;
- }
-
+ struct detach_data data;
for (;;) {
- struct select_data data;
-
- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.found = NULL;
+ d_walk(dentry, &data, do_detach_submounts, NULL);
- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
-
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
-
- if (ret <= 0)
+ if (!data.found)
break;
+ detach_mounts(data.found);
cond_resched();
}
+ detach_mounts(dentry);
+}
-out:
- return ret;
+/**
+ * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
+ *
+ * All done as a single atomic operation reletaive to d_set_mounted().
+ *
+ * @dentry: dentry to detach, prune and drop
+ */
+void shrink_submounts_and_drop(struct dentry *dentry)
+{
+ d_drop(dentry);
+ detach_submounts(dentry);
+ shrink_dcache_parent(dentry);
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(shrink_submounts_and_drop);
/**
* __d_alloc - allocate a dcache entry
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b577bfc..b1cd7b79a325 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -259,8 +259,7 @@ out:
invalid:
ret = 0;
- if (check_submounts_and_drop(entry) != 0)
- ret = 1;
+ shrink_submounts_and_drop(entry);
goto out;
}
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..2ecc2b873829 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,7 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
return 0;
diff --git a/fs/namei.c b/fs/namei.c
index df7bd6e9c7b6..a12c1d31d4c8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3574,10 +3574,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dget(dentry);
mutex_lock(&dentry->d_inode->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(dentry))
- goto out;
-
error = security_inode_rmdir(dir, dentry);
if (error)
goto out;
@@ -3589,6 +3585,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);
out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3674,14 +3671,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
return -EPERM;
mutex_lock(&dentry->d_inode->i_mutex);
- if (d_mountpoint(dentry))
- error = -EBUSY;
- else {
- error = security_inode_unlink(dir, dentry);
+ error = security_inode_unlink(dir, dentry);
+ if (!error) {
+ error = dir->i_op->unlink(dir, dentry);
if (!error) {
- error = dir->i_op->unlink(dir, dentry);
- if (!error)
- dont_mount(dentry);
+ dont_mount(dentry);
+ detach_mounts(dentry);
}
}
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -4008,10 +4003,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
-
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
new_dir->i_nlink >= max_links)
@@ -4026,6 +4017,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4051,16 +4043,14 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);
- error = -EBUSY;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
-
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (error)
goto out;
- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 854a8f05a610..e8e35acd8850 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1142,10 +1142,7 @@ out_zap_parent:
if (dentry->d_flags & DCACHE_DISCONNECTED)
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
__func__, dentry->d_parent->d_name.name,
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 4d83cedb9fcb..477c66d4e2a8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
}
mutex_unlock(&sysfs_mutex);
-out_valid:
return 1;
out_bad:
/* Remove the dentry from the dcache hashes.
@@ -341,13 +340,7 @@ out_bad:
* to the dcache hashes.
*/
mutex_unlock(&sysfs_mutex);
-
- /* If we have submounts we must allow the vfs caches
- * to lie about the state of the filesystem to prevent
- * leaks and other nasty things.
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
+ shrink_submounts_and_drop(dentry);
return 0;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 59066e0b4ff1..17948b49f3d5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);
/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);
+extern void detach_submounts(struct dentry *dentry);
+extern void shrink_submounts_and_drop(struct dentry *);
/*
* This adds the entry to the hash queues.
--
1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> In preparation for allowing mountpoints to be renamed and unlinked
> in remote filesystems and in other mount namespaces test if on a dentry
> there is a mount in the local mount namespace before allowing it to
> be renamed or unlinked.
>
> The primary motivation here are old versions of fusermount unmount
> which is not safe if the a path can be renamed or unlinked while it is
> verifying the mount is safe to unmount. More recent versions are simpler
> and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
> in a directory owned by an arbitrary user.
>
> Miklos Szeredi <[email protected]> reports this is approach is good
> enough to remove concerns about new kernels mixed with old versions
> of fusermount.
>
> A secondary motivation for restrictions here is that it removing empty
> directories that have non-empty mount points on them appears to
> violate the rule that rmdir can not remove empty directories. As
> Linus Torvalds pointed out this is useful for programs (like git) that
> test if a directory is empty with rmdir.
>
> Therefore this patch arranges to enforce the existing mount point
> semantics for local mount namespace.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/namei.c | 25 +++++++++++++++++++++++++
> 1 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 645268f23eb6..df7bd6e9c7b6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
> spin_unlock(&dentry->d_lock);
> }
>
> +static bool covered(struct vfsmount *mnt, struct dentry *dentry)
> +{
> + /* test to see if a dentry is covered with a mount in
> + * the current mount namespace.
> + */
> + bool is_covered;
> +
> + rcu_read_lock();
> + is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
> + rcu_read_unlock();
> +
> + return is_covered;
> +}
> +
> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> {
> int error = may_delete(dir, dentry, 1);
> @@ -3622,6 +3636,9 @@ retry:
> error = -ENOENT;
> goto exit3;
> }
> + error = -EBUSY;
> + if (covered(nd.path.mnt, dentry))
> + goto exit3;
> error = security_path_rmdir(&nd.path, dentry);
> if (error)
> goto exit3;
> @@ -3716,6 +3733,9 @@ retry:
> inode = dentry->d_inode;
> if (!inode)
> goto slashes;
> + error = -EBUSY;
> + if (covered(nd.path.mnt, dentry))
> + goto exit2;
> ihold(inode);
> error = security_path_unlink(&nd.path, dentry);
> if (error)
> @@ -4164,6 +4184,11 @@ retry:
> error = -ENOTEMPTY;
> if (new_dentry == trap)
> goto exit5;
> + error = -EBUSY;
> + if (covered(oldnd.path.mnt, old_dentry))
> + goto exit5;
> + if (covered(newnd.path.mnt, new_dentry))
> + goto exit5;
>
> error = security_path_rename(&oldnd.path, old_dentry,
> &newnd.path, new_dentry);
> --
> 1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> To spot any possible problems call BUG if a mountpoint
> is put when it's list of mounts is not empty.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 6 ++++++
> 2 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 64a858143ff9..e4342b8dfab1 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -21,6 +21,7 @@ struct mnt_pcp {
> struct mountpoint {
> struct list_head m_hash;
> struct dentry *m_dentry;
> + struct list_head m_list;
> int m_count;
> };
>
> @@ -47,6 +48,7 @@ struct mount {
> struct mount *mnt_master; /* slave is on master->mnt_slave_list */
> struct mnt_namespace *mnt_ns; /* containing namespace */
> struct mountpoint *mnt_mp; /* where is it mounted */
> + struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
> #ifdef CONFIG_FSNOTIFY
> struct hlist_head mnt_fsnotify_marks;
> __u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index da5c49483430..e4fe22c23e00 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
> INIT_LIST_HEAD(&mnt->mnt_share);
> INIT_LIST_HEAD(&mnt->mnt_slave_list);
> INIT_LIST_HEAD(&mnt->mnt_slave);
> + INIT_LIST_HEAD(&mnt->mnt_mp_list);
> #ifdef CONFIG_FSNOTIFY
> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
> #endif
> @@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
> mp->m_dentry = dentry;
> mp->m_count = 1;
> list_add(&mp->m_hash, chain);
> + INIT_LIST_HEAD(&mp->m_list);
> return mp;
> }
>
> @@ -643,6 +645,7 @@ static void put_mountpoint(struct mountpoint *mp)
> {
> if (!--mp->m_count) {
> struct dentry *dentry = mp->m_dentry;
> + BUG_ON(!list_empty(&mp->m_list));
> spin_lock(&dentry->d_lock);
> dentry->d_flags &= ~DCACHE_MOUNTED;
> spin_unlock(&dentry->d_lock);
> @@ -689,6 +692,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
> mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> list_del_init(&mnt->mnt_child);
> list_del_init(&mnt->mnt_hash);
> + list_del_init(&mnt->mnt_mp_list);
> put_mountpoint(mnt->mnt_mp);
> mnt->mnt_mp = NULL;
> }
> @@ -705,6 +709,7 @@ void mnt_set_mountpoint(struct mount *mnt,
> child_mnt->mnt_mountpoint = dget(mp->m_dentry);
> child_mnt->mnt_parent = mnt;
> child_mnt->mnt_mp = mp;
> + list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
> }
>
> /*
> @@ -1191,6 +1196,7 @@ void umount_tree(struct mount *mnt, int propagate)
> list_del_init(&p->mnt_child);
> if (mnt_has_parent(p)) {
> p->mnt_parent->mnt_ghosts++;
> + list_del_init(&p->mnt_mp_list);
> put_mountpoint(p->mnt_mp);
> p->mnt_mp = NULL;
> }
> --
> 1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> v2: Always drop the lock when exiting early.
> v3: Make detach_mounts robust about freeing several
> mounts on the same mountpoint at one time, and remove
> the unneeded mnt_list list test.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/mount.h | 1 +
> fs/namespace.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index e4342b8dfab1..7a6a2bb3f290 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
> }
>
> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
> +extern void detach_mounts(struct dentry *dentry);
>
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e4fe22c23e00..78f7c5c9e673 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1295,6 +1295,30 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mountpoint *mp;
> + struct mount *mnt;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry))
> + goto out_unlock;
> +
> + mp = new_mountpoint(dentry);
> + if (IS_ERR(mp))
> + goto out_unlock;
> +
> + br_write_lock(&vfsmount_lock);
> + while (!list_empty(&mp->m_list)) {
> + mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
> + umount_tree(mnt, 1);
> + }
> + br_write_unlock(&vfsmount_lock);
> + put_mountpoint(mp);
> +out_unlock:
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */
> --
> 1.7.5.4
Quoting Eric W. Biederman ([email protected]):
>
> With the introduction of mount namespaces and bind mounts it became
> possible to access files and directories that on some paths are mount
> points but are not mount points on other paths. It is very confusing
> when rm -rf somedir returns -EBUSY simply because somedir is mounted
> somewhere else. With the addition of user namespaces allowing
> unprivileged mounts this condition has gone from annoying to allowing
> a DOS attack on other users in the system.
>
> The possibility for mischief is removed by updating the vfs to support
> rename, unlink and rmdir on a dentry that is a mountpoint and by
> lazily unmounting mountpoints on deleted dentries.
>
> In particular this change allows rename, unlink and rmdir system calls
> on a dentry without a mountpoint in the current mount namespace to
> succeed, and it allows rename, unlink, and rmdir performed on a
> distributed filesystem to update the vfs cache even if when there is a
> mount in some namespace on the original dentry.
>
> There are two common patterns of maintaining mounts: Mounts on trusted
> paths with the parent directory of the mount point and all ancestory
> directories up to / owned by root and modifiable only by root
> (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
> cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
> maintained by fusermount.
>
> In the case of mounts in trusted directories owned by root and
> modifiable only by root the current parent directory permissions are
> sufficient to ensure a mount point on a trusted path is not removed
> or renamed by anyone other than root, even if there is a context
> where the there are no mount points to prevent this.
>
> In the case of mounts in directories owned by less privileged users
> races with users modifying the path of a mount point are already a
> danger. fusermount already uses a combination of chdir,
> /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
> removable of global rename, unlink, and rmdir protection really adds
> nothing new to consider only a widening of the attack window, and
> fusermount is already safe against unprivileged users modifying the
> directory simultaneously.
>
> In principle for perfect userspace programs returning -EBUSY for
> unlink, rmdir, and rename of dentires that have mounts in the local
> namespace is actually unnecessary. Unfortunately not all userspace
> programs are perfect so retaining -EBUSY for unlink, rmdir and rename
> of dentries that have mounts in the current mount namespace plays an
> important role of maintaining consistency with historical behavior and
> making imperfect userspace applications hard to exploit.
>
> v2: Remove spurious old_dentry.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/afs/dir.c | 3 +-
> fs/dcache.c | 80 ++++++++++++++++++++----------------------------
> fs/fuse/dir.c | 3 +-
> fs/gfs2/dentry.c | 4 +--
> fs/namei.c | 30 ++++++------------
> fs/nfs/dir.c | 5 +--
> fs/sysfs/dir.c | 9 +-----
> include/linux/dcache.h | 3 +-
> 8 files changed, 50 insertions(+), 87 deletions(-)
>
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index 646337dc5201..7fb69d45f1b9 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -686,8 +686,7 @@ not_found:
>
> out_bad:
> /* don't unhash if we have submounts */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_skip;
> + shrink_submounts_and_drop(dentry);
>
> _debug("dropping dentry %s/%s",
> parent->d_name.name, dentry->d_name.name);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 41000305d716..1e9bf96b0132 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
> int ret = -ENOENT;
> write_seqlock(&rename_lock);
> for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> - /* Need exclusion wrt. check_submounts_and_drop() */
> + /* Need exclusion wrt. shrink_submounts_and_drop() */
> spin_lock(&p->d_lock);
> if (unlikely(d_unhashed(p))) {
> spin_unlock(&p->d_lock);
> @@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
> }
> EXPORT_SYMBOL(shrink_dcache_parent);
>
> -static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
> +struct detach_data {
> + struct dentry *found;
> +};
> +static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
> {
> - struct select_data *data = _data;
> -
> - if (d_mountpoint(dentry)) {
> - data->found = -EBUSY;
> - return D_WALK_QUIT;
> - }
> -
> - return select_collect(_data, dentry);
> -}
> + struct detach_data *data = ptr;
>
> -static void check_and_drop(void *_data)
> -{
> - struct select_data *data = _data;
> + if (d_mountpoint(dentry))
> + data->found = dentry;
>
> - if (d_mountpoint(data->start))
> - data->found = -EBUSY;
> - if (!data->found)
> - __d_drop(data->start);
> + return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
> }
>
> /**
> - * check_submounts_and_drop - prune dcache, check for submounts and drop
> + * detach_submounts - check for submounts and detach them.
> *
> - * All done as a single atomic operation relative to has_unlinked_ancestor().
> - * Returns 0 if successfully unhashed @parent. If there were submounts then
> - * return -EBUSY.
> + * @dentry: dentry to find mount points under.
> *
> - * @dentry: dentry to prune and drop
> + * If dentry or any of it's children is a mount point detach those mounts.
> */
> -int check_submounts_and_drop(struct dentry *dentry)
> +void detach_submounts(struct dentry *dentry)
> {
> - int ret = 0;
> -
> - /* Negative dentries can be dropped without further checks */
> - if (!dentry->d_inode) {
> - d_drop(dentry);
> - goto out;
> - }
> -
> + struct detach_data data;
> for (;;) {
> - struct select_data data;
> -
> - INIT_LIST_HEAD(&data.dispose);
> - data.start = dentry;
> - data.found = 0;
> + data.found = NULL;
> + d_walk(dentry, &data, do_detach_submounts, NULL);
>
> - d_walk(dentry, &data, check_and_collect, check_and_drop);
> - ret = data.found;
> -
> - if (!list_empty(&data.dispose))
> - shrink_dentry_list(&data.dispose);
> -
> - if (ret <= 0)
> + if (!data.found)
> break;
>
> + detach_mounts(data.found);
> cond_resched();
> }
> + detach_mounts(dentry);
> +}
>
> -out:
> - return ret;
> +/**
> + * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
> + *
> + * All done as a single atomic operation reletaive to d_set_mounted().
> + *
> + * @dentry: dentry to detach, prune and drop
> + */
> +void shrink_submounts_and_drop(struct dentry *dentry)
> +{
> + d_drop(dentry);
> + detach_submounts(dentry);
> + shrink_dcache_parent(dentry);
> }
> -EXPORT_SYMBOL(check_submounts_and_drop);
> +EXPORT_SYMBOL(shrink_submounts_and_drop);
>
> /**
> * __d_alloc - allocate a dcache entry
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 62b43b577bfc..b1cd7b79a325 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -259,8 +259,7 @@ out:
>
> invalid:
> ret = 0;
> - if (check_submounts_and_drop(entry) != 0)
> - ret = 1;
> + shrink_submounts_and_drop(entry);
> goto out;
> }
>
> diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
> index d3a5d4e29ba5..2ecc2b873829 100644
> --- a/fs/gfs2/dentry.c
> +++ b/fs/gfs2/dentry.c
> @@ -93,9 +93,7 @@ invalid_gunlock:
> if (!had_lock)
> gfs2_glock_dq_uninit(&d_gh);
> invalid:
> - if (check_submounts_and_drop(dentry) != 0)
> - goto valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> return 0;
>
> diff --git a/fs/namei.c b/fs/namei.c
> index df7bd6e9c7b6..a12c1d31d4c8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3574,10 +3574,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> dget(dentry);
> mutex_lock(&dentry->d_inode->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(dentry))
> - goto out;
> -
> error = security_inode_rmdir(dir, dentry);
> if (error)
> goto out;
> @@ -3589,6 +3585,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>
> dentry->d_inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> + detach_mounts(dentry);
>
> out:
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -3674,14 +3671,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
> return -EPERM;
>
> mutex_lock(&dentry->d_inode->i_mutex);
> - if (d_mountpoint(dentry))
> - error = -EBUSY;
> - else {
> - error = security_inode_unlink(dir, dentry);
> + error = security_inode_unlink(dir, dentry);
> + if (!error) {
> + error = dir->i_op->unlink(dir, dentry);
> if (!error) {
> - error = dir->i_op->unlink(dir, dentry);
> - if (!error)
> - dont_mount(dentry);
> + dont_mount(dentry);
> + detach_mounts(dentry);
> }
> }
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -4008,10 +4003,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
> - goto out;
> -
> error = -EMLINK;
> if (max_links && !target && new_dir != old_dir &&
> new_dir->i_nlink >= max_links)
> @@ -4026,6 +4017,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target) {
> target->i_flags |= S_DEAD;
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> }
> out:
> if (target)
> @@ -4051,16 +4043,14 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> - goto out;
> -
> error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
> if (error)
> goto out;
>
> - if (target)
> + if (target) {
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> + }
> if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> d_move(old_dentry, new_dentry);
> out:
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 854a8f05a610..e8e35acd8850 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1142,10 +1142,7 @@ out_zap_parent:
> if (dentry->d_flags & DCACHE_DISCONNECTED)
> goto out_valid;
> }
> - /* If we have submounts, don't unhash ! */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
> __func__, dentry->d_parent->d_name.name,
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 4d83cedb9fcb..477c66d4e2a8 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
> }
>
> mutex_unlock(&sysfs_mutex);
> -out_valid:
> return 1;
> out_bad:
> /* Remove the dentry from the dcache hashes.
> @@ -341,13 +340,7 @@ out_bad:
> * to the dcache hashes.
> */
> mutex_unlock(&sysfs_mutex);
> -
> - /* If we have submounts we must allow the vfs caches
> - * to lie about the state of the filesystem to prevent
> - * leaks and other nasty things.
> - */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> + shrink_submounts_and_drop(dentry);
>
> return 0;
> }
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 59066e0b4ff1..17948b49f3d5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);
>
> /* test whether we have any submounts in a subdir tree */
> extern int have_submounts(struct dentry *);
> -extern int check_submounts_and_drop(struct dentry *);
> +extern void detach_submounts(struct dentry *dentry);
> +extern void shrink_submounts_and_drop(struct dentry *);
>
> /*
> * This adds the entry to the hash queues.
> --
> 1.7.5.4
The last round of this patchset the semantics of dropping mounts on
unlink were agreed upon so long as we preserve the current semantics in
a single mount namespace.
There are two big changes with this version of the patchset.
1) The test for a mount in the current mount namespace has been
rewritten so that it is actually correct and is now a drop in
replacement for d_mountpoint. Removing the need for code changes
that could cause unitended regressions.
2) This work has been comprehensivly extend to d_revalidate and
d_invalidate allowing some long standing bugs to be fixed at the end
of this patch series.
In net this should result in a kernel that is more comprehensible and
maintainble as well as fixing some possible mount leaks, and removing
the possibility of DOS attacks from evily placed mount points, in other
mount namespaces.
Eric W. Biederman (11):
vfs: Document the effect of d_revalidate on d_find_alias
vfs: More precise tests in d_invalidate
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
vfs: Remove unnecessary calls of check_submounts_and_drop
vfs: Merge check_submounts_and_drop and d_invalidate
vfs: Make d_invalidate return void
vfs: Remove d_drop calls from d_revalidate implementations
proc: Update proc_flush_task_mnt to use d_invalidate
fs/afs/dir.c | 5 --
fs/btrfs/ioctl.c | 5 +--
fs/ceph/dir.c | 1 -
fs/cifs/readdir.c | 6 +--
fs/dcache.c | 140 +++++++++++++++++-------------------------------
fs/fuse/dir.c | 7 +--
fs/gfs2/dentry.c | 3 -
fs/kernfs/dir.c | 11 ----
fs/mount.h | 13 +++++
fs/namei.c | 28 ++++++----
fs/namespace.c | 80 +++++++++++++++++++++++++++
fs/nfs/dir.c | 7 +--
fs/proc/base.c | 10 +---
fs/proc/fd.c | 2 -
include/linux/dcache.h | 3 +-
15 files changed, 167 insertions(+), 154 deletions(-)
Eric
d_drop or check_submounts_and_drop called from d_revalidate can result
in renamed directories with child dentries being unhashed. These
renamed and drop directory dentries can be rehashed after
d_materialise_unique uses d_find_alias to find them.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..c71c86f7780e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -726,7 +726,8 @@ EXPORT_SYMBOL(dget_parent);
* acquire the reference to alias and return it. Otherwise return NULL.
* Notice that if inode is a directory there can be only one alias and
* it can be unhashed only if it has no children, or if it is the root
- * of a filesystem.
+ * of a filesystem, or if the directory was renamed and d_revalidate
+ * was the first vfs operation to notice.
*
* If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
* any other hashed alias over that one unless @want_discon is set,
--
1.7.5.4
The current comments in d_invalidate about what and why it is doing
what it is doing are wildly off-base. Which is not surprising as
the comments date back to last minute bug fix of the 2.2 kernel.
The big fat lie of a comment said: If it's a directory, we can't drop
it for fear of somebody re-populating it with children (even though
dropping it would make it unreachable from that root, we still might
repopulate it if it was a working directory or similar).
The truth is that for remote filesystems the failure of d_revalidate
and d_weak_revalidate prevents us from populating or otherwise
inappropriately using a directory. For local filesystems and for
local directory removals the setting of S_DEAD prevents us from
populating or otherwise inappropriate using a directory.
The current rules are:
- To prevent mount point leaks dentries that are mount points or that
have childrent that are mount points may not be be unhashed.
- All dentries may be unhashed.
- Directories may be rehashed with d_materialise_unique
check_submounts_and_drop implements this already for well maintained
remote filesystems so implement the current rules in d_invalidate
by just calling check_submounts_and_drop.
The one difference between d_invalidate and check_submounts_and_drop
is that d_invalidate must respect it when a d_revalidate method has
earlier called d_drop so preserve the d_unhashed check in
d_invalidate.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index c71c86f7780e..b0add629f5fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -613,9 +613,8 @@ EXPORT_SYMBOL(dput);
* @dentry: dentry to invalidate
*
* Try to invalidate the dentry if it turns out to be
- * possible. If there are other dentries that can be
- * reached through this one we can't delete it and we
- * return -EBUSY. On success we return 0.
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
*
* no dcache lock.
*/
@@ -630,38 +629,9 @@ int d_invalidate(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
return 0;
}
- /*
- * Check whether to do a partial shrink_dcache
- * to get rid of unused child entries.
- */
- if (!list_empty(&dentry->d_subdirs)) {
- spin_unlock(&dentry->d_lock);
- shrink_dcache_parent(dentry);
- spin_lock(&dentry->d_lock);
- }
-
- /*
- * Somebody else still using it?
- *
- * If it's a directory, we can't drop it
- * for fear of somebody re-populating it
- * with children (even though dropping it
- * would make it unreachable from the root,
- * we might still populate it if it was a
- * working directory or similar).
- * We also need to leave mountpoints alone,
- * directory or not.
- */
- if (dentry->d_lockref.count > 1 && dentry->d_inode) {
- if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
- spin_unlock(&dentry->d_lock);
- return -EBUSY;
- }
- }
-
- __d_drop(dentry);
spin_unlock(&dentry->d_lock);
- return 0;
+
+ return check_submounts_and_drop(dentry);
}
EXPORT_SYMBOL(d_invalidate);
--
1.7.5.4
In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.
The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.
Miklos Szeredi <[email protected]> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.
A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.
Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.
v2: Rewrote the test to be a drop in replacement for d_mountpoint
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/mount.h | 9 +++++++++
fs/namei.c | 8 +++++++-
fs/namespace.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+), 1 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a17458ca6f29..17d913d86add 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -109,3 +109,12 @@ struct proc_mounts {
#define proc_mounts(p) (container_of((p), struct proc_mounts, m))
extern const struct seq_operations mounts_op;
+
+extern int __is_local_mountpoint(struct dentry *dentry);
+static inline int is_local_mountpoint(struct dentry *dentry)
+{
+ if (!d_mountpoint(dentry))
+ return 0;
+
+ return __is_local_mountpoint(dentry);
+}
diff --git a/fs/namei.c b/fs/namei.c
index d580df2e6804..4e6fe16ef488 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3507,6 +3507,8 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
mutex_lock(&dentry->d_inode->i_mutex);
error = -EBUSY;
+ if (is_local_mountpoint(dentry))
+ goto out;
if (d_mountpoint(dentry))
goto out;
@@ -3622,7 +3624,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;
mutex_lock(&target->i_mutex);
- if (d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -4001,6 +4003,8 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
mutex_lock(&target->i_mutex);
error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
goto out;
@@ -4045,6 +4049,8 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
lock_two_nondirectories(source, target);
error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
goto out;
diff --git a/fs/namespace.c b/fs/namespace.c
index 22e536705c45..6dc23614d44d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
return m;
}
+/*
+ * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
+ * current mount namespace.
+ *
+ * The common case is dentries are not mountpoints at all and that
+ * test is handled inline. For the slow case when we are actually
+ * dealing with a mountpoint of some kind, walk through all of the
+ * mounts in the current mount namespace and test to see if the dentry
+ * is a mountpoint.
+ *
+ * The mount_hashtable is not usable in the context because we
+ * need to identify all mounts that may be in the current mount
+ * namespace not just a mount that happens to have some specified
+ * parent mount.
+ */
+int __is_local_mountpoint(struct dentry *dentry)
+{
+ struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+ struct mount *mnt;
+ int is_covered = 0;
+
+ if (!d_mountpoint(dentry))
+ goto out;
+
+ down_read(&namespace_sem);
+ list_for_each_entry(mnt, &ns->list, mnt_list) {
+ is_covered = (mnt->mnt_mountpoint == dentry);
+ if (is_covered)
+ break;
+ }
+ up_read(&namespace_sem);
+out:
+ return is_covered;
+}
+
static struct mountpoint *new_mountpoint(struct dentry *dentry)
{
struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
--
1.7.5.4
To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 2 ++
fs/namespace.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 17d913d86add..50a72d46e7a6 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
@@ -48,6 +49,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 6dc23614d44d..33db9e95bd5c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -195,6 +195,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -695,6 +696,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@ -702,6 +704,7 @@ static void put_mountpoint(struct mountpoint *mp)
{
if (!--mp->m_count) {
struct dentry *dentry = mp->m_dentry;
+ BUG_ON(!list_empty(&mp->m_list));
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
@@ -748,6 +751,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@ -764,6 +768,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}
/*
@@ -1246,6 +1251,7 @@ void umount_tree(struct mount *mnt, int how)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
list_del_init(&p->mnt_child);
if (mnt_has_parent(p)) {
+ list_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
/* move the reference to mountpoint into ->mnt_ex_mountpoint */
p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
--
1.7.5.4
v2: Always drop the lock when exiting early.
v3: Make detach_mounts robust about freeing several
mounts on the same mountpoint at one time, and remove
the unneeded mnt_list list test.
v4: Document the purpose of detach_mounts and why new_mountpoint is
safe to call.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 2 ++
fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 50a72d46e7a6..2b470f34e665 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
extern bool legitimize_mnt(struct vfsmount *, unsigned);
+extern void detach_mounts(struct dentry *dentry);
+
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index 33db9e95bd5c..7abbf722ce18 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1359,6 +1359,45 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
+/*
+ * detach_mounts - lazily unmount all mounts on the specified dentry
+ *
+ * During unlink, rmdir, and d_drop it is possible to loose the path
+ * to an existing mountpoint, and wind up leaking the mount.
+ * detach_mounts allows lazily unmounting those mounts instead of
+ * leaking them.
+ *
+ * The caller may hold dentry->d_inode->i_mutex.
+ */
+void detach_mounts(struct dentry *dentry)
+{
+ struct mountpoint *mp;
+ struct mount *mnt;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry))
+ goto out_unlock;
+
+ /*
+ * The namespace lock and d_mountpoint being set guarantees
+ * that new_mountpoint will just be a lookup of the existing
+ * mountpoint structure.
+ */
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp))
+ goto out_unlock;
+
+ lock_mount_hash();
+ while (!list_empty(&mp->m_list)) {
+ mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
+ umount_tree(mnt, 1);
+ }
+ unlock_mount_hash();
+ put_mountpoint(mp);
+out_unlock:
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 60 ++++++++++++++++++++++++++++++++--------------------------
fs/namei.c | 18 ++++++++--------
2 files changed, 42 insertions(+), 36 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index b0add629f5fe..27585b1dd6f1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1374,36 +1374,39 @@ void shrink_dcache_for_umount(struct super_block *sb)
}
}
-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct select_data select;
+ struct dentry *mountpoint;
+};
+static enum d_walk_ret detach_and_collect(void *_data, struct dentry *dentry)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;
if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
+ __dget_dlock(dentry);
+ data->mountpoint = dentry;
return D_WALK_QUIT;
}
- return select_collect(_data, dentry);
+ return select_collect(&data->select, dentry);
}
static void check_and_drop(void *_data)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;
- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ if (!data->mountpoint && !data->select.found)
+ __d_drop(data->select.start);
}
/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * check_submounts_and_drop - detach submounts, prune dcache, and drop
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * The final d_drop is done as an atomic operation relative to
+ * rename_lock ensuring there are no races with d_set_mounted. This
+ * ensures there are no unhashed dentries on the path to a mountpoint.
*
- * @dentry: dentry to prune and drop
+ * @dentry: dentry to detach, prune and drop
*/
int check_submounts_and_drop(struct dentry *dentry)
{
@@ -1416,19 +1419,24 @@ int check_submounts_and_drop(struct dentry *dentry)
}
for (;;) {
- struct select_data data;
+ struct detach_data data;
- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.mountpoint = NULL;
+ INIT_LIST_HEAD(&data.select.dispose);
+ data.select.start = dentry;
+ data.select.found = 0;
- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
+ d_walk(dentry, &data, detach_and_collect, check_and_drop);
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
+ if (data.select.found)
+ shrink_dentry_list(&data.select.dispose);
- if (ret <= 0)
+ if (data.mountpoint) {
+ detach_mounts(data.mountpoint);
+ dput(data.mountpoint);
+ }
+
+ if (!data.mountpoint && !data.select.found)
break;
cond_resched();
@@ -2640,10 +2648,8 @@ static struct dentry *__d_unalias(struct inode *inode,
goto out_err;
m2 = &alias->d_parent->d_inode->i_mutex;
out_unalias:
- if (likely(!d_mountpoint(alias))) {
- __d_move(alias, dentry);
- ret = alias;
- }
+ __d_move(alias, dentry);
+ ret = alias;
out_err:
spin_unlock(&inode->i_lock);
if (m2)
diff --git a/fs/namei.c b/fs/namei.c
index 4e6fe16ef488..3fca30cd448b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3509,8 +3509,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
error = -EBUSY;
if (is_local_mountpoint(dentry))
goto out;
- if (d_mountpoint(dentry))
- goto out;
error = security_inode_rmdir(dir, dentry);
if (error)
@@ -3523,6 +3521,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);
out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3624,7 +3623,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;
mutex_lock(&target->i_mutex);
- if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -3633,8 +3632,10 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
if (error)
goto out;
error = dir->i_op->unlink(dir, dentry);
- if (!error)
+ if (!error) {
dont_mount(dentry);
+ detach_mounts(dentry);
+ }
}
}
out:
@@ -4005,8 +4006,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
@@ -4022,6 +4021,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4051,8 +4051,6 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
error = try_break_deleg(source, delegated_inode);
if (error)
@@ -4066,8 +4064,10 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (error)
goto out;
- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
--
1.7.5.4
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/afs/dir.c | 5 -----
fs/fuse/dir.c | 3 ---
fs/gfs2/dentry.c | 3 ---
fs/kernfs/dir.c | 11 -----------
fs/nfs/dir.c | 4 ----
5 files changed, 0 insertions(+), 26 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 529300327f45..a1645b88fe8a 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -669,7 +669,6 @@ static int afs_d_revalidate(struct dentry *dentry, unsigned int flags)
out_valid:
dentry->d_fsdata = dir_version;
-out_skip:
dput(parent);
key_put(key);
_leave(" = 1 [valid]");
@@ -682,10 +681,6 @@ not_found:
spin_unlock(&dentry->d_lock);
out_bad:
- /* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
-
_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
dput(parent);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 1d1292c581c3..5192d0a04d20 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -273,9 +273,6 @@ out:
invalid:
ret = 0;
-
- if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
- ret = 1;
goto out;
}
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..589f4ea9381c 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,6 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
dput(parent);
return 0;
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5104cf5d25c5..43d5ab905238 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -296,21 +296,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;
mutex_unlock(&kernfs_mutex);
-out_valid:
return 1;
out_bad:
mutex_unlock(&kernfs_mutex);
out_bad_unlocked:
- /*
- * @dentry doesn't match the underlying kernfs node, drop the
- * dentry and force lookup. If we have submounts we must allow the
- * vfs caches to lie about the state of the filesystem to prevent
- * leaks and other nasty things, so use check_submounts_and_drop()
- * instead of d_drop().
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
return 0;
}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index be38b573495a..a8a2f7f86c45 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1143,10 +1143,6 @@ out_zap_parent:
if (IS_ROOT(dentry))
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
__func__, dentry);
--
1.7.5.4
Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 55 +++++++++++++++++++----------------------------
include/linux/dcache.h | 1 -
2 files changed, 22 insertions(+), 34 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 27585b1dd6f1..5b41205cbf33 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -608,32 +608,6 @@ kill_it:
}
EXPORT_SYMBOL(dput);
-/**
- * d_invalidate - invalidate a dentry
- * @dentry: dentry to invalidate
- *
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
- * no dcache lock.
- */
-
-int d_invalidate(struct dentry * dentry)
-{
- /*
- * If it's already been dropped, return OK.
- */
- spin_lock(&dentry->d_lock);
- if (d_unhashed(dentry)) {
- spin_unlock(&dentry->d_lock);
- return 0;
- }
- spin_unlock(&dentry->d_lock);
-
- return check_submounts_and_drop(dentry);
-}
-EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
@@ -1175,7 +1149,7 @@ EXPORT_SYMBOL(have_submounts);
* reachable (e.g. NFS can unhash a directory dentry and then the complete
* subtree can become unreachable).
*
- * Only one of check_submounts_and_drop() and d_set_mounted() must succeed. For
+ * Only one of d_invalidate() and d_set_mounted() must succeed. For
* this reason take rename_lock and d_lock on dentry and ancestors.
*/
int d_set_mounted(struct dentry *dentry)
@@ -1184,7 +1158,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. d_invalidate() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1400,18 +1374,33 @@ static void check_and_drop(void *_data)
}
/**
- * check_submounts_and_drop - detach submounts, prune dcache, and drop
+ * d_invalidate - detach submounts, prune dcache, and drop
+ * @dentry: dentry to invalidate (aka detach, prune and drop)
+ *
+ * Try to invalidate the dentry if it turns out to be
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
+ *
+ * no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
- *
- * @dentry: dentry to detach, prune and drop
*/
-int check_submounts_and_drop(struct dentry *dentry)
+int d_invalidate(struct dentry *dentry)
{
int ret = 0;
+ /*
+ * If it's already been dropped, return OK.
+ */
+ spin_lock(&dentry->d_lock);
+ if (d_unhashed(dentry)) {
+ spin_unlock(&dentry->d_lock);
+ return 0;
+ }
+ spin_unlock(&dentry->d_lock);
+
/* Negative dentries can be dropped without further checks */
if (!dentry->d_inode) {
d_drop(dentry);
@@ -1445,7 +1434,7 @@ int check_submounts_and_drop(struct dentry *dentry)
out:
return ret;
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(d_invalidate);
/**
* __d_alloc - allocate a dcache entry
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf72e9ac6de0..ae77222c3e86 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -265,7 +265,6 @@ extern void d_prune_aliases(struct inode *);
/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);
/*
* This adds the entry to the hash queues.
--
1.7.5.4
Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/btrfs/ioctl.c | 5 +----
fs/cifs/readdir.c | 6 +-----
fs/dcache.c | 12 +++---------
fs/fuse/dir.c | 4 +---
fs/namei.c | 10 +++++-----
fs/nfs/dir.c | 3 +--
include/linux/dcache.h | 2 +-
7 files changed, 13 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b0134892dc70..349848bd54e2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2229,9 +2229,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
}
mutex_lock(&inode->i_mutex);
- err = d_invalidate(dentry);
- if (err)
- goto out_unlock;
+ d_invalidate(dentry);
down_write(&root->fs_info->subvol_sem);
@@ -2316,7 +2314,6 @@ out_release:
btrfs_subvolume_release_metadata(root, &block_rsv, qgroup_reserved);
out_up_write:
up_write(&root->fs_info->subvol_sem);
-out_unlock:
mutex_unlock(&inode->i_mutex);
if (!err) {
shrink_dcache_sb(root->fs_info->sb);
diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
index b15862e0f68c..d0e9d0169b37 100644
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -87,8 +87,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
return;
if (dentry) {
- int err;
-
inode = dentry->d_inode;
if (inode) {
/*
@@ -105,10 +103,8 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
goto out;
}
}
- err = d_invalidate(dentry);
+ d_invalidate(dentry);
dput(dentry);
- if (err)
- return;
}
/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5b41205cbf33..5b78bd98649c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1377,27 +1377,21 @@ static void check_and_drop(void *_data)
* d_invalidate - detach submounts, prune dcache, and drop
* @dentry: dentry to invalidate (aka detach, prune and drop)
*
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
* no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
*/
-int d_invalidate(struct dentry *dentry)
+void d_invalidate(struct dentry *dentry)
{
- int ret = 0;
-
/*
* If it's already been dropped, return OK.
*/
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
- return 0;
+ return;
}
spin_unlock(&dentry->d_lock);
@@ -1432,7 +1426,7 @@ int d_invalidate(struct dentry *dentry)
}
out:
- return ret;
+ return;
}
EXPORT_SYMBOL(d_invalidate);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5192d0a04d20..6e920fadb45d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1225,9 +1225,7 @@ static int fuse_direntplus_link(struct file *file,
d_drop(dentry);
} else if (get_node_id(inode) != o->nodeid ||
((o->attr.mode ^ inode->i_mode) & S_IFMT)) {
- err = d_invalidate(dentry);
- if (err)
- goto out;
+ d_invalidate(dentry);
} else if (is_bad_inode(inode)) {
err = -EIO;
goto out;
diff --git a/fs/namei.c b/fs/namei.c
index 3fca30cd448b..384fcc6a5606 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1259,7 +1259,8 @@ static struct dentry *lookup_dcache(struct qstr *name, struct dentry *dir,
if (error < 0) {
dput(dentry);
return ERR_PTR(error);
- } else if (!d_invalidate(dentry)) {
+ } else {
+ d_invalidate(dentry);
dput(dentry);
dentry = NULL;
}
@@ -1391,10 +1392,9 @@ unlazy:
dput(dentry);
return status;
}
- if (!d_invalidate(dentry)) {
- dput(dentry);
- goto need_lookup;
- }
+ d_invalidate(dentry);
+ dput(dentry);
+ goto need_lookup;
}
path->mnt = mnt;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a8a2f7f86c45..a2b2b6d17c1f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -464,8 +464,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_setsecurity(dentry->d_inode, entry->fattr, entry->label);
goto out;
} else {
- if (d_invalidate(dentry) != 0)
- goto out;
+ d_invalidate(dentry);
dput(dentry);
}
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index ae77222c3e86..95ddb0bdadb0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -250,7 +250,7 @@ extern struct dentry * d_obtain_alias(struct inode *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_for_umount(struct super_block *);
-extern int d_invalidate(struct dentry *);
+extern void d_invalidate(struct dentry *);
/* only used at mount-time */
extern struct dentry * d_make_root(struct inode *);
--
1.7.5.4
Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.
Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries. Using d_invalidate ensures that paths
to mount points will not be dropped.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/ceph/dir.c | 1 -
fs/proc/base.c | 4 ----
fs/proc/fd.c | 2 --
3 files changed, 0 insertions(+), 7 deletions(-)
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 6da4df84ba30..ecd2402739e2 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1052,7 +1052,6 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags)
ceph_dentry_lru_touch(dentry);
} else {
ceph_dir_clear_complete(dir);
- d_drop(dentry);
}
iput(dir);
return valid;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 51507065263b..aa62b37f1303 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1654,7 +1654,6 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
put_task_struct(task);
return 1;
}
- d_drop(dentry);
return 0;
}
@@ -1791,9 +1790,6 @@ out:
put_task_struct(task);
out_notask:
- if (status <= 0)
- d_drop(dentry);
-
return status;
}
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 985ea881b5bc..7e4f04ccab31 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -127,8 +127,6 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
}
put_task_struct(task);
}
-
- d_drop(dentry);
return 0;
}
--
1.7.5.4
Now that d_invalidate always succeeds and flushes mount points use
it in stead of a combination of shrink_dcache_parent and d_drop
in proc_flush_task_mnt. This removes the danger of a mount point
under /proc/<pid>/... becoming unreachable after the d_drop.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/base.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index aa62b37f1303..40d97e576d32 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2700,8 +2700,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}
@@ -2721,8 +2720,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
name.len = snprintf(buf, sizeof(buf), "%d", pid);
dentry = d_hash_and_lookup(dir, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}
--
1.7.5.4
On Sat, Feb 15, 2014 at 1:36 PM, Eric W. Biederman
<[email protected]> wrote:
>
> The one difference between d_invalidate and check_submounts_and_drop
> is that d_invalidate must respect it when a d_revalidate method has
> earlier called d_drop so preserve the d_unhashed check in
> d_invalidate.
What?
There's another *huge* difference, namely locking. Your change has
huge detrimental effects wrt d_lock - not only do you drop it for the
local dentry (to then re-take it in check_submounts_and_drop()), but
the whole check_submounts_and_drop thing walks the parent chain and
locks each parent with the renamelock held for writing.
So NAK on this patch, if for no other reason that you are not talking
about the *huge* locking changes at all, and apparently completely
ignored them.
It's possible that you can argue that performance doesn't matter, and
that all the extra huge locking overhead is immaterial because this is
not commonly called, but no way in hell can I accept a patch with a
commit message that basically lies by omission about what it is doing.
This kills the whole series for me. I'm not going to bother trying to
review the rest of the patches when the second patch already makes me
go "Eric is trying to hide things". Because the locking changes aren't
obvious in the patch without knowing what else is going on, and they
certainly aren't mentioned in the commit message.
Linus
On Sat, Feb 15, 2014 at 2:51 PM, Linus Torvalds
<[email protected]> wrote:
>
> the whole check_submounts_and_drop thing walks the parent chain and
> locks each parent with the renamelock held for writing.
Oops, my bad about the write lock, brainfart due to grepping and
reading the wrong context...
check_submounts_and_drop() doesn't do the parent walk with the rename
lock held for writing, it just holds it for reading.
But it does do that very complex "walk parents and check all siblings"
and locks them, so the rest of the commentary was correct.
Linus
Linus Torvalds <[email protected]> writes:
> On Sat, Feb 15, 2014 at 2:51 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> the whole check_submounts_and_drop thing walks the parent chain and
>> locks each parent with the renamelock held for writing.
>
> Oops, my bad about the write lock, brainfart due to grepping and
> reading the wrong context...
>
> check_submounts_and_drop() doesn't do the parent walk with the rename
> lock held for writing, it just holds it for reading.
>
> But it does do that very complex "walk parents and check all siblings"
> and locks them, so the rest of the commentary was correct.
Except that today d_invalidate drops the dcache lock and
calls shrink_dcache_parent. Which gets you into exactly the same
complex "walk parents and check all siblings" code.
The only difference between the shrink_dcache_parent and
check_submounts_and_drop (not counting the final drop)
is that check_submounts_and_drop aborts when it encounters a dentry
with d_mountpoint set.
So no I am not trying to hide something. I called out that I changed
this logic in particular and this particular patch all I am doing is
killing the enforcing of 2.2 era logic. Further I front loaded this
change so I bisect could point it's fingers at this before any other
substantial changes were made if this is indeed a problem.
Beyond that check_submounts_and_drop is what well maintained distributed
filesystems are calling from d_revalidate.
Now I would not be surprised if this change to d_invalidate is a
challenge to get your head around. It took me a while of reading the
code to realize (a) how the code makes some degree of sense today,
and (b) that the change is semantically safe.
But when shrink_dcache_parent and check_submounts_and_drop are
effectiely the same function I can't possibly see how you can argue how
the locking has changed or that I am trying to hide things.
Eric
[email protected] (Eric W. Biederman) writes:
> But when shrink_dcache_parent and check_submounts_and_drop are
> effectiely the same function I can't possibly see how you can argue how
> the locking has changed or that I am trying to hide things.
And in particular the only locking change that I can see at all is that
d_walk takes read_seqbegin_or_lock before checking the if the d_subdirs
list is empty, which is just an extra cache line read.
Which in practical terms appears like I have removed unnecessary special
cases in favor less code. Which I think if anything should perform
better just because there is less code to run, and what is happening is
less obfuscated.
Eric
On Sat, Feb 15, 2014 at 3:23 PM, Eric W. Biederman
<[email protected]> wrote:
>
> Except that today d_invalidate drops the dcache lock and
> calls shrink_dcache_parent. Which gets you into exactly the same
> complex "walk parents and check all siblings" code.
Hmm. It only does that for directories that have sub-entries, though.
I think you may care just about directories (because that's what your
series is about), but d_invalidate() is used for other cases too,
notably d_revalidate() (ie things like stale NFS lookups of normal
files).
That said, I'll have to think about this more. If d_subdir is empty, I
guess d_walk() will be fairly cheap. It's very different, but maybe
not as disastrous as I thought.
Linus
Linus Torvalds <[email protected]> writes:
> On Sat, Feb 15, 2014 at 3:23 PM, Eric W. Biederman
> <[email protected]> wrote:
>>
>> Except that today d_invalidate drops the dcache lock and
>> calls shrink_dcache_parent. Which gets you into exactly the same
>> complex "walk parents and check all siblings" code.
>
> Hmm. It only does that for directories that have sub-entries, though.
>
> I think you may care just about directories (because that's what your
> series is about), but d_invalidate() is used for other cases too,
> notably d_revalidate() (ie things like stale NFS lookups of normal
> files).
>
> That said, I'll have to think about this more. If d_subdir is empty, I
> guess d_walk() will be fairly cheap. It's very different, but maybe
> not as disastrous as I thought.
Thank you for taking the time to look.
It is also very true that I have been mostly focusing on the semantics
and correctness rather than on the performance of these changes.
I did keep a weather eye on the performance impact of these changes
though and as long as we can stand a slow down in the rare case where
mount points are present my changes should be fine.
Eric
[email protected] (Eric W. Biederman) writes:
> v2: Always drop the lock when exiting early.
> v3: Make detach_mounts robust about freeing several
> mounts on the same mountpoint at one time, and remove
> the unneeded mnt_list list test.
> v4: Document the purpose of detach_mounts and why new_mountpoint is
> safe to call.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 50a72d46e7a6..2b470f34e665 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>
> extern bool legitimize_mnt(struct vfsmount *, unsigned);
>
> +extern void detach_mounts(struct dentry *dentry);
After Linus's rant about performance I took a second look, and lo and
behold there is a significant optimization oppportunity missed here.
This should almost certainly be:
extern void __detach_mounts(struct dentry *dentry);
static inline void detach_mounts(struct dentry *dentry)
{
if (unlikely(d_mountpoint(dentry)))
return;
__detach_mounts(dentry);
}
Just so I don't take the namespace lock when not necessary.
Although wether this should be inline or at the start of detach_mounts
and out of line I don't know. Sigh. I will have to play with this
detail a bit more.
Eric
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> atomic_inc(&ns->count);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 33db9e95bd5c..7abbf722ce18 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1359,6 +1359,45 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +/*
> + * detach_mounts - lazily unmount all mounts on the specified dentry
> + *
> + * During unlink, rmdir, and d_drop it is possible to loose the path
> + * to an existing mountpoint, and wind up leaking the mount.
> + * detach_mounts allows lazily unmounting those mounts instead of
> + * leaking them.
> + *
> + * The caller may hold dentry->d_inode->i_mutex.
> + */
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mountpoint *mp;
> + struct mount *mnt;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry))
> + goto out_unlock;
> +
> + /*
> + * The namespace lock and d_mountpoint being set guarantees
> + * that new_mountpoint will just be a lookup of the existing
> + * mountpoint structure.
> + */
> + mp = new_mountpoint(dentry);
> + if (IS_ERR(mp))
> + goto out_unlock;
> +
> + lock_mount_hash();
> + while (!list_empty(&mp->m_list)) {
> + mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
> + umount_tree(mnt, 1);
> + }
> + unlock_mount_hash();
> + put_mountpoint(mp);
> +out_unlock:
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */
On Sat, Feb 15, 2014 at 01:36:38PM -0800, Eric W. Biederman wrote:
>
> In preparation for allowing mountpoints to be renamed and unlinked
> in remote filesystems and in other mount namespaces test if on a dentry
> there is a mount in the local mount namespace before allowing it to
> be renamed or unlinked.
>
> The primary motivation here are old versions of fusermount unmount
> which is not safe if the a path can be renamed or unlinked while it is
> verifying the mount is safe to unmount. More recent versions are simpler
> and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
> in a directory owned by an arbitrary user.
>
> Miklos Szeredi <[email protected]> reports this is approach is good
> enough to remove concerns about new kernels mixed with old versions
> of fusermount.
>
> A secondary motivation for restrictions here is that it removing empty
> directories that have non-empty mount points on them appears to
> violate the rule that rmdir can not remove empty directories. As
> Linus Torvalds pointed out this is useful for programs (like git) that
> test if a directory is empty with rmdir.
>
> Therefore this patch arranges to enforce the existing mount point
> semantics for local mount namespace.
>
> v2: Rewrote the test to be a drop in replacement for d_mountpoint
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/mount.h | 9 +++++++++
> fs/namei.c | 8 +++++++-
> fs/namespace.c | 35 +++++++++++++++++++++++++++++++++++
> 3 files changed, 51 insertions(+), 1 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index a17458ca6f29..17d913d86add 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -109,3 +109,12 @@ struct proc_mounts {
> #define proc_mounts(p) (container_of((p), struct proc_mounts, m))
>
> extern const struct seq_operations mounts_op;
> +
> +extern int __is_local_mountpoint(struct dentry *dentry);
> +static inline int is_local_mountpoint(struct dentry *dentry)
> +{
> + if (!d_mountpoint(dentry))
> + return 0;
> +
> + return __is_local_mountpoint(dentry);
> +}
> diff --git a/fs/namei.c b/fs/namei.c
> index d580df2e6804..4e6fe16ef488 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3507,6 +3507,8 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> mutex_lock(&dentry->d_inode->i_mutex);
>
> error = -EBUSY;
> + if (is_local_mountpoint(dentry))
> + goto out;
> if (d_mountpoint(dentry))
> goto out;
>
> @@ -3622,7 +3624,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
> return -EPERM;
>
> mutex_lock(&target->i_mutex);
> - if (d_mountpoint(dentry))
> + if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
> error = -EBUSY;
> else {
> error = security_inode_unlink(dir, dentry);
> @@ -4001,6 +4003,8 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> mutex_lock(&target->i_mutex);
>
> error = -EBUSY;
> + if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
> + goto out;
> if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
> goto out;
>
> @@ -4045,6 +4049,8 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> lock_two_nondirectories(source, target);
>
> error = -EBUSY;
> + if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
> + goto out;
> if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> goto out;
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 22e536705c45..6dc23614d44d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
> return m;
> }
>
> +/*
> + * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
> + * current mount namespace.
> + *
> + * The common case is dentries are not mountpoints at all and that
> + * test is handled inline. For the slow case when we are actually
> + * dealing with a mountpoint of some kind, walk through all of the
> + * mounts in the current mount namespace and test to see if the dentry
> + * is a mountpoint.
> + *
> + * The mount_hashtable is not usable in the context because we
> + * need to identify all mounts that may be in the current mount
> + * namespace not just a mount that happens to have some specified
> + * parent mount.
> + */
> +int __is_local_mountpoint(struct dentry *dentry)
Minor nit: return value of any is_* function is either true or false, so why not
declare it bool?
> +{
> + struct mnt_namespace *ns = current->nsproxy->mnt_ns;
> + struct mount *mnt;
> + int is_covered = 0;
And the same for this var.
> +
> + if (!d_mountpoint(dentry))
> + goto out;
> +
> + down_read(&namespace_sem);
> + list_for_each_entry(mnt, &ns->list, mnt_list) {
> + is_covered = (mnt->mnt_mountpoint == dentry);
> + if (is_covered)
> + break;
> + }
> + up_read(&namespace_sem);
> +out:
> + return is_covered;
> +}
> +
> static struct mountpoint *new_mountpoint(struct dentry *dentry)
> {
> struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
> --
> 1.7.5.4
>
On Sat, Feb 15, 2014 at 01:37:26PM -0800, Eric W. Biederman wrote:
>
> v2: Always drop the lock when exiting early.
> v3: Make detach_mounts robust about freeing several
> mounts on the same mountpoint at one time, and remove
> the unneeded mnt_list list test.
> v4: Document the purpose of detach_mounts and why new_mountpoint is
> safe to call.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 50a72d46e7a6..2b470f34e665 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>
> extern bool legitimize_mnt(struct vfsmount *, unsigned);
>
> +extern void detach_mounts(struct dentry *dentry);
> +
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> atomic_inc(&ns->count);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 33db9e95bd5c..7abbf722ce18 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1359,6 +1359,45 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +/*
> + * detach_mounts - lazily unmount all mounts on the specified dentry
> + *
> + * During unlink, rmdir, and d_drop it is possible to loose the path
> + * to an existing mountpoint, and wind up leaking the mount.
> + * detach_mounts allows lazily unmounting those mounts instead of
> + * leaking them.
> + *
> + * The caller may hold dentry->d_inode->i_mutex.
> + */
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mountpoint *mp;
> + struct mount *mnt;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry))
> + goto out_unlock;
> +
> + /*
> + * The namespace lock and d_mountpoint being set guarantees
> + * that new_mountpoint will just be a lookup of the existing
> + * mountpoint structure.
> + */
> + mp = new_mountpoint(dentry);
Howabout a get_mountpoint(dentry) helper, that returns NULL if it turns out to
be not a mountpoint? And, as an added bonus, you can drop the comment above as
well.
> + if (IS_ERR(mp))
> + goto out_unlock;
> +
> + lock_mount_hash();
> + while (!list_empty(&mp->m_list)) {
> + mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
> + umount_tree(mnt, 1);
> + }
> + unlock_mount_hash();
> + put_mountpoint(mp);
> +out_unlock:
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */
> --
> 1.7.5.4
>
On Sat, Feb 15, 2014 at 01:39:22PM -0800, Eric W. Biederman wrote:
>
> Now that d_invalidate is the only caller of check_submounts_and_drop,
> expand check_submounts_and_drop inline in d_invalidate.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/dcache.c | 55 +++++++++++++++++++----------------------------
> include/linux/dcache.h | 1 -
> 2 files changed, 22 insertions(+), 34 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 27585b1dd6f1..5b41205cbf33 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -608,32 +608,6 @@ kill_it:
> }
> EXPORT_SYMBOL(dput);
>
> -/**
> - * d_invalidate - invalidate a dentry
> - * @dentry: dentry to invalidate
> - *
> - * Try to invalidate the dentry if it turns out to be
> - * possible. If there are reasons not to delete it
> - * return -EBUSY. On success return 0.
> - *
> - * no dcache lock.
> - */
> -
> -int d_invalidate(struct dentry * dentry)
> -{
> - /*
> - * If it's already been dropped, return OK.
> - */
> - spin_lock(&dentry->d_lock);
> - if (d_unhashed(dentry)) {
> - spin_unlock(&dentry->d_lock);
> - return 0;
> - }
> - spin_unlock(&dentry->d_lock);
> -
> - return check_submounts_and_drop(dentry);
> -}
> -EXPORT_SYMBOL(d_invalidate);
>
> /* This must be called with d_lock held */
> static inline void __dget_dlock(struct dentry *dentry)
> @@ -1175,7 +1149,7 @@ EXPORT_SYMBOL(have_submounts);
> * reachable (e.g. NFS can unhash a directory dentry and then the complete
> * subtree can become unreachable).
> *
> - * Only one of check_submounts_and_drop() and d_set_mounted() must succeed. For
> + * Only one of d_invalidate() and d_set_mounted() must succeed. For
> * this reason take rename_lock and d_lock on dentry and ancestors.
> */
> int d_set_mounted(struct dentry *dentry)
> @@ -1184,7 +1158,7 @@ int d_set_mounted(struct dentry *dentry)
> int ret = -ENOENT;
> write_seqlock(&rename_lock);
> for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> - /* Need exclusion wrt. check_submounts_and_drop() */
> + /* Need exclusion wrt. d_invalidate() */
> spin_lock(&p->d_lock);
> if (unlikely(d_unhashed(p))) {
> spin_unlock(&p->d_lock);
> @@ -1400,18 +1374,33 @@ static void check_and_drop(void *_data)
> }
>
> /**
> - * check_submounts_and_drop - detach submounts, prune dcache, and drop
> + * d_invalidate - detach submounts, prune dcache, and drop
> + * @dentry: dentry to invalidate (aka detach, prune and drop)
> + *
> + * Try to invalidate the dentry if it turns out to be
> + * possible. If there are reasons not to delete it
> + * return -EBUSY. On success return 0.
> + *
> + * no dcache lock.
> *
> * The final d_drop is done as an atomic operation relative to
> * rename_lock ensuring there are no races with d_set_mounted. This
> * ensures there are no unhashed dentries on the path to a mountpoint.
> - *
> - * @dentry: dentry to detach, prune and drop
> */
> -int check_submounts_and_drop(struct dentry *dentry)
> +int d_invalidate(struct dentry *dentry)
> {
> int ret = 0;
>
> + /*
> + * If it's already been dropped, return OK.
> + */
> + spin_lock(&dentry->d_lock);
> + if (d_unhashed(dentry)) {
> + spin_unlock(&dentry->d_lock);
> + return 0;
> + }
> + spin_unlock(&dentry->d_lock);
> +
> /* Negative dentries can be dropped without further checks */
> if (!dentry->d_inode) {
> d_drop(dentry);
You can optimize this by including the negative check within the above d_locked
region and calling __d_drop() instead.
> @@ -1445,7 +1434,7 @@ int check_submounts_and_drop(struct dentry *dentry)
> out:
> return ret;
> }
> -EXPORT_SYMBOL(check_submounts_and_drop);
> +EXPORT_SYMBOL(d_invalidate);
>
> /**
> * __d_alloc - allocate a dcache entry
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index bf72e9ac6de0..ae77222c3e86 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -265,7 +265,6 @@ extern void d_prune_aliases(struct inode *);
>
> /* test whether we have any submounts in a subdir tree */
> extern int have_submounts(struct dentry *);
> -extern int check_submounts_and_drop(struct dentry *);
>
> /*
> * This adds the entry to the hash queues.
> --
> 1.7.5.4
>
On Sat, Feb 15, 2014 at 01:34:43PM -0800, Eric W. Biederman wrote:
>
> The last round of this patchset the semantics of dropping mounts on
> unlink were agreed upon so long as we preserve the current semantics in
> a single mount namespace.
>
> There are two big changes with this version of the patchset.
> 1) The test for a mount in the current mount namespace has been
> rewritten so that it is actually correct and is now a drop in
> replacement for d_mountpoint. Removing the need for code changes
> that could cause unitended regressions.
>
> 2) This work has been comprehensivly extend to d_revalidate and
> d_invalidate allowing some long standing bugs to be fixed at the end
> of this patch series.
>
> In net this should result in a kernel that is more comprehensible and
> maintainble as well as fixing some possible mount leaks, and removing
> the possibility of DOS attacks from evily placed mount points, in other
> mount namespaces.
This patchset looks good to me. You can add
Reviewed-by: Miklos Szeredi <[email protected]>
Thanks,
Miklos
>
> Eric W. Biederman (11):
> vfs: Document the effect of d_revalidate on d_find_alias
> vfs: More precise tests in d_invalidate
> vfs: Don't allow overwriting mounts in the current mount namespace
> vfs: Keep a list of mounts on a mount point
> vfs: Add a function to lazily unmount all mounts from any dentry.
> vfs: Lazily remove mounts on unlinked files and directories.
> vfs: Remove unnecessary calls of check_submounts_and_drop
> vfs: Merge check_submounts_and_drop and d_invalidate
> vfs: Make d_invalidate return void
> vfs: Remove d_drop calls from d_revalidate implementations
> proc: Update proc_flush_task_mnt to use d_invalidate
>
> fs/afs/dir.c | 5 --
> fs/btrfs/ioctl.c | 5 +--
> fs/ceph/dir.c | 1 -
> fs/cifs/readdir.c | 6 +--
> fs/dcache.c | 140 +++++++++++++++++-------------------------------
> fs/fuse/dir.c | 7 +--
> fs/gfs2/dentry.c | 3 -
> fs/kernfs/dir.c | 11 ----
> fs/mount.h | 13 +++++
> fs/namei.c | 28 ++++++----
> fs/namespace.c | 80 +++++++++++++++++++++++++++
> fs/nfs/dir.c | 7 +--
> fs/proc/base.c | 10 +---
> fs/proc/fd.c | 2 -
> include/linux/dcache.h | 3 +-
> 15 files changed, 167 insertions(+), 154 deletions(-)
>
> Eric
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 22e536705c45..6dc23614d44d 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
>> return m;
>> }
>>
>> +/*
>> + * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
>> + * current mount namespace.
>> + *
>> + * The common case is dentries are not mountpoints at all and that
>> + * test is handled inline. For the slow case when we are actually
>> + * dealing with a mountpoint of some kind, walk through all of the
>> + * mounts in the current mount namespace and test to see if the dentry
>> + * is a mountpoint.
>> + *
>> + * The mount_hashtable is not usable in the context because we
>> + * need to identify all mounts that may be in the current mount
>> + * namespace not just a mount that happens to have some specified
>> + * parent mount.
>> + */
>> +int __is_local_mountpoint(struct dentry *dentry)
>
> Minor nit: return value of any is_* function is either true or false, so why not
> declare it bool?
Because I am working on the core of the kernel and C compilers do weird
things with bool variables (storing them in bytes...). I expected a
type that the C compiler does not do weird things with would be more
readily received on a path whose performance people are interested in.
>> +{
>> + struct mnt_namespace *ns = current->nsproxy->mnt_ns;
>> + struct mount *mnt;
>> + int is_covered = 0;
>
> And the same for this var.
>
>> +
>> + if (!d_mountpoint(dentry))
>> + goto out;
>> +
>> + down_read(&namespace_sem);
>> + list_for_each_entry(mnt, &ns->list, mnt_list) {
>> + is_covered = (mnt->mnt_mountpoint == dentry);
>> + if (is_covered)
>> + break;
>> + }
>> + up_read(&namespace_sem);
>> +out:
>> + return is_covered;
>> +}
>> +
>> static struct mountpoint *new_mountpoint(struct dentry *dentry)
>> {
>> struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
>> --
>> 1.7.5.4
Eric
On Tue, Feb 18, 2014 at 1:28 PM, Eric W. Biederman
<[email protected]> wrote:
>
>>
>> Minor nit: return value of any is_* function is either true or false, so why not
>> declare it bool?
>
> Because I am working on the core of the kernel and C compilers do weird
> things with bool variables (storing them in bytes...). I expected a
> type that the C compiler does not do weird things with would be more
> readily received on a path whose performance people are interested in.
Yeah, I have to say that I'm not a huge fan of "bool". It has some odd
properties, especially in memory (ie as a structure member).
For this kind of function return value it actually tends to work very
well, and in fact often generates slightly better code than "int". So
I don't _hate_ bool, and we've certainly had a lot more use creep in
lately, but I also don't really see "bool" as much of an upside.
Linus
On Tue, 18 Feb 2014 14:20:29 -0800 Linus Torvalds <[email protected]> wrote:
>
> For this kind of function return value it actually tends to work very
> well, and in fact often generates slightly better code than "int". So
> I don't _hate_ bool, and we've certainly had a lot more use creep in
> lately, but I also don't really see "bool" as much of an upside.
And in function declarations, it makes it very obvious that the function
is not one of our "return 0 or -ERROR" ones.
--
Cheers,
Stephen Rothwell [email protected]
Stephen Rothwell <[email protected]> writes:
> On Tue, 18 Feb 2014 14:20:29 -0800 Linus Torvalds <[email protected]> wrote:
>>
>> For this kind of function return value it actually tends to work very
>> well, and in fact often generates slightly better code than "int". So
>> I don't _hate_ bool, and we've certainly had a lot more use creep in
>> lately, but I also don't really see "bool" as much of an upside.
>
> And in function declarations, it makes it very obvious that the function
> is not one of our "return 0 or -ERROR" ones.
So I played with this instance in particular. The only difference winds
up being by the bool version uses byte instructions on %al and %bl
instead of their 32 bit interger equivalents on %eax and %ebx.
I also benchmarked the difference and on the most sensitive test I could
find. will-it-scale/unlink2 (aka create,close,unlink each process in a
separate directory). There were no measurable performance differences.
So for purposes of better documentation I have changed the function, and
I will repost my patches shortly.
Eric
Miklos Szeredi <[email protected]> writes:
> On Sat, Feb 15, 2014 at 01:39:22PM -0800, Eric W. Biederman wrote:
>>
>> Now that d_invalidate is the only caller of check_submounts_and_drop,
>> expand check_submounts_and_drop inline in d_invalidate.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/dcache.c | 55 +++++++++++++++++++----------------------------
>> include/linux/dcache.h | 1 -
>> 2 files changed, 22 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index 27585b1dd6f1..5b41205cbf33 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> -int check_submounts_and_drop(struct dentry *dentry)
>> +int d_invalidate(struct dentry *dentry)
>> {
>> int ret = 0;
>>
>> + /*
>> + * If it's already been dropped, return OK.
>> + */
>> + spin_lock(&dentry->d_lock);
>> + if (d_unhashed(dentry)) {
>> + spin_unlock(&dentry->d_lock);
>> + return 0;
>> + }
>> + spin_unlock(&dentry->d_lock);
>> +
>> /* Negative dentries can be dropped without further checks */
>> if (!dentry->d_inode) {
>> d_drop(dentry);
>
>
> You can optimize this by including the negative check within the above d_locked
> region and calling __d_drop() instead.
For this patch just moving the code and not changing it is the corret
thing to do because it helps with review and understanding the code.
There are two ways I could see going with optimizing the preamble.
Simply dropping the d_lock from around the d_unhashed test as a pointer
dereference should be atomic, and the test is racy against
d_materialise_unique. (We don't always hold the parent
directories inode mutex when d_invalidate is called). So the d_lock
buys us very little. Alternatively we could move the work into the
d_walk callbacks.
That kind of optimization deserves it's own patch that can be reviewed
independently.
Eric
Miklos Szeredi <[email protected]> writes:
> On Sat, Feb 15, 2014 at 01:37:26PM -0800, Eric W. Biederman wrote:
>>
>> v2: Always drop the lock when exiting early.
>> v3: Make detach_mounts robust about freeing several
>> mounts on the same mountpoint at one time, and remove
>> the unneeded mnt_list list test.
>> v4: Document the purpose of detach_mounts and why new_mountpoint is
>> safe to call.
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> fs/mount.h | 2 ++
>> fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 41 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/mount.h b/fs/mount.h
>> index 50a72d46e7a6..2b470f34e665 100644
>> --- a/fs/mount.h
>> +++ b/fs/mount.h
>> @@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>>
>> extern bool legitimize_mnt(struct vfsmount *, unsigned);
>>
>> +extern void detach_mounts(struct dentry *dentry);
>> +
>> static inline void get_mnt_ns(struct mnt_namespace *ns)
>> {
>> atomic_inc(&ns->count);
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 33db9e95bd5c..7abbf722ce18 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1359,6 +1359,45 @@ static int do_umount(struct mount *mnt, int flags)
>> return retval;
>> }
>>
>> +/*
>> + * detach_mounts - lazily unmount all mounts on the specified dentry
>> + *
>> + * During unlink, rmdir, and d_drop it is possible to loose the path
>> + * to an existing mountpoint, and wind up leaking the mount.
>> + * detach_mounts allows lazily unmounting those mounts instead of
>> + * leaking them.
>> + *
>> + * The caller may hold dentry->d_inode->i_mutex.
>> + */
>> +void detach_mounts(struct dentry *dentry)
>> +{
>> + struct mountpoint *mp;
>> + struct mount *mnt;
>> +
>> + namespace_lock();
>> + if (!d_mountpoint(dentry))
>> + goto out_unlock;
>> +
>> + /*
>> + * The namespace lock and d_mountpoint being set guarantees
>> + * that new_mountpoint will just be a lookup of the existing
>> + * mountpoint structure.
>> + */
>> + mp = new_mountpoint(dentry);
>
> Howabout a get_mountpoint(dentry) helper, that returns NULL if it turns out to
> be not a mountpoint? And, as an added bonus, you can drop the comment above as
> well.
I hate to admit it but that is a nice change. Especially as it allows
removing the d_mounpoint check inside of namespace_lock. I still need a
cheak d_mounpoint check outside of namespace lock but inside it can go.
The first time I looked at doing that I missed something and the change
looked too awkward to be worth implementing :(
Eric
After an interesting set of review comments this patchset is back with
very modest changes.
I spent quite a bit of timing trying to see if I could understand and
possibly reproduce the poor issues reported by Fengguang Wu. Ultimately
I wound up running dbench and will-it-scale/unlink2 on a ramfs
filesystem with a fixed clock speed so I could get enough jitter out of
the benchmark numbers to that I could make some sense of the results.
On my bitty box I could not find any problems with dbench,
will-it-scale/unlink2 I measured a small loss of performance because
detach_mounts was silly and took the lock before testing d_mountpoint.
With that fixed I can not see any performance differences between my
code and unmodified 3.14-rcX. Nor can I see any reason to suspect
there might be. The core vfs changes are quite modest.
Now I did not benchark d_invalidate, but check_submounts_and_drop is
what any sane distributed filesystem is doing today, and as discussed in
the review the actual changes are very modest even in d_invalidate.
In the first round of this patchset the semantics of dropping mounts on
unlink were agreed to so long as we preserve the current semantics in
a single mount namespace.
The are really only two changes in this version of the patchset.
1) is_local_mountpoint now returns a bool.
2) detach_mounts was simplified/cleanup and made more performant.
In net this should result in a kernel that is more comprehensible and
maintainble as well as fixing some possible mount leaks, and removing
the possibility of DOS attacks from evily placed mount points, in other
mount namespaces.
For people who prefer git trees these changes are available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git vfs-detach-mounts3
Eric W. Biederman (12):
vfs: Document the effect of d_revalidate on d_find_alias
vfs: More precise tests in d_invalidate
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: factor out lookup_mountpoint from new_mountpoint
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
vfs: Remove unnecessary calls of check_submounts_and_drop
vfs: Merge check_submounts_and_drop and d_invalidate
vfs: Make d_invalidate return void
vfs: Remove d_drop calls from d_revalidate implementations
proc: Update proc_flush_task_mnt to use d_invalidate
fs/afs/dir.c | 5 --
fs/btrfs/ioctl.c | 5 +--
fs/ceph/dir.c | 1 -
fs/cifs/readdir.c | 6 +--
fs/dcache.c | 140 +++++++++++++++++-------------------------------
fs/fuse/dir.c | 7 +--
fs/gfs2/dentry.c | 3 -
fs/kernfs/dir.c | 11 ----
fs/mount.h | 20 +++++++
fs/namei.c | 28 ++++++----
fs/namespace.c | 87 +++++++++++++++++++++++++++++-
fs/nfs/dir.c | 7 +--
fs/proc/base.c | 10 +---
fs/proc/fd.c | 2 -
include/linux/dcache.h | 3 +-
15 files changed, 178 insertions(+), 157 deletions(-)
Eric
d_drop or check_submounts_and_drop called from d_revalidate can result
in renamed directories with child dentries being unhashed. These
renamed and drop directory dentries can be rehashed after
d_materialise_unique uses d_find_alias to find them.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..c71c86f7780e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -726,7 +726,8 @@ EXPORT_SYMBOL(dget_parent);
* acquire the reference to alias and return it. Otherwise return NULL.
* Notice that if inode is a directory there can be only one alias and
* it can be unhashed only if it has no children, or if it is the root
- * of a filesystem.
+ * of a filesystem, or if the directory was renamed and d_revalidate
+ * was the first vfs operation to notice.
*
* If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
* any other hashed alias over that one unless @want_discon is set,
--
1.7.5.4
The current comments in d_invalidate about what and why it is doing
what it is doing are wildly off-base. Which is not surprising as
the comments date back to last minute bug fix of the 2.2 kernel.
The big fat lie of a comment said: If it's a directory, we can't drop
it for fear of somebody re-populating it with children (even though
dropping it would make it unreachable from that root, we still might
repopulate it if it was a working directory or similar).
The truth is that for remote filesystems the failure of d_revalidate
and d_weak_revalidate prevents us from populating or otherwise
inappropriately using a directory. For local filesystems and for
local directory removals the setting of S_DEAD prevents us from
populating or otherwise inappropriate using a directory.
The current rules are:
- To prevent mount point leaks dentries that are mount points or that
have childrent that are mount points may not be be unhashed.
- All dentries may be unhashed.
- Directories may be rehashed with d_materialise_unique
check_submounts_and_drop implements this already for well maintained
remote filesystems so implement the current rules in d_invalidate
by just calling check_submounts_and_drop.
The one difference between d_invalidate and check_submounts_and_drop
is that d_invalidate must respect it when a d_revalidate method has
earlier called d_drop so preserve the d_unhashed check in
d_invalidate.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index c71c86f7780e..b0add629f5fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -613,9 +613,8 @@ EXPORT_SYMBOL(dput);
* @dentry: dentry to invalidate
*
* Try to invalidate the dentry if it turns out to be
- * possible. If there are other dentries that can be
- * reached through this one we can't delete it and we
- * return -EBUSY. On success we return 0.
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
*
* no dcache lock.
*/
@@ -630,38 +629,9 @@ int d_invalidate(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
return 0;
}
- /*
- * Check whether to do a partial shrink_dcache
- * to get rid of unused child entries.
- */
- if (!list_empty(&dentry->d_subdirs)) {
- spin_unlock(&dentry->d_lock);
- shrink_dcache_parent(dentry);
- spin_lock(&dentry->d_lock);
- }
-
- /*
- * Somebody else still using it?
- *
- * If it's a directory, we can't drop it
- * for fear of somebody re-populating it
- * with children (even though dropping it
- * would make it unreachable from the root,
- * we might still populate it if it was a
- * working directory or similar).
- * We also need to leave mountpoints alone,
- * directory or not.
- */
- if (dentry->d_lockref.count > 1 && dentry->d_inode) {
- if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
- spin_unlock(&dentry->d_lock);
- return -EBUSY;
- }
- }
-
- __d_drop(dentry);
spin_unlock(&dentry->d_lock);
- return 0;
+
+ return check_submounts_and_drop(dentry);
}
EXPORT_SYMBOL(d_invalidate);
--
1.7.5.4
In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.
The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.
Miklos Szeredi <[email protected]> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.
A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.
Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.
v2: Rewrote the test to be a drop in replacement for d_mountpoint
v3: Use bool instead of int as the return type of is_local_mountpoint
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/mount.h | 9 +++++++++
fs/namei.c | 8 +++++++-
fs/namespace.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+), 1 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a17458ca6f29..18511835dc28 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -109,3 +109,12 @@ struct proc_mounts {
#define proc_mounts(p) (container_of((p), struct proc_mounts, m))
extern const struct seq_operations mounts_op;
+
+extern bool __is_local_mountpoint(struct dentry *dentry);
+static inline bool is_local_mountpoint(struct dentry *dentry)
+{
+ if (!d_mountpoint(dentry))
+ return false;
+
+ return __is_local_mountpoint(dentry);
+}
diff --git a/fs/namei.c b/fs/namei.c
index d580df2e6804..4e6fe16ef488 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3507,6 +3507,8 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
mutex_lock(&dentry->d_inode->i_mutex);
error = -EBUSY;
+ if (is_local_mountpoint(dentry))
+ goto out;
if (d_mountpoint(dentry))
goto out;
@@ -3622,7 +3624,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;
mutex_lock(&target->i_mutex);
- if (d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -4001,6 +4003,8 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
mutex_lock(&target->i_mutex);
error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
goto out;
@@ -4045,6 +4049,8 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
lock_two_nondirectories(source, target);
error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
goto out;
diff --git a/fs/namespace.c b/fs/namespace.c
index 22e536705c45..abe2f56efaea 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
return m;
}
+/*
+ * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
+ * current mount namespace.
+ *
+ * The common case is dentries are not mountpoints at all and that
+ * test is handled inline. For the slow case when we are actually
+ * dealing with a mountpoint of some kind, walk through all of the
+ * mounts in the current mount namespace and test to see if the dentry
+ * is a mountpoint.
+ *
+ * The mount_hashtable is not usable in the context because we
+ * need to identify all mounts that may be in the current mount
+ * namespace not just a mount that happens to have some specified
+ * parent mount.
+ */
+bool __is_local_mountpoint(struct dentry *dentry)
+{
+ struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+ struct mount *mnt;
+ bool is_covered = false;
+
+ if (!d_mountpoint(dentry))
+ goto out;
+
+ down_read(&namespace_sem);
+ list_for_each_entry(mnt, &ns->list, mnt_list) {
+ is_covered = (mnt->mnt_mountpoint == dentry);
+ if (is_covered)
+ break;
+ }
+ up_read(&namespace_sem);
+out:
+ return is_covered;
+}
+
static struct mountpoint *new_mountpoint(struct dentry *dentry)
{
struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
--
1.7.5.4
To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 2 ++
fs/namespace.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 18511835dc28..5375da0dc570 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
@@ -48,6 +49,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index abe2f56efaea..32d380eabdac 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -195,6 +195,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -695,6 +696,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@ -702,6 +704,7 @@ static void put_mountpoint(struct mountpoint *mp)
{
if (!--mp->m_count) {
struct dentry *dentry = mp->m_dentry;
+ BUG_ON(!list_empty(&mp->m_list));
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
@@ -748,6 +751,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@ -764,6 +768,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}
/*
@@ -1246,6 +1251,7 @@ void umount_tree(struct mount *mnt, int how)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
list_del_init(&p->mnt_child);
if (mnt_has_parent(p)) {
+ list_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
/* move the reference to mountpoint into ->mnt_ex_mountpoint */
p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
--
1.7.5.4
I am shortly going to add a new user of struct mountpoint that
needs to look up existing entries but does not want to create
a struct mountpoint if one does not exist. Therefore to keep
the code simple and easy to read split out lookup_mountpoint
from new_mountpoint.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namespace.c | 15 ++++++++++++---
1 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 32d380eabdac..f2f03082fe70 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -667,11 +667,10 @@ out:
return is_covered;
}
-static struct mountpoint *new_mountpoint(struct dentry *dentry)
+static struct mountpoint *lookup_mountpoint(struct dentry *dentry)
{
struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
struct mountpoint *mp;
- int ret;
list_for_each_entry(mp, chain, m_hash) {
if (mp->m_dentry == dentry) {
@@ -682,6 +681,14 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
return mp;
}
}
+ return NULL;
+}
+
+static struct mountpoint *new_mountpoint(struct dentry *dentry)
+{
+ struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
+ struct mountpoint *mp;
+ int ret;
mp = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
if (!mp)
@@ -1702,7 +1709,9 @@ retry:
namespace_lock();
mnt = lookup_mnt(path);
if (likely(!mnt)) {
- struct mountpoint *mp = new_mountpoint(dentry);
+ struct mountpoint *mp = lookup_mountpoint(dentry);
+ if (!mp)
+ mp = new_mountpoint(dentry);
if (IS_ERR(mp)) {
namespace_unlock();
mutex_unlock(&dentry->d_inode->i_mutex);
--
1.7.5.4
The new function detach_mounts comes in two pieces. The first piece
is a static inline test of d_mounpoint that returns immediately
without taking any locks if d_mounpoint is not set. In the common
case when mountpoints are absent this allows the vfs to continue
running with it's same cacheline foot print.
The second piece of detach_mounts __detach_mounts actually does the
work and it assumes that a mountpoint is present so it is slow and
takes namespace_sem for write, and then locks the mount hash (aka
mount_lock) after a struct mountpoint has been found.
With those two locks held each entry on the list of mounts on a
mountpoint is selected and lazily unmounted until all of the mount
have been lazily unmounted.
v7: Wrote a proper change description and removed the changelog
documenting deleted wrong turns.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/mount.h | 9 +++++++++
fs/namespace.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index 5375da0dc570..c5e717542bbc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -84,6 +84,15 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
extern bool legitimize_mnt(struct vfsmount *, unsigned);
+extern void __detach_mounts(struct dentry *dentry);
+
+static inline void detach_mounts(struct dentry *dentry)
+{
+ if (!d_mountpoint(dentry))
+ return;
+ __detach_mounts(dentry);
+}
+
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index f2f03082fe70..52f4174e294c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1366,6 +1366,37 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
+/*
+ * __detach_mounts - lazily unmount all mounts on the specified dentry
+ *
+ * During unlink, rmdir, and d_drop it is possible to loose the path
+ * to an existing mountpoint, and wind up leaking the mount.
+ * detach_mounts allows lazily unmounting those mounts instead of
+ * leaking them.
+ *
+ * The caller may hold dentry->d_inode->i_mutex.
+ */
+void __detach_mounts(struct dentry *dentry)
+{
+ struct mountpoint *mp;
+ struct mount *mnt;
+
+ namespace_lock();
+ mp = lookup_mountpoint(dentry);
+ if (!mp)
+ goto out_unlock;
+
+ lock_mount_hash();
+ while (!list_empty(&mp->m_list)) {
+ mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
+ umount_tree(mnt, 2);
+ }
+ unlock_mount_hash();
+ put_mountpoint(mp);
+out_unlock:
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.
The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.
In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.
There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.
In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.
In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.
In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.
v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 60 ++++++++++++++++++++++++++++++++--------------------------
fs/namei.c | 18 ++++++++--------
2 files changed, 42 insertions(+), 36 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index b0add629f5fe..27585b1dd6f1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1374,36 +1374,39 @@ void shrink_dcache_for_umount(struct super_block *sb)
}
}
-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct select_data select;
+ struct dentry *mountpoint;
+};
+static enum d_walk_ret detach_and_collect(void *_data, struct dentry *dentry)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;
if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
+ __dget_dlock(dentry);
+ data->mountpoint = dentry;
return D_WALK_QUIT;
}
- return select_collect(_data, dentry);
+ return select_collect(&data->select, dentry);
}
static void check_and_drop(void *_data)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;
- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ if (!data->mountpoint && !data->select.found)
+ __d_drop(data->select.start);
}
/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * check_submounts_and_drop - detach submounts, prune dcache, and drop
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * The final d_drop is done as an atomic operation relative to
+ * rename_lock ensuring there are no races with d_set_mounted. This
+ * ensures there are no unhashed dentries on the path to a mountpoint.
*
- * @dentry: dentry to prune and drop
+ * @dentry: dentry to detach, prune and drop
*/
int check_submounts_and_drop(struct dentry *dentry)
{
@@ -1416,19 +1419,24 @@ int check_submounts_and_drop(struct dentry *dentry)
}
for (;;) {
- struct select_data data;
+ struct detach_data data;
- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.mountpoint = NULL;
+ INIT_LIST_HEAD(&data.select.dispose);
+ data.select.start = dentry;
+ data.select.found = 0;
- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
+ d_walk(dentry, &data, detach_and_collect, check_and_drop);
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
+ if (data.select.found)
+ shrink_dentry_list(&data.select.dispose);
- if (ret <= 0)
+ if (data.mountpoint) {
+ detach_mounts(data.mountpoint);
+ dput(data.mountpoint);
+ }
+
+ if (!data.mountpoint && !data.select.found)
break;
cond_resched();
@@ -2640,10 +2648,8 @@ static struct dentry *__d_unalias(struct inode *inode,
goto out_err;
m2 = &alias->d_parent->d_inode->i_mutex;
out_unalias:
- if (likely(!d_mountpoint(alias))) {
- __d_move(alias, dentry);
- ret = alias;
- }
+ __d_move(alias, dentry);
+ ret = alias;
out_err:
spin_unlock(&inode->i_lock);
if (m2)
diff --git a/fs/namei.c b/fs/namei.c
index 4e6fe16ef488..3fca30cd448b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3509,8 +3509,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
error = -EBUSY;
if (is_local_mountpoint(dentry))
goto out;
- if (d_mountpoint(dentry))
- goto out;
error = security_inode_rmdir(dir, dentry);
if (error)
@@ -3523,6 +3521,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);
out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3624,7 +3623,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;
mutex_lock(&target->i_mutex);
- if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -3633,8 +3632,10 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
if (error)
goto out;
error = dir->i_op->unlink(dir, dentry);
- if (!error)
+ if (!error) {
dont_mount(dentry);
+ detach_mounts(dentry);
+ }
}
}
out:
@@ -4005,8 +4006,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
@@ -4022,6 +4021,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4051,8 +4051,6 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
error = try_break_deleg(source, delegated_inode);
if (error)
@@ -4066,8 +4064,10 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (error)
goto out;
- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
--
1.7.5.4
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/afs/dir.c | 5 -----
fs/fuse/dir.c | 3 ---
fs/gfs2/dentry.c | 3 ---
fs/kernfs/dir.c | 11 -----------
fs/nfs/dir.c | 4 ----
5 files changed, 0 insertions(+), 26 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 529300327f45..a1645b88fe8a 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -669,7 +669,6 @@ static int afs_d_revalidate(struct dentry *dentry, unsigned int flags)
out_valid:
dentry->d_fsdata = dir_version;
-out_skip:
dput(parent);
key_put(key);
_leave(" = 1 [valid]");
@@ -682,10 +681,6 @@ not_found:
spin_unlock(&dentry->d_lock);
out_bad:
- /* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
-
_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
dput(parent);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 1d1292c581c3..5192d0a04d20 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -273,9 +273,6 @@ out:
invalid:
ret = 0;
-
- if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
- ret = 1;
goto out;
}
diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..589f4ea9381c 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,6 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
dput(parent);
return 0;
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5104cf5d25c5..43d5ab905238 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -296,21 +296,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;
mutex_unlock(&kernfs_mutex);
-out_valid:
return 1;
out_bad:
mutex_unlock(&kernfs_mutex);
out_bad_unlocked:
- /*
- * @dentry doesn't match the underlying kernfs node, drop the
- * dentry and force lookup. If we have submounts we must allow the
- * vfs caches to lie about the state of the filesystem to prevent
- * leaks and other nasty things, so use check_submounts_and_drop()
- * instead of d_drop().
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
return 0;
}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index be38b573495a..a8a2f7f86c45 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1143,10 +1143,6 @@ out_zap_parent:
if (IS_ROOT(dentry))
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
__func__, dentry);
--
1.7.5.4
Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/dcache.c | 55 +++++++++++++++++++----------------------------
include/linux/dcache.h | 1 -
2 files changed, 22 insertions(+), 34 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 27585b1dd6f1..5b41205cbf33 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -608,32 +608,6 @@ kill_it:
}
EXPORT_SYMBOL(dput);
-/**
- * d_invalidate - invalidate a dentry
- * @dentry: dentry to invalidate
- *
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
- * no dcache lock.
- */
-
-int d_invalidate(struct dentry * dentry)
-{
- /*
- * If it's already been dropped, return OK.
- */
- spin_lock(&dentry->d_lock);
- if (d_unhashed(dentry)) {
- spin_unlock(&dentry->d_lock);
- return 0;
- }
- spin_unlock(&dentry->d_lock);
-
- return check_submounts_and_drop(dentry);
-}
-EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
@@ -1175,7 +1149,7 @@ EXPORT_SYMBOL(have_submounts);
* reachable (e.g. NFS can unhash a directory dentry and then the complete
* subtree can become unreachable).
*
- * Only one of check_submounts_and_drop() and d_set_mounted() must succeed. For
+ * Only one of d_invalidate() and d_set_mounted() must succeed. For
* this reason take rename_lock and d_lock on dentry and ancestors.
*/
int d_set_mounted(struct dentry *dentry)
@@ -1184,7 +1158,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. d_invalidate() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1400,18 +1374,33 @@ static void check_and_drop(void *_data)
}
/**
- * check_submounts_and_drop - detach submounts, prune dcache, and drop
+ * d_invalidate - detach submounts, prune dcache, and drop
+ * @dentry: dentry to invalidate (aka detach, prune and drop)
+ *
+ * Try to invalidate the dentry if it turns out to be
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
+ *
+ * no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
- *
- * @dentry: dentry to detach, prune and drop
*/
-int check_submounts_and_drop(struct dentry *dentry)
+int d_invalidate(struct dentry *dentry)
{
int ret = 0;
+ /*
+ * If it's already been dropped, return OK.
+ */
+ spin_lock(&dentry->d_lock);
+ if (d_unhashed(dentry)) {
+ spin_unlock(&dentry->d_lock);
+ return 0;
+ }
+ spin_unlock(&dentry->d_lock);
+
/* Negative dentries can be dropped without further checks */
if (!dentry->d_inode) {
d_drop(dentry);
@@ -1445,7 +1434,7 @@ int check_submounts_and_drop(struct dentry *dentry)
out:
return ret;
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(d_invalidate);
/**
* __d_alloc - allocate a dcache entry
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf72e9ac6de0..ae77222c3e86 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -265,7 +265,6 @@ extern void d_prune_aliases(struct inode *);
/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);
/*
* This adds the entry to the hash queues.
--
1.7.5.4
Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/btrfs/ioctl.c | 5 +----
fs/cifs/readdir.c | 6 +-----
fs/dcache.c | 12 +++---------
fs/fuse/dir.c | 4 +---
fs/namei.c | 10 +++++-----
fs/nfs/dir.c | 3 +--
include/linux/dcache.h | 2 +-
7 files changed, 13 insertions(+), 29 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b0134892dc70..349848bd54e2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2229,9 +2229,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
}
mutex_lock(&inode->i_mutex);
- err = d_invalidate(dentry);
- if (err)
- goto out_unlock;
+ d_invalidate(dentry);
down_write(&root->fs_info->subvol_sem);
@@ -2316,7 +2314,6 @@ out_release:
btrfs_subvolume_release_metadata(root, &block_rsv, qgroup_reserved);
out_up_write:
up_write(&root->fs_info->subvol_sem);
-out_unlock:
mutex_unlock(&inode->i_mutex);
if (!err) {
shrink_dcache_sb(root->fs_info->sb);
diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
index b15862e0f68c..d0e9d0169b37 100644
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -87,8 +87,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
return;
if (dentry) {
- int err;
-
inode = dentry->d_inode;
if (inode) {
/*
@@ -105,10 +103,8 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
goto out;
}
}
- err = d_invalidate(dentry);
+ d_invalidate(dentry);
dput(dentry);
- if (err)
- return;
}
/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5b41205cbf33..5b78bd98649c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1377,27 +1377,21 @@ static void check_and_drop(void *_data)
* d_invalidate - detach submounts, prune dcache, and drop
* @dentry: dentry to invalidate (aka detach, prune and drop)
*
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
* no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
*/
-int d_invalidate(struct dentry *dentry)
+void d_invalidate(struct dentry *dentry)
{
- int ret = 0;
-
/*
* If it's already been dropped, return OK.
*/
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
- return 0;
+ return;
}
spin_unlock(&dentry->d_lock);
@@ -1432,7 +1426,7 @@ int d_invalidate(struct dentry *dentry)
}
out:
- return ret;
+ return;
}
EXPORT_SYMBOL(d_invalidate);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5192d0a04d20..6e920fadb45d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1225,9 +1225,7 @@ static int fuse_direntplus_link(struct file *file,
d_drop(dentry);
} else if (get_node_id(inode) != o->nodeid ||
((o->attr.mode ^ inode->i_mode) & S_IFMT)) {
- err = d_invalidate(dentry);
- if (err)
- goto out;
+ d_invalidate(dentry);
} else if (is_bad_inode(inode)) {
err = -EIO;
goto out;
diff --git a/fs/namei.c b/fs/namei.c
index 3fca30cd448b..384fcc6a5606 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1259,7 +1259,8 @@ static struct dentry *lookup_dcache(struct qstr *name, struct dentry *dir,
if (error < 0) {
dput(dentry);
return ERR_PTR(error);
- } else if (!d_invalidate(dentry)) {
+ } else {
+ d_invalidate(dentry);
dput(dentry);
dentry = NULL;
}
@@ -1391,10 +1392,9 @@ unlazy:
dput(dentry);
return status;
}
- if (!d_invalidate(dentry)) {
- dput(dentry);
- goto need_lookup;
- }
+ d_invalidate(dentry);
+ dput(dentry);
+ goto need_lookup;
}
path->mnt = mnt;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a8a2f7f86c45..a2b2b6d17c1f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -464,8 +464,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_setsecurity(dentry->d_inode, entry->fattr, entry->label);
goto out;
} else {
- if (d_invalidate(dentry) != 0)
- goto out;
+ d_invalidate(dentry);
dput(dentry);
}
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index ae77222c3e86..95ddb0bdadb0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -250,7 +250,7 @@ extern struct dentry * d_obtain_alias(struct inode *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_for_umount(struct super_block *);
-extern int d_invalidate(struct dentry *);
+extern void d_invalidate(struct dentry *);
/* only used at mount-time */
extern struct dentry * d_make_root(struct inode *);
--
1.7.5.4
Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.
Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries. Using d_invalidate ensures that paths
to mount points will not be dropped.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/ceph/dir.c | 1 -
fs/proc/base.c | 4 ----
fs/proc/fd.c | 2 --
3 files changed, 0 insertions(+), 7 deletions(-)
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 6da4df84ba30..ecd2402739e2 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1052,7 +1052,6 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags)
ceph_dentry_lru_touch(dentry);
} else {
ceph_dir_clear_complete(dir);
- d_drop(dentry);
}
iput(dir);
return valid;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 51507065263b..aa62b37f1303 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1654,7 +1654,6 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
put_task_struct(task);
return 1;
}
- d_drop(dentry);
return 0;
}
@@ -1791,9 +1790,6 @@ out:
put_task_struct(task);
out_notask:
- if (status <= 0)
- d_drop(dentry);
-
return status;
}
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 985ea881b5bc..7e4f04ccab31 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -127,8 +127,6 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
}
put_task_struct(task);
}
-
- d_drop(dentry);
return 0;
}
--
1.7.5.4
Now that d_invalidate always succeeds and flushes mount points use
it in stead of a combination of shrink_dcache_parent and d_drop
in proc_flush_task_mnt. This removes the danger of a mount point
under /proc/<pid>/... becoming unreachable after the d_drop.
Reviewed-by: Miklos Szeredi <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/base.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index aa62b37f1303..40d97e576d32 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2700,8 +2700,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}
@@ -2721,8 +2720,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
name.len = snprintf(buf, sizeof(buf), "%d", pid);
dentry = d_hash_and_lookup(dir, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}
--
1.7.5.4
On Mon, Feb 24, 2014 at 04:01:29PM -0800, Eric W. Biederman wrote:
> Miklos Szeredi <[email protected]> writes:
>
> > On Sat, Feb 15, 2014 at 01:39:22PM -0800, Eric W. Biederman wrote:
> >>
> >> Now that d_invalidate is the only caller of check_submounts_and_drop,
> >> expand check_submounts_and_drop inline in d_invalidate.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >> fs/dcache.c | 55 +++++++++++++++++++----------------------------
> >> include/linux/dcache.h | 1 -
> >> 2 files changed, 22 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/fs/dcache.c b/fs/dcache.c
> >> index 27585b1dd6f1..5b41205cbf33 100644
> >> --- a/fs/dcache.c
> >> +++ b/fs/dcache.c
>
> >> -int check_submounts_and_drop(struct dentry *dentry)
> >> +int d_invalidate(struct dentry *dentry)
> >> {
> >> int ret = 0;
> >>
> >> + /*
> >> + * If it's already been dropped, return OK.
> >> + */
> >> + spin_lock(&dentry->d_lock);
> >> + if (d_unhashed(dentry)) {
> >> + spin_unlock(&dentry->d_lock);
> >> + return 0;
> >> + }
> >> + spin_unlock(&dentry->d_lock);
> >> +
> >> /* Negative dentries can be dropped without further checks */
> >> if (!dentry->d_inode) {
> >> d_drop(dentry);
> >
> >
> > You can optimize this by including the negative check within the above d_locked
> > region and calling __d_drop() instead.
>
> For this patch just moving the code and not changing it is the corret
> thing to do because it helps with review and understanding the code.
>
> There are two ways I could see going with optimizing the preamble.
> Simply dropping the d_lock from around the d_unhashed test as a pointer
> dereference should be atomic, and the test is racy against
> d_materialise_unique.
Could you explain? What's the race, and what are the consequences?
--b.
> (We don't always hold the parent
> directories inode mutex when d_invalidate is called). So the d_lock
> buys us very little. Alternatively we could move the work into the
> d_walk callbacks.
>
> That kind of optimization deserves it's own patch that can be reviewed
> independently.
>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
"J. Bruce Fields" <[email protected]> writes:
> On Mon, Feb 24, 2014 at 04:01:29PM -0800, Eric W. Biederman wrote:
>> Miklos Szeredi <[email protected]> writes:
>>
>> >
>> > You can optimize this by including the negative check within the above d_locked
>> > region and calling __d_drop() instead.
>>
>> For this patch just moving the code and not changing it is the corret
>> thing to do because it helps with review and understanding the code.
>>
>> There are two ways I could see going with optimizing the preamble.
>> Simply dropping the d_lock from around the d_unhashed test as a pointer
>> dereference should be atomic, and the test is racy against
>> d_materialise_unique.
>
> Could you explain? What's the race, and what are the consequences?
>> (We don't always hold the parent directories inode mutex when d_invalidate is called).
d_unhashed is not a permanent condition because of d_materialise_unique,
and d_splice_alias.
d_invalidate can be called on an unhashed dentry in one of two ways
(either d_revalidate dropped the dentry or another routine that drops
the dentry beat the current invocation of d_invalidate to the job).
There are 3 places d_revalidate is called.
Once on the rcu path with with the appropriate flag set.
Once without out the parent i_mutex held, just off of the rcu path,
on that path d_invalidate is when d_revalidate fails.
Once during lookup with the parent directory i_mutex held.
Because the parent direcories i_mutex is not always held accross
d_revalidate and the following d_invalidate it happens that d_invalidate
is not always an atomic operation.
At worst the race results in a dentry that is dropped when it could be
hashed, that we will resurrect next time someone attempts to look it
up and d_materialise_unique/d_splice_alias is called.
None of that really matters for optimizing d_invalidate, but it is part
of the background in which d_invalidate lives. All that is significant
in d_invalidate is knowing that d_materialise_unique, and possibly
d_splice_alias may run concurrently with d_invalidate. It is unlikely
and essentially harmless.
After my patchset (because I removed all of the d_drop's from
.d_revalidate) the only race that should remain is between two parallel
calls of d_invalidate. Which probably means we can remove the test for
d_unhashed altogether.
Right now I just want to make this first big step and make certain the
code is solid. After that optimization is easy.
Eric
On Tue, Feb 25, 2014 at 02:03:36PM -0800, Eric W. Biederman wrote:
> "J. Bruce Fields" <[email protected]> writes:
>
> > On Mon, Feb 24, 2014 at 04:01:29PM -0800, Eric W. Biederman wrote:
> >> Miklos Szeredi <[email protected]> writes:
> >>
> >> >
> >> > You can optimize this by including the negative check within the above d_locked
> >> > region and calling __d_drop() instead.
> >>
> >> For this patch just moving the code and not changing it is the corret
> >> thing to do because it helps with review and understanding the code.
> >>
> >> There are two ways I could see going with optimizing the preamble.
> >> Simply dropping the d_lock from around the d_unhashed test as a pointer
> >> dereference should be atomic, and the test is racy against
> >> d_materialise_unique.
> >
> > Could you explain? What's the race, and what are the consequences?
Actually I was just confused as to whether the above was "is racy" was
claiming the existance of some bug.
I believe I should have read the above as more like "the test is already
racy against d_materialise_unique, but it's a harmless race, and
dropping the d_lock wouldn't make it any worse".
> >> (We don't always hold the parent directories inode mutex when d_invalidate is called).
>
> d_unhashed is not a permanent condition because of d_materialise_unique,
> and d_splice_alias.
>
> d_invalidate can be called on an unhashed dentry in one of two ways
> (either d_revalidate dropped the dentry or another routine that drops
> the dentry beat the current invocation of d_invalidate to the job).
>
>
> There are 3 places d_revalidate is called.
>
> Once on the rcu path with with the appropriate flag set.
>
> Once without out the parent i_mutex held, just off of the rcu path,
> on that path d_invalidate is when d_revalidate fails.
>
> Once during lookup with the parent directory i_mutex held.
>
>
> Because the parent direcories i_mutex is not always held accross
> d_revalidate and the following d_invalidate it happens that d_invalidate
> is not always an atomic operation.
>
>
> At worst the race results in a dentry that is dropped when it could be
> hashed,
Because somebody not holding the i_mutex calls d_invalidate based on old
information and unhashes something that
d_materialise_unique/d_splice_alias just hashed?
> that we will resurrect next time someone attempts to look it
> up and d_materialise_unique/d_splice_alias is called.
OK.
> None of that really matters for optimizing d_invalidate, but it is part
> of the background in which d_invalidate lives. All that is significant
> in d_invalidate is knowing that d_materialise_unique, and possibly
> d_splice_alias may run concurrently with d_invalidate. It is unlikely
> and essentially harmless.
>
>
> After my patchset (because I removed all of the d_drop's from
> .d_revalidate) the only race that should remain is between two parallel
> calls of d_invalidate. Which probably means we can remove the test for
> d_unhashed altogether.
>
> Right now I just want to make this first big step and make certain the
> code is solid. After that optimization is easy.
Thanks for the explanation!
--b.
"J. Bruce Fields" <[email protected]> writes:
> On Tue, Feb 25, 2014 at 02:03:36PM -0800, Eric W. Biederman wrote:
>> "J. Bruce Fields" <[email protected]> writes:
>>
>> > On Mon, Feb 24, 2014 at 04:01:29PM -0800, Eric W. Biederman wrote:
>> >> Miklos Szeredi <[email protected]> writes:
>> >>
>> >> >
>> >> > You can optimize this by including the negative check within the above d_locked
>> >> > region and calling __d_drop() instead.
>> >>
>> >> For this patch just moving the code and not changing it is the corret
>> >> thing to do because it helps with review and understanding the code.
>> >>
>> >> There are two ways I could see going with optimizing the preamble.
>> >> Simply dropping the d_lock from around the d_unhashed test as a pointer
>> >> dereference should be atomic, and the test is racy against
>> >> d_materialise_unique.
>> >
>> > Could you explain? What's the race, and what are the consequences?
>
> Actually I was just confused as to whether the above was "is racy" was
> claiming the existance of some bug.
>
> I believe I should have read the above as more like "the test is already
> racy against d_materialise_unique, but it's a harmless race, and
> dropping the d_lock wouldn't make it any worse".
>
>> >> (We don't always hold the parent directories inode mutex when d_invalidate is called).
>>
>> d_unhashed is not a permanent condition because of d_materialise_unique,
>> and d_splice_alias.
>>
>> d_invalidate can be called on an unhashed dentry in one of two ways
>> (either d_revalidate dropped the dentry or another routine that drops
>> the dentry beat the current invocation of d_invalidate to the job).
>>
>>
>> There are 3 places d_revalidate is called.
>>
>> Once on the rcu path with with the appropriate flag set.
>>
>> Once without out the parent i_mutex held, just off of the rcu path,
>> on that path d_invalidate is when d_revalidate fails.
>>
>> Once during lookup with the parent directory i_mutex held.
>>
>>
>> Because the parent direcories i_mutex is not always held accross
>> d_revalidate and the following d_invalidate it happens that d_invalidate
>> is not always an atomic operation.
>>
>>
>> At worst the race results in a dentry that is dropped when it could be
>> hashed,
>
> Because somebody not holding the i_mutex calls d_invalidate based on old
> information and unhashes something that
> d_materialise_unique/d_splice_alias just hashed?
More likely today somebody not holding i_mutex and not in rcu context
calls d_revalidate. d_revalidate drops the dentry and before we
d_invalidate d_materialise_unique/d_splice_alias rehashes it.
After my changes it looks like it takes 3 processes two instances
of d_invalidate and a instance of d_materialise_unique/d_spliace_alias
to trigger this case.
In either case the window is very small and the outcome is effectively
harmless. So I don't see this as a problem.
>> that we will resurrect next time someone attempts to look it
>> up and d_materialise_unique/d_splice_alias is called.
>
> OK.
>
>> None of that really matters for optimizing d_invalidate, but it is part
>> of the background in which d_invalidate lives. All that is significant
>> in d_invalidate is knowing that d_materialise_unique, and possibly
>> d_splice_alias may run concurrently with d_invalidate. It is unlikely
>> and essentially harmless.
>>
>>
>> After my patchset (because I removed all of the d_drop's from
>> .d_revalidate) the only race that should remain is between two parallel
>> calls of d_invalidate. Which probably means we can remove the test for
>> d_unhashed altogether.
>>
>> Right now I just want to make this first big step and make certain the
>> code is solid. After that optimization is easy.
>
> Thanks for the explanation!
Welcome.
Eric
Linus,
Please pull the for-linus branch from the git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
HEAD: 0d7d90f86f83f29a442b37c78172870f8ee28c58 proc: Update proc_flush_task_mnt to use d_invalidate
My apologies for sending this pull request late. I thought these
changes were going to come through Al's tree that doesn't look like it
is going to happen, and it is long past time for these changes to be
merged.
This set of changes has been reviewed and been sitting idle for the last
6 weeks. In that time the vfs has slightly shifted under me the new
version of rename and the mount hash list becoming a hlist. None of
those changes has caused changed the code in ways to invalidate these
changes, but small conflicts do result and I have attached my conflict
resolution at the end of this email in case it helps.
To recap these changes allow a file or a directory that is a mount point
in one mount namespace to be unlinked/rmdired elsewhere where it is not
a mount point (either a remote filesystem or another mount namespace).
As has been agreed during review semantics when only a single mount
namespace exists remain unchanged.
This removes a long standing need to lie to the vfs when a mount point
has been removed behind it's back. This also removes a DOS attack where
an unprivileged user could prevent root from renaming or deleting files
and directories by using them as mountpoints in another mount namespace.
This change also fixes a few cases where because we were not lying to
the vfs we could leak mount points.
When renaming or unlinking directory entries that are not mountpoints
no additional locks are taken so no performance differences can result,
and my benchmark reflected that.
To verify that nothing significant from the time this code was
written until now, I have performed a test merge. I successfully
performed an allyesconfig build (skipping a broken wireless driver in
staging) and tested to make certain that the code still functions
as expected.
Eric W. Biederman (12):
vfs: Document the effect of d_revalidate on d_find_alias
vfs: More precise tests in d_invalidate
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: factor out lookup_mountpoint from new_mountpoint
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
vfs: Remove unnecessary calls of check_submounts_and_drop
vfs: Merge check_submounts_and_drop and d_invalidate
vfs: Make d_invalidate return void
vfs: Remove d_drop calls from d_revalidate implementations
proc: Update proc_flush_task_mnt to use d_invalidate
fs/afs/dir.c | 5 --
fs/btrfs/ioctl.c | 5 +--
fs/ceph/dir.c | 1 -
fs/cifs/readdir.c | 6 +--
fs/dcache.c | 140 +++++++++++++++++-------------------------------
fs/fuse/dir.c | 7 +--
fs/gfs2/dentry.c | 3 -
fs/kernfs/dir.c | 11 ----
fs/mount.h | 20 +++++++
fs/namei.c | 28 ++++++----
fs/namespace.c | 87 +++++++++++++++++++++++++++++-
fs/nfs/dir.c | 7 +--
fs/proc/base.c | 10 +---
fs/proc/fd.c | 2 -
include/linux/dcache.h | 3 +-
15 files changed, 178 insertions(+), 157 deletions(-)
Eric
---
My merge conflict resolution.
__d_move gained an argument I took it out of a conditional.
m_hash went from a struct list to a struct hlist changing nearby code,
and affecting my factoring out of lookup_mountpoint from new_mountpoint.
There was a major refactoring of rename. As long as d_mountpoint
becomes is_local_mounptoint and detach_mount is added after dont_mount
all is well.
dcache.c | 6 ++----
mount.h | 1 +
namei.c | 3 ++-
namespace.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 51 insertions(+), 7 deletions(-)
diff --cc fs/dcache.c
index 66cba5a8a346,5b78bd98649c..d4a1e55d65a9
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@@ -2701,10 -2631,8 +2663,8 @@@ static struct dentry *__d_unalias(struc
goto out_err;
m2 = &alias->d_parent->d_inode->i_mutex;
out_unalias:
- if (likely(!d_mountpoint(alias))) {
- __d_move(alias, dentry, false);
- ret = alias;
- }
- __d_move(alias, dentry);
++ __d_move(alias, dentry, false);
+ ret = alias;
out_err:
spin_unlock(&inode->i_lock);
if (m2)
diff --cc fs/mount.h
index b29e42f05f34,c5e717542bbc..aa3c0aa473df
--- a/fs/mount.h
+++ b/fs/mount.h
@@@ -19,8 -19,9 +19,9 @@@ struct mnt_pcp
};
struct mountpoint {
- struct list_head m_hash;
+ struct hlist_node m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
diff --cc fs/namei.c
index 88339f59efb5,384fcc6a5606..0e438b395e28
--- a/fs/namei.c
+++ b/fs/namei.c
@@@ -4082,33 -4045,17 +4085,33 @@@ int vfs_rename(struct inode *old_dir, s
if (error)
return error;
+ old_name = fsnotify_oldname_init(old_dentry->d_name.name);
dget(new_dentry);
- lock_two_nondirectories(source, target);
+ if (!is_dir || (flags & RENAME_EXCHANGE))
+ lock_two_nondirectories(source, target);
+ else if (target)
+ mutex_lock(&target->i_mutex);
error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- error = try_break_deleg(source, delegated_inode);
- if (error)
- goto out;
- if (target) {
+ if (max_links && new_dir != old_dir) {
+ error = -EMLINK;
+ if (is_dir && !new_is_dir && new_dir->i_nlink >= max_links)
+ goto out;
+ if ((flags & RENAME_EXCHANGE) && !is_dir && new_is_dir &&
+ old_dir->i_nlink >= max_links)
+ goto out;
+ }
+ if (is_dir && !(flags & RENAME_EXCHANGE) && target)
+ shrink_dcache_parent(new_dentry);
+ if (!is_dir) {
+ error = try_break_deleg(source, delegated_inode);
+ if (error)
+ goto out;
+ }
+ if (target && !new_is_dir) {
error = try_break_deleg(target, delegated_inode);
if (error)
goto out;
@@@ -4123,31 -4064,73 +4126,32 @@@
if (error)
goto out;
- if (target) {
+ if (!(flags & RENAME_EXCHANGE) && target) {
+ if (is_dir)
+ target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
- if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
- d_move(old_dentry, new_dentry);
+ if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE)) {
+ if (!(flags & RENAME_EXCHANGE))
+ d_move(old_dentry, new_dentry);
+ else
+ d_exchange(old_dentry, new_dentry);
+ }
out:
- unlock_two_nondirectories(source, target);
+ if (!is_dir || (flags & RENAME_EXCHANGE))
+ unlock_two_nondirectories(source, target);
+ else if (target)
+ mutex_unlock(&target->i_mutex);
dput(new_dentry);
- return error;
-}
-
-/**
- * vfs_rename - rename a filesystem object
- * @old_dir: parent of source
- * @old_dentry: source
- * @new_dir: parent of destination
- * @new_dentry: destination
- * @delegated_inode: returns an inode needing a delegation break
- *
- * The caller must hold multiple mutexes--see lock_rename()).
- *
- * If vfs_rename discovers a delegation in need of breaking at either
- * the source or destination, it will return -EWOULDBLOCK and return a
- * reference to the inode in delegated_inode. The caller should then
- * break the delegation and retry. Because breaking a delegation may
- * take a long time, the caller should drop all locks before doing
- * so.
- *
- * Alternatively, a caller may pass NULL for delegated_inode. This may
- * be appropriate for callers that expect the underlying filesystem not
- * to be NFS exported.
- */
-int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry,
- struct inode **delegated_inode)
-{
- int error;
- int is_dir = d_is_directory(old_dentry) || d_is_autodir(old_dentry);
- const unsigned char *old_name;
-
- if (old_dentry->d_inode == new_dentry->d_inode)
- return 0;
-
- error = may_delete(old_dir, old_dentry, is_dir);
- if (error)
- return error;
-
- if (!new_dentry->d_inode)
- error = may_create(new_dir, new_dentry);
- else
- error = may_delete(new_dir, new_dentry, is_dir);
- if (error)
- return error;
-
- if (!old_dir->i_op->rename)
- return -EPERM;
-
- old_name = fsnotify_oldname_init(old_dentry->d_name.name);
-
- if (is_dir)
- error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
- else
- error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry,delegated_inode);
- if (!error)
+ if (!error) {
fsnotify_move(old_dir, new_dir, old_name, is_dir,
- new_dentry->d_inode, old_dentry);
+ !(flags & RENAME_EXCHANGE) ? target : NULL, old_dentry);
+ if (flags & RENAME_EXCHANGE) {
+ fsnotify_move(new_dir, old_dir, old_dentry->d_name.name,
+ new_is_dir, NULL, new_dentry);
+ }
+ }
fsnotify_oldname_free(old_name);
return error;
diff --cc fs/namespace.c
index 2ffc5a2905d4,52f4174e294c..c809205f30df
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@@ -665,13 -632,47 +666,47 @@@ struct vfsmount *lookup_mnt(struct pat
return m;
}
- static struct mountpoint *new_mountpoint(struct dentry *dentry)
+ /*
+ * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
+ * current mount namespace.
+ *
+ * The common case is dentries are not mountpoints at all and that
+ * test is handled inline. For the slow case when we are actually
+ * dealing with a mountpoint of some kind, walk through all of the
+ * mounts in the current mount namespace and test to see if the dentry
+ * is a mountpoint.
+ *
+ * The mount_hashtable is not usable in the context because we
+ * need to identify all mounts that may be in the current mount
+ * namespace not just a mount that happens to have some specified
+ * parent mount.
+ */
+ bool __is_local_mountpoint(struct dentry *dentry)
+ {
+ struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+ struct mount *mnt;
+ bool is_covered = false;
+
+ if (!d_mountpoint(dentry))
+ goto out;
+
+ down_read(&namespace_sem);
+ list_for_each_entry(mnt, &ns->list, mnt_list) {
+ is_covered = (mnt->mnt_mountpoint == dentry);
+ if (is_covered)
+ break;
+ }
+ up_read(&namespace_sem);
+ out:
+ return is_covered;
+ }
+
+ static struct mountpoint *lookup_mountpoint(struct dentry *dentry)
{
- struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
+ struct hlist_head *chain = mp_hash(dentry);
struct mountpoint *mp;
- int ret;
- list_for_each_entry(mp, chain, m_hash) {
+ hlist_for_each_entry(mp, chain, m_hash) {
if (mp->m_dentry == dentry) {
/* might be worth a WARN_ON() */
if (d_unlinked(dentry))
@@@ -680,6 -681,14 +715,14 @@@
return mp;
}
}
+ return NULL;
+ }
+
+ static struct mountpoint *new_mountpoint(struct dentry *dentry)
+ {
- struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
++ struct hlist_head *chain = mp_hash(dentry);
+ struct mountpoint *mp;
+ int ret;
mp = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
if (!mp)
@@@ -693,7 -702,8 +736,8 @@@
mp->m_dentry = dentry;
mp->m_count = 1;
- list_add(&mp->m_hash, chain);
+ hlist_add_head(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@@ -746,7 -757,8 +791,8 @@@ static void detach_mnt(struct mount *mn
mnt->mnt_parent = mnt;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
- list_del_init(&mnt->mnt_hash);
+ hlist_del_init_rcu(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
On Tue, Apr 08, 2014 at 05:21:32PM -0700, Eric W. Biederman wrote:
> This set of changes has been reviewed and been sitting idle for the last
> 6 weeks. In that time the vfs has slightly shifted under me the new
> version of rename and the mount hash list becoming a hlist. None of
> those changes has caused changed the code in ways to invalidate these
> changes, but small conflicts do result and I have attached my conflict
> resolution at the end of this email in case it helps.
>
> To recap these changes allow a file or a directory that is a mount point
> in one mount namespace to be unlinked/rmdired elsewhere where it is not
> a mount point (either a remote filesystem or another mount namespace).
> As has been agreed during review semantics when only a single mount
> namespace exists remain unchanged.
>
> This removes a long standing need to lie to the vfs when a mount point
> has been removed behind it's back. This also removes a DOS attack where
> an unprivileged user could prevent root from renaming or deleting files
> and directories by using them as mountpoints in another mount namespace.
>
> This change also fixes a few cases where because we were not lying to
> the vfs we could leak mount points.
>
> When renaming or unlinking directory entries that are not mountpoints
> no additional locks are taken so no performance differences can result,
> and my benchmark reflected that.
It also means that d_invalidate() now might trigger fs shutdown. Which
has bloody huge stack footprint, for obvious reasons. And d_invalidate()
can be called with pretty deep stack - walk into wrong dentry while
resolving a deeply nested symlink and there you go...
On Wed, Apr 09, 2014 at 03:30:27AM +0100, Al Viro wrote:
> > When renaming or unlinking directory entries that are not mountpoints
> > no additional locks are taken so no performance differences can result,
> > and my benchmark reflected that.
>
> It also means that d_invalidate() now might trigger fs shutdown. Which
> has bloody huge stack footprint, for obvious reasons. And d_invalidate()
> can be called with pretty deep stack - walk into wrong dentry while
> resolving a deeply nested symlink and there you go...
PS: I thought I actually replied with that point back a month or so ago,
but having checked sent-mail... Looks like I had not. My deep apologies.
FWIW, I think that overall this thing is a good idea, provided that we can
live with semantics changes. The implementation is too optimistic, though -
at the very least, we want this work done upon namespace_unlock() held
back until we are not too deep in stack. task_work_add() fodder, perhaps?
Al Viro <[email protected]> writes:
> On Wed, Apr 09, 2014 at 03:30:27AM +0100, Al Viro wrote:
>
>> > When renaming or unlinking directory entries that are not mountpoints
>> > no additional locks are taken so no performance differences can result,
>> > and my benchmark reflected that.
>>
>> It also means that d_invalidate() now might trigger fs shutdown. Which
>> has bloody huge stack footprint, for obvious reasons. And d_invalidate()
>> can be called with pretty deep stack - walk into wrong dentry while
>> resolving a deeply nested symlink and there you go...
>
> PS: I thought I actually replied with that point back a month or so ago,
> but having checked sent-mail... Looks like I had not. My deep
> apologies.
If I had been aware of the concern I would have addressed this a month
ago. Oh well that is water under the bridge now.
> FWIW, I think that overall this thing is a good idea, provided that we can
> live with semantics changes. The implementation is too optimistic, though -
> at the very least, we want this work done upon namespace_unlock() held
> back until we are not too deep in stack. task_work_add() fodder, perhaps?
Given where we are at in the merge window I need to make the argument
that it makes sense to pull this code now and address the deep
stack concerns before 3.15-final.
As I understand it your concern with d_invalidate is that there will be
a remote filesystem like nfs, and on a directory of that filesystem we
have mounted some other filesystem. Someone on the nfs server deletes
the directory that is our mountpint. Sometime later we traverse the
path that leads to the mountpoint in question and call d_invalidate.
The d_invalidate call drops performs a lazy unmount drops the last
filesystem reference and the filesystem umount potentially overflows
the kernel stack.
I would like to argue that because it is stupid to mount something on a
directory that someone else can delete this is not a particularly
likely scenario. Unfortunately an evil user can mount filesystems
tagged with FS_USERNS_MOUNT and with care can trigger this at will. :(
So if we can overflow the stack this is problem that must be fixed.
The good news is that because it stupid to have a setup where someone
else can delete your mountpoint this won't really be a problem except in
setups with evil users which means merging these changes now, and fixing
this final issue in a few days when a tested patch is available should
not cause any problems.
Looking at the code there are two points where we could make this lazy,
walking the unmounted list in namespace unlock, and simply calling
detach_mount (As dropping a dentry and holid a reference is perfectly
legitimate). I don't see any fundamental difficulties just some
painstaking detail work, to make certain the resulting implemention is
correct.
Linus would you please merge my branch. In it's current state the code
fixes a lot of semantic and implementation bugs. With only the
potential issue of overflowing that stack on a code path that most
people should never trigger. Having the code merged for 3.15 will free
me to focus on the technical stack depth concerns and allow me to stop
worry about all of the other potential issues the patchset touches on.
Eric
p.s. Now I am off to sleep before I propose a patch to deal with the
potentail of very deep stacks. I think I can reuse mnt_rcu in struct
mount for the work struct to pass to task_work_add but I need to look
closer at all of the details involved.
Al Viro <[email protected]> writes:
> On Wed, Apr 09, 2014 at 03:30:27AM +0100, Al Viro wrote:
>
>> > When renaming or unlinking directory entries that are not mountpoints
>> > no additional locks are taken so no performance differences can result,
>> > and my benchmark reflected that.
>>
>> It also means that d_invalidate() now might trigger fs shutdown. Which
>> has bloody huge stack footprint, for obvious reasons. And d_invalidate()
>> can be called with pretty deep stack - walk into wrong dentry while
>> resolving a deeply nested symlink and there you go...
>
> PS: I thought I actually replied with that point back a month or so ago,
> but having checked sent-mail... Looks like I had not. My deep apologies.
>
> FWIW, I think that overall this thing is a good idea, provided that we can
> live with semantics changes. The implementation is too optimistic, though -
> at the very least, we want this work done upon namespace_unlock() held
> back until we are not too deep in stack. task_work_add() fodder,
> perhaps?
Hmm.
Just to confirm what I am dealing with I have proceeded to measure the
amount of stack used by these operations.
For resolving a deeply nested symlink that hits the limit of 8 nested
symlinks, I find 4688 bytes left on the stack. Which means we use
roughly 3504 bytes of stack when stating a deeply nested symlink.
For umount I had a little trouble measuring as typically the work done
by umount was not the largest stack consumer, but I found for a small
ext4 filesystem after the umount operation was complete there were
5152 bytes left on the stack, or umount used roughly 3040 bytes.
3504 + 3040 = 6544 bytes of stack used or 1684 bytes of stack left
unused. Which certainly isn't a lot of margin but it is not overflowing
the kernel stack either.
Is there a case that see where umount uses a lot more kernel stack? Is
your concern an architecture other than x86_64 with different
limitations?
I am quite happy to change my code to avoid stack overflow but I want to
make certain I understand where the stack usage is coming from so that I
actually fix them issue.
Eric
On Wed, Apr 09, 2014 at 10:32:14AM -0700, Eric W. Biederman wrote:
> For resolving a deeply nested symlink that hits the limit of 8 nested
> symlinks, I find 4688 bytes left on the stack. Which means we use
> roughly 3504 bytes of stack when stating a deeply nested symlink.
>
> For umount I had a little trouble measuring as typically the work done
> by umount was not the largest stack consumer, but I found for a small
> ext4 filesystem after the umount operation was complete there were
> 5152 bytes left on the stack, or umount used roughly 3040 bytes.
A bit less - we have a non-empty stack footprint from sys_umount() itself.
> 3504 + 3040 = 6544 bytes of stack used or 1684 bytes of stack left
> unused. Which certainly isn't a lot of margin but it is not overflowing
> the kernel stack either.
>
> Is there a case that see where umount uses a lot more kernel stack? Is
> your concern an architecture other than x86_64 with different
> limitations?
For starters, put that ext4 on top of dm-raid or dm-multipath. That alone
will very likely push you over the top.
Keep in mind, BTW, that you do not have full 8K to play with - there's
struct thread_info that should not be stepped upon. Not particulary large
(IIRC, restart_block is the largest piece in amd64 one), but it eats about
100 bytes.
I'd probably use renameat(2) in testing - i.e. trigger the shite when
resolving a deeply nested symlink in renameat() arguments. That brings
extra struct nameidata into the game, i.e. extra 152 bytes chewed off the
stack.
On Wed, Apr 09, 2014 at 06:53:23PM +0100, Al Viro wrote:
> For starters, put that ext4 on top of dm-raid or dm-multipath. That alone
> will very likely push you over the top.
>
> Keep in mind, BTW, that you do not have full 8K to play with - there's
> struct thread_info that should not be stepped upon. Not particulary large
> (IIRC, restart_block is the largest piece in amd64 one), but it eats about
> 100 bytes.
>
> I'd probably use renameat(2) in testing - i.e. trigger the shite when
> resolving a deeply nested symlink in renameat() arguments. That brings
> extra struct nameidata into the game, i.e. extra 152 bytes chewed off the
> stack.
Come to think of that, some extra nastiness could be had by mixing it with
execve(). You can have up to 4 levels of #! resolution there, each eating
up at least 128 bytes (more, actually). Compiler _might_ turn that
tail call of search_binary_handler() into a jump, but it's not guaranteed
at all.
FWIW, it probably makes sense to turn load_script() into
static int load_script(struct linux_binprm *bprm)
{
int err = __load_script(bprm);
if (err)
return err;
return search_binary_handler(bprm);
}
regardless of that issue; we don't need interp[] after the call of
open_exec(), so it makes sense to reduce the footprint in mutual
recursion loop.
For extra pain, consider s/ext4/xfs/, possibly with iscsi thrown under the
bus^Wdm-multipath.
The thing is, we are already too close to stack overflow limit. Adding
several kilobytes more is not survivable, and since you are taking
somebody in a userns DoSing the system into consideration, you can't
say "it takes malicious root to set up, so it's not serious" - the
DoS you mentioned requires the same thing...
BTW, another thing to test would be this:
mount nfs on /mnt
mount a filesystem on /mnt/path that can be invalidated
cd to /mnt/path/foo
bind /mnt on /mnt/path/foo/bar
shoot /mnt/path (on server)
stat bar/path/foo
That should rip the fs you are in out of the tree; it should work, but
it's definitely a case worth testing.
On Wed, Apr 09, 2014 at 10:32:14AM -0700, Eric W. Biederman wrote:
> Al Viro <[email protected]> writes:
>
> > On Wed, Apr 09, 2014 at 03:30:27AM +0100, Al Viro wrote:
> >
> >> > When renaming or unlinking directory entries that are not mountpoints
> >> > no additional locks are taken so no performance differences can result,
> >> > and my benchmark reflected that.
> >>
> >> It also means that d_invalidate() now might trigger fs shutdown. Which
> >> has bloody huge stack footprint, for obvious reasons. And d_invalidate()
> >> can be called with pretty deep stack - walk into wrong dentry while
> >> resolving a deeply nested symlink and there you go...
> >
> > PS: I thought I actually replied with that point back a month or so ago,
> > but having checked sent-mail... Looks like I had not. My deep apologies.
> >
> > FWIW, I think that overall this thing is a good idea, provided that we can
> > live with semantics changes. The implementation is too optimistic, though -
> > at the very least, we want this work done upon namespace_unlock() held
> > back until we are not too deep in stack. task_work_add() fodder,
> > perhaps?
>
> Hmm.
>
> Just to confirm what I am dealing with I have proceeded to measure the
> amount of stack used by these operations.
>
> For resolving a deeply nested symlink that hits the limit of 8 nested
> symlinks, I find 4688 bytes left on the stack. Which means we use
> roughly 3504 bytes of stack when stating a deeply nested symlink.
>
> For umount I had a little trouble measuring as typically the work done
> by umount was not the largest stack consumer, but I found for a small
> ext4 filesystem after the umount operation was complete there were
> 5152 bytes left on the stack, or umount used roughly 3040 bytes.
Try XFS, or make sure that the unmount path that you measure does
something that requires memory allocation and triggers memory
reclaim.
> 3504 + 3040 = 6544 bytes of stack used or 1684 bytes of stack left
> unused. Which certainly isn't a lot of margin but it is not overflowing
> the kernel stack either.
>
> Is there a case that see where umount uses a lot more kernel stack? Is
> your concern an architecture other than x86_64 with different
> limitations?
Anything that enters the block layer IO path can consume upwards of
4-5K of stack because memory allocation occurs right at the bottom
of the IO stack and memory allocation is extremely stack heavy
(think 2.5-3k of stack for a typical GFP_NOIO context allocation
when there is no memory available).
Even scheduling requires you have around 1.5k of stack space
available for the scheduler to do it's stuff so at 1684 bytes of
stack left you're borderline for triggering stack overflow issues if
there's a sleeping lock at that deep leaf function...
Cheers,
Dave.
--
Dave Chinner
[email protected]
Al Viro <[email protected]> writes:
> On Wed, Apr 09, 2014 at 06:53:23PM +0100, Al Viro wrote:
>
>> For starters, put that ext4 on top of dm-raid or dm-multipath. That alone
>> will very likely push you over the top.
>>
>> Keep in mind, BTW, that you do not have full 8K to play with - there's
>> struct thread_info that should not be stepped upon. Not particulary large
>> (IIRC, restart_block is the largest piece in amd64 one), but it eats about
>> 100 bytes.
>>
>> I'd probably use renameat(2) in testing - i.e. trigger the shite when
>> resolving a deeply nested symlink in renameat() arguments. That brings
>> extra struct nameidata into the game, i.e. extra 152 bytes chewed off the
>> stack.
>
> Come to think of that, some extra nastiness could be had by mixing it with
> execve(). You can have up to 4 levels of #! resolution there, each eating
> up at least 128 bytes (more, actually). Compiler _might_ turn that
> tail call of search_binary_handler() into a jump, but it's not guaranteed
> at all.
>
> FWIW, it probably makes sense to turn load_script() into
> static int load_script(struct linux_binprm *bprm)
> {
> int err = __load_script(bprm);
> if (err)
> return err;
> return search_binary_handler(bprm);
> }
>
> regardless of that issue; we don't need interp[] after the call of
> open_exec(), so it makes sense to reduce the footprint in mutual
> recursion loop.
>
> For extra pain, consider s/ext4/xfs/, possibly with iscsi thrown under the
> bus^Wdm-multipath.
>
> The thing is, we are already too close to stack overflow limit. Adding
> several kilobytes more is not survivable, and since you are taking
> somebody in a userns DoSing the system into consideration, you can't
> say "it takes malicious root to set up, so it's not serious" - the
> DoS you mentioned requires the same thing...
Thank you for the comments this makes it clear that the problem is with
mntput (and the filesystem I/O that can be triggered) not particularly
with detach_mounts. As I read the code all of these nasty cases we are
concern with today can be triggered with pathput/mntput already with an
appropriate race against umount, which means my detach_mounts code
doesn't introduce a stack space usage regression, but seems to be the
messenger that we have such problems in the VFS.
There is a also a big difference between what can be triggered using
filesystems we allow unprivileged users to mount, devpts, proc, ramfs,
sysfs, mqueue, shmem (none of which have backing store) and filesystems
with a long I/O path.
So it looks like it still requires global root to trigger this, although
I still think it is serious.
One of the more interesting aspects of this userns work is running into
code that semantically should be safe for unprivileged users to use but
as we haven't historically allowed unprivileged users to use the code
there are silly assumptions and untested code paths.
> BTW, another thing to test would be this:
> mount nfs on /mnt
> mount a filesystem on /mnt/path that can be invalidated
> cd to /mnt/path/foo
> bind /mnt on /mnt/path/foo/bar
> shoot /mnt/path (on server)
> stat bar/path/foo
> That should rip the fs you are in out of the tree; it should work, but
> it's definitely a case worth testing.
Agreed that is a case worth test. I wasn't looking at that case in
particular as that is not the worst case stack usage or even an
approximation of it.
Eric
mntput as part of pathput is called from all over the vfs sometimes as
in the case of symlink chasing from some rather deep call chains.
During filesystem unmount with the right set of races those innocuous
little mntput calls that take very little stack space can become calls
become mosters calling deactivate_super that can take up 3k or more of
stack space as syncrhonous filesystem I/O is performed, through
multiple levels of the I/O stack.
Avoid deactivate_super being called from a deep stack by converting
mntput to use task_work_add when the mnt_count goes to 0. The
filesystem is still unmounted synchronously preserving the semantics
that system calls like umount require.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
This patch has only seen light testing so far but it emperically it
appears to solve the stack depth problem. A simple umount of ext4 went
from having 5162 stack bytes untouched to having 5568 stack bytes
untouched. Freeing up at least 416 bytes of stack in that simple case.
fs/mount.h | 2 +-
fs/namespace.c | 24 ++++++++++++++++++------
2 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index aa3c0aa473df..4e78ca90467f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -30,7 +30,7 @@ struct mount {
struct mount *mnt_parent;
struct dentry *mnt_mountpoint;
struct vfsmount mnt;
- struct rcu_head mnt_rcu;
+ struct callback_head mnt_callback;
#ifdef CONFIG_SMP
struct mnt_pcp __percpu *mnt_pcp;
#else
diff --git a/fs/namespace.c b/fs/namespace.c
index c809205f30df..686afe9942bc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -24,6 +24,7 @@
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
+#include <linux/task_work.h>
#include "pnode.h"
#include "internal.h"
@@ -981,7 +982,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
static void delayed_free(struct rcu_head *head)
{
- struct mount *mnt = container_of(head, struct mount, mnt_rcu);
+ struct mount *mnt = container_of(head, struct mount, mnt_callback);
kfree(mnt->mnt_devname);
#ifdef CONFIG_SMP
free_percpu(mnt->mnt_pcp);
@@ -989,6 +990,17 @@ static void delayed_free(struct rcu_head *head)
kmem_cache_free(mnt_cache, mnt);
}
+static void mntput_delayed(struct callback_head *head)
+{
+ struct mount *mnt = container_of(head, struct mount, mnt_callback);
+
+ fsnotify_vfsmount_delete(&mnt->mnt);
+ dput(mnt->mnt.mnt_root);
+ deactivate_super(mnt->mnt.mnt_sb);
+ mnt_free_id(mnt);
+ call_rcu(&mnt->mnt_callback, delayed_free);
+}
+
static void mntput_no_expire(struct mount *mnt)
{
put_again:
@@ -1034,11 +1046,11 @@ put_again:
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
- fsnotify_vfsmount_delete(&mnt->mnt);
- dput(mnt->mnt.mnt_root);
- deactivate_super(mnt->mnt.mnt_sb);
- mnt_free_id(mnt);
- call_rcu(&mnt->mnt_rcu, delayed_free);
+ /* The stack may be deep here so perform this where the stack
+ * is guaranteed to be shallow.
+ */
+ init_task_work(&mnt->mnt_callback, mntput_delayed);
+ WARN_ON(task_work_add(current, &mnt->mnt_callback, true) != 0);
}
void mntput(struct vfsmount *mnt)
--
1.9.1
On Wed, Apr 09, 2014 at 03:58:25PM -0700, Eric W. Biederman wrote:
>
> mntput as part of pathput is called from all over the vfs sometimes as
> in the case of symlink chasing from some rather deep call chains.
> During filesystem unmount with the right set of races those innocuous
> little mntput calls that take very little stack space can become calls
> become mosters calling deactivate_super that can take up 3k or more of
> stack space as syncrhonous filesystem I/O is performed, through
> multiple levels of the I/O stack.
>
> Avoid deactivate_super being called from a deep stack by converting
> mntput to use task_work_add when the mnt_count goes to 0. The
> filesystem is still unmounted synchronously preserving the semantics
> that system calls like umount require.
Careful. For one thing, you've just introduced a massive leak in knfsd
and any other kernel thread that might do mntput(). task_work_add()
makes no sense there - there is no userland to return to. For another,
in things like cleanup of failing modprobe we might end up delaying fs
shutdown too much. So it's not that simple, unfortunately.
I agree that fs shutdown is better dealt with on mostly empty stack, of
course - moreover, done right that has a potential to make mntput()
safe in atomic contexts (there's also acct_auto_close_mnt() to deal
with; that might take some work to get right, but I think it's not
fatal).
Al Viro <[email protected]> writes:
> On Wed, Apr 09, 2014 at 03:58:25PM -0700, Eric W. Biederman wrote:
>>
>> mntput as part of pathput is called from all over the vfs sometimes as
>> in the case of symlink chasing from some rather deep call chains.
>> During filesystem unmount with the right set of races those innocuous
>> little mntput calls that take very little stack space can become calls
>> become mosters calling deactivate_super that can take up 3k or more of
>> stack space as syncrhonous filesystem I/O is performed, through
>> multiple levels of the I/O stack.
>>
>> Avoid deactivate_super being called from a deep stack by converting
>> mntput to use task_work_add when the mnt_count goes to 0. The
>> filesystem is still unmounted synchronously preserving the semantics
>> that system calls like umount require.
>
> Careful. For one thing, you've just introduced a massive leak in knfsd
> and any other kernel thread that might do mntput(). task_work_add()
> makes no sense there - there is no userland to return to. For another,
> in things like cleanup of failing modprobe we might end up delaying fs
> shutdown too much. So it's not that simple, unfortunately.
Unfortunately.
> I agree that fs shutdown is better dealt with on mostly empty stack, of
> course - moreover, done right that has a potential to make mntput()
> safe in atomic contexts (there's also acct_auto_close_mnt() to deal
> with; that might take some work to get right, but I think it's not
> fatal).
I am slowly digging into this.
With this patch I was was able to do an A/B comparison of what the stack
cost on my unmounting my minimal ext4 filesystem from d_invalidate
called with in a context with maximum symlink recursion depth, without
and without a changed mntput.
I used sysfs instead of nfs to mount my minimal ext4 filesystem on as I
was a lazy bum and didn't have a nfs server setup handy.
With just my detach_mounts branched merged into 3.15-rc0
I saw 4880 stack bytes left before calling detach_mounts from d_invalidate
I saw 3904 stack bytes left after calling detach_mounts from d_invalidate
Which means in practice unmounting my mininal ext4 filesystem image only
used 976 additional bytes of stack.
With the same kernel plus my change to mntput
I saw 4880 stack bytes left before calling detach_mounts from d_invalidate
I saw 4880 stack bytes left after calling detach_mounts from d_invalidate
Which at least confirms that a change to mntput is enough to make deep
stacks safe.
With 3904 bytes of headroom from ext4 I may have to measure some of the
nastier cases just to be certain there actually is a problem here.
Eric
Al Viro <[email protected]> writes:
> On Wed, Apr 09, 2014 at 03:58:25PM -0700, Eric W. Biederman wrote:
>>
>> mntput as part of pathput is called from all over the vfs sometimes as
>> in the case of symlink chasing from some rather deep call chains.
>> During filesystem unmount with the right set of races those innocuous
>> little mntput calls that take very little stack space can become calls
>> become mosters calling deactivate_super that can take up 3k or more of
>> stack space as syncrhonous filesystem I/O is performed, through
>> multiple levels of the I/O stack.
>>
>> Avoid deactivate_super being called from a deep stack by converting
>> mntput to use task_work_add when the mnt_count goes to 0. The
>> filesystem is still unmounted synchronously preserving the semantics
>> that system calls like umount require.
>
> Careful. For one thing, you've just introduced a massive leak in knfsd
> and any other kernel thread that might do mntput(). task_work_add()
> makes no sense there - there is no userland to return to. For another,
> in things like cleanup of failing modprobe we might end up delaying fs
> shutdown too much. So it's not that simple, unfortunately.
Can you explain which scenario you are thinking about with respect to a
failed modprobe?
I don't see any problems on the mount path, and copy_module_from_fd
already uses struct file which does this.
So I am curious to know which scenario you were thinking of. Outside of
mount (where we demand load the fs module) I can't think of a scenario
with modprobe where we would require immediate free semantics.
Unmout certainly needs to have the filesystem freed by the time we reach
userspace (in the normal case) but other than that I am drawing a blank.
> I agree that fs shutdown is better dealt with on mostly empty stack, of
> course - moreover, done right that has a potential to make mntput()
> safe in atomic contexts (there's also acct_auto_close_mnt() to deal
> with; that might take some work to get right, but I think it's not
> fatal).
I just about have a working version, and acct_auto_close_mnt wasn't
particularly difficult. Just one more thing that had to be done.
Eric
On Sat, Apr 12, 2014 at 03:15:39PM -0700, Eric W. Biederman wrote:
> Can you explain which scenario you are thinking about with respect to a
> failed modprobe?
Completely made up example:
static struct file_system_type foofs = {
.mount = mount_foo,
.kill_sb = kill_foo,
};
static struct vfsmount *mnt;
static __init int foo_init(void)
{
int err;
err = init_some();
if (err < 0)
return err;
mnt = kern_mount(&foofs);
if (IS_ERR(mnt)) {
uninit_some();
return PTR_ERR(mnt);
}
err = init_some_more();
if (err < 0) {
kern_umount(mnt);
uninit_some();
return err;
}
printk(KERN_INFO "loaded foo");
return 0;
}
Now, think what happens if init_some_more() in the above fails. With the
current mntput() semantics, everything works. After making mntput() (from
kern_umount()) delayed until the return to userland, we end up with attempt
to call kill_foo() after the memory where it code sits gets freed. For that
matter, by that point we are not even guaranteed to reach it, since it
comes as mnt->mnt_sb->s_type->kill_sb() and s_type points to freed memory.
I'm not saying that we have something that would closely resemble this
example, but it's not hard to vary it in a lot of ways, keeping the same
problem. Basically, you need to audit all paths leading from failure
exits in some module_init() to mntput() and figure out if delaying the
effect of that mntput() would be safe there (== doesn't get delayed past
the point where we destroy something needed for that fs shutdown).
It's not *that* horrible, since not too many modules out there are
declaring any fs types, but it needs to be done. In theory, you could
also fall prey to something like this:
type = get_fs_type("proc");
ns = kmalloc(...);
/* fill *ns */
mnt = kern_mount_data(type, p);
...
if (error) {
kern_unmount(mnt);
kfree(p);
put_filesystem(type);
}
possibly with get_fs_type() replaced with some other way to get that
pointer to fs type (defined elsewhere). E.g. for procfs it could
be, say, task_active_pid_ns(current)->proc_mnt->mnt_sb->s_type, etc.
Again, it's not impossible to audit (there's not a lot of places where
struct file_system_type * is ever stored, there are few instances of
struct file_system_type, all statically allocated, etc.), but it's
a non-trivial amount of work. And I honestly don't know if we have
any such places right now. Moreover, unless you feel like repeating
that kind of audit every merge window, we'll need a some way of dealing
with such situations. Something like flush_pending_mntput(fs_type), for
example, documented as barrier to be used in such places might do, but
if you can think of something more fool-proof...
On Sun, Apr 13, 2014 at 06:39:56AM +0100, Al Viro wrote:
> type = get_fs_type("proc");
> ns = kmalloc(...);
> /* fill *ns */
> mnt = kern_mount_data(type, p);
s/p/ns/
> ...
> if (error) {
> kern_unmount(mnt);
> kfree(p);
ditto.
> put_filesystem(type);
> }
BTW, miklos seems to have a buggered gmail setup - every posting to that
thread triggers a bounce that appears to come from some kind of spam filters
there...
Al Viro <[email protected]> writes:
> On Sat, Apr 12, 2014 at 03:15:39PM -0700, Eric W. Biederman wrote:
>
>> Can you explain which scenario you are thinking about with respect to a
>> failed modprobe?
>
> Completely made up example:
>
> static struct file_system_type foofs = {
> .mount = mount_foo,
> .kill_sb = kill_foo,
> };
>
> static struct vfsmount *mnt;
>
> static __init int foo_init(void)
> {
> int err;
> err = init_some();
> if (err < 0)
> return err;
> mnt = kern_mount(&foofs);
> if (IS_ERR(mnt)) {
> uninit_some();
> return PTR_ERR(mnt);
> }
> err = init_some_more();
> if (err < 0) {
> kern_umount(mnt);
> uninit_some();
> return err;
> }
> printk(KERN_INFO "loaded foo");
> return 0;
> }
>
> Now, think what happens if init_some_more() in the above fails. With the
> current mntput() semantics, everything works. After making mntput() (from
> kern_umount()) delayed until the return to userland, we end up with attempt
> to call kill_foo() after the memory where it code sits gets freed. For that
> matter, by that point we are not even guaranteed to reach it, since it
> comes as mnt->mnt_sb->s_type->kill_sb() and s_type points to freed memory.
>
> I'm not saying that we have something that would closely resemble this
> example, but it's not hard to vary it in a lot of ways, keeping the same
> problem. Basically, you need to audit all paths leading from failure
> exits in some module_init() to mntput() and figure out if delaying the
> effect of that mntput() would be safe there (== doesn't get delayed past
> the point where we destroy something needed for that fs shutdown).
>
> It's not *that* horrible, since not too many modules out there are
> declaring any fs types, but it needs to be done. In theory, you could
> also fall prey to something like this:
> type = get_fs_type("proc");
> ns = kmalloc(...);
> /* fill *ns */
> mnt = kern_mount_data(type, p);
> ...
> if (error) {
> kern_unmount(mnt);
> kfree(p);
> put_filesystem(type);
> }
> possibly with get_fs_type() replaced with some other way to get that
> pointer to fs type (defined elsewhere). E.g. for procfs it could
> be, say, task_active_pid_ns(current)->proc_mnt->mnt_sb->s_type, etc.
>
> Again, it's not impossible to audit (there's not a lot of places where
> struct file_system_type * is ever stored, there are few instances of
> struct file_system_type, all statically allocated, etc.), but it's
> a non-trivial amount of work. And I honestly don't know if we have
> any such places right now. Moreover, unless you feel like repeating
> that kind of audit every merge window, we'll need a some way of dealing
> with such situations. Something like flush_pending_mntput(fs_type), for
> example, documented as barrier to be used in such places might do, but
> if you can think of something more fool-proof...
I performed a quick audit and I don't see that case happening in the
current code.
Further I believe we already have a pretty good protection against the
class of bug you are talking about.
deactivate_locked_super calls put_filesystem which calls module_put. So
we have a reference to the module the filesystem was loaded in so the
code and static data is not going away until the we free the super block
and decrement the module reference count.
So while I think it is possible for kill_sb, or put_super to reference
so global module data that we have cleaned up I think it is a pretty
unlikely situation.
kern_mount or kern_mount_data seems to be the very last thing modules
that reference it perform. I will take a closer look before I send the
final version of this but I think you are worrying about something that
just doesn't happen in practice and isn't likely too.
Eric
[email protected] (Eric W. Biederman) writes:
> Al Viro <[email protected]> writes:
>
>> On Sat, Apr 12, 2014 at 03:15:39PM -0700, Eric W. Biederman wrote:
>>
>>> Can you explain which scenario you are thinking about with respect to a
>>> failed modprobe?
>>
>> Again, it's not impossible to audit (there's not a lot of places where
>> struct file_system_type * is ever stored, there are few instances of
>> struct file_system_type, all statically allocated, etc.), but it's
>> a non-trivial amount of work. And I honestly don't know if we have
>> any such places right now. Moreover, unless you feel like repeating
>> that kind of audit every merge window, we'll need a some way of dealing
>> with such situations. Something like flush_pending_mntput(fs_type), for
>> example, documented as barrier to be used in such places might do, but
>> if you can think of something more fool-proof...
>
> I performed a quick audit and I don't see that case happening in the
> current code.
Sigh. I was wrong. Almost this exact case happens in
btrfs_init_test_fs, and of course I was silly when I thought the module
ref count would be useful for something before init_module succeeds.
Still I suspect I was on the right track. We do have the get_fs_type,
get_filesystem and put_filesystem. Which ought to be enough to allow
us to convert unregister_filesystem into an appropriate barrier.
Something to look at after I have slept.
Eric
On Sun, Apr 13, 2014 at 12:51:56AM -0700, Eric W. Biederman wrote:
> btrfs_init_test_fs, and of course I was silly when I thought the module
> ref count would be useful for something before init_module succeeds.
>
> Still I suspect I was on the right track. We do have the get_fs_type,
> get_filesystem and put_filesystem. Which ought to be enough to allow
> us to convert unregister_filesystem into an appropriate barrier.
Umm... I don't think so - register_filesystem and unregister_filesystem
are only about the fs being visible in fs type list. You do *not* have
to register the sucker at all in order to be able to do kern_mount().
So the call of unregister_filesystem() is not guaranteed to happen at
all - as the matter of fact, I think we ought to stop registering any
mount_pseudo()-based ones (pipefs, etc.) and the only reason for _not_
doing that is that some scripts might be poking in /proc/filesystems.
We definitely do not want that for anything *new* that isn't mountable
from userland...
BTW, could somebody explain this: in fs/ext4/super.c we have
#define IS_EXT2_SB(sb) ((sb)->s_bdev->bd_holder == &ext2_fs_type)
What's wrong with simple ((sb)->s_type == &ext2_fs_type)? What am
I missing there? Ted?
FWIW, I have found one bug of similar sort, but it's already
present with the current semantics: init_hugetlb_fs() has
for_each_hstate(h) {
char buf[50];
unsigned ps_kb = 1U << (h->order + PAGE_SHIFT - 10);
snprintf(buf, sizeof(buf), "pagesize=%uK", ps_kb);
hugetlbfs_vfsmount[i] = kern_mount_data(&hugetlbfs_fs_type,
buf);
if (IS_ERR(hugetlbfs_vfsmount[i])) {
pr_err("hugetlb: Cannot mount internal hugetlbfs for "
"page size %uK", ps_kb);
error = PTR_ERR(hugetlbfs_vfsmount[i]);
hugetlbfs_vfsmount[i] = NULL;
}
i++;
}
/* Non default hstates are optional */
if (!IS_ERR_OR_NULL(hugetlbfs_vfsmount[default_hstate_idx]))
return 0;
followed by
kmem_cache_destroy(hugetlbfs_inode_cachep);
If some of those kern_mount_data() succeed, but not the one we really
want to have, we'll end up with kmem_cache_destroy() on non-empy cache...
Granted, it's very unlikely to happen, but it's still a bug. BTW, that
IS_ERR_OR_NULL() there looks like cargo-culting - it's ERR_PTR() is
explicitly turned into NULL a few lines above...
Another thing: does anybody seriously expect the system to survive with
every pipe(2) failing (and oopsing, actually)? If not, we probably should
just make init_pipe_fs() panic on failure... The same goes for sock_init(),
IMO...
Remove the extra reference added to mnt_count by the mnt_pinned code,
and stop jumping to the top of mntput_no_expire.
When fput synchronously called mntput that extra reference count and
the loop ensured that after acct_auto_close_mnt closed all of the
pinned references mntput_no_expire would then proceed actually
unmount the filesystem.
As fput is asynchronous today there is no point in looping.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namespace.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 9d0eb6b4e7d5..2abd5a2416d0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -985,7 +985,6 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
static void mntput_no_expire(struct mount *mnt)
{
-put_again:
rcu_read_lock();
mnt_add_count(mnt, -1);
if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */
@@ -999,12 +998,12 @@ put_again:
return;
}
if (unlikely(mnt->mnt_pinned)) {
- mnt_add_count(mnt, mnt->mnt_pinned + 1);
+ mnt_add_count(mnt, mnt->mnt_pinned);
mnt->mnt_pinned = 0;
rcu_read_unlock();
unlock_mount_hash();
acct_auto_close_mnt(&mnt->mnt);
- goto put_again;
+ return;
}
if (unlikely(mnt->mnt.mnt_flags & MNT_DOOMED)) {
rcu_read_unlock();
--
1.9.1
The autoclose of BSD accounting writes a record to the BSD accounting
file. When mntput is called from moderaly deep within the stack the
(3KiB or so) generating I/O can be problemenatic as some I/O paths
require nearly 5KiB of stack on their own.
Avoid the possibility of stack overflow by moving the close of the BSD
accounting file into a work queue.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/acct.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/kernel/acct.c b/kernel/acct.c
index 8d6e145138bb..1853dd4a1d01 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -93,6 +93,7 @@ struct bsd_acct_struct {
static DEFINE_SPINLOCK(acct_lock);
static LIST_HEAD(acct_list);
+static LIST_HEAD(acct_close_list);
/*
* Check the amount of free space and suspend/resume accordingly.
@@ -280,6 +281,20 @@ SYSCALL_DEFINE1(acct, const char __user *, name)
return error;
}
+static void acct_close_mnts(struct work_struct *unused)
+{
+ struct bsd_acct_struct *acct;
+
+ spin_lock(&acct_lock);
+restart:
+ list_for_each_entry(acct, &acct_close_list, list) {
+ acct_file_reopen(acct, NULL, NULL);
+ goto restart;
+ }
+ spin_unlock(&acct_lock);
+}
+static DECLARE_WORK(acct_close_work, acct_close_mnts);
+
/**
* acct_auto_close - turn off a filesystem's accounting if it is on
* @m: vfsmount being shut down
@@ -289,15 +304,15 @@ SYSCALL_DEFINE1(acct, const char __user *, name)
*/
void acct_auto_close_mnt(struct vfsmount *m)
{
- struct bsd_acct_struct *acct;
+ struct bsd_acct_struct *acct, *tmp;
spin_lock(&acct_lock);
-restart:
- list_for_each_entry(acct, &acct_list, list)
+ list_for_each_entry_safe(acct, tmp, &acct_list, list) {
if (acct->file && acct->file->f_path.mnt == m) {
- acct_file_reopen(acct, NULL, NULL);
- goto restart;
+ list_move_tail(&acct->list, &acct_close_list);
+ schedule_work(&acct_close_work);
}
+ }
spin_unlock(&acct_lock);
}
--
1.9.1
mntput as part of path_put is called from all over the vfs sometimes
as in the case of symlink chasing from some rather deep call chains.
During lazy filesystem unmount with the right set of races those
innocuous little mntput calls (that take very little stack space) can
call deactivate_super and wind up taking 3k of stack space or more
(David Chinner reports 5k for xfs).
Avoid deactivate_super being called from a deep stack by moving
the cleanup of the mount into a work queue.
To avoid semantic changes mntput waits for deactivate_super to complete
before returning.
With this change all filesystem unmounting happens with about 7400
bytes free on the stack at the point where deactivate_super is called.
Giving filesystems plenty of room to do I/O and not overflow the
kernel stack during unmounting.
The extra fields mnt_cleanup_work, and mnt_undone are added in
a union to avoid growing struct mount unnecessarily.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/mount.h | 14 +++++++++++---
fs/namespace.c | 32 +++++++++++++++++++++++++++-----
2 files changed, 38 insertions(+), 8 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a4111f19fd2e..7272fc416580 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -48,8 +48,17 @@ struct mount {
struct list_head mnt_slave; /* slave list entry */
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
- struct mountpoint *mnt_mp; /* where is it mounted */
- struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
+ union {
+ struct {
+ struct path mnt_ex_mountpoint;
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
+ struct mountpoint *mnt_mp; /* where is it mounted */
+ };
+ struct {
+ struct work_struct mnt_cleanup_work;
+ struct completion *mnt_undone;
+ };
+ };
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
@@ -58,7 +67,6 @@ struct mount {
int mnt_group_id; /* peer group identifier */
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
- struct path mnt_ex_mountpoint;
};
#define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
diff --git a/fs/namespace.c b/fs/namespace.c
index 2abd5a2416d0..ac589ad9f22d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -983,8 +983,25 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
return ERR_PTR(err);
}
+static void cleanup_mnt(struct mount *mnt)
+{
+ fsnotify_vfsmount_delete(&mnt->mnt);
+ dput(mnt->mnt.mnt_root);
+ deactivate_super(mnt->mnt.mnt_sb);
+ mnt_free_id(mnt);
+ complete(mnt->mnt_undone);
+ call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
+}
+
+static void cleanup_mnt_work(struct work_struct *work)
+{
+ cleanup_mnt(container_of(work, struct mount, mnt_cleanup_work));
+}
+
static void mntput_no_expire(struct mount *mnt)
{
+ struct completion undone;
+
rcu_read_lock();
mnt_add_count(mnt, -1);
if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */
@@ -1027,11 +1044,16 @@ static void mntput_no_expire(struct mount *mnt)
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
- fsnotify_vfsmount_delete(&mnt->mnt);
- dput(mnt->mnt.mnt_root);
- deactivate_super(mnt->mnt.mnt_sb);
- mnt_free_id(mnt);
- call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
+ /* The stack may be deep here, cleanup the mount on a work
+ * queue where the stack is guaranteed to be shallow.
+ */
+ init_completion(&undone);
+ mnt->mnt_undone = &undone;
+
+ INIT_WORK(&mnt->mnt_cleanup_work, cleanup_mnt_work);
+ schedule_work(&mnt->mnt_cleanup_work);
+
+ wait_for_completion(&undone);
}
void mntput(struct vfsmount *mnt)
--
1.9.1
Have the mntput that started the BSD accounting files closing block
until the superblock has been succesfully freed. This is what used to
happen until fput started calling mntput assynchronously, and is arguably
the right thing to do in the case of umount.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namespace.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index ac589ad9f22d..b10db3d69943 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1015,11 +1015,14 @@ static void mntput_no_expire(struct mount *mnt)
return;
}
if (unlikely(mnt->mnt_pinned)) {
+ init_completion(&undone);
+ mnt->mnt_undone = &undone;
mnt_add_count(mnt, mnt->mnt_pinned);
mnt->mnt_pinned = 0;
rcu_read_unlock();
unlock_mount_hash();
acct_auto_close_mnt(&mnt->mnt);
+ wait_for_completion(&undone);
return;
}
if (unlikely(mnt->mnt.mnt_flags & MNT_DOOMED)) {
@@ -1048,7 +1051,10 @@ static void mntput_no_expire(struct mount *mnt)
* queue where the stack is guaranteed to be shallow.
*/
init_completion(&undone);
- mnt->mnt_undone = &undone;
+ if (!mnt->mnt_undone)
+ mnt->mnt_undone = &undone;
+ else
+ complete(&undone);
INIT_WORK(&mnt->mnt_cleanup_work, cleanup_mnt_work);
schedule_work(&mnt->mnt_cleanup_work);
--
1.9.1
There are a lot of ways we could approach this, and I sat down and wrote
the simplest variant I could think of, so hopefully there are not silly
bugs that get overlooked.
The code move all cleanup from mntput that would do filesystem I/O into
work queues.
The code waits in mntput for that I/O to complete if we wait today.
The code has been tested and it works, and it succeeds in running
deactivate_super in a stack from with just short of 7500 bytes free.
7500 bytes should be enough for anybody!
I want to double check that I am using work queues correctly, they used
to be deadlock prone, and sleep on everything before I commit to a final
version but this version is probably good enough.
I may be going overboard in the case where we auto close BSD accounting
files, but at this point I figure better safe than sorry.
Eric W. Biederman (4):
vfs: Remove useless loop in mntput_no_expire
vfs: Move autoclose of BSD accounting into a work queue
vfs: In mntput run deactivate_super on a shallow stack.
vfs: Block intuitively in the case of BSD accounting files
fs/mount.h | 14 +++++++++++---
fs/namespace.c | 43 +++++++++++++++++++++++++++++++++++--------
kernel/acct.c | 25 ++++++++++++++++++++-----
3 files changed, 66 insertions(+), 16 deletions(-)
[email protected] (Eric W. Biederman) writes:
> There are a lot of ways we could approach this, and I sat down and wrote
> the simplest variant I could think of, so hopefully there are not silly
> bugs that get overlooked.
>
> The code move all cleanup from mntput that would do filesystem I/O into
> work queues.
>
> The code waits in mntput for that I/O to complete if we wait today.
>
> The code has been tested and it works, and it succeeds in running
> deactivate_super in a stack from with just short of 7500 bytes free.
> 7500 bytes should be enough for anybody!
>
> I want to double check that I am using work queues correctly, they used
> to be deadlock prone, and sleep on everything before I commit to a final
> version but this version is probably good enough.
I have confirmed by emperical testing, reading the code and with lockdep
that moving the work of mntput onto the system_wq does not introduce
deadlocks.
I have read through all of fsnotify_vfsmount_delete looking for anything
that might be affected by moving fsnotify_vfsmount_delete into another
context and thankfully there is nothing. At most audit will log the
removal of a rule, and inotify will log a removal. But in neither case
is there any information taken from the context of the current process.
> I may be going overboard in the case where we auto close BSD accounting
> files, but at this point I figure better safe than sorry.
After sleeping on the code I realized that putting a union in mount.h
to keep the size the same as bit to clever. Otherwise I am satisfied
with the code and have tested it every way I can think of.
I have looked a bit more at the issue of removing the blocking from
mntput and it looks this set of changes can easily serve as an
intermediate step to get there.
Certainly the suggested flush_pending_mntput(fs_type) which is probably
better called flush_filesystem(fs_type) is straight forward to
implement. The code just needs to wait for fs_type->fs_supers to become
an empty list, after walking fs_supers->s_mounts and force a lazy
unmount on all mounts of the filesystem.
A few stray thoughts, in modules filesystems and error handling.
- kern_mount from a module is nasty because rmmod is only supported if
(struct file_system_type).owner == NULL. So unlike other filesystems
which we know can not be in use during module unload there are
no programatic guarantees about a kernel mount.
- Today a module that calls register_filestem or kern_mount and does
anything else afterward that might fail is problematic.
We have nothing that would flush or otherwise force a registered
filesystem to be unmounted. Ouch.
kern_mount for anything except debugging is in a similar situation.
Presumably the filesystem initialization will make something available
to the kernel at large that will have access to the mount point.
At which point we can have filesystem reference counts that we can not
statically account for and we don't have the code to flush them.
Which is a long way of saying it would probably be a good idea to
write and use flush_filesystem even if we never move to a mntput that
doesn't sleep.
Eric
Linus,
Please pull the for-linus branch from the git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
HEAD: 3efe1ac78e996da8e141b86667cc15758aad4366 vfs: Block intuitively in the case of BSD accounting files
This branch contains 3 significant fixes.
- Allowing a file or directory that is only a mountpoint in another
mount namespace to be renamed or removed.
- Removal of the need to lie to the vfs when a mountpoint has been
renamed or removed on a remote filesystem.
- Cleaning up a mount in a separate context so that mntput is safe
to call even from deep on the kernel stack.
Allowing files and directories to be renamed and removed prevents
a DOS attack where an unprivileged user could prevent root from renaming
or deleting files and directories by using them as mountpoints in
another mount namespace.
To verify that nothing significant changed from since the time this code
was written I performed a test merge, and tested the result. As of
roughly 3.15-rc1 there were no real conflicts just changes in context.
This is the same code I sent a pull request for in the merge window plus
changes to mntput to prevent the class of stack overflow the other
changes made easier to trigger, that was pointed out by Al Viro.
Eric W. Biederman (16):
vfs: Document the effect of d_revalidate on d_find_alias
vfs: More precise tests in d_invalidate
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: factor out lookup_mountpoint from new_mountpoint
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
vfs: Remove unnecessary calls of check_submounts_and_drop
vfs: Merge check_submounts_and_drop and d_invalidate
vfs: Make d_invalidate return void
vfs: Remove d_drop calls from d_revalidate implementations
proc: Update proc_flush_task_mnt to use d_invalidate
vfs: Remove useless loop in mntput_no_expire
vfs: Move autoclose of BSD accounting into a work queue
vfs: In mntput run deactivate_super on a shallow stack.
vfs: Block intuitively in the case of BSD accounting files
fs/afs/dir.c | 5 --
fs/btrfs/ioctl.c | 5 +-
fs/ceph/dir.c | 1 -
fs/cifs/readdir.c | 6 +--
fs/dcache.c | 140 ++++++++++++++++++-------------------------------
fs/fuse/dir.c | 7 +--
fs/gfs2/dentry.c | 3 --
fs/kernfs/dir.c | 11 ----
fs/mount.h | 22 ++++++++
fs/namei.c | 28 ++++++----
fs/namespace.c | 130 +++++++++++++++++++++++++++++++++++++++++----
fs/nfs/dir.c | 7 +--
fs/proc/base.c | 10 +---
fs/proc/fd.c | 2 -
include/linux/dcache.h | 3 +-
kernel/acct.c | 25 +++++++--
16 files changed, 235 insertions(+), 170 deletions(-)
Eric
---
My merge conflict resolution
__d_move gained an argument, my changes took __d_move out of a conditional.
m_hash went from a struct list to a struct hlist changing nearby code,
and affecting my factoring out of lookup_mountpoint from new_mountpoint.
There was a major refactoring of rename. As long as d_mountpoint
becomes is_local_mounptoint and detach_mount is added after dont_mount
all is well.
delayed_free was renamed vfsmount_delayed_free
dcache.c | 6 +----
mount.h | 1
namei.c | 9 +++++++-
namespace.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
4 files changed, 74 insertions(+), 8 deletions(-)
diff --cc fs/dcache.c
index 40707d88a945,5b78bd98649c..0407ed4a2604
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@@ -2701,10 -2631,8 +2663,8 @@@ static struct dentry *__d_unalias(struc
goto out_err;
m2 = &alias->d_parent->d_inode->i_mutex;
out_unalias:
- if (likely(!d_mountpoint(alias))) {
- __d_move(alias, dentry, false);
- ret = alias;
- }
- __d_move(alias, dentry);
++ __d_move(alias, dentry, false);
+ ret = alias;
out_err:
spin_unlock(&inode->i_lock);
if (m2)
diff --cc fs/mount.h
index d55297f2fa05,4104a3cca238..84c282321d67
--- a/fs/mount.h
+++ b/fs/mount.h
@@@ -19,8 -19,9 +19,9 @@@ struct mnt_pcp
};
struct mountpoint {
- struct list_head m_hash;
+ struct hlist_node m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};
diff --cc fs/namei.c
index c6157c894fce,384fcc6a5606..38cb09a05765
--- a/fs/namei.c
+++ b/fs/namei.c
@@@ -4101,33 -4045,17 +4104,33 @@@ int vfs_rename(struct inode *old_dir, s
if (error)
return error;
+ old_name = fsnotify_oldname_init(old_dentry->d_name.name);
dget(new_dentry);
- lock_two_nondirectories(source, target);
+ if (!is_dir || (flags & RENAME_EXCHANGE))
+ lock_two_nondirectories(source, target);
+ else if (target)
+ mutex_lock(&target->i_mutex);
error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- error = try_break_deleg(source, delegated_inode);
- if (error)
- goto out;
- if (target) {
+ if (max_links && new_dir != old_dir) {
+ error = -EMLINK;
+ if (is_dir && !new_is_dir && new_dir->i_nlink >= max_links)
+ goto out;
+ if ((flags & RENAME_EXCHANGE) && !is_dir && new_is_dir &&
+ old_dir->i_nlink >= max_links)
+ goto out;
+ }
+ if (is_dir && !(flags & RENAME_EXCHANGE) && target)
+ shrink_dcache_parent(new_dentry);
+ if (!is_dir) {
+ error = try_break_deleg(source, delegated_inode);
+ if (error)
+ goto out;
+ }
+ if (target && !new_is_dir) {
error = try_break_deleg(target, delegated_inode);
if (error)
goto out;
@@@ -4142,31 -4064,73 +4145,38 @@@
if (error)
goto out;
+ if (!(flags & RENAME_EXCHANGE) && target) {
+ if (is_dir)
+ target->i_flags |= S_DEAD;
+ dont_mount(new_dentry);
++ detach_mounts(new_dentry);
+ }
+ if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE)) {
+ if (!(flags & RENAME_EXCHANGE))
+ d_move(old_dentry, new_dentry);
+ else
+ d_exchange(old_dentry, new_dentry);
+ }
+ if (target) {
+ dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
+ if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
+ d_move(old_dentry, new_dentry);
out:
- unlock_two_nondirectories(source, target);
+ if (!is_dir || (flags & RENAME_EXCHANGE))
+ unlock_two_nondirectories(source, target);
+ else if (target)
+ mutex_unlock(&target->i_mutex);
dput(new_dentry);
- return error;
-}
-
-/**
- * vfs_rename - rename a filesystem object
- * @old_dir: parent of source
- * @old_dentry: source
- * @new_dir: parent of destination
- * @new_dentry: destination
- * @delegated_inode: returns an inode needing a delegation break
- *
- * The caller must hold multiple mutexes--see lock_rename()).
- *
- * If vfs_rename discovers a delegation in need of breaking at either
- * the source or destination, it will return -EWOULDBLOCK and return a
- * reference to the inode in delegated_inode. The caller should then
- * break the delegation and retry. Because breaking a delegation may
- * take a long time, the caller should drop all locks before doing
- * so.
- *
- * Alternatively, a caller may pass NULL for delegated_inode. This may
- * be appropriate for callers that expect the underlying filesystem not
- * to be NFS exported.
- */
-int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
- struct inode *new_dir, struct dentry *new_dentry,
- struct inode **delegated_inode)
-{
- int error;
- int is_dir = d_is_directory(old_dentry) || d_is_autodir(old_dentry);
- const unsigned char *old_name;
-
- if (old_dentry->d_inode == new_dentry->d_inode)
- return 0;
-
- error = may_delete(old_dir, old_dentry, is_dir);
- if (error)
- return error;
-
- if (!new_dentry->d_inode)
- error = may_create(new_dir, new_dentry);
- else
- error = may_delete(new_dir, new_dentry, is_dir);
- if (error)
- return error;
-
- if (!old_dir->i_op->rename)
- return -EPERM;
-
- old_name = fsnotify_oldname_init(old_dentry->d_name.name);
-
- if (is_dir)
- error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
- else
- error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry,delegated_inode);
- if (!error)
+ if (!error) {
fsnotify_move(old_dir, new_dir, old_name, is_dir,
- new_dentry->d_inode, old_dentry);
+ !(flags & RENAME_EXCHANGE) ? target : NULL, old_dentry);
+ if (flags & RENAME_EXCHANGE) {
+ fsnotify_move(new_dir, old_dir, old_dentry->d_name.name,
+ new_is_dir, NULL, new_dentry);
+ }
+ }
fsnotify_oldname_free(old_name);
return error;
diff --cc fs/namespace.c
index 182bc41cd887,128c051041be..b10db3d69943
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@@ -667,13 -632,47 +668,47 @@@ struct vfsmount *lookup_mnt(struct pat
return m;
}
- static struct mountpoint *new_mountpoint(struct dentry *dentry)
+ /*
+ * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
+ * current mount namespace.
+ *
+ * The common case is dentries are not mountpoints at all and that
+ * test is handled inline. For the slow case when we are actually
+ * dealing with a mountpoint of some kind, walk through all of the
+ * mounts in the current mount namespace and test to see if the dentry
+ * is a mountpoint.
+ *
+ * The mount_hashtable is not usable in the context because we
+ * need to identify all mounts that may be in the current mount
+ * namespace not just a mount that happens to have some specified
+ * parent mount.
+ */
+ bool __is_local_mountpoint(struct dentry *dentry)
+ {
+ struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+ struct mount *mnt;
+ bool is_covered = false;
+
+ if (!d_mountpoint(dentry))
+ goto out;
+
+ down_read(&namespace_sem);
+ list_for_each_entry(mnt, &ns->list, mnt_list) {
+ is_covered = (mnt->mnt_mountpoint == dentry);
+ if (is_covered)
+ break;
+ }
+ up_read(&namespace_sem);
+ out:
+ return is_covered;
+ }
+
+ static struct mountpoint *lookup_mountpoint(struct dentry *dentry)
{
- struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
+ struct hlist_head *chain = mp_hash(dentry);
struct mountpoint *mp;
- int ret;
- list_for_each_entry(mp, chain, m_hash) {
+ hlist_for_each_entry(mp, chain, m_hash) {
if (mp->m_dentry == dentry) {
/* might be worth a WARN_ON() */
if (d_unlinked(dentry))
@@@ -682,6 -681,14 +717,14 @@@
return mp;
}
}
+ return NULL;
+ }
+
+ static struct mountpoint *new_mountpoint(struct dentry *dentry)
+ {
- struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
++ struct hlist_head *chain = mp_hash(dentry);
+ struct mountpoint *mp;
+ int ret;
mp = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
if (!mp)
@@@ -695,7 -702,8 +738,8 @@@
mp->m_dentry = dentry;
mp->m_count = 1;
- list_add(&mp->m_hash, chain);
+ hlist_add_head(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}
@@@ -748,7 -757,8 +793,8 @@@ static void detach_mnt(struct mount *mn
mnt->mnt_parent = mnt;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
- list_del_init(&mnt->mnt_hash);
+ hlist_del_init_rcu(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@@ -936,9 -943,35 +983,25 @@@ static struct mount *clone_mnt(struct m
return ERR_PTR(err);
}
-static void delayed_free(struct rcu_head *head)
-{
- struct mount *mnt = container_of(head, struct mount, mnt_rcu);
- kfree(mnt->mnt_devname);
-#ifdef CONFIG_SMP
- free_percpu(mnt->mnt_pcp);
-#endif
- kmem_cache_free(mnt_cache, mnt);
-}
-
+ static void cleanup_mnt(struct mount *mnt)
+ {
+ fsnotify_vfsmount_delete(&mnt->mnt);
+ dput(mnt->mnt.mnt_root);
+ deactivate_super(mnt->mnt.mnt_sb);
+ mnt_free_id(mnt);
+ complete(mnt->mnt_undone);
- call_rcu(&mnt->mnt_rcu, delayed_free);
++ call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
+ }
+
+ static void cleanup_mnt_work(struct work_struct *work)
+ {
+ cleanup_mnt(container_of(work, struct mount, mnt_cleanup_work));
+ }
+
static void mntput_no_expire(struct mount *mnt)
{
- put_again:
+ struct completion undone;
+
rcu_read_lock();
mnt_add_count(mnt, -1);
if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */
Have you tried to profile something like umount -l on a large mount tree?
You variant causes a shitstorm of
schedule work
switch to workqueue
do actual fs shutdown
wake umount(8) up
get through wait_for_completion()
for every bleeding vfsmount in there. And no, it's *not* guaranteed to
be dominated by fs shutdown time. Here's the case where it definitely
won't be:
mkdir /tmp/a
mount --rbind / /tmp/a
umount -l /tmp/a
All vfsmounts involved are killed off with no fs shutdown. And that's
*not* a rare case - exit of the last process in namespace is very likely
to look that way too.
That's far too heavy.
Al Viro <[email protected]> writes:
> Have you tried to profile something like umount -l on a large mount tree?
> You variant causes a shitstorm of
> schedule work
> switch to workqueue
> do actual fs shutdown
> wake umount(8) up
> get through wait_for_completion()
> for every bleeding vfsmount in there. And no, it's *not* guaranteed to
> be dominated by fs shutdown time. Here's the case where it definitely
> won't be:
> mkdir /tmp/a
> mount --rbind / /tmp/a
> umount -l /tmp/a
>
> All vfsmounts involved are killed off with no fs shutdown. And that's
> *not* a rare case - exit of the last process in namespace is very likely
> to look that way too.
>
> That's far too heavy.
The typically system has about 12 mounts and no mount namespaces.
Making umount of any kind a rare case.
Al my apologies for not picking your favorite solution and in choosing
not to audit every call to mntput in the tree this week I have made an
uncommon case slower.
These bug fixes are important, the code is correct, and it is
technically straight forward to remove the wait in mntput in most
cases.
I am very frustrated that when given lots of opportunities to look at
and comment on my code it is hard to get you to engage and comment
unless I send a pull request to Linus.
This last objection really looks to be a case of perfection getting in
the way of the good.
Eric
On Thu, Apr 17, 2014 at 02:23:27PM -0700, Eric W. Biederman wrote:
> The typically system has about 12 mounts and no mount namespaces.
> Making umount of any kind a rare case.
The typical system has no userns either. Take a look at e.g. docker
setups (and unless I'm mistaken, they are going to become a serious
usecase for your stuff as well).
> Al my apologies for not picking your favorite solution and in choosing
> not to audit every call to mntput in the tree this week I have made an
> uncommon case slower.
>
> These bug fixes are important, the code is correct, and it is
> technically straight forward to remove the wait in mntput in most
> cases.
>
> I am very frustrated that when given lots of opportunities to look at
> and comment on my code it is hard to get you to engage and comment
> unless I send a pull request to Linus.
Sigh... FYI, here's what was going on through since Sunday: I had been
beating vfs.git#for-next into shape, exactly to avoid the kind of fun
we had this cycle. With xfstests update in the middle of it, picking
a bunch of fun both in my code and elsewhere. With any luck, this stuff
is dealt with (see vfs.git on git.kernel.org) and we might be able to
avoid that kind of shit this time around.
Now I can get to your series. And no, I'm _not_ suggesting an audit of
every mntput() in the tree. Really. It's much more limited thing -
we only need to take care of
* mntput() from kernel threads. The same kind of situation we
have with final fput(), the same solution works.
* *KERNEL* vfsmounts; anything without MNT_INTERNAL is fair game
as far as ordering of fs shutdown is concerned. If it wasn't MNT_INTERNAL,
and we'd done mntput() before destroying some data structures needed for
fs shutdown, we had been fucked anyway - no warranties that mntput() had
been the final one. Moreover, that means going through kern_unmount() or
simple_release_fs(). Which is trivially dealt with by providing
mntput_sync(mnt) that *will* wait and using it in those two. Again, if
we have an extra reference (e.g. stuffed into a struct file, etc.) -
we can't rely on mntput() being final.
I'd probably turn mntput_no_expire() into something like
static struct mount *__mntput(struct mount *m)
that would return NULL if nothing needs to be killed and returned m
if m really needs killing. Leaving the caller to decide what to do
with that puppy. We have, as it is, exactly two callers - exit path
in sys_umount() and mntput(). So we add two more functions:
static void kill_mnt_async(struct mount *m)
and
static void kill_mnt_sync(struct mount *m)
both being no-op on NULL. Then in sys_umount() and mntput() we do
kill_mnt_async(__mntput(mnt));
and in mntput_sync() - kill_mnt_sync(__mntput(mnt));
For that matter, kill_mnt_sync() (basically, your variant with completions)
can be folded into mntput_sync().
kill_mnt_async() would be a direct analog of this:
struct task_struct *task = current;
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&file->f_u.fu_rcuhead, ____fput);
if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
return;
/*
* After this task has run exit_task_work(),
* task_work_add() will fail. Fall through to delayed
* fput to avoid leaking *file.
*/
}
if (llist_add(&file->f_u.fu_llist, &delayed_fput_list))
schedule_delayed_work(&delayed_fput_work, 1);
with obvious replacements. Works for struct file, will work here as well.
That's all. And yes, I believe that such series would make sense on its
own and once it survives beating (see above about docker - that bastard has
surprised me quite a bit re stressing namespace-related codepaths), I would
be quite willing to help with getting it merged.
On Thu, Apr 17, 2014 at 11:12:03PM +0100, Al Viro wrote:
> * *KERNEL* vfsmounts; anything without MNT_INTERNAL is fair game
> as far as ordering of fs shutdown is concerned. If it wasn't MNT_INTERNAL,
> and we'd done mntput() before destroying some data structures needed for
s/before/just before/, of course ;-)
> fs shutdown, we had been fucked anyway - no warranties that mntput() had
> been the final one. Moreover, that means going through kern_unmount() or
> simple_release_fs(). Which is trivially dealt with by providing
> mntput_sync(mnt) that *will* wait and using it in those two. Again, if
> we have an extra reference (e.g. stuffed into a struct file, etc.) -
> we can't rely on mntput() being final.
On Thu, Apr 17, 2014 at 11:12:03PM +0100, Al Viro wrote:
> That's all. And yes, I believe that such series would make sense on its
> own and once it survives beating (see above about docker - that bastard has
> surprised me quite a bit re stressing namespace-related codepaths), I would
> be quite willing to help with getting it merged.
FWIW, the tricky part around auto-close of acct is that we really want to
preserve the following property:
In usual setups, umount(2) will not return until fs has been
shut down.
fput() being async is not a problem - it will be processed before we
return to userland. I agree that we don't need the loop anymore (it's
basically a stack depth reduction measure that was needed with sync
fput() - without "add one more and deal with it when we return" we
would be getting mntput_no_expire -> fput -> mntput -> fs shutdown
back then). But offloading that fput() to workqueue makes it really
possible to have actual fs shutdown happen after umount(2) returns,
without any extra mounts of the same fs, etc. And since that shutdown
*can* take a long time (lots of dirty pages around, slow device or
slow network, etc.), we really might be talking about e.g. umount(8)
being finished before fs shutdown finishes. It's an expected situation
when we have the same thing still mounted elsewhere or lazy-umounted
and busy, but this changes behaviour on setups where we had been
guaranteed that umount -a *will* wait until all filesystems except
root are shut down and root is remounted r/o. So this change really
can cause data loss on reboot(8)/halt(8) on existing boxen...
IOW, workqueue is not the right tool here. OTOH, it looks like we do have
a problem with kernel/acct.c vs. umount; it just requires a race between
auto-closing and acct_process_in_ns(). It's narrow, so it doesn't bite
us all the time, but it's there... Damn, it had been a long time since
I really looked at that code ;-/
Actually, there's another reason why workqueue is bogus - we call
do_acct_process(), same as we do on acct(NULL) (which might or might
not be a good idea), but at least with do that from the context of
real process doing umount(2). Doing that from workqueue is going to
produce a really bogus record...
On Fri, Apr 18, 2014 at 01:37:26AM +0100, Al Viro wrote:
> IOW, workqueue is not the right tool here. OTOH, it looks like we do have
> a problem with kernel/acct.c vs. umount; it just requires a race between
> auto-closing and acct_process_in_ns(). It's narrow, so it doesn't bite
> us all the time, but it's there... Damn, it had been a long time since
> I really looked at that code ;-/
>
> Actually, there's another reason why workqueue is bogus - we call
> do_acct_process(), same as we do on acct(NULL) (which might or might
> not be a good idea), but at least with do that from the context of
> real process doing umount(2). Doing that from workqueue is going to
> produce a really bogus record...
Egads... Why the hell are we forming (almost) the same record again
and again for every pidns the process belongs to? Sure, we want
pid/ppid/uid/gid munged, but the rest of it?
And there's something else wrong here - what happens if the last process
in a namespace where we have accounting going on just plain exits?
All mounts in that namespace get dissolved. Which leads to acct being
autoclosed. From the context of a process that already has done
acct_process().
Do we ever want to write an acct record on autoclose-on-umount? Do we
want that record of umount(8) we would've missed otherwise (along with
those of all other still living processes - those we *will* miss anyway)?
Linus, do you have objections against dropping that behaviour? In theory,
some tools might look at the last record in acct file to figure out what
has stopped the sucker (accton vs. umount), so it's a user-visible change,
but...
On Fri, Apr 18, 2014 at 01:37:26AM +0100, Al Viro wrote:
> FWIW, the tricky part around auto-close of acct is that we really want to
> preserve the following property:
>
> In usual setups, umount(2) will not return until fs has been
> shut down.
>
> fput() being async is not a problem - it will be processed before we
> return to userland. I agree that we don't need the loop anymore (it's
> basically a stack depth reduction measure that was needed with sync
> fput() - without "add one more and deal with it when we return" we
> would be getting mntput_no_expire -> fput -> mntput -> fs shutdown
> back then). But offloading that fput() to workqueue makes it really
> possible to have actual fs shutdown happen after umount(2) returns,
> without any extra mounts of the same fs, etc. And since that shutdown
> *can* take a long time (lots of dirty pages around, slow device or
> slow network, etc.), we really might be talking about e.g. umount(8)
> being finished before fs shutdown finishes. It's an expected situation
> when we have the same thing still mounted elsewhere or lazy-umounted
> and busy, but this changes behaviour on setups where we had been
> guaranteed that umount -a *will* wait until all filesystems except
> root are shut down and root is remounted r/o. So this change really
> can cause data loss on reboot(8)/halt(8) on existing boxen...
My apologies for confusion - I have not looked at your last commit.
I *really* don't like that solution, but it probably does close that
particular problem. Consider that objection withdrawn (modulo "you
have created a bisect hazard here, that part of series needs to be
reordered").
I still don't think that this is the right approach. Hell knows ;-/
Maybe you are right about offloading auto-close to a wq... The whole
mnt_pin/mnt_unpin is bloody ugly, especially with multiple references
held by kernel/acct.c-opened files ;-/ And yes, this horror is originally
mine; the only excuse I have is that previous variant had been even more
kludgy.
_If_ we push it to wq, we really want to stop calling do_process_acct()
on auto-close. With umount(8) it's borderline defensible, but with
workqueue it makes no sense whatsoever. Moreover, it makes no sense
on acct_exit_ns() - it's done when PID 1 of that pidns dies, after having
called acct_process(). What's the point of having accounting in
non-root pidns spew two entries for its init?
Assuming we can stop generating records on auto-close, do we need
the wq thing at all? After all, we could rip the affected acct->file
out, then flush and fput them all right there. The usual logics for fput()
will make sure that they (and all of them will be final) will be
processed before we leave umount(2). And let the last one trigger the
final mntput, now that mnt_pinned is zero...
PS: AFAICT, *BSD do not spew that even on explicit acct(NULL). 4.3 variants
found on PUPS do not, ditto for 4.4BSD-Alpha2 there; in Net2 kern_acct.c
is one of the castration victims, so it doesn't do anything at all and in
current {Free,Net,Open}BSD kernels it also doesn't happen.
Why do we bother with that at all? It had been that way all way back to
2.1.68pre1 when kernel/acct.c had been merged, and it's probably too
late by now to try and ask the author about the reasons...
On Sat, Apr 19, 2014 at 02:35:26AM +0100, Al Viro wrote:
> My apologies for confusion - I have not looked at your last commit.
> I *really* don't like that solution, but it probably does close that
> particular problem. Consider that objection withdrawn (modulo "you
> have created a bisect hazard here, that part of series needs to be
> reordered").
>
> I still don't think that this is the right approach. Hell knows ;-/
> Maybe you are right about offloading auto-close to a wq... The whole
> mnt_pin/mnt_unpin is bloody ugly, especially with multiple references
> held by kernel/acct.c-opened files ;-/
Actually, it's worse than ugly. Consider the following race:
* umount(2) decides that victim isn't busy and does everything up to
the final mntput_no_expire().
* acct(NULL) is called, and gets to
if (old_acct) {
mnt_unpin(old_acct->f_path.mnt);
spin_unlock(&acct_lock);
do_acct_process(acct, old_ns, old_acct);
filp_close(old_acct, NULL);
spin_lock(&acct_lock);
}
It starts writing and blocks. Now mnt_pinned is 0 and refcount is 1, so
mntput_no_expire() from umount(2) does not see anything untowards and just
returns. Eventually, write finishes and acct(2) does filp_close(). Fine,
but by that point umount(2) has already returned to userland and we have
the problem I'd been complaining about in your solution. And your patches
won't go anywhere near wait_for_completion() in that case, so they don't
close that hole either...
Not your fault, and not that scary wrt dirty shutdown, but it still needs
fixing. While we are at it, consider what happens if something is busy
in acct_process() while we are hitting that race. This filp_close()
will happen before the final fput(), so the actual fs shutdown is moved
to whatever exiting process that was in acct_process() at that point.
Might be more than one of those, even - then the last one to finish
writing will end up carrying the bucket.
Actually, even without umount(2) in the mix (just acct(NULL) vs. exiting
processes) it's somewhat fishy - we are, after all, calling ->flush()
before the final entries are written.
That, BTW, is another reason why "let's write one last entry on acct(NULL)"
is bogus - userland tools might hope to use that to get information about the
command that has stopped logging, but it is not guaranteed to be the last
one.
On Sat, Apr 19, 2014 at 03:16:46AM +0100, Al Viro wrote:
> On Sat, Apr 19, 2014 at 02:35:26AM +0100, Al Viro wrote:
>
> > My apologies for confusion - I have not looked at your last commit.
> > I *really* don't like that solution, but it probably does close that
> > particular problem. Consider that objection withdrawn (modulo "you
> > have created a bisect hazard here, that part of series needs to be
> > reordered").
> >
> > I still don't think that this is the right approach. Hell knows ;-/
> > Maybe you are right about offloading auto-close to a wq... The whole
> > mnt_pin/mnt_unpin is bloody ugly, especially with multiple references
> > held by kernel/acct.c-opened files ;-/
>
> Actually, it's worse than ugly. Consider the following race:
> * umount(2) decides that victim isn't busy and does everything up to
> the final mntput_no_expire().
> * acct(NULL) is called, and gets to
> if (old_acct) {
> mnt_unpin(old_acct->f_path.mnt);
> spin_unlock(&acct_lock);
> do_acct_process(acct, old_ns, old_acct);
> filp_close(old_acct, NULL);
> spin_lock(&acct_lock);
> }
> It starts writing and blocks. Now mnt_pinned is 0 and refcount is 1, so
> mntput_no_expire() from umount(2) does not see anything untowards and just
> returns. Eventually, write finishes and acct(2) does filp_close(). Fine,
> but by that point umount(2) has already returned to userland and we have
> the problem I'd been complaining about in your solution. And your patches
> won't go anywhere near wait_for_completion() in that case, so they don't
> close that hole either...
>
> Not your fault, and not that scary wrt dirty shutdown, but it still needs
> fixing. While we are at it, consider what happens if something is busy
> in acct_process() while we are hitting that race. This filp_close()
> will happen before the final fput(), so the actual fs shutdown is moved
> to whatever exiting process that was in acct_process() at that point.
> Might be more than one of those, even - then the last one to finish
> writing will end up carrying the bucket.
>
> Actually, even without umount(2) in the mix (just acct(NULL) vs. exiting
> processes) it's somewhat fishy - we are, after all, calling ->flush()
> before the final entries are written.
>
> That, BTW, is another reason why "let's write one last entry on acct(NULL)"
> is bogus - userland tools might hope to use that to get information about the
> command that has stopped logging, but it is not guaranteed to be the last
> one.
FWIW, I think I have a solution for that one. Let's drop mnt_pinned crap
completely; we don't really have to use the _same_ vfsmount for that struct
file. A clone will do just as well. So let's do the following:
* have sys_acct() clone the vfsmount it got in file->f_path.mnt
and replace the original with it. And hang it off the original vfsmount
(on a new cyclic list, for example). mntput() the original after that,
don't have it pinned down at all.
* to hell with special-casing that in mntput_no_expire(); let's
deal with those guys in the function you've called cleanup_mnt(). That's
where we'll do auto-close.
* serialize auto-close taking acct out/explicit acct(2)/writes to that
acct; writes *will* be serialized anyway, so we are not losing any concurrency
here. Lose get_file/fput on the write side (acct_process_in_ns()).
Refcount acct (protecting the counter with acct_lock spinlock we already
have). auto-close would walk the list under acct_lock, skip the ones we
don't care about. Found one we want to shut down? Bump the refcount,
drop acct_lock, grab mutex on acct. Recheck if it's still of interest
for us and if it is, do what acct(NULL) would've done with it. Then
drop mutex (as acct(NULL) would've done), retake acct_lock, move to the
next list element and decrement the refcount on the one we'd dealt with.
If the refcount has reached zero, list_del() and kfree, which is fine
under acct_lock. No need to rescan anything - no new ones of interest
will be appearing.
* acct_exit_ns() does an equivalent of acct(NULL), then grabs
acct_lock, decrements refcount and if it has reached zero does the same
list_del() + kfree().
AFAICS, that ought to work; however, I'm half-asleep right now, so I might
be easily missing something. If you see any holes in the above - yell.
I'll try to put something together tomorrow after I get some sleep (it's
3am here)...
On Thu, Apr 17, 2014 at 11:12:03PM +0100, Al Viro wrote:
> I'd probably turn mntput_no_expire() into something like
> static struct mount *__mntput(struct mount *m)
> that would return NULL if nothing needs to be killed and returned m
> if m really needs killing. Leaving the caller to decide what to do
> with that puppy. We have, as it is, exactly two callers - exit path
> in sys_umount() and mntput(). So we add two more functions:
> static void kill_mnt_async(struct mount *m)
> and
> static void kill_mnt_sync(struct mount *m)
> both being no-op on NULL. Then in sys_umount() and mntput() we do
> kill_mnt_async(__mntput(mnt));
> and in mntput_sync() - kill_mnt_sync(__mntput(mnt));
> For that matter, kill_mnt_sync() (basically, your variant with completions)
> can be folded into mntput_sync().
Actually, all kern_unmount() callers are doing that from fairly shallow
stack depth and all simple_release_fs() ones are dealing with rather
trivial ->kill_sb(). So mntput_sync() is an overkill; all we need is
if (mnt->mnt_flags & MNT_INTERNAL) {
cleanup_mnt(mnt);
return;
}
<do task_work_add or schedule_delayed_work song and dance>
right in the end of mntput_no_expire(). OK, now I have something that
looks like a complete solution. The last missing bit is to take all
filp_close() of acct->file to kernel thread, and have them done via
__fput_sync() there. Then auto-close (done from cleanup_mnt()) will
consist of shutting down all affected acct and waiting for that kernel
thread to run through everything currently in its queue. That'll take
care of waiting until acct(NULL) done by somebody else gets through closing
the file and through corresponding mntput(). And *those* mntput() also
can be synchronous - they are clones of the one we hadn't finished shutting
down yet, so both dput() and deactivate_super() will bugger off immediately.
So we just mark those instead-of-mnt_pin() clones as MNT_INTERNAL. Voila.
After that ->mnt_pinned crap dies, acct auto-close ought to be race-free
and we get the actual fs shutdown guaranteed to be on shallow stack, without
extra context switches, etc. in the normal case.
Let's see if that survives testing...