Some full-OS container software bind mounts debugfs into containers to
satisfy the assumptions of older userspaces which expect to be able to
mount debugfs. This regressed in 4.1 due to the addition of tracefs,
which gets automounted in the tracing subdirectory of debugfs. In a
cloned mount namespace the bind mount now fails because the tracefs
mount is a locked child of the debugfs mount.
For new mounts we already make an exception to the "locked child mount"
rule. Directories in psuedo filesystems created for the sole purpose of
being mountpoints are created as permanently empty directories which can
never contain any entries, therefore the kernel can know than any mounts
on these directories are not for security purposes. These mounts are
then excluded from locked mount tests in some circumstances.
The same logic clearly applies to directories created in
debugfs_create_automount(). The following patches update this function
to create permanently empty directories for mountpoints and adds an
exclusion to the tests for bind mounts to exclude child mounts on
permanently empty directories.
Thanks,
Seth
Seth Forshee (2):
fs: Allow bind mounts with locked children on permaenetly empty
directories
debugfs: Make automount point inodes permanently empty
fs/debugfs/inode.c | 2 +-
fs/namespace.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.
In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.
Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.
Cc: [email protected] # v4.1+
Signed-off-by: Seth Forshee <[email protected]>
---
fs/debugfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index bece948b363d..8580831ed237 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -457,7 +457,7 @@ struct dentry *debugfs_create_automount(const char *name,
if (unlikely(!inode))
return failed_creating(dentry);
- inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
+ make_empty_dir_inode(inode);
inode->i_flags |= S_AUTOMOUNT;
inode->i_private = data;
dentry->d_fsdata = (void *)f;
--
1.9.1
Forbidding a bind mount due to a locked child on a permanently
empty directory provides no security benefit since the
directory cannot contain any contents which have been overmounted
for security reasons.
Cc: [email protected] # v4.1+
Signed-off-by: Seth Forshee <[email protected]>
---
fs/namespace.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/namespace.c b/fs/namespace.c
index 4fb1691b4355..930f5557b1d1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2065,6 +2065,8 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
if (!is_subdir(child->mnt_mountpoint, dentry))
continue;
+ if (is_empty_dir_inode(child->mnt_mountpoint->d_inode))
+ continue;
if (child->mnt.mnt_flags & MNT_LOCKED)
return true;
--
1.9.1
Quoting Seth Forshee ([email protected]):
> Starting with 4.1 the tracing subsystem has its own filesystem
> which is automounted in the tracing subdirectory of debugfs.
> Prior to this debugfs could be bind mounted in a cloned mount
> namespace, but if tracefs has been mounted under debugfs this
> now fails because there is a locked child mount. This creates
> a regression for container software which bind mounts debugfs
> to satisfy the assumption of some userspace software.
>
> In other pseudo filesystems such as proc and sysfs we're already
> creating mountpoints like this in such a way that no dirents can
> be created in the directories, allowing them to be exceptions to
> some MNT_LOCKED tests. In fact we're already do this for the
> tracefs mountpoint in sysfs.
>
> Do the same in debugfs_create_automount(), since the intention
> here is clearly to create a mountpoint. This fixes the regression,
> as locked child mounts on permanently empty directories do not
> cause a bind mount to fail.
>
> Cc: [email protected] # v4.1+
> Signed-off-by: Seth Forshee <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/debugfs/inode.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
> index bece948b363d..8580831ed237 100644
> --- a/fs/debugfs/inode.c
> +++ b/fs/debugfs/inode.c
> @@ -457,7 +457,7 @@ struct dentry *debugfs_create_automount(const char *name,
> if (unlikely(!inode))
> return failed_creating(dentry);
>
> - inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
> + make_empty_dir_inode(inode);
> inode->i_flags |= S_AUTOMOUNT;
> inode->i_private = data;
> dentry->d_fsdata = (void *)f;
> --
> 1.9.1
Quoting Seth Forshee ([email protected]):
> Forbidding a bind mount due to a locked child on a permanently
> empty directory provides no security benefit since the
> directory cannot contain any contents which have been overmounted
> for security reasons.
>
> Cc: [email protected] # v4.1+
> Signed-off-by: Seth Forshee <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
> ---
> fs/namespace.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4fb1691b4355..930f5557b1d1 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2065,6 +2065,8 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
> list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> if (!is_subdir(child->mnt_mountpoint, dentry))
> continue;
> + if (is_empty_dir_inode(child->mnt_mountpoint->d_inode))
> + continue;
>
> if (child->mnt.mnt_flags & MNT_LOCKED)
> return true;
> --
> 1.9.1
Seth Forshee <[email protected]> writes:
> Some full-OS container software bind mounts debugfs into containers to
> satisfy the assumptions of older userspaces which expect to be able to
> mount debugfs. This regressed in 4.1 due to the addition of tracefs,
> which gets automounted in the tracing subdirectory of debugfs. In a
> cloned mount namespace the bind mount now fails because the tracefs
> mount is a locked child of the debugfs mount.
>
> For new mounts we already make an exception to the "locked child mount"
> rule. Directories in psuedo filesystems created for the sole purpose of
> being mountpoints are created as permanently empty directories which can
> never contain any entries, therefore the kernel can know than any mounts
> on these directories are not for security purposes. These mounts are
> then excluded from locked mount tests in some circumstances.
>
> The same logic clearly applies to directories created in
> debugfs_create_automount(). The following patches update this function
> to create permanently empty directories for mountpoints and adds an
> exclusion to the tests for bind mounts to exclude child mounts on
> permanently empty directories.
So I don't know that this approach is bad. However in reading through
your patch descriptions I do not see any consideration of using
"mount --rbind" instead of "mount --bind". AKA adding the MS_REC flag
to your bind mount.
I would think simply using MS_REC would solve this problem, without
needing any additional kernel support. Am I missing something?
Eric
Quoting Eric W. Biederman ([email protected]):
> Seth Forshee <[email protected]> writes:
>
> > Some full-OS container software bind mounts debugfs into containers to
> > satisfy the assumptions of older userspaces which expect to be able to
> > mount debugfs. This regressed in 4.1 due to the addition of tracefs,
> > which gets automounted in the tracing subdirectory of debugfs. In a
> > cloned mount namespace the bind mount now fails because the tracefs
> > mount is a locked child of the debugfs mount.
> >
> > For new mounts we already make an exception to the "locked child mount"
> > rule. Directories in psuedo filesystems created for the sole purpose of
> > being mountpoints are created as permanently empty directories which can
> > never contain any entries, therefore the kernel can know than any mounts
> > on these directories are not for security purposes. These mounts are
> > then excluded from locked mount tests in some circumstances.
> >
> > The same logic clearly applies to directories created in
> > debugfs_create_automount(). The following patches update this function
> > to create permanently empty directories for mountpoints and adds an
> > exclusion to the tests for bind mounts to exclude child mounts on
> > permanently empty directories.
>
> So I don't know that this approach is bad. However in reading through
> your patch descriptions I do not see any consideration of using
> "mount --rbind" instead of "mount --bind". AKA adding the MS_REC flag
> to your bind mount.
>
> I would think simply using MS_REC would solve this problem, without
> needing any additional kernel support. Am I missing something?
That's what we're doing to work around it fwiw, but it would be nice to
not have to.