When a container shuts down, it likes to do 'mount -o remount,ro /'.
That sets the superblock's readonly flag, not the mount's. So unless
the mount action fails for some reason (i.e. a file is held open on
the fs), if the container's rootfs is just a directory on the host's
fs, the host fs will be marked readonly.
Thanks to Dave Hansen for pointing out how simple the fix can be. If
the devices cgroup denies the mounting task write access to the
underlying superblock (as it usually does when the container's root fs
is on a block device shared with the host), then it do_remount_sb should
deny the right to change mount flags as well.
This patch adds that check.
Note that another possibility would be to have the LSM step in. We
can't catch this (as is) at the LSM level because security_remount_sb
doesn't get the mount flags, so we can't distinguish
mount -o remount,ro
from
mount --bind -o remount,ro.
Sending the flags to that hook would probably be a good idea in addition
to this patch, but I haven't done it here.
Signed-off-by: Serge Hallyn <[email protected]>
---
fs/super.c | 5 +++++
include/linux/device_cgroup.h | 3 +++
security/device_cgroup.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index afd0f1a..e29cdd1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -32,6 +32,7 @@
#include <linux/backing-dev.h>
#include <linux/rculist_bl.h>
#include <linux/cleancache.h>
+#include <linux/device_cgroup.h>
#include "internal.h"
@@ -709,6 +710,10 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
return -EACCES;
#endif
+ retval = devcgroup_remount(sb);
+ if (retval)
+ return retval;
+
if (flags & MS_RDONLY)
acct_auto_close(sb);
shrink_dcache_sb(sb);
diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 8b64221..8c77b40 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -11,9 +11,12 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask)
return 0;
return __devcgroup_inode_permission(inode, mask);
}
+extern int devcgroup_remount(struct super_block *sb);
#else
static inline int devcgroup_inode_permission(struct inode *inode, int mask)
{ return 0; }
static inline int devcgroup_inode_mknod(int mode, dev_t dev)
{ return 0; }
+static inline int devcgroup_remount(struct super_block *sb)
+{ return 0; }
#endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 662ae5f..8b76d73 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -538,3 +538,35 @@ found:
return -EPERM;
}
+
+int devcgroup_remount(struct super_block *sb)
+{
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh;
+ u32 major = MAJOR(sb->s_dev), minor = MINOR(sb->s_dev);
+
+ rcu_read_lock();
+
+ dev_cgroup = task_devcgroup(current);
+
+ list_for_each_entry_rcu(wh, &dev_cgroup->whitelist, list) {
+ if (wh->type & DEV_ALL)
+ goto found;
+ if (!(wh->type & DEV_BLOCK))
+ continue;
+ if (wh->major != ~0 && wh->major != major)
+ continue;
+ if (wh->minor != ~0 && wh->minor != minor)
+ continue;
+ if (!(wh->access & ACC_WRITE))
+ continue;
+found:
+ rcu_read_unlock();
+ return 0;
+ }
+
+ rcu_read_unlock();
+
+ return -EPERM;
+}
+EXPORT_SYMBOL(devcgroup_remount);
--
1.7.9
On Fri, Feb 10, 2012 at 09:19:39PM -0600, Serge Hallyn wrote:
> When a container shuts down, it likes to do 'mount -o remount,ro /'.
> That sets the superblock's readonly flag, not the mount's. So unless
> the mount action fails for some reason (i.e. a file is held open on
> the fs), if the container's rootfs is just a directory on the host's
> fs, the host fs will be marked readonly.
>
> Thanks to Dave Hansen for pointing out how simple the fix can be. If
> the devices cgroup denies the mounting task write access to the
> underlying superblock (as it usually does when the container's root fs
> is on a block device shared with the host), then it do_remount_sb should
> deny the right to change mount flags as well.
>
> This patch adds that check.
>
> Note that another possibility would be to have the LSM step in. We
> can't catch this (as is) at the LSM level because security_remount_sb
> doesn't get the mount flags, so we can't distinguish
> mount -o remount,ro
> from
> mount --bind -o remount,ro.
> Sending the flags to that hook would probably be a good idea in addition
> to this patch, but I haven't done it here.
NAK. This is just plain wrong - what about the filesystems that are not
bdev-backed or, as e.g. btrfs, sit on more than one device?
<comments about inadequacy of cgroup as an API censored - far too unprintable>
Quoting Al Viro ([email protected]):
> On Fri, Feb 10, 2012 at 09:19:39PM -0600, Serge Hallyn wrote:
> > When a container shuts down, it likes to do 'mount -o remount,ro /'.
> > That sets the superblock's readonly flag, not the mount's. So unless
> > the mount action fails for some reason (i.e. a file is held open on
> > the fs), if the container's rootfs is just a directory on the host's
> > fs, the host fs will be marked readonly.
> >
> > Thanks to Dave Hansen for pointing out how simple the fix can be. If
> > the devices cgroup denies the mounting task write access to the
> > underlying superblock (as it usually does when the container's root fs
> > is on a block device shared with the host), then it do_remount_sb should
> > deny the right to change mount flags as well.
> >
> > This patch adds that check.
> >
> > Note that another possibility would be to have the LSM step in. We
> > can't catch this (as is) at the LSM level because security_remount_sb
> > doesn't get the mount flags, so we can't distinguish
> > mount -o remount,ro
> > from
> > mount --bind -o remount,ro.
> > Sending the flags to that hook would probably be a good idea in addition
> > to this patch, but I haven't done it here.
>
> NAK. This is just plain wrong - what about the filesystems that are not
> bdev-backed or, as e.g. btrfs, sit on more than one device?
>
> <comments about inadequacy of cgroup as an API censored - far too unprintable>
Drat.
Would passing the mount flags from do_remount() to security_sb_remount()
be acceptable? Then at least the LSM could distinguish and act
accordingly.
Thanks for looking.
-serge
Quoting Al Viro ([email protected]):
> On Fri, Feb 10, 2012 at 09:19:39PM -0600, Serge Hallyn wrote:
> > When a container shuts down, it likes to do 'mount -o remount,ro /'.
> > That sets the superblock's readonly flag, not the mount's. So unless
> > the mount action fails for some reason (i.e. a file is held open on
> > the fs), if the container's rootfs is just a directory on the host's
> > fs, the host fs will be marked readonly.
> >
> > Thanks to Dave Hansen for pointing out how simple the fix can be. If
> > the devices cgroup denies the mounting task write access to the
> > underlying superblock (as it usually does when the container's root fs
> > is on a block device shared with the host), then it do_remount_sb should
> > deny the right to change mount flags as well.
> >
> > This patch adds that check.
> >
> > Note that another possibility would be to have the LSM step in. We
> > can't catch this (as is) at the LSM level because security_remount_sb
> > doesn't get the mount flags, so we can't distinguish
> > mount -o remount,ro
> > from
> > mount --bind -o remount,ro.
> > Sending the flags to that hook would probably be a good idea in addition
> > to this patch, but I haven't done it here.
>
> NAK. This is just plain wrong - what about the filesystems that are not
BTW, sorry - the patch clearly should've taken non-bdevs into account, but
I accept that wouldn't have been enough to evade a NAK.
> bdev-backed or, as e.g. btrfs, sit on more than one device?
btrfs is actually one of my main motivators - to quickly snapshot containers
with btrfs means that the containers all share one fs, but that means one
container can mark them all ro.
> <comments about inadequacy of cgroup as an API censored - far too unprintable>
-serge
Serge Hallyn <[email protected]> writes:
> Quoting Al Viro ([email protected]):
>> On Fri, Feb 10, 2012 at 09:19:39PM -0600, Serge Hallyn wrote:
>> > When a container shuts down, it likes to do 'mount -o remount,ro /'.
>> > That sets the superblock's readonly flag, not the mount's. So unless
>> > the mount action fails for some reason (i.e. a file is held open on
>> > the fs), if the container's rootfs is just a directory on the host's
>> > fs, the host fs will be marked readonly.
>> >
>> > Thanks to Dave Hansen for pointing out how simple the fix can be. If
>> > the devices cgroup denies the mounting task write access to the
>> > underlying superblock (as it usually does when the container's root fs
>> > is on a block device shared with the host), then it do_remount_sb should
>> > deny the right to change mount flags as well.
>> >
>> > This patch adds that check.
>> >
>> > Note that another possibility would be to have the LSM step in. We
>> > can't catch this (as is) at the LSM level because security_remount_sb
>> > doesn't get the mount flags, so we can't distinguish
>> > mount -o remount,ro
>> > from
>> > mount --bind -o remount,ro.
>> > Sending the flags to that hook would probably be a good idea in addition
>> > to this patch, but I haven't done it here.
>>
>> NAK. This is just plain wrong - what about the filesystems that are not
>
> BTW, sorry - the patch clearly should've taken non-bdevs into account, but
> I accept that wouldn't have been enough to evade a NAK.
>
>> bdev-backed or, as e.g. btrfs, sit on more than one device?
>
> btrfs is actually one of my main motivators - to quickly snapshot containers
> with btrfs means that the containers all share one fs, but that means one
> container can mark them all ro.
Serge let me respectfully suggest that getting the user namespace done
will deal with this issue nicely.
In the simple case you simply won't be root so remount will just be
denied.
When/if we allow a limited form of unprivileged mounts in a user
namespace your user won't have mounted the filesystem so you should not
have the privilege to call remount on the filesystem.
I think I will have a set of patches ready for serious scrutiny in
the next week or so. So we aren't talking impossible pie in the sky
distance to see this happen.
Eric
Quoting Eric W. Biederman ([email protected]):
> Serge Hallyn <[email protected]> writes:
>
> > Quoting Al Viro ([email protected]):
> >> On Fri, Feb 10, 2012 at 09:19:39PM -0600, Serge Hallyn wrote:
> >> > When a container shuts down, it likes to do 'mount -o remount,ro /'.
> >> > That sets the superblock's readonly flag, not the mount's. So unless
> >> > the mount action fails for some reason (i.e. a file is held open on
> >> > the fs), if the container's rootfs is just a directory on the host's
> >> > fs, the host fs will be marked readonly.
> >> >
> >> > Thanks to Dave Hansen for pointing out how simple the fix can be. If
> >> > the devices cgroup denies the mounting task write access to the
> >> > underlying superblock (as it usually does when the container's root fs
> >> > is on a block device shared with the host), then it do_remount_sb should
> >> > deny the right to change mount flags as well.
> >> >
> >> > This patch adds that check.
> >> >
> >> > Note that another possibility would be to have the LSM step in. We
> >> > can't catch this (as is) at the LSM level because security_remount_sb
> >> > doesn't get the mount flags, so we can't distinguish
> >> > mount -o remount,ro
> >> > from
> >> > mount --bind -o remount,ro.
> >> > Sending the flags to that hook would probably be a good idea in addition
> >> > to this patch, but I haven't done it here.
> >>
> >> NAK. This is just plain wrong - what about the filesystems that are not
> >
> > BTW, sorry - the patch clearly should've taken non-bdevs into account, but
> > I accept that wouldn't have been enough to evade a NAK.
> >
> >> bdev-backed or, as e.g. btrfs, sit on more than one device?
> >
> > btrfs is actually one of my main motivators - to quickly snapshot containers
> > with btrfs means that the containers all share one fs, but that means one
> > container can mark them all ro.
>
> Serge let me respectfully suggest that getting the user namespace done
> will deal with this issue nicely.
>
> In the simple case you simply won't be root so remount will just be
> denied.
>
> When/if we allow a limited form of unprivileged mounts in a user
> namespace your user won't have mounted the filesystem so you should not
> have the privilege to call remount on the filesystem.
Hm, that's a good point. Though note it'll require the userns code to
distinguish between the a bind remount and superblock remount. The
last time we seriously discussed this, that wasn't even on the roadmap.
It was only going to support fully assigning the whole filesystem to
a user namespace. In that case, the remount issue doesn't apply anyway
as the fs isn't shared with another container.
In any case, there are other workarounds, so I wasn't in a hurry to
address this - it just should be addressed eventually. I just figured
that to bring up the issue I needed a patch :)
> I think I will have a set of patches ready for serious scrutiny in
> the next week or so. So we aren't talking impossible pie in the sky
> distance to see this happen.
Awesome.
thanks,
-serge
"Serge E. Hallyn" <[email protected]> writes:
>> Serge let me respectfully suggest that getting the user namespace done
>> will deal with this issue nicely.
>>
>> In the simple case you simply won't be root so remount will just be
>> denied.
>>
>> When/if we allow a limited form of unprivileged mounts in a user
>> namespace your user won't have mounted the filesystem so you should not
>> have the privilege to call remount on the filesystem.
>
> Hm, that's a good point. Though note it'll require the userns code to
> distinguish between the a bind remount and superblock remount. The
> last time we seriously discussed this, that wasn't even on the roadmap.
> It was only going to support fully assigning the whole filesystem to
> a user namespace. In that case, the remount issue doesn't apply anyway
> as the fs isn't shared with another container.
Come to think of it unmounting and remounting is a bit tricky, and
it is a bit parallel to having a disk base filesystem being in one
user namespace. Currently my patches have the rule that everything
maps to the initial user namespace, so using a filesystem from multiple
user namespaces is not a problem.
Unmounting is pretty safe if the rule is that you control the entire
mount namespace.
Remounting though that does become tricky in the unprivileged situation.
I honestly haven't thought through what that permission check should
look like yet.
Eric