Hello,
How to reproduce:
1. Prepare a container, enable userns and disable netns
2. use libvirt-lxc to start a container
3. libvirt could not mount sysfs then failed to start.
Then I found that
commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
"Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
over the net namespace."
But why should we check sysfs mouont permission over net namespace?
We've already checked CAP_SYS_ADMIN though.
What the relationship between sysfs and net namespace,
or this check is a little redundant?
Any insights on this?
Thanks,
- Chen
PS: codes below could be a workaround
@@ -34,7 +35,8 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
return ERR_PTR(-EPERM);
- if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
+ if (current->nsproxy->net_ns != &init_net &&
+ !kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
return ERR_PTR(-EPERM);
}
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
Quoting [email protected] ([email protected]):
> Hello,
>
> How to reproduce:
> 1. Prepare a container, enable userns and disable netns
> 2. use libvirt-lxc to start a container
> 3. libvirt could not mount sysfs then failed to start.
>
> Then I found that
> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
> over the net namespace."
>
> But why should we check sysfs mouont permission over net namespace?
> We've already checked CAP_SYS_ADMIN though.
>
> What the relationship between sysfs and net namespace,
> or this check is a little redundant?
It is not redundant. The whole point is that after clone(CLONE_NEWUSER)
you get a newly filled set of capabilities. But you should not have
privileges over the host's network namesapce. After you unshare a new
network namespace, you *should* have privilege over it. So the fact
that we've already check CAP_SYS_ADMIN means nothing, because the
capabilities need to be targeted.
> Any insights on this?
>
> Thanks,
> - Chen
>
> PS: codes below could be a workaround
>
> @@ -34,7 +35,8 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
> if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
> return ERR_PTR(-EPERM);
>
> - if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
> + if (current->nsproxy->net_ns != &init_net &&
> + !kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
> return ERR_PTR(-EPERM);
> }
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
"Serge E. Hallyn" <[email protected]> writes:
> Quoting [email protected] ([email protected]):
>> Hello,
>>
>> How to reproduce:
>> 1. Prepare a container, enable userns and disable netns
>> 2. use libvirt-lxc to start a container
>> 3. libvirt could not mount sysfs then failed to start.
>>
>> Then I found that
>> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
>> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
>> over the net namespace."
>>
>> But why should we check sysfs mouont permission over net namespace?
>> We've already checked CAP_SYS_ADMIN though.
We already checked capable(CAP_SYS_ADMIN) and it failed.
>> What the relationship between sysfs and net namespace,
>> or this check is a little redundant?
You want a bind mount not a new fresh mount.
When looking at how evil actors could abuse things it turned out that in
some circumstances the root user (before a user namespace is created)
needs to control the policy on which filesystems may be mounted. There
are files in sysfs and in proc that you never want to see in a chroot
jail, as they just create more surface area to attack.
The only reason for creating a new fresh mount of sysfs is to get access
to /sys/class/net. So to keep things simple we restrict creation of
that mount to cases where the mounter has permisions over the network
namespace, and cases where nothing interesing is mounted on top of
sysfs.
If a new /sys/class/net is not needed it is possible to bind mount the
existing copy of sysfs to the new location without loss of
functionality.
> It is not redundant. The whole point is that after clone(CLONE_NEWUSER)
> you get a newly filled set of capabilities. But you should not have
> privileges over the host's network namesapce. After you unshare a new
> network namespace, you *should* have privilege over it. So the fact
> that we've already check CAP_SYS_ADMIN means nothing, because the
> capabilities need to be targeted.
Exactly the tests are failing because the caller is not the global root
and so the code is properly failing the permission checks.
Eric
> -----Original Message-----
> From: Eric W. Biederman [mailto:[email protected]]
> Sent: Saturday, July 12, 2014 12:29 AM
> To: Serge E. Hallyn
> Cc: Chen, Hanxiao/?? ????; Serge Hallyn ([email protected]); Greg
> Kroah-Hartman; [email protected];
> [email protected]
> Subject: Re: Could not mount sysfs when enable userns but disable netns
>
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Quoting [email protected] ([email protected]):
> >> Hello,
> >>
> >> How to reproduce:
> >> 1. Prepare a container, enable userns and disable netns
> >> 2. use libvirt-lxc to start a container
> >> 3. libvirt could not mount sysfs then failed to start.
> >>
> >> Then I found that
> >> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
> >> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
> >> over the net namespace."
> >>
> >> But why should we check sysfs mouont permission over net namespace?
> >> We've already checked CAP_SYS_ADMIN though.
>
> We already checked capable(CAP_SYS_ADMIN) and it failed.
But on my machine, capable(CAP_SYS_ADMIN) passed
but failed in kobj_ns_current_may_mount.
I added some printks in sysfs_mount:
if (!(flags & MS_KERNMOUNT)) {
- if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
+ if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) {
+ printk(KERN_WARNING "Failed in capable\n");
return ERR_PTR(-EPERM);
+ }
- if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
+ if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) {
+ printk(KERN_WARNING "Failed in kobj_ns_current_may_mount\n");
return ERR_PTR(-EPERM);
+ }
And found:
Jul 14 09:55:26 localhost systemd: Starting Container lxc-chx.
Jul 14 09:55:26 localhost systemd-machined: New machine lxc-chx.
Jul 14 09:55:26 localhost systemd: Started Container lxc-chx.
Jul 14 09:55:26 localhost kernel: [ 784.044709] Failed in kobj_ns_current_may_mount
Jul 14 09:55:26 localhost systemd-machined: Machine lxc-chx terminated.
>
> >> What the relationship between sysfs and net namespace,
> >> or this check is a little redundant?
>
> You want a bind mount not a new fresh mount.
>
Yes, we need to modify libvirt's codes to deal with sysfs
when enable userns but disable netns.
Thanks,
- Chen
> When looking at how evil actors could abuse things it turned out that in
> some circumstances the root user (before a user namespace is created)
> needs to control the policy on which filesystems may be mounted. There
> are files in sysfs and in proc that you never want to see in a chroot
> jail, as they just create more surface area to attack.
>
> The only reason for creating a new fresh mount of sysfs is to get access
> to /sys/class/net. So to keep things simple we restrict creation of
> that mount to cases where the mounter has permisions over the network
> namespace, and cases where nothing interesing is mounted on top of
> sysfs.
>
> If a new /sys/class/net is not needed it is possible to bind mount the
> existing copy of sysfs to the new location without loss of
> functionality.
>
> > It is not redundant. The whole point is that after clone(CLONE_NEWUSER)
> > you get a newly filled set of capabilities. But you should not have
> > privileges over the host's network namesapce. After you unshare a new
> > network namespace, you *should* have privilege over it. So the fact
> > that we've already check CAP_SYS_ADMIN means nothing, because the
> > capabilities need to be targeted.
>
> Exactly the tests are failing because the caller is not the global root
> and so the code is properly failing the permission checks.
>
> Eric
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
"[email protected]" <[email protected]> writes:
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:[email protected]]
>> Sent: Saturday, July 12, 2014 12:29 AM
>> To: Serge E. Hallyn
>> Cc: Chen, Hanxiao/陈 晗霄; Serge Hallyn ([email protected]); Greg
>> Kroah-Hartman; [email protected];
>> [email protected]
>> Subject: Re: Could not mount sysfs when enable userns but disable netns
>>
>> "Serge E. Hallyn" <[email protected]> writes:
>>
>> > Quoting [email protected] ([email protected]):
>> >> Hello,
>> >>
>> >> How to reproduce:
>> >> 1. Prepare a container, enable userns and disable netns
>> >> 2. use libvirt-lxc to start a container
>> >> 3. libvirt could not mount sysfs then failed to start.
>> >>
>> >> Then I found that
>> >> commit 7dc5dbc879bd0779924b5132a48b731a0bc04a1e says:
>> >> "Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
>> >> over the net namespace."
>> >>
>> >> But why should we check sysfs mouont permission over net namespace?
>> >> We've already checked CAP_SYS_ADMIN though.
>>
>> We already checked capable(CAP_SYS_ADMIN) and it failed.
>
> But on my machine, capable(CAP_SYS_ADMIN) passed
> but failed in kobj_ns_current_may_mount.
No. capable(CAP_SYS_ADMIN) did not pass.
fs_fully_visible did passed.
There is a significant distinction. If capable(CAP_SYS_ADMIN) had
passed kobj_ns_current_may_mount (which is a fancy way of saying
ns_capable(net->user_ns, CAP_SYS_ADMIN)) would also have passed.
> I added some printks in sysfs_mount:
> if (!(flags & MS_KERNMOUNT)) {
> - if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type))
> + if (!capable(CAP_SYS_ADMIN) && !fs_fully_visible(fs_type)) {
> + printk(KERN_WARNING "Failed in capable\n");
> return ERR_PTR(-EPERM);
> + }
>
> - if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
> + if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET)) {
> + printk(KERN_WARNING "Failed in kobj_ns_current_may_mount\n");
> return ERR_PTR(-EPERM);
> + }
>
> And found:
> Jul 14 09:55:26 localhost systemd: Starting Container lxc-chx.
> Jul 14 09:55:26 localhost systemd-machined: New machine lxc-chx.
> Jul 14 09:55:26 localhost systemd: Started Container lxc-chx.
> Jul 14 09:55:26 localhost kernel: [ 784.044709] Failed in kobj_ns_current_may_mount
> Jul 14 09:55:26 localhost systemd-machined: Machine lxc-chx terminated.
>
>>
>> >> What the relationship between sysfs and net namespace,
>> >> or this check is a little redundant?
>>
>> You want a bind mount not a new fresh mount.
>>
>
> Yes, we need to modify libvirt's codes to deal with sysfs
> when enable userns but disable netns.
Please go for it. I don't have any insignt into libvirt so I can't help
you there.
Eric