Hello,
when using user namespaces I found a bug in the capability checks done by ioctl.
If someone tries to use chattr +i while in a different user namespace it will get the following:
ioctl(3, EXT2_IOC_SETFLAGS, 0x7fffa4fedacc) = -1 EPERM (Operation not permitted)
I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with
ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
If you agree I can send patches for all filesystems.
I'm proposing the following patch:
---
fs/ext4/ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..25683d0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
* This test looks nicer. Thanks to Pauline Middelink
*/
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
- if (!capable(CAP_LINUX_IMMUTABLE))
+ if (!ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE))
goto flags_out;
}
--
1.8.4
--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: [email protected]
ICQ: 7556201
Mobile: +359 886 660 270
On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>
> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
Um, wouldn't it be better to simply fix the capable() function?
/**
* capable - Determine if the current task has a superior capability in effect
* @cap: The capability to be tested for
*
* Return true if the current task has the given superior capability currently
* available for use, false if not.
*
* This sets PF_SUPERPRIV on the task if the capability is available on the
* assumption that it's about to be used.
*/
bool capable(int cap)
{
return ns_capable(&init_user_ns, cap);
}
EXPORT_SYMBOL(capable);
The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.
No? Otherwise, pretty much every single use of capable() would be
broken, not just this once instances in ext4/ioctl.c.
- Ted
Quoting Theodore Ts'o ([email protected]):
> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >
> > I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
> > check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
>
> Um, wouldn't it be better to simply fix the capable() function?
>
> /**
> * capable - Determine if the current task has a superior capability in effect
> * @cap: The capability to be tested for
> *
> * Return true if the current task has the given superior capability currently
> * available for use, false if not.
> *
> * This sets PF_SUPERPRIV on the task if the capability is available on the
> * assumption that it's about to be used.
> */
> bool capable(int cap)
> {
> return ns_capable(&init_user_ns, cap);
> }
> EXPORT_SYMBOL(capable);
>
> The documentation states that it is for "the current task", and I
> can't imagine any use case, where user namespaces are in effect, where
> using init_user_ns would ever make sense.
the init_user_ns represents the user_ns owning the object, not the
subject.
The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.
So NACK to that particular patch. I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?
> No? Otherwise, pretty much every single use of capable() would be
> broken, not just this once instances in ext4/ioctl.c.
>
> - Ted
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> Quoting Theodore Ts'o ([email protected]):
>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>
>>> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
>>> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
>>
>> Um, wouldn't it be better to simply fix the capable() function?
>>
>> /**
>> * capable - Determine if the current task has a superior capability in effect
>> * @cap: The capability to be tested for
>> *
>> * Return true if the current task has the given superior capability currently
>> * available for use, false if not.
>> *
>> * This sets PF_SUPERPRIV on the task if the capability is available on the
>> * assumption that it's about to be used.
>> */
>> bool capable(int cap)
>> {
>> return ns_capable(&init_user_ns, cap);
>> }
>> EXPORT_SYMBOL(capable);
>>
>> The documentation states that it is for "the current task", and I
>> can't imagine any use case, where user namespaces are in effect, where
>> using init_user_ns would ever make sense.
>
> the init_user_ns represents the user_ns owning the object, not the
> subject.
>
> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
> CAP_SYS_IMMUTABLE)' by definition.
>
> So NACK to that particular patch. I'm not sure, but IIUC it should be
> safe to check against the userns owning the inode?
>
So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
I agree that this is more sane.
Marian
>> No? Otherwise, pretty much every single use of capable() would be
>> broken, not just this once instances in ext4/ioctl.c.
>>
>> - Ted
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: [email protected]
ICQ: 7556201
Mobile: +359 886 660 270
Quoting Marian Marinov ([email protected]):
> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >Quoting Theodore Ts'o ([email protected]):
> >>On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >>>
> >>>I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
> >>>check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
> >>
> >>Um, wouldn't it be better to simply fix the capable() function?
> >>
> >>/**
> >> * capable - Determine if the current task has a superior capability in effect
> >> * @cap: The capability to be tested for
> >> *
> >> * Return true if the current task has the given superior capability currently
> >> * available for use, false if not.
> >> *
> >> * This sets PF_SUPERPRIV on the task if the capability is available on the
> >> * assumption that it's about to be used.
> >> */
> >>bool capable(int cap)
> >>{
> >> return ns_capable(&init_user_ns, cap);
> >>}
> >>EXPORT_SYMBOL(capable);
> >>
> >>The documentation states that it is for "the current task", and I
> >>can't imagine any use case, where user namespaces are in effect, where
> >>using init_user_ns would ever make sense.
> >
> >the init_user_ns represents the user_ns owning the object, not the
> >subject.
> >
> >The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
> >CAP_SYS_IMMUTABLE)' by definition.
> >
> >So NACK to that particular patch. I'm not sure, but IIUC it should be
> >safe to check against the userns owning the inode?
> >
>
> So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>
> I agree that this is more sane.
Right, and I think the two operations you're looking at seem sane
to allow.
thanks,
-serge
On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> Quoting Marian Marinov ([email protected]):
>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>> Quoting Theodore Ts'o ([email protected]):
>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>
>>>>> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
>>>>> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
>>>>
>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>
>>>> /**
>>>> * capable - Determine if the current task has a superior capability in effect
>>>> * @cap: The capability to be tested for
>>>> *
>>>> * Return true if the current task has the given superior capability currently
>>>> * available for use, false if not.
>>>> *
>>>> * This sets PF_SUPERPRIV on the task if the capability is available on the
>>>> * assumption that it's about to be used.
>>>> */
>>>> bool capable(int cap)
>>>> {
>>>> return ns_capable(&init_user_ns, cap);
>>>> }
>>>> EXPORT_SYMBOL(capable);
>>>>
>>>> The documentation states that it is for "the current task", and I
>>>> can't imagine any use case, where user namespaces are in effect, where
>>>> using init_user_ns would ever make sense.
>>>
>>> the init_user_ns represents the user_ns owning the object, not the
>>> subject.
>>>
>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>>> setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
>>> CAP_SYS_IMMUTABLE)' by definition.
>>>
>>> So NACK to that particular patch. I'm not sure, but IIUC it should be
>>> safe to check against the userns owning the inode?
>>>
>>
>> So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>
>> I agree that this is more sane.
>
> Right, and I think the two operations you're looking at seem sane
> to allow.
If you are ok with this patch, I will fix all file systems and send patches.
Signed-off-by: Marian Marinov <[email protected]>
---
fs/ext4/ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..9418634 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
* This test looks nicer. Thanks to Pauline Middelink
*/
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
- if (!capable(CAP_LINUX_IMMUTABLE))
+ if (!inode_capable(inode, CAP_LINUX_IMMUTABLE))
goto flags_out;
}
---
1.8.4
Marian
>
> thanks,
> -serge
>
--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: [email protected]
ICQ: 7556201
Mobile: +359 886 660 270
Quoting Marian Marinov ([email protected]):
> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> >Quoting Marian Marinov ([email protected]):
> >>On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >>>Quoting Theodore Ts'o ([email protected]):
> >>>>On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >>>>>
> >>>>>I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
> >>>>>check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
> >>>>
> >>>>Um, wouldn't it be better to simply fix the capable() function?
> >>>>
> >>>>/**
> >>>> * capable - Determine if the current task has a superior capability in effect
> >>>> * @cap: The capability to be tested for
> >>>> *
> >>>> * Return true if the current task has the given superior capability currently
> >>>> * available for use, false if not.
> >>>> *
> >>>> * This sets PF_SUPERPRIV on the task if the capability is available on the
> >>>> * assumption that it's about to be used.
> >>>> */
> >>>>bool capable(int cap)
> >>>>{
> >>>> return ns_capable(&init_user_ns, cap);
> >>>>}
> >>>>EXPORT_SYMBOL(capable);
> >>>>
> >>>>The documentation states that it is for "the current task", and I
> >>>>can't imagine any use case, where user namespaces are in effect, where
> >>>>using init_user_ns would ever make sense.
> >>>
> >>>the init_user_ns represents the user_ns owning the object, not the
> >>>subject.
> >>>
> >>>The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >>>setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
> >>>CAP_SYS_IMMUTABLE)' by definition.
> >>>
> >>>So NACK to that particular patch. I'm not sure, but IIUC it should be
> >>>safe to check against the userns owning the inode?
> >>>
> >>
> >>So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> >>'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
> >>
> >>I agree that this is more sane.
> >
> >Right, and I think the two operations you're looking at seem sane
> >to allow.
>
> If you are ok with this patch, I will fix all file systems and send patches.
Sounds good, thanks.
> Signed-off-by: Marian Marinov <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
> ---
> fs/ext4/ioctl.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index d011b69..9418634 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> * This test looks nicer. Thanks to Pauline Middelink
> */
> if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
> - if (!capable(CAP_LINUX_IMMUTABLE))
> + if (!inode_capable(inode, CAP_LINUX_IMMUTABLE))
> goto flags_out;
> }
>
> ---
> 1.8.4
>
> Marian
>
>
> >
> >thanks,
> >-serge
> >
>
>
> --
> Marian Marinov
> Founder & CEO of 1H Ltd.
> Jabber/GTalk: [email protected]
> ICQ: 7556201
> Mobile: +359 886 660 270
On 04/29/2014 03:29 PM, Serge Hallyn wrote:
> Quoting Marian Marinov ([email protected]):
>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>> Quoting Marian Marinov ([email protected]):
>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>> Quoting Theodore Ts'o ([email protected]):
>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>
>>>>>>> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
>>>>>>> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
>>>>>>
>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>
>>>>>> /**
>>>>>> * capable - Determine if the current task has a superior capability in effect
>>>>>> * @cap: The capability to be tested for
>>>>>> *
>>>>>> * Return true if the current task has the given superior capability currently
>>>>>> * available for use, false if not.
>>>>>> *
>>>>>> * This sets PF_SUPERPRIV on the task if the capability is available on the
>>>>>> * assumption that it's about to be used.
>>>>>> */
>>>>>> bool capable(int cap)
>>>>>> {
>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>> }
>>>>>> EXPORT_SYMBOL(capable);
>>>>>>
>>>>>> The documentation states that it is for "the current task", and I
>>>>>> can't imagine any use case, where user namespaces are in effect, where
>>>>>> using init_user_ns would ever make sense.
>>>>>
>>>>> the init_user_ns represents the user_ns owning the object, not the
>>>>> subject.
>>>>>
>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>>>>> setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>
>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should be
>>>>> safe to check against the userns owning the inode?
>>>>>
>>>>
>>>> So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>
>>>> I agree that this is more sane.
>>>
>>> Right, and I think the two operations you're looking at seem sane
>>> to allow.
>>
>> If you are ok with this patch, I will fix all file systems and send patches.
>
> Sounds good, thanks.
>
>> Signed-off-by: Marian Marinov <[email protected]>
>
> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/[email protected]>
Wait, what?
Inodes aren't owned by user namespaces; they're owned by users. And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.
Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.
Nacked-by: Andy Lutomirski <[email protected]>
On Tue, Apr 29, 2014 at 03:45:24PM -0700, Andy Lutomirski wrote:
>
> Wait, what?
>
> Inodes aren't owned by user namespaces; they're owned by users. And any
> user can arrange to have a user namespace in which they pass an
> inode_capable check on any inode that they own.
>
> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> entirely.
>
> Nacked-by: Andy Lutomirski <[email protected]>
Right, but you can't set a mapping in a child namespace unless you
have CAP_SETUID in the parent namespace, right? Otherwise user
namespaces are completely broken from a security perspective, since
inode_capable() could never do the right thing.
Personally, reading how user namespaces work, it makes the hair rise
on the back of my neck. I'm not sure the concept works at all from a
security perspective, but hey, I'm not using user namespaces, and some
fool thought it was worth merging. :-)
- Ted
On Tue, Apr 29, 2014 at 4:06 PM, Theodore Ts'o <[email protected]> wrote:
> On Tue, Apr 29, 2014 at 03:45:24PM -0700, Andy Lutomirski wrote:
>>
>> Wait, what?
>>
>> Inodes aren't owned by user namespaces; they're owned by users. And any
>> user can arrange to have a user namespace in which they pass an
>> inode_capable check on any inode that they own.
>>
>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>> entirely.
>>
>> Nacked-by: Andy Lutomirski <[email protected]>
>
> Right, but you can't set a mapping in a child namespace unless you
> have CAP_SETUID in the parent namespace, right?
Nope. You can't set a mapping for someone else's uid, but you can
certainly map your own.
> Otherwise user
> namespaces are completely broken from a security perspective, since
> inode_capable() could never do the right thing.
I don't know what inode_capable's "right thing" is, but at least one
of the existing callers is blatantly wrong. Patches coming shortly.
>
> Personally, reading how user namespaces work, it makes the hair rise
> on the back of my neck. I'm not sure the concept works at all from a
> security perspective, but hey, I'm not using user namespaces, and some
> fool thought it was worth merging. :-)
I like them. I've also found quite a few serious bugs in them. So go figure :)
--Andy
On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>> Quoting Marian Marinov ([email protected]):
>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov ([email protected]):
>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>>> Quoting Theodore Ts'o ([email protected]):
>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>>
>>>>>>>> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
>>>>>>>> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
>>>>>>>
>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>>
>>>>>>> /**
>>>>>>> * capable - Determine if the current task has a superior capability in effect
>>>>>>> * @cap: The capability to be tested for
>>>>>>> *
>>>>>>> * Return true if the current task has the given superior capability currently
>>>>>>> * available for use, false if not.
>>>>>>> *
>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is available on the
>>>>>>> * assumption that it's about to be used.
>>>>>>> */
>>>>>>> bool capable(int cap)
>>>>>>> {
>>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(capable);
>>>>>>>
>>>>>>> The documentation states that it is for "the current task", and I
>>>>>>> can't imagine any use case, where user namespaces are in effect, where
>>>>>>> using init_user_ns would ever make sense.
>>>>>>
>>>>>> the init_user_ns represents the user_ns owning the object, not the
>>>>>> subject.
>>>>>>
>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>>>>>> setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>>
>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should be
>>>>>> safe to check against the userns owning the inode?
>>>>>>
>>>>>
>>>>> So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>>
>>>>> I agree that this is more sane.
>>>>
>>>> Right, and I think the two operations you're looking at seem sane
>>>> to allow.
>>>
>>> If you are ok with this patch, I will fix all file systems and send patches.
>>
>> Sounds good, thanks.
>>
>>> Signed-off-by: Marian Marinov <[email protected]>
>>
>> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/[email protected]>
>
> Wait, what?
>
> Inodes aren't owned by user namespaces; they're owned by users. And any
> user can arrange to have a user namespace in which they pass an
> inode_capable check on any inode that they own.
>
> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> entirely.
The problem I'm trying to solve is this:
container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this
container has access to.
Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container.
The first solution got its user namespace from the currently running process and the second gets its user namespace from
the currently opened inode.
So what would be the best solution in this case?
Marian
>
> Nacked-by: Andy Lutomirski <[email protected]>
>
>
>
--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: [email protected]
ICQ: 7556201
Mobile: +359 886 660 270
On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
> On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>>
>> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>>>
>>> Quoting Marian Marinov ([email protected]):
>>>>
>>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>>>>
>>>>> Quoting Marian Marinov ([email protected]):
>>>>>>
>>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>>>>
>>>>>>> Quoting Theodore Ts'o ([email protected]):
>>>>>>>>
>>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm proposing a fix to this, by replacing the
>>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>>>>>>>>> check with ns_capable(current_cred()->user_ns,
>>>>>>>>> CAP_LINUX_IMMUTABLE).
>>>>>>>>
>>>>>>>>
>>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>>>
>>>>>>>> /**
>>>>>>>> * capable - Determine if the current task has a superior
>>>>>>>> capability in effect
>>>>>>>> * @cap: The capability to be tested for
>>>>>>>> *
>>>>>>>> * Return true if the current task has the given superior
>>>>>>>> capability currently
>>>>>>>> * available for use, false if not.
>>>>>>>> *
>>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>>>>>>>> available on the
>>>>>>>> * assumption that it's about to be used.
>>>>>>>> */
>>>>>>>> bool capable(int cap)
>>>>>>>> {
>>>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>>>> }
>>>>>>>> EXPORT_SYMBOL(capable);
>>>>>>>>
>>>>>>>> The documentation states that it is for "the current task", and I
>>>>>>>> can't imagine any use case, where user namespaces are in effect,
>>>>>>>> where
>>>>>>>> using init_user_ns would ever make sense.
>>>>>>>
>>>>>>>
>>>>>>> the init_user_ns represents the user_ns owning the object, not the
>>>>>>> subject.
>>>>>>>
>>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>>>>>>> setuid(0), execve, and end up satisfying
>>>>>>> 'ns_capable(current_cred()->userns,
>>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>>>
>>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
>>>>>>> be
>>>>>>> safe to check against the userns owning the inode?
>>>>>>>
>>>>>>
>>>>>> So what you are proposing is to replace
>>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>>>
>>>>>> I agree that this is more sane.
>>>>>
>>>>>
>>>>> Right, and I think the two operations you're looking at seem sane
>>>>> to allow.
>>>>
>>>>
>>>> If you are ok with this patch, I will fix all file systems and send
>>>> patches.
>>>
>>>
>>> Sounds good, thanks.
>>>
>>>> Signed-off-by: Marian Marinov <[email protected]>
>>>
>>>
>>> Acked-by: Serge E. Hallyn
>>> <serge.hallyn-GeWIH/[email protected]>
>>
>>
>> Wait, what?
>>
>> Inodes aren't owned by user namespaces; they're owned by users. And any
>> user can arrange to have a user namespace in which they pass an
>> inode_capable check on any inode that they own.
>>
>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>> entirely.
>
>
> The problem I'm trying to solve is this:
>
> container with its own user namespace and CAP_SYS_IMMUTABLE should be able
> to use chattr on all files witch this container has access to.
>
> Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
>
> With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
> container.
>
> The first solution got its user namespace from the currently running process
> and the second gets its user namespace from the currently opened inode.
>
> So what would be the best solution in this case?
I'd suggest adding a mount option like fs_owner_uid that names a uid
that owns, in the sense of having unlimited access to, a filesystem.
Then anyone with caps on a namespace owned by that uid could do
whatever.
Eric?
--Andy
On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber <[email protected]> wrote:
> On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
>> > On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>> >>
>> >> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>> >>>
>> >>> Quoting Marian Marinov ([email protected]):
>> >>>>
>> >>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>> >>>>>
>> >>>>> Quoting Marian Marinov ([email protected]):
>> >>>>>>
>> >>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>> >>>>>>>
>> >>>>>>> Quoting Theodore Ts'o ([email protected]):
>> >>>>>>>>
>> >>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> I'm proposing a fix to this, by replacing the
>> >>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>> >>>>>>>>> check with ns_capable(current_cred()->user_ns,
>> >>>>>>>>> CAP_LINUX_IMMUTABLE).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>> >>>>>>>>
>> >>>>>>>> /**
>> >>>>>>>> * capable - Determine if the current task has a superior
>> >>>>>>>> capability in effect
>> >>>>>>>> * @cap: The capability to be tested for
>> >>>>>>>> *
>> >>>>>>>> * Return true if the current task has the given superior
>> >>>>>>>> capability currently
>> >>>>>>>> * available for use, false if not.
>> >>>>>>>> *
>> >>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>> >>>>>>>> available on the
>> >>>>>>>> * assumption that it's about to be used.
>> >>>>>>>> */
>> >>>>>>>> bool capable(int cap)
>> >>>>>>>> {
>> >>>>>>>> return ns_capable(&init_user_ns, cap);
>> >>>>>>>> }
>> >>>>>>>> EXPORT_SYMBOL(capable);
>> >>>>>>>>
>> >>>>>>>> The documentation states that it is for "the current task", and I
>> >>>>>>>> can't imagine any use case, where user namespaces are in effect,
>> >>>>>>>> where
>> >>>>>>>> using init_user_ns would ever make sense.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> the init_user_ns represents the user_ns owning the object, not the
>> >>>>>>> subject.
>> >>>>>>>
>> >>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>> >>>>>>> setuid(0), execve, and end up satisfying
>> >>>>>>> 'ns_capable(current_cred()->userns,
>> >>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>> >>>>>>>
>> >>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
>> >>>>>>> be
>> >>>>>>> safe to check against the userns owning the inode?
>> >>>>>>>
>> >>>>>>
>> >>>>>> So what you are proposing is to replace
>> >>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>> >>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>> >>>>>>
>> >>>>>> I agree that this is more sane.
>> >>>>>
>> >>>>>
>> >>>>> Right, and I think the two operations you're looking at seem sane
>> >>>>> to allow.
>> >>>>
>> >>>>
>> >>>> If you are ok with this patch, I will fix all file systems and send
>> >>>> patches.
>> >>>
>> >>>
>> >>> Sounds good, thanks.
>> >>>
>> >>>> Signed-off-by: Marian Marinov <[email protected]>
>> >>>
>> >>>
>> >>> Acked-by: Serge E. Hallyn
>> >>> <serge.hallyn-GeWIH/[email protected]>
>> >>
>> >>
>> >> Wait, what?
>> >>
>> >> Inodes aren't owned by user namespaces; they're owned by users. And any
>> >> user can arrange to have a user namespace in which they pass an
>> >> inode_capable check on any inode that they own.
>> >>
>> >> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
>> >> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>> >> entirely.
>> >
>> >
>> > The problem I'm trying to solve is this:
>> >
>> > container with its own user namespace and CAP_SYS_IMMUTABLE should be able
>> > to use chattr on all files witch this container has access to.
>> >
>> > Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
>> >
>> > With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
>> > container.
>> >
>> > The first solution got its user namespace from the currently running process
>> > and the second gets its user namespace from the currently opened inode.
>> >
>> > So what would be the best solution in this case?
>>
>> I'd suggest adding a mount option like fs_owner_uid that names a uid
>> that owns, in the sense of having unlimited access to, a filesystem.
>> Then anyone with caps on a namespace owned by that uid could do
>> whatever.
>>
>> Eric?
>>
>> --Andy
>
> The most obvious problem I can think of with "do whatever" is that this
> will likely include mknod of char and block devices which you can then
> chown/chmod as you wish and use to access any devices on the system from
> an unprivileged container.
> This can however be mitigated by using the devices cgroup controller.
Or 'nodev'. setuid/setgid may have the same problem, too.
Implementing something like this would also make CAP_DAC_READ_SEARCH
and CAP_DAC_OVERRIDE work.
Arguably it should be impossible to mount such a thing in the first
place without global privilege.
>
> You also probably wouldn't want any unprivileged user from the host to
> find a way to access that mounted filesytem but so long as you do the
> mount in a separate mountns and don't share uids between the host and
> the container, that should be fine too.
This part should be a nonissue -- an unprivileged user who has the
right uid owns the namespace anyway, so this is the least of your
worries.
--Andy
On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
> > On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
> >>
> >> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
> >>>
> >>> Quoting Marian Marinov ([email protected]):
> >>>>
> >>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> >>>>>
> >>>>> Quoting Marian Marinov ([email protected]):
> >>>>>>
> >>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >>>>>>>
> >>>>>>> Quoting Theodore Ts'o ([email protected]):
> >>>>>>>>
> >>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I'm proposing a fix to this, by replacing the
> >>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
> >>>>>>>>> check with ns_capable(current_cred()->user_ns,
> >>>>>>>>> CAP_LINUX_IMMUTABLE).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
> >>>>>>>>
> >>>>>>>> /**
> >>>>>>>> * capable - Determine if the current task has a superior
> >>>>>>>> capability in effect
> >>>>>>>> * @cap: The capability to be tested for
> >>>>>>>> *
> >>>>>>>> * Return true if the current task has the given superior
> >>>>>>>> capability currently
> >>>>>>>> * available for use, false if not.
> >>>>>>>> *
> >>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
> >>>>>>>> available on the
> >>>>>>>> * assumption that it's about to be used.
> >>>>>>>> */
> >>>>>>>> bool capable(int cap)
> >>>>>>>> {
> >>>>>>>> return ns_capable(&init_user_ns, cap);
> >>>>>>>> }
> >>>>>>>> EXPORT_SYMBOL(capable);
> >>>>>>>>
> >>>>>>>> The documentation states that it is for "the current task", and I
> >>>>>>>> can't imagine any use case, where user namespaces are in effect,
> >>>>>>>> where
> >>>>>>>> using init_user_ns would ever make sense.
> >>>>>>>
> >>>>>>>
> >>>>>>> the init_user_ns represents the user_ns owning the object, not the
> >>>>>>> subject.
> >>>>>>>
> >>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >>>>>>> setuid(0), execve, and end up satisfying
> >>>>>>> 'ns_capable(current_cred()->userns,
> >>>>>>> CAP_SYS_IMMUTABLE)' by definition.
> >>>>>>>
> >>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
> >>>>>>> be
> >>>>>>> safe to check against the userns owning the inode?
> >>>>>>>
> >>>>>>
> >>>>>> So what you are proposing is to replace
> >>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> >>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
> >>>>>>
> >>>>>> I agree that this is more sane.
> >>>>>
> >>>>>
> >>>>> Right, and I think the two operations you're looking at seem sane
> >>>>> to allow.
> >>>>
> >>>>
> >>>> If you are ok with this patch, I will fix all file systems and send
> >>>> patches.
> >>>
> >>>
> >>> Sounds good, thanks.
> >>>
> >>>> Signed-off-by: Marian Marinov <[email protected]>
> >>>
> >>>
> >>> Acked-by: Serge E. Hallyn
> >>> <serge.hallyn-GeWIH/[email protected]>
> >>
> >>
> >> Wait, what?
> >>
> >> Inodes aren't owned by user namespaces; they're owned by users. And any
> >> user can arrange to have a user namespace in which they pass an
> >> inode_capable check on any inode that they own.
> >>
> >> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
> >> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> >> entirely.
> >
> >
> > The problem I'm trying to solve is this:
> >
> > container with its own user namespace and CAP_SYS_IMMUTABLE should be able
> > to use chattr on all files witch this container has access to.
> >
> > Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
> >
> > With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
> > container.
> >
> > The first solution got its user namespace from the currently running process
> > and the second gets its user namespace from the currently opened inode.
> >
> > So what would be the best solution in this case?
>
> I'd suggest adding a mount option like fs_owner_uid that names a uid
> that owns, in the sense of having unlimited access to, a filesystem.
> Then anyone with caps on a namespace owned by that uid could do
> whatever.
>
> Eric?
>
> --Andy
The most obvious problem I can think of with "do whatever" is that this
will likely include mknod of char and block devices which you can then
chown/chmod as you wish and use to access any devices on the system from
an unprivileged container.
This can however be mitigated by using the devices cgroup controller.
You also probably wouldn't want any unprivileged user from the host to
find a way to access that mounted filesytem but so long as you do the
mount in a separate mountns and don't share uids between the host and
the container, that should be fine too.
--
St?phane Graber
Ubuntu developer
http://www.ubuntu.com
On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 29, 2014 at 4:47 PM, St?phane Graber <[email protected]> wrote:
> > On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
> >> > On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
> >> >>
> >> >> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
> >> >>>
> >> >>> Quoting Marian Marinov ([email protected]):
> >> >>>>
> >> >>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> >> >>>>>
> >> >>>>> Quoting Marian Marinov ([email protected]):
> >> >>>>>>
> >> >>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >> >>>>>>>
> >> >>>>>>> Quoting Theodore Ts'o ([email protected]):
> >> >>>>>>>>
> >> >>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> I'm proposing a fix to this, by replacing the
> >> >>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
> >> >>>>>>>>> check with ns_capable(current_cred()->user_ns,
> >> >>>>>>>>> CAP_LINUX_IMMUTABLE).
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
> >> >>>>>>>>
> >> >>>>>>>> /**
> >> >>>>>>>> * capable - Determine if the current task has a superior
> >> >>>>>>>> capability in effect
> >> >>>>>>>> * @cap: The capability to be tested for
> >> >>>>>>>> *
> >> >>>>>>>> * Return true if the current task has the given superior
> >> >>>>>>>> capability currently
> >> >>>>>>>> * available for use, false if not.
> >> >>>>>>>> *
> >> >>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
> >> >>>>>>>> available on the
> >> >>>>>>>> * assumption that it's about to be used.
> >> >>>>>>>> */
> >> >>>>>>>> bool capable(int cap)
> >> >>>>>>>> {
> >> >>>>>>>> return ns_capable(&init_user_ns, cap);
> >> >>>>>>>> }
> >> >>>>>>>> EXPORT_SYMBOL(capable);
> >> >>>>>>>>
> >> >>>>>>>> The documentation states that it is for "the current task", and I
> >> >>>>>>>> can't imagine any use case, where user namespaces are in effect,
> >> >>>>>>>> where
> >> >>>>>>>> using init_user_ns would ever make sense.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> the init_user_ns represents the user_ns owning the object, not the
> >> >>>>>>> subject.
> >> >>>>>>>
> >> >>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >> >>>>>>> setuid(0), execve, and end up satisfying
> >> >>>>>>> 'ns_capable(current_cred()->userns,
> >> >>>>>>> CAP_SYS_IMMUTABLE)' by definition.
> >> >>>>>>>
> >> >>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
> >> >>>>>>> be
> >> >>>>>>> safe to check against the userns owning the inode?
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> So what you are proposing is to replace
> >> >>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> >> >>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
> >> >>>>>>
> >> >>>>>> I agree that this is more sane.
> >> >>>>>
> >> >>>>>
> >> >>>>> Right, and I think the two operations you're looking at seem sane
> >> >>>>> to allow.
> >> >>>>
> >> >>>>
> >> >>>> If you are ok with this patch, I will fix all file systems and send
> >> >>>> patches.
> >> >>>
> >> >>>
> >> >>> Sounds good, thanks.
> >> >>>
> >> >>>> Signed-off-by: Marian Marinov <[email protected]>
> >> >>>
> >> >>>
> >> >>> Acked-by: Serge E. Hallyn
> >> >>> <serge.hallyn-GeWIH/[email protected]>
> >> >>
> >> >>
> >> >> Wait, what?
> >> >>
> >> >> Inodes aren't owned by user namespaces; they're owned by users. And any
> >> >> user can arrange to have a user namespace in which they pass an
> >> >> inode_capable check on any inode that they own.
> >> >>
> >> >> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
> >> >> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> >> >> entirely.
> >> >
> >> >
> >> > The problem I'm trying to solve is this:
> >> >
> >> > container with its own user namespace and CAP_SYS_IMMUTABLE should be able
> >> > to use chattr on all files witch this container has access to.
> >> >
> >> > Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
> >> >
> >> > With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
> >> > container.
> >> >
> >> > The first solution got its user namespace from the currently running process
> >> > and the second gets its user namespace from the currently opened inode.
> >> >
> >> > So what would be the best solution in this case?
> >>
> >> I'd suggest adding a mount option like fs_owner_uid that names a uid
> >> that owns, in the sense of having unlimited access to, a filesystem.
> >> Then anyone with caps on a namespace owned by that uid could do
> >> whatever.
> >>
> >> Eric?
> >>
> >> --Andy
> >
> > The most obvious problem I can think of with "do whatever" is that this
> > will likely include mknod of char and block devices which you can then
> > chown/chmod as you wish and use to access any devices on the system from
> > an unprivileged container.
> > This can however be mitigated by using the devices cgroup controller.
>
> Or 'nodev'. setuid/setgid may have the same problem, too.
>
> Implementing something like this would also make CAP_DAC_READ_SEARCH
> and CAP_DAC_OVERRIDE work.
>
> Arguably it should be impossible to mount such a thing in the first
> place without global privilege.
>
> >
> > You also probably wouldn't want any unprivileged user from the host to
> > find a way to access that mounted filesytem but so long as you do the
> > mount in a separate mountns and don't share uids between the host and
> > the container, that should be fine too.
>
> This part should be a nonissue -- an unprivileged user who has the
> right uid owns the namespace anyway, so this is the least of your
> worries.
>
> --Andy
It should be a nonissue so long as we make sure that a file owned by a
uid outside the scope of the container may not be changed even though
fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
a shell and anyone who can see the fs from the host will be getting a
root shell (assuming said file is owned by the host's uid 0).
So that's restricting slightly what "do whatever" would do in this case.
--
St?phane Graber
Ubuntu developer
http://www.ubuntu.com
On 04/30/2014 03:01 AM, St?phane Graber wrote:
> On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 29, 2014 at 4:47 PM, St?phane Graber <[email protected]> wrote:
>>> On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
>>>> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
>>>>> On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>>>>>>
>>>>>> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>>>>>>>
>>>>>>> Quoting Marian Marinov ([email protected]):
>>>>>>>>
>>>>>>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>>>>>>>>
>>>>>>>>> Quoting Marian Marinov ([email protected]):
>>>>>>>>>>
>>>>>>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>>>>>>>>
>>>>>>>>>>> Quoting Theodore Ts'o ([email protected]):
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm proposing a fix to this, by replacing the
>>>>>>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>>>>>>>>>>>>> check with ns_capable(current_cred()->user_ns,
>>>>>>>>>>>>> CAP_LINUX_IMMUTABLE).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>>>>>>>
>>>>>>>>>>>> /**
>>>>>>>>>>>> * capable - Determine if the current task has a superior
>>>>>>>>>>>> capability in effect
>>>>>>>>>>>> * @cap: The capability to be tested for
>>>>>>>>>>>> *
>>>>>>>>>>>> * Return true if the current task has the given superior
>>>>>>>>>>>> capability currently
>>>>>>>>>>>> * available for use, false if not.
>>>>>>>>>>>> *
>>>>>>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>>>>>>>>>>>> available on the
>>>>>>>>>>>> * assumption that it's about to be used.
>>>>>>>>>>>> */
>>>>>>>>>>>> bool capable(int cap)
>>>>>>>>>>>> {
>>>>>>>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>>>>>>>> }
>>>>>>>>>>>> EXPORT_SYMBOL(capable);
>>>>>>>>>>>>
>>>>>>>>>>>> The documentation states that it is for "the current task", and I
>>>>>>>>>>>> can't imagine any use case, where user namespaces are in effect,
>>>>>>>>>>>> where
>>>>>>>>>>>> using init_user_ns would ever make sense.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the init_user_ns represents the user_ns owning the object, not the
>>>>>>>>>>> subject.
>>>>>>>>>>>
>>>>>>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>>>>>>>>>>> setuid(0), execve, and end up satisfying
>>>>>>>>>>> 'ns_capable(current_cred()->userns,
>>>>>>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>>>>>>>
>>>>>>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
>>>>>>>>>>> be
>>>>>>>>>>> safe to check against the userns owning the inode?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So what you are proposing is to replace
>>>>>>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>>>>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>>>>>>>
>>>>>>>>>> I agree that this is more sane.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right, and I think the two operations you're looking at seem sane
>>>>>>>>> to allow.
>>>>>>>>
>>>>>>>>
>>>>>>>> If you are ok with this patch, I will fix all file systems and send
>>>>>>>> patches.
>>>>>>>
>>>>>>>
>>>>>>> Sounds good, thanks.
>>>>>>>
>>>>>>>> Signed-off-by: Marian Marinov <[email protected]>
>>>>>>>
>>>>>>>
>>>>>>> Acked-by: Serge E. Hallyn
>>>>>>> <serge.hallyn-GeWIH/[email protected]>
>>>>>>
>>>>>>
>>>>>> Wait, what?
>>>>>>
>>>>>> Inodes aren't owned by user namespaces; they're owned by users. And any
>>>>>> user can arrange to have a user namespace in which they pass an
>>>>>> inode_capable check on any inode that they own.
>>>>>>
>>>>>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
>>>>>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>>>>>> entirely.
>>>>>
>>>>>
>>>>> The problem I'm trying to solve is this:
>>>>>
>>>>> container with its own user namespace and CAP_SYS_IMMUTABLE should be able
>>>>> to use chattr on all files witch this container has access to.
>>>>>
>>>>> Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
>>>>>
>>>>> With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
>>>>> container.
>>>>>
>>>>> The first solution got its user namespace from the currently running process
>>>>> and the second gets its user namespace from the currently opened inode.
>>>>>
>>>>> So what would be the best solution in this case?
>>>>
>>>> I'd suggest adding a mount option like fs_owner_uid that names a uid
>>>> that owns, in the sense of having unlimited access to, a filesystem.
>>>> Then anyone with caps on a namespace owned by that uid could do
>>>> whatever.
>>>>
>>>> Eric?
>>>>
>>>> --Andy
>>>
>>> The most obvious problem I can think of with "do whatever" is that this
>>> will likely include mknod of char and block devices which you can then
>>> chown/chmod as you wish and use to access any devices on the system from
>>> an unprivileged container.
>>> This can however be mitigated by using the devices cgroup controller.
>>
>> Or 'nodev'. setuid/setgid may have the same problem, too.
>>
>> Implementing something like this would also make CAP_DAC_READ_SEARCH
>> and CAP_DAC_OVERRIDE work.
>>
>> Arguably it should be impossible to mount such a thing in the first
>> place without global privilege.
>>
>>>
>>> You also probably wouldn't want any unprivileged user from the host to
>>> find a way to access that mounted filesytem but so long as you do the
>>> mount in a separate mountns and don't share uids between the host and
>>> the container, that should be fine too.
>>
>> This part should be a nonissue -- an unprivileged user who has the
>> right uid owns the namespace anyway, so this is the least of your
>> worries.
>>
>> --Andy
>
> It should be a nonissue so long as we make sure that a file owned by a
> uid outside the scope of the container may not be changed even though
> fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
> a shell and anyone who can see the fs from the host will be getting a
> root shell (assuming said file is owned by the host's uid 0).
>
> So that's restricting slightly what "do whatever" would do in this case.
>
In my case I give an LVM volume to each container and limit the container to only this block device using the devices
cgroup.
So the inode_capable() fix worked like a charm for me.
The container can not see any filesystem other then its own.
And I have another patch for my kernel that prohibits setns from cgroup other then / which prevents programs from the
container from getting out. clone() can be used to create new namespaces but can not be used to attach to already
created namespaces.
Marian
--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: [email protected]
ICQ: 7556201
Mobile: +359 886 660 270
On Tue, Apr 29, 2014 at 5:01 PM, Stéphane Graber <[email protected]> wrote:
> On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber <[email protected]> wrote:
>> > On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
>> >> > On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>> >> >>
>> >> >> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>> >> >>>
>> >> >>> Quoting Marian Marinov ([email protected]):
>> >> >>>>
>> >> >>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>> >> >>>>>
>> >> >>>>> Quoting Marian Marinov ([email protected]):
>> >> >>>>>>
>> >> >>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>> >> >>>>>>>
>> >> >>>>>>> Quoting Theodore Ts'o ([email protected]):
>> >> >>>>>>>>
>> >> >>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm proposing a fix to this, by replacing the
>> >> >>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>> >> >>>>>>>>> check with ns_capable(current_cred()->user_ns,
>> >> >>>>>>>>> CAP_LINUX_IMMUTABLE).
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>> >> >>>>>>>>
>> >> >>>>>>>> /**
>> >> >>>>>>>> * capable - Determine if the current task has a superior
>> >> >>>>>>>> capability in effect
>> >> >>>>>>>> * @cap: The capability to be tested for
>> >> >>>>>>>> *
>> >> >>>>>>>> * Return true if the current task has the given superior
>> >> >>>>>>>> capability currently
>> >> >>>>>>>> * available for use, false if not.
>> >> >>>>>>>> *
>> >> >>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>> >> >>>>>>>> available on the
>> >> >>>>>>>> * assumption that it's about to be used.
>> >> >>>>>>>> */
>> >> >>>>>>>> bool capable(int cap)
>> >> >>>>>>>> {
>> >> >>>>>>>> return ns_capable(&init_user_ns, cap);
>> >> >>>>>>>> }
>> >> >>>>>>>> EXPORT_SYMBOL(capable);
>> >> >>>>>>>>
>> >> >>>>>>>> The documentation states that it is for "the current task", and I
>> >> >>>>>>>> can't imagine any use case, where user namespaces are in effect,
>> >> >>>>>>>> where
>> >> >>>>>>>> using init_user_ns would ever make sense.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> the init_user_ns represents the user_ns owning the object, not the
>> >> >>>>>>> subject.
>> >> >>>>>>>
>> >> >>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
>> >> >>>>>>> setuid(0), execve, and end up satisfying
>> >> >>>>>>> 'ns_capable(current_cred()->userns,
>> >> >>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>> >> >>>>>>>
>> >> >>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
>> >> >>>>>>> be
>> >> >>>>>>> safe to check against the userns owning the inode?
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>> So what you are proposing is to replace
>> >> >>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>> >> >>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>> >> >>>>>>
>> >> >>>>>> I agree that this is more sane.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Right, and I think the two operations you're looking at seem sane
>> >> >>>>> to allow.
>> >> >>>>
>> >> >>>>
>> >> >>>> If you are ok with this patch, I will fix all file systems and send
>> >> >>>> patches.
>> >> >>>
>> >> >>>
>> >> >>> Sounds good, thanks.
>> >> >>>
>> >> >>>> Signed-off-by: Marian Marinov <[email protected]>
>> >> >>>
>> >> >>>
>> >> >>> Acked-by: Serge E. Hallyn
>> >> >>> <serge.hallyn-GeWIH/[email protected]>
>> >> >>
>> >> >>
>> >> >> Wait, what?
>> >> >>
>> >> >> Inodes aren't owned by user namespaces; they're owned by users. And any
>> >> >> user can arrange to have a user namespace in which they pass an
>> >> >> inode_capable check on any inode that they own.
>> >> >>
>> >> >> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
>> >> >> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>> >> >> entirely.
>> >> >
>> >> >
>> >> > The problem I'm trying to solve is this:
>> >> >
>> >> > container with its own user namespace and CAP_SYS_IMMUTABLE should be able
>> >> > to use chattr on all files witch this container has access to.
>> >> >
>> >> > Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
>> >> >
>> >> > With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
>> >> > container.
>> >> >
>> >> > The first solution got its user namespace from the currently running process
>> >> > and the second gets its user namespace from the currently opened inode.
>> >> >
>> >> > So what would be the best solution in this case?
>> >>
>> >> I'd suggest adding a mount option like fs_owner_uid that names a uid
>> >> that owns, in the sense of having unlimited access to, a filesystem.
>> >> Then anyone with caps on a namespace owned by that uid could do
>> >> whatever.
>> >>
>> >> Eric?
>> >>
>> >> --Andy
>> >
>> > The most obvious problem I can think of with "do whatever" is that this
>> > will likely include mknod of char and block devices which you can then
>> > chown/chmod as you wish and use to access any devices on the system from
>> > an unprivileged container.
>> > This can however be mitigated by using the devices cgroup controller.
>>
>> Or 'nodev'. setuid/setgid may have the same problem, too.
>>
>> Implementing something like this would also make CAP_DAC_READ_SEARCH
>> and CAP_DAC_OVERRIDE work.
>>
>> Arguably it should be impossible to mount such a thing in the first
>> place without global privilege.
>>
>> >
>> > You also probably wouldn't want any unprivileged user from the host to
>> > find a way to access that mounted filesytem but so long as you do the
>> > mount in a separate mountns and don't share uids between the host and
>> > the container, that should be fine too.
>>
>> This part should be a nonissue -- an unprivileged user who has the
>> right uid owns the namespace anyway, so this is the least of your
>> worries.
>>
>> --Andy
>
> It should be a nonissue so long as we make sure that a file owned by a
> uid outside the scope of the container may not be changed even though
> fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
> a shell and anyone who can see the fs from the host will be getting a
> root shell (assuming said file is owned by the host's uid 0).
I feel like that's too fragile. I'd rather add a rule that one of
these filesystems always acts like it's nosuid unless you're inside a
user namespace that matches fs_owner_uid.
Maybe even that is too weird. How about setuid, setgid, and fcaps
only work on mounts that are in mount namespaces that are owned by the
current user namespace or one of its parents? IOW, a struct mount is
only trusted if mnt->mnt_ns->user_ns == current user ns or one of its
parents?
Untrusted mounts would act like they are nosuid,nodev. Someone can
try to figure out a safe way to relax nodev at some point.
--Andy
On Tue, Apr 29, 2014 at 5:10 PM, Marian Marinov <[email protected]> wrote:
> On 04/30/2014 03:01 AM, Stéphane Graber wrote:
>>
>> On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
>>>
>>> On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber <[email protected]>
>>> wrote:
>>>>
>>>> On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
>>>>>
>>>>> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
>>>>>>
>>>>>> On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting Marian Marinov ([email protected]):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Quoting Marian Marinov ([email protected]):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Quoting Theodore Ts'o ([email protected]):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm proposing a fix to this, by replacing the
>>>>>>>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
>>>>>>>>>>>>>> check with ns_capable(current_cred()->user_ns,
>>>>>>>>>>>>>> CAP_LINUX_IMMUTABLE).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
>>>>>>>>>>>>>
>>>>>>>>>>>>> /**
>>>>>>>>>>>>> * capable - Determine if the current task has a superior
>>>>>>>>>>>>> capability in effect
>>>>>>>>>>>>> * @cap: The capability to be tested for
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * Return true if the current task has the given superior
>>>>>>>>>>>>> capability currently
>>>>>>>>>>>>> * available for use, false if not.
>>>>>>>>>>>>> *
>>>>>>>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
>>>>>>>>>>>>> available on the
>>>>>>>>>>>>> * assumption that it's about to be used.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> bool capable(int cap)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> return ns_capable(&init_user_ns, cap);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> EXPORT_SYMBOL(capable);
>>>>>>>>>>>>>
>>>>>>>>>>>>> The documentation states that it is for "the current task", and
>>>>>>>>>>>>> I
>>>>>>>>>>>>> can't imagine any use case, where user namespaces are in
>>>>>>>>>>>>> effect,
>>>>>>>>>>>>> where
>>>>>>>>>>>>> using init_user_ns would ever make sense.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the init_user_ns represents the user_ns owning the object, not
>>>>>>>>>>>> the
>>>>>>>>>>>> subject.
>>>>>>>>>>>>
>>>>>>>>>>>> The patch by Marian is wrong. Anyone can do
>>>>>>>>>>>> 'clone(CLONE_NEWUSER)',
>>>>>>>>>>>> setuid(0), execve, and end up satisfying
>>>>>>>>>>>> 'ns_capable(current_cred()->userns,
>>>>>>>>>>>> CAP_SYS_IMMUTABLE)' by definition.
>>>>>>>>>>>>
>>>>>>>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it
>>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>> safe to check against the userns owning the inode?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So what you are proposing is to replace
>>>>>>>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
>>>>>>>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
>>>>>>>>>>>
>>>>>>>>>>> I agree that this is more sane.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Right, and I think the two operations you're looking at seem sane
>>>>>>>>>> to allow.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you are ok with this patch, I will fix all file systems and send
>>>>>>>>> patches.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds good, thanks.
>>>>>>>>
>>>>>>>>> Signed-off-by: Marian Marinov <[email protected]>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Acked-by: Serge E. Hallyn
>>>>>>>> <serge.hallyn-GeWIH/[email protected]>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Wait, what?
>>>>>>>
>>>>>>> Inodes aren't owned by user namespaces; they're owned by users. And
>>>>>>> any
>>>>>>> user can arrange to have a user namespace in which they pass an
>>>>>>> inode_capable check on any inode that they own.
>>>>>>>
>>>>>>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If
>>>>>>> this
>>>>>>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
>>>>>>> entirely.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The problem I'm trying to solve is this:
>>>>>>
>>>>>> container with its own user namespace and CAP_SYS_IMMUTABLE should be
>>>>>> able
>>>>>> to use chattr on all files witch this container has access to.
>>>>>>
>>>>>> Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not
>>>>>> working.
>>>>>>
>>>>>> With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
>>>>>> container.
>>>>>>
>>>>>> The first solution got its user namespace from the currently running
>>>>>> process
>>>>>> and the second gets its user namespace from the currently opened
>>>>>> inode.
>>>>>>
>>>>>> So what would be the best solution in this case?
>>>>>
>>>>>
>>>>> I'd suggest adding a mount option like fs_owner_uid that names a uid
>>>>> that owns, in the sense of having unlimited access to, a filesystem.
>>>>> Then anyone with caps on a namespace owned by that uid could do
>>>>> whatever.
>>>>>
>>>>> Eric?
>>>>>
>>>>> --Andy
>>>>
>>>>
>>>> The most obvious problem I can think of with "do whatever" is that this
>>>> will likely include mknod of char and block devices which you can then
>>>> chown/chmod as you wish and use to access any devices on the system from
>>>> an unprivileged container.
>>>> This can however be mitigated by using the devices cgroup controller.
>>>
>>>
>>> Or 'nodev'. setuid/setgid may have the same problem, too.
>>>
>>> Implementing something like this would also make CAP_DAC_READ_SEARCH
>>> and CAP_DAC_OVERRIDE work.
>>>
>>> Arguably it should be impossible to mount such a thing in the first
>>> place without global privilege.
>>>
>>>>
>>>> You also probably wouldn't want any unprivileged user from the host to
>>>> find a way to access that mounted filesytem but so long as you do the
>>>> mount in a separate mountns and don't share uids between the host and
>>>> the container, that should be fine too.
>>>
>>>
>>> This part should be a nonissue -- an unprivileged user who has the
>>> right uid owns the namespace anyway, so this is the least of your
>>> worries.
>>>
>>> --Andy
>>
>>
>> It should be a nonissue so long as we make sure that a file owned by a
>> uid outside the scope of the container may not be changed even though
>> fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
>> a shell and anyone who can see the fs from the host will be getting a
>> root shell (assuming said file is owned by the host's uid 0).
>>
>> So that's restricting slightly what "do whatever" would do in this case.
>>
>
> In my case I give an LVM volume to each container and limit the container to
> only this block device using the devices cgroup.
> So the inode_capable() fix worked like a charm for me.
> The container can not see any filesystem other then its own.
> And I have another patch for my kernel that prohibits setns from cgroup
> other then / which prevents programs from the container from getting out.
> clone() can be used to create new namespaces but can not be used to attach
> to already created namespaces.
Doesn't matter -- the risk here is that an attacker outside the
namespace can get an fd that points to a directory in the namespace.
SCM_RIGHTS would be the major vector.
--Andy
Quoting Andy Lutomirski ([email protected]):
> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
> > Quoting Marian Marinov ([email protected]):
> >> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> >>> Quoting Marian Marinov ([email protected]):
> >>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >>>>> Quoting Theodore Ts'o ([email protected]):
> >>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >>>>>>>
> >>>>>>> I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
> >>>>>>> check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).
> >>>>>>
> >>>>>> Um, wouldn't it be better to simply fix the capable() function?
> >>>>>>
> >>>>>> /**
> >>>>>> * capable - Determine if the current task has a superior capability in effect
> >>>>>> * @cap: The capability to be tested for
> >>>>>> *
> >>>>>> * Return true if the current task has the given superior capability currently
> >>>>>> * available for use, false if not.
> >>>>>> *
> >>>>>> * This sets PF_SUPERPRIV on the task if the capability is available on the
> >>>>>> * assumption that it's about to be used.
> >>>>>> */
> >>>>>> bool capable(int cap)
> >>>>>> {
> >>>>>> return ns_capable(&init_user_ns, cap);
> >>>>>> }
> >>>>>> EXPORT_SYMBOL(capable);
> >>>>>>
> >>>>>> The documentation states that it is for "the current task", and I
> >>>>>> can't imagine any use case, where user namespaces are in effect, where
> >>>>>> using init_user_ns would ever make sense.
> >>>>>
> >>>>> the init_user_ns represents the user_ns owning the object, not the
> >>>>> subject.
> >>>>>
> >>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >>>>> setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
> >>>>> CAP_SYS_IMMUTABLE)' by definition.
> >>>>>
> >>>>> So NACK to that particular patch. I'm not sure, but IIUC it should be
> >>>>> safe to check against the userns owning the inode?
> >>>>>
> >>>>
> >>>> So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> >>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
> >>>>
> >>>> I agree that this is more sane.
> >>>
> >>> Right, and I think the two operations you're looking at seem sane
> >>> to allow.
> >>
> >> If you are ok with this patch, I will fix all file systems and send patches.
> >
> > Sounds good, thanks.
> >
> >> Signed-off-by: Marian Marinov <[email protected]>
> >
> > Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/[email protected]>
>
> Wait, what?
>
> Inodes aren't owned by user namespaces; they're owned by users. And any
> user can arrange to have a user namespace in which they pass an
> inode_capable check on any inode that they own.
>
> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
Sigh, yeah... I just dont' understand what it is. But you're right.
> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> entirely.
>
> Nacked-by: Andy Lutomirski <[email protected]>
I forget the details, but there was another case where I wanted to
have the userns which 'owns' the whole fs available. I guess we'd
have to check against that instead of using inode_capable.
-serge
Quoting Andy Lutomirski ([email protected]):
> On Tue, Apr 29, 2014 at 5:01 PM, St?phane Graber <[email protected]> wrote:
> > On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 29, 2014 at 4:47 PM, St?phane Graber <[email protected]> wrote:
> >> > On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov <[email protected]> wrote:
> >> >> > On 04/30/2014 01:45 AM, Andy Lutomirski wrote:
> >> >> >>
> >> >> >> On 04/29/2014 03:29 PM, Serge Hallyn wrote:
> >> >> >>>
> >> >> >>> Quoting Marian Marinov ([email protected]):
> >> >> >>>>
> >> >> >>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote:
> >> >> >>>>>
> >> >> >>>>> Quoting Marian Marinov ([email protected]):
> >> >> >>>>>>
> >> >> >>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote:
> >> >> >>>>>>>
> >> >> >>>>>>> Quoting Theodore Ts'o ([email protected]):
> >> >> >>>>>>>>
> >> >> >>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:
> >> >> >>>>>>>>>
> >> >> >>>>>>>>>
> >> >> >>>>>>>>> I'm proposing a fix to this, by replacing the
> >> >> >>>>>>>>> capable(CAP_LINUX_IMMUTABLE)
> >> >> >>>>>>>>> check with ns_capable(current_cred()->user_ns,
> >> >> >>>>>>>>> CAP_LINUX_IMMUTABLE).
> >> >> >>>>>>>>
> >> >> >>>>>>>>
> >> >> >>>>>>>> Um, wouldn't it be better to simply fix the capable() function?
> >> >> >>>>>>>>
> >> >> >>>>>>>> /**
> >> >> >>>>>>>> * capable - Determine if the current task has a superior
> >> >> >>>>>>>> capability in effect
> >> >> >>>>>>>> * @cap: The capability to be tested for
> >> >> >>>>>>>> *
> >> >> >>>>>>>> * Return true if the current task has the given superior
> >> >> >>>>>>>> capability currently
> >> >> >>>>>>>> * available for use, false if not.
> >> >> >>>>>>>> *
> >> >> >>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is
> >> >> >>>>>>>> available on the
> >> >> >>>>>>>> * assumption that it's about to be used.
> >> >> >>>>>>>> */
> >> >> >>>>>>>> bool capable(int cap)
> >> >> >>>>>>>> {
> >> >> >>>>>>>> return ns_capable(&init_user_ns, cap);
> >> >> >>>>>>>> }
> >> >> >>>>>>>> EXPORT_SYMBOL(capable);
> >> >> >>>>>>>>
> >> >> >>>>>>>> The documentation states that it is for "the current task", and I
> >> >> >>>>>>>> can't imagine any use case, where user namespaces are in effect,
> >> >> >>>>>>>> where
> >> >> >>>>>>>> using init_user_ns would ever make sense.
> >> >> >>>>>>>
> >> >> >>>>>>>
> >> >> >>>>>>> the init_user_ns represents the user_ns owning the object, not the
> >> >> >>>>>>> subject.
> >> >> >>>>>>>
> >> >> >>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)',
> >> >> >>>>>>> setuid(0), execve, and end up satisfying
> >> >> >>>>>>> 'ns_capable(current_cred()->userns,
> >> >> >>>>>>> CAP_SYS_IMMUTABLE)' by definition.
> >> >> >>>>>>>
> >> >> >>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should
> >> >> >>>>>>> be
> >> >> >>>>>>> safe to check against the userns owning the inode?
> >> >> >>>>>>>
> >> >> >>>>>>
> >> >> >>>>>> So what you are proposing is to replace
> >> >> >>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
> >> >> >>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?
> >> >> >>>>>>
> >> >> >>>>>> I agree that this is more sane.
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> Right, and I think the two operations you're looking at seem sane
> >> >> >>>>> to allow.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> If you are ok with this patch, I will fix all file systems and send
> >> >> >>>> patches.
> >> >> >>>
> >> >> >>>
> >> >> >>> Sounds good, thanks.
> >> >> >>>
> >> >> >>>> Signed-off-by: Marian Marinov <[email protected]>
> >> >> >>>
> >> >> >>>
> >> >> >>> Acked-by: Serge E. Hallyn
> >> >> >>> <serge.hallyn-GeWIH/[email protected]>
> >> >> >>
> >> >> >>
> >> >> >> Wait, what?
> >> >> >>
> >> >> >> Inodes aren't owned by user namespaces; they're owned by users. And any
> >> >> >> user can arrange to have a user namespace in which they pass an
> >> >> >> inode_capable check on any inode that they own.
> >> >> >>
> >> >> >> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this
> >> >> >> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
> >> >> >> entirely.
> >> >> >
> >> >> >
> >> >> > The problem I'm trying to solve is this:
> >> >> >
> >> >> > container with its own user namespace and CAP_SYS_IMMUTABLE should be able
> >> >> > to use chattr on all files witch this container has access to.
> >> >> >
> >> >> > Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.
> >> >> >
> >> >> > With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
> >> >> > container.
> >> >> >
> >> >> > The first solution got its user namespace from the currently running process
> >> >> > and the second gets its user namespace from the currently opened inode.
> >> >> >
> >> >> > So what would be the best solution in this case?
> >> >>
> >> >> I'd suggest adding a mount option like fs_owner_uid that names a uid
> >> >> that owns, in the sense of having unlimited access to, a filesystem.
> >> >> Then anyone with caps on a namespace owned by that uid could do
> >> >> whatever.
> >> >>
> >> >> Eric?
> >> >>
> >> >> --Andy
> >> >
> >> > The most obvious problem I can think of with "do whatever" is that this
> >> > will likely include mknod of char and block devices which you can then
> >> > chown/chmod as you wish and use to access any devices on the system from
> >> > an unprivileged container.
> >> > This can however be mitigated by using the devices cgroup controller.
> >>
> >> Or 'nodev'. setuid/setgid may have the same problem, too.
> >>
> >> Implementing something like this would also make CAP_DAC_READ_SEARCH
> >> and CAP_DAC_OVERRIDE work.
> >>
> >> Arguably it should be impossible to mount such a thing in the first
> >> place without global privilege.
> >>
> >> >
> >> > You also probably wouldn't want any unprivileged user from the host to
> >> > find a way to access that mounted filesytem but so long as you do the
> >> > mount in a separate mountns and don't share uids between the host and
> >> > the container, that should be fine too.
> >>
> >> This part should be a nonissue -- an unprivileged user who has the
> >> right uid owns the namespace anyway, so this is the least of your
> >> worries.
> >>
> >> --Andy
> >
> > It should be a nonissue so long as we make sure that a file owned by a
> > uid outside the scope of the container may not be changed even though
> > fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
> > a shell and anyone who can see the fs from the host will be getting a
> > root shell (assuming said file is owned by the host's uid 0).
>
> I feel like that's too fragile. I'd rather add a rule that one of
yeah I don't wnat to rush something like that. I'd rather stash
the userns of the task which did the mounting and check against
that. Note that would make it worthless unless and until we allowed
mounting from non-init userns, but then we can only claim "our fs
superblock readers suck and therefore containers can't mount an fs"
so long before we start to feel some shame and audit them...
> these filesystems always acts like it's nosuid unless you're inside a
> user namespace that matches fs_owner_uid.
>
> Maybe even that is too weird. How about setuid, setgid, and fcaps
> only work on mounts that are in mount namespaces that are owned by the
> current user namespace or one of its parents? IOW, a struct mount is
> only trusted if mnt->mnt_ns->user_ns == current user ns or one of its
> parents?
>
> Untrusted mounts would act like they are nosuid,nodev. Someone can
> try to figure out a safe way to relax nodev at some point.
>
> --Andy
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
On Tue, Apr 29, 2014 at 5:21 PM, Serge Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> > It should be a nonissue so long as we make sure that a file owned by a
>> > uid outside the scope of the container may not be changed even though
>> > fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
>> > a shell and anyone who can see the fs from the host will be getting a
>> > root shell (assuming said file is owned by the host's uid 0).
>>
>> I feel like that's too fragile. I'd rather add a rule that one of
>
> yeah I don't wnat to rush something like that. I'd rather stash
> the userns of the task which did the mounting and check against
> that. Note that would make it worthless unless and until we allowed
> mounting from non-init userns, but then we can only claim "our fs
> superblock readers suck and therefore containers can't mount an fs"
> so long before we start to feel some shame and audit them...
>
>> these filesystems always acts like it's nosuid unless you're inside a
>> user namespace that matches fs_owner_uid.
>>
>> Maybe even that is too weird. How about setuid, setgid, and fcaps
>> only work on mounts that are in mount namespaces that are owned by the
>> current user namespace or one of its parents? IOW, a struct mount is
>> only trusted if mnt->mnt_ns->user_ns == current user ns or one of its
>> parents?
>>
>> Untrusted mounts would act like they are nosuid,nodev. Someone can
>> try to figure out a safe way to relax nodev at some point.
Do you like this variant? We could add a way for global root to mount
an fs on behalf of a userns. I'd rather this be more explicit than
just mounting it in a mount ns owned by the user namespace, though.
--Andy
On Wed, Apr 30, 2014 at 12:16:41AM +0000, Serge Hallyn wrote:
> I forget the details, but there was another case where I wanted to
> have the userns which 'owns' the whole fs available. I guess we'd
> have to check against that instead of using inode_capable.
Yes, that sounds right.
And *please* tell me that that under no circumstances can anyone other
than root@init_user_ns is allowed to use mknod....
- Ted
On Tue, Apr 29, 2014 at 5:32 PM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Apr 30, 2014 at 12:16:41AM +0000, Serge Hallyn wrote:
>> I forget the details, but there was another case where I wanted to
>> have the userns which 'owns' the whole fs available. I guess we'd
>> have to check against that instead of using inode_capable.
>
> Yes, that sounds right.
>
> And *please* tell me that that under no circumstances can anyone other
> than root@init_user_ns is allowed to use mknod....
I haven't read the code, but I tried it the other day, and I got
-EPERM. So we're okay for now. (Well, other than the issue I just
sent to [email protected], but that's not quite the same thing.)
--Andy
Quoting Theodore Ts'o ([email protected]):
> On Wed, Apr 30, 2014 at 12:16:41AM +0000, Serge Hallyn wrote:
> > I forget the details, but there was another case where I wanted to
> > have the userns which 'owns' the whole fs available. I guess we'd
> > have to check against that instead of using inode_capable.
>
> Yes, that sounds right.
>
> And *please* tell me that that under no circumstances can anyone other
> than root@init_user_ns is allowed to use mknod....
That's the case. We've considered making exceptions for things like
/dev/null, but in practice bind-mounting devices from the host has
worked out just fine.
Quoting Andy Lutomirski ([email protected]):
> On Tue, Apr 29, 2014 at 5:21 PM, Serge Hallyn <[email protected]> wrote:
> > Quoting Andy Lutomirski ([email protected]):
> >> > It should be a nonissue so long as we make sure that a file owned by a
> >> > uid outside the scope of the container may not be changed even though
> >> > fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
> >> > a shell and anyone who can see the fs from the host will be getting a
> >> > root shell (assuming said file is owned by the host's uid 0).
> >>
> >> I feel like that's too fragile. I'd rather add a rule that one of
> >
> > yeah I don't wnat to rush something like that. I'd rather stash
> > the userns of the task which did the mounting and check against
> > that. Note that would make it worthless unless and until we allowed
> > mounting from non-init userns, but then we can only claim "our fs
> > superblock readers suck and therefore containers can't mount an fs"
> > so long before we start to feel some shame and audit them...
> >
> >> these filesystems always acts like it's nosuid unless you're inside a
> >> user namespace that matches fs_owner_uid.
> >>
> >> Maybe even that is too weird. How about setuid, setgid, and fcaps
> >> only work on mounts that are in mount namespaces that are owned by the
> >> current user namespace or one of its parents? IOW, a struct mount is
> >> only trusted if mnt->mnt_ns->user_ns == current user ns or one of its
> >> parents?
> >>
> >> Untrusted mounts would act like they are nosuid,nodev. Someone can
> >> try to figure out a safe way to relax nodev at some point.
>
> Do you like this variant? We could add a way for global root to mount
> an fs on behalf of a userns. I'd rather this be more explicit than
> just mounting it in a mount ns owned by the user namespace, though.
I'm missing something. Which mnt are you talking about? A user
can just clone a new userns and then clone(CLONE_NEWNS) to get a set
of mounts owned by himself... We need to get a mnt (or a cred or
straight to a userns) tied to the first mount of the superblock, istm.
-serge
On Tue, Apr 29, 2014 at 5:44 PM, Serge Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> On Tue, Apr 29, 2014 at 5:21 PM, Serge Hallyn <[email protected]> wrote:
>> > Quoting Andy Lutomirski ([email protected]):
>> >> > It should be a nonissue so long as we make sure that a file owned by a
>> >> > uid outside the scope of the container may not be changed even though
>> >> > fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
>> >> > a shell and anyone who can see the fs from the host will be getting a
>> >> > root shell (assuming said file is owned by the host's uid 0).
>> >>
>> >> I feel like that's too fragile. I'd rather add a rule that one of
>> >
>> > yeah I don't wnat to rush something like that. I'd rather stash
>> > the userns of the task which did the mounting and check against
>> > that. Note that would make it worthless unless and until we allowed
>> > mounting from non-init userns, but then we can only claim "our fs
>> > superblock readers suck and therefore containers can't mount an fs"
>> > so long before we start to feel some shame and audit them...
>> >
>> >> these filesystems always acts like it's nosuid unless you're inside a
>> >> user namespace that matches fs_owner_uid.
>> >>
>> >> Maybe even that is too weird. How about setuid, setgid, and fcaps
>> >> only work on mounts that are in mount namespaces that are owned by the
>> >> current user namespace or one of its parents? IOW, a struct mount is
>> >> only trusted if mnt->mnt_ns->user_ns == current user ns or one of its
>> >> parents?
>> >>
>> >> Untrusted mounts would act like they are nosuid,nodev. Someone can
>> >> try to figure out a safe way to relax nodev at some point.
>>
>> Do you like this variant? We could add a way for global root to mount
>> an fs on behalf of a userns. I'd rather this be more explicit than
>> just mounting it in a mount ns owned by the user namespace, though.
>
> I'm missing something. Which mnt are you talking about? A user
> can just clone a new userns and then clone(CLONE_NEWNS) to get a set
> of mounts owned by himself... We need to get a mnt (or a cred or
> straight to a userns) tied to the first mount of the superblock, istm.
Sure, but then that user is the only user that ends up trusting the
mount. This could end up being surprising, though -- it would be
weird for a bind mount of an implicitly nosuid mount to end up not
being nosuid as seen by the mounter.
This still feels a bit overcomplicated. Grr. I do like that idea
that, if someone creates a tmpfs mount, sticks a setuid file in it,
and hands someone outside the namespace an fd to the mount, that the
file won't be setuid as seen from outside. This will make using the
same uids in different containers a lot safer, although it still won't
really be safe.
Another wart: chroot on a directory in someone else's mount namespace
works, I think. That just seems wrong, although I don't immediately
see how it's a problem.
--Andy
Theodore Ts'o <[email protected]> writes:
> On Wed, Apr 30, 2014 at 12:16:41AM +0000, Serge Hallyn wrote:
>> I forget the details, but there was another case where I wanted to
>> have the userns which 'owns' the whole fs available. I guess we'd
>> have to check against that instead of using inode_capable.
>
> Yes, that sounds right.
>
> And *please* tell me that that under no circumstances can anyone other
> than root@init_user_ns is allowed to use mknod....
Nope. mknod not allowed. capable(CAP_MKNOD) is required is required
and I can't see any reason to change that.
As a rule of thumb, the only additional actions allowed in a user
namespace above and beyond what an ordinary unpriviliged user would be
allowed to do are those things which we only don't allow because they
could confuse a setuid root executable.
If we ever allow the creation of immutable files by unprivileged users
those files would at least have to be kept completely separate from the
files the global root encounters (aka a disjoint mount namespace).
I do not currently see a path to safely using immutable files with just
user namespace root permission.
Eric
Quoting Eric W. Biederman ([email protected]):
> Theodore Ts'o <[email protected]> writes:
>
> > On Wed, Apr 30, 2014 at 12:16:41AM +0000, Serge Hallyn wrote:
> >> I forget the details, but there was another case where I wanted to
> >> have the userns which 'owns' the whole fs available. I guess we'd
> >> have to check against that instead of using inode_capable.
> >
> > Yes, that sounds right.
> >
> > And *please* tell me that that under no circumstances can anyone other
> > than root@init_user_ns is allowed to use mknod....
>
> Nope. mknod not allowed. capable(CAP_MKNOD) is required is required
> and I can't see any reason to change that.
>
> As a rule of thumb, the only additional actions allowed in a user
> namespace above and beyond what an ordinary unpriviliged user would be
> allowed to do are those things which we only don't allow because they
> could confuse a setuid root executable.
>
>
> If we ever allow the creation of immutable files by unprivileged users
> those files would at least have to be kept completely separate from the
> files the global root encounters (aka a disjoint mount namespace).
>
> I do not currently see a path to safely using immutable files with just
> user namespace root permission.
It's very far off, but I think the path is:
1. at first mount of a blockdev, note the cred (or just userns) which
mounted it
2. work on auditing superblock readers so we can start allowing some
blockdev mounts in user namespaces :)
3. check for privilege against the userns owning a superblock
-serge