Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755406AbaD3ALG (ORCPT ); Tue, 29 Apr 2014 20:11:06 -0400 Received: from mail.siteground.com ([67.19.240.234]:58006 "EHLO mail.siteground.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752127AbaD3ALE (ORCPT ); Tue, 29 Apr 2014 20:11:04 -0400 Message-ID: <53603F8F.6070304@1h.com> Date: Wed, 30 Apr 2014 03:10:55 +0300 From: Marian Marinov Organization: 1H Ltd. User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: =?ISO-8859-1?Q?St=E9phane_Graber?= , Andy Lutomirski CC: "Eric W. Biederman" , Linux Containers , Serge Hallyn , "Ted Ts'o" , Linux Kernel Mailing List , lxc-devel Subject: Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace References: <20140429185251.GA27969@ubuntumail> <53601E5B.5050004@1h.com> <20140429220234.GC28410@ubuntumail> <536026B3.1020905@1h.com> <20140429222913.GD28410@ubuntumail> <53602B84.1020304@mit.edu> <536033A9.5070504@1h.com> <20140429234739.GB2997@dakara> <20140430000101.GC2997@dakara> In-Reply-To: <20140430000101.GC2997@dakara> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - mail.siteground.com X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - 1h.com X-Get-Message-Sender-Via: mail.siteground.com: none X-Source: X-Source-Args: X-Source-Dir: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/30/2014 03:01 AM, St?phane Graber wrote: > On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote: >> On Tue, Apr 29, 2014 at 4:47 PM, St?phane Graber wrote: >>> On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote: >>>> On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov wrote: >>>>> On 04/30/2014 01:45 AM, Andy Lutomirski wrote: >>>>>> >>>>>> On 04/29/2014 03:29 PM, Serge Hallyn wrote: >>>>>>> >>>>>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org): >>>>>>>> >>>>>>>> On 04/30/2014 01:02 AM, Serge Hallyn wrote: >>>>>>>>> >>>>>>>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org): >>>>>>>>>> >>>>>>>>>> On 04/29/2014 09:52 PM, Serge Hallyn wrote: >>>>>>>>>>> >>>>>>>>>>> Quoting Theodore Ts'o (tytso-3s7WtUTddSA@public.gmane.org): >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'm proposing a fix to this, by replacing the >>>>>>>>>>>>> capable(CAP_LINUX_IMMUTABLE) >>>>>>>>>>>>> check with ns_capable(current_cred()->user_ns, >>>>>>>>>>>>> CAP_LINUX_IMMUTABLE). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Um, wouldn't it be better to simply fix the capable() function? >>>>>>>>>>>> >>>>>>>>>>>> /** >>>>>>>>>>>> * capable - Determine if the current task has a superior >>>>>>>>>>>> capability in effect >>>>>>>>>>>> * @cap: The capability to be tested for >>>>>>>>>>>> * >>>>>>>>>>>> * Return true if the current task has the given superior >>>>>>>>>>>> capability currently >>>>>>>>>>>> * available for use, false if not. >>>>>>>>>>>> * >>>>>>>>>>>> * This sets PF_SUPERPRIV on the task if the capability is >>>>>>>>>>>> available on the >>>>>>>>>>>> * assumption that it's about to be used. >>>>>>>>>>>> */ >>>>>>>>>>>> bool capable(int cap) >>>>>>>>>>>> { >>>>>>>>>>>> return ns_capable(&init_user_ns, cap); >>>>>>>>>>>> } >>>>>>>>>>>> EXPORT_SYMBOL(capable); >>>>>>>>>>>> >>>>>>>>>>>> The documentation states that it is for "the current task", and I >>>>>>>>>>>> can't imagine any use case, where user namespaces are in effect, >>>>>>>>>>>> where >>>>>>>>>>>> using init_user_ns would ever make sense. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> the init_user_ns represents the user_ns owning the object, not the >>>>>>>>>>> subject. >>>>>>>>>>> >>>>>>>>>>> The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', >>>>>>>>>>> setuid(0), execve, and end up satisfying >>>>>>>>>>> 'ns_capable(current_cred()->userns, >>>>>>>>>>> CAP_SYS_IMMUTABLE)' by definition. >>>>>>>>>>> >>>>>>>>>>> So NACK to that particular patch. I'm not sure, but IIUC it should >>>>>>>>>>> be >>>>>>>>>>> safe to check against the userns owning the inode? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So what you are proposing is to replace >>>>>>>>>> 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with >>>>>>>>>> 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? >>>>>>>>>> >>>>>>>>>> I agree that this is more sane. >>>>>>>>> >>>>>>>>> >>>>>>>>> Right, and I think the two operations you're looking at seem sane >>>>>>>>> to allow. >>>>>>>> >>>>>>>> >>>>>>>> If you are ok with this patch, I will fix all file systems and send >>>>>>>> patches. >>>>>>> >>>>>>> >>>>>>> Sounds good, thanks. >>>>>>> >>>>>>>> Signed-off-by: Marian Marinov >>>>>>> >>>>>>> >>>>>>> Acked-by: Serge E. Hallyn >>>>>>> >>>>>> >>>>>> >>>>>> Wait, what? >>>>>> >>>>>> Inodes aren't owned by user namespaces; they're owned by users. And any >>>>>> user can arrange to have a user namespace in which they pass an >>>>>> inode_capable check on any inode that they own. >>>>>> >>>>>> Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this >>>>>> gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE >>>>>> entirely. >>>>> >>>>> >>>>> The problem I'm trying to solve is this: >>>>> >>>>> container with its own user namespace and CAP_SYS_IMMUTABLE should be able >>>>> to use chattr on all files witch this container has access to. >>>>> >>>>> Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working. >>>>> >>>>> With the proposed two fixes CAP_SYS_IMMUTABLE started working in the >>>>> container. >>>>> >>>>> The first solution got its user namespace from the currently running process >>>>> and the second gets its user namespace from the currently opened inode. >>>>> >>>>> So what would be the best solution in this case? >>>> >>>> I'd suggest adding a mount option like fs_owner_uid that names a uid >>>> that owns, in the sense of having unlimited access to, a filesystem. >>>> Then anyone with caps on a namespace owned by that uid could do >>>> whatever. >>>> >>>> Eric? >>>> >>>> --Andy >>> >>> The most obvious problem I can think of with "do whatever" is that this >>> will likely include mknod of char and block devices which you can then >>> chown/chmod as you wish and use to access any devices on the system from >>> an unprivileged container. >>> This can however be mitigated by using the devices cgroup controller. >> >> Or 'nodev'. setuid/setgid may have the same problem, too. >> >> Implementing something like this would also make CAP_DAC_READ_SEARCH >> and CAP_DAC_OVERRIDE work. >> >> Arguably it should be impossible to mount such a thing in the first >> place without global privilege. >> >>> >>> You also probably wouldn't want any unprivileged user from the host to >>> find a way to access that mounted filesytem but so long as you do the >>> mount in a separate mountns and don't share uids between the host and >>> the container, that should be fine too. >> >> This part should be a nonissue -- an unprivileged user who has the >> right uid owns the namespace anyway, so this is the least of your >> worries. >> >> --Andy > > It should be a nonissue so long as we make sure that a file owned by a > uid outside the scope of the container may not be changed even though > fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say > a shell and anyone who can see the fs from the host will be getting a > root shell (assuming said file is owned by the host's uid 0). > > So that's restricting slightly what "do whatever" would do in this case. > In my case I give an LVM volume to each container and limit the container to only this block device using the devices cgroup. So the inode_capable() fix worked like a charm for me. The container can not see any filesystem other then its own. And I have another patch for my kernel that prohibits setns from cgroup other then / which prevents programs from the container from getting out. clone() can be used to create new namespaces but can not be used to attach to already created namespaces. Marian -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hackman@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/