Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751226AbeACBfh (ORCPT + 1 other); Tue, 2 Jan 2018 20:35:37 -0500 Received: from mail-yw0-f193.google.com ([209.85.161.193]:43944 "EHLO mail-yw0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751105AbeACBfe (ORCPT ); Tue, 2 Jan 2018 20:35:34 -0500 X-Google-Smtp-Source: ACJfBovEx14rCl+vB9GkNxcyqs4cjWh46a6jCxWRwxUdDCwR95zux3+6X9V6CC2PFyfwXhPBnFPZXuauCYfs2GlP3DU= MIME-Version: 1.0 In-Reply-To: <7658ff18-23fc-6aff-6bac-15573f67ddfb@gmail.com> References: <20171205223052.12687-1-mahesh@bandewar.net> <7658ff18-23fc-6aff-6bac-15573f67ddfb@gmail.com> From: =?UTF-8?B?TWFoZXNoIEJhbmRld2FyICjgpK7gpLngpYfgpLYg4KSs4KSC4KSh4KWH4KS14KS+4KSwKQ==?= Date: Tue, 2 Jan 2018 17:35:12 -0800 Message-ID: Subject: Re: [PATCHv3 0/2] capability controlled user-namespaces To: "Michael Kerrisk (man-pages)" Cc: James Morris , LKML , Netdev , Kernel-hardening , Linux API , Kees Cook , Serge Hallyn , "Eric W . Biederman" , Eric Dumazet , David Miller , Mahesh Bandewar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: Hello Michael, I really don't want to turn this into how-to-hack guide but I do see few points in your argument to make the case clearer. Please see the comments inline. On Sat, Dec 30, 2017 at 12:50 AM, Michael Kerrisk (man-pages) wrote: > Hello Mahesh, > > On 12/28/2017 01:45 AM, Mahesh Bandewar (महेश बंडेवार) wrote: >> On Wed, Dec 27, 2017 at 12:23 PM, Michael Kerrisk (man-pages) >> wrote: >>> Hello Mahesh, >>> >>> On 27 December 2017 at 18:09, Mahesh Bandewar (महेश बंडेवार) >>> wrote: >>>> Hello James, >>>> >>>> Seems like I missed your name to be added into the review of this >>>> patch series. Would you be willing be pull this into the security >>>> tree? Serge Hallyn has already ACKed it. >>> >>> We seem to have no formal documentation/specification of this feature. >>> I think that should be written up before this patch goes into >>> mainline... >>> >> absolutely. I have added enough information into the Documentation dir >> relevant to this feature (please look at the individual patches), >> that could be used. I could help if needed. > > Yes, but I think that the documentation is rather incomplete. > I'll also reply to the relevant Documentation thread. > > See also some comments below about this commit message, which > should make things *much* easier for the reader. > >>>> On Tue, Dec 5, 2017 at 2:30 PM, Mahesh Bandewar wrote: >>>>> From: Mahesh Bandewar >>>>> >>>>> TL;DR version >>>>> ------------- >>>>> Creating a sandbox environment with namespaces is challenging >>>>> considering what these sandboxed processes can engage into. e.g. >>>>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few. >>>>> Current form of user-namespaces, however, if changed a bit can allow >>>>> us to create a sandbox environment without locking down user- >>>>> namespaces. >>>>> >>>>> Detailed version >>>>> ---------------- >>>>> >>>>> Problem >>>>> ------- >>>>> User-namespaces in the current form have increased the attack surface as >>>>> any process can acquire capabilities which are not available to them (by >>>>> default) by performing combination of clone()/unshare()/setns() syscalls. >>>>> >>>>> #define _GNU_SOURCE >>>>> #include >>>>> #include >>>>> #include >>>>> >>>>> int main(int ac, char **av) >>>>> { >>>>> int sock = -1; >>>>> >>>>> printf("Attempting to open RAW socket before unshare()...\n"); >>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >>>>> if (sock < 0) { >>>>> perror("socket() SOCK_RAW failed: "); >>>>> } else { >>>>> printf("Successfully opened RAW-Sock before unshare().\n"); >>>>> close(sock); >>>>> sock = -1; >>>>> } >>>>> >>>>> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) { >>>>> perror("unshare() failed: "); >>>>> return 1; >>>>> } >>>>> >>>>> printf("Attempting to open RAW socket after unshare()...\n"); >>>>> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW); >>>>> if (sock < 0) { >>>>> perror("socket() SOCK_RAW failed: "); >>>>> } else { >>>>> printf("Successfully opened RAW-Sock after unshare().\n"); >>>>> close(sock); >>>>> sock = -1; >>>>> } >>>>> >>>>> return 0; >>>>> } >>>>> >>>>> The above example shows how easy it is to acquire NET_RAW capabilities >>>>> and once acquired, these processes could take benefit of above mentioned >>>>> or similar issues discovered/undiscovered with malicious intent. > > But you do not actually describe what the problem is. I think > it's not sufficient to simply refer to some CVEs. > Your mail message/commit should clearly describe what the issue is, > rather than leave the reader to decipher a bunch of CVEs, and derive > your concerns from those CVEs. > I have mentioned in the 'problem' section of this log - how easy to acquire 'the capability' and then CVE describes if you have 'the capability' you can exploit! So I'm not sure why there is any decipher needed. Also that is the general example while this patch-set addresses this for any generic capability and not just the one mentioned in the a.m.CVEs. So rather than going deep into the CVE, I will try to demonstrate the modified behavior to give an idea. >>>>> Note >>>>> that this is just an example and the problem/solution is not limited >>>>> to NET_RAW capability *only*. >>>>> >>>>> The easiest fix one can apply here is to lock-down user-namespaces which >>>>> many of the distros do (i.e. don't allow users to create user namespaces), >>>>> but unfortunately that prevents everyone from using them. >>>>> >>>>> Approach >>>>> -------- >>>>> Introduce a notion of 'controlled' user-namespaces. Every process on >>>>> the host is allowed to create user-namespaces (governed by the limit >>>>> imposed by per-ns sysctl) however, mark user-namespaces created by >>>>> sandboxed processes as 'controlled'. Use this 'mark' at the time of >>>>> capability check in conjunction with a global capability whitelist. >>>>> If the capability is not whitelisted, processes that belong to >>>>> controlled user-namespaces will not be allowed. >>>>> >>>>> Once a user-ns is marked as 'controlled'; all its child user- >>>>> namespaces are marked as 'controlled' too. > > How is a user-ns marked as "controlled"? It is not clear at this point. > Please clarify this in your cover mail (and on the Documentation patch.) > Yes, I would add some more text describing how user-ns gets marked as controlled. It's actually part of the Documentation patch in this series but it does makes sense to add some text here to describe it clearly. >>>>> A global whitelist is list of capabilities governed by the >>>>> sysctl which is available to (privileged) user in init-ns to modify > > What "the sysctl? Please name it at this point. (This may be purely a > language issue. Do you mean "...governed by *a* sysctl, > [sysctl-name-inserted-here]"?) > Correct, '..governed by a sysctl var kernel.controlled_userns_caps_whitelist'. >>>>> while it's applicable to all controlled user-namespaces on the host. >>>>> >>>>> Marking user-namespaces controlled without modifying the whitelist is >>>>> equivalent of the current behavior. The default value of whitelist includes >>>>> all capabilities so that the compatibility is maintained. However it gives >>>>> admins fine-grained ability to control various capabilities system wide >>>>> without locking down user-namespaces. > > Is there a way that a process can see whether it is a controlled user-ns > versus an uncontrolled user-ns? I think it would be good to explain that > in this cover mail, and perhaps also in the documentation patch. > There is no direct way of doing this and I'm not sure it's a good idea/investment to add a new syscall just for that. > In general, it's not too obvious what you are trying to do, based on > this commit message. > > Can I suggest including as part of the commit messages a walk through > shell session that demonstrates the use of these interfaces and how > they allow/disallow capabilities. I think such a walkthrough might also > be worth including in the Documentation patch. > OK, I'll add something to address these concerns. Thanks, --mahesh.. > Thanks, > > Michael > > >>>>> >>>>> Please see individual patches in this series. >>>>> >>>>> Mahesh Bandewar (2): >>>>> capability: introduce sysctl for controlled user-ns capability whitelist >>>>> userns: control capabilities of some user namespaces >>>>> >>>>> Documentation/sysctl/kernel.txt | 21 +++++++++++++++++ >>>>> include/linux/capability.h | 7 ++++++ >>>>> include/linux/user_namespace.h | 25 ++++++++++++++++++++ >>>>> kernel/capability.c | 52 +++++++++++++++++++++++++++++++++++++++++ >>>>> kernel/sysctl.c | 5 ++++ >>>>> kernel/user_namespace.c | 4 ++++ >>>>> security/commoncap.c | 8 +++++++ >>>>> 7 files changed, 122 insertions(+) >>>>> >>>>> -- >>>>> 2.15.0.531.g2ccb3012c9-goog >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-api" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> -- >>> Michael Kerrisk >>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ >>> Linux/UNIX System Programming Training: http://man7.org/training/ >> > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/