Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp1101879pxa; Wed, 5 Aug 2020 22:45:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxrxKdB/zQeUYAId7S5YzeGFPTBDFz2L29nDLMLxHiK9CLzRZBjg4ngC4S6zLZN+54X684F X-Received: by 2002:aa7:dc4f:: with SMTP id g15mr2501189edu.335.1596692750604; Wed, 05 Aug 2020 22:45:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1596692750; cv=none; d=google.com; s=arc-20160816; b=AP+xPNecztd0GeeDjtvuzRGgIqFZGXwoDjvqWEDiSVL2Iul6vEKlQ3wyPmMayXjhM1 nHomikKEiupkx58vHJvWoMukc7XI/Ej1M7NIax4UoYcD/MlGdnqgDJ2DwwwWy9DPuytK qDvGFPak/fcZXicfWg9kCiMeGKfBPE2qXn3wwKNEoFOpM3gd2RRtr0bTz4yf4q8lfPwD kXNu7uAp6/iqmJF3tySAZEbxn/ygfbnyyu+ugxLygLUISNOI0+jhozVqK6XiO8OqTIL6 PYM3iFGM9/6UPTQvtd1Mn7EPO650aW4KwQbaqgWLXaFYdKFsc9iiKIoiwCjAUEXSh83A V4kg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=B1WkhY8QD0f5DxDB9jVqeA3H8Wo7kGb4Rskkut4YioA=; b=ajZIDp3ezYs0f5R98k6jYYvO8mMN/w5PRTnY/0xp7OaPy4iQArV+q3CMwohw78Scab +z3I+6hR60pZK2l4D8I36PmuoDsOKu4NoBGuBb9t+zrY9N2NM30nY2AeXB6HXr85TjGZ qv45yOqozQYp2HKwlDPgJs/aJiXsow9Z9YsID75UmPvMefOIfFfn78ugIIghw2YlxWfR A8YQUoAYbA/FOvXygNZtrGabwVdwqPR5mCORwLSG02HqvF9RCXg/w8Xh5OX19z/UJyXz nOW85C7+xhkEHDooB0IVTu9QAMnAs3uqqZ6SFw1AT6hkzDDPntLpZMpo7S2USxvbgDcI qtug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HP6Mu1Vr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g11si2566188edp.245.2020.08.05.22.45.28; Wed, 05 Aug 2020 22:45:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HP6Mu1Vr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728028AbgHFFoW (ORCPT + 99 others); Thu, 6 Aug 2020 01:44:22 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:20871 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727979AbgHFFoW (ORCPT ); Thu, 6 Aug 2020 01:44:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1596692659; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B1WkhY8QD0f5DxDB9jVqeA3H8Wo7kGb4Rskkut4YioA=; b=HP6Mu1Vr9CY1VqEdvQTRQNqNP3EGiSIbfH1nWwef/DU/4Ow5o8dWtF4ixazHGUOkXCyiAH hJOb7SPulusgua0E0znzJukJJerR7s6g6sQqqWqQiAZgCPoVlX3L/O9+6CN0CRhs6aD41y CZzUNTfFjVNodmV1ENd3PPAnNdvDSPo= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-62-VDjGwevbOy6WlWOo876_fA-1; Thu, 06 Aug 2020 01:44:09 -0400 X-MC-Unique: VDjGwevbOy6WlWOo876_fA-1 Received: by mail-wr1-f71.google.com with SMTP id b8so4686398wrr.2 for ; Wed, 05 Aug 2020 22:44:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=B1WkhY8QD0f5DxDB9jVqeA3H8Wo7kGb4Rskkut4YioA=; b=W3W00MX1eCf0T+whGlScPxnYoU0YCPr3LPuuwqaKVL02pLRU0fBdA6s7nojzYcnp63 dwG+gJvEHlyRits4j3Cas4OAc7kldQLj5D9MxrqHZRxzpkElJoGgvjSddBp7BfDQEiiH fad0SO7jbiDVJSK9ZPpV6CnDpvE6If6EsfTfZIoqyscHA5gy4+MmU3N1XhcGmjEk5MNA fsSQxVCv2csBv/EJUGzEgG2o2B5SRtbvnWLTkGkS1uYfLGlLrnU8E1qv1oHDnr3yNkst HxqYxKIVU5OZpMDAC+hdSfvU7vtjdFkvoWZpngASjx2Z/JJaejK2mgnNwAzBFwW4eYMu UvFQ== X-Gm-Message-State: AOAM531vIq4K3gTfxfdLS+LX40xCVOCGwZie4VIylvT8X1qxgHzEv/JL lkvvOsrY9DBKafEETjGsu4n9XzsHMgkTWF7Mz3lzBnqp15cSDqLmrDHatUzdp+Cm9XDbj310/ap Sc7z3e7x7/ADm2NihaMXxKdBf X-Received: by 2002:a1c:2095:: with SMTP id g143mr6059862wmg.78.1596692647839; Wed, 05 Aug 2020 22:44:07 -0700 (PDT) X-Received: by 2002:a1c:2095:: with SMTP id g143mr6059816wmg.78.1596692647427; Wed, 05 Aug 2020 22:44:07 -0700 (PDT) Received: from redhat.com (bzq-79-177-102-128.red.bezeqint.net. [79.177.102.128]) by smtp.gmail.com with ESMTPSA id v12sm5079783wri.47.2020.08.05.22.44.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Aug 2020 22:44:06 -0700 (PDT) Date: Thu, 6 Aug 2020 01:44:01 -0400 From: "Michael S. Tsirkin" To: Nick Kralevich Cc: Lokesh Gidra , Jeffrey Vander Stoep , Andrea Arcangeli , Suren Baghdasaryan , Kees Cook , Jonathan Corbet , Alexander Viro , Luis Chamberlain , Iurii Zaikin , Mauro Carvalho Chehab , Andrew Morton , Andy Shevchenko , Vlastimil Babka , Mel Gorman , Sebastian Andrzej Siewior , Peter Xu , Mike Rapoport , Jerome Glisse , Shaohua Li , linux-doc@vger.kernel.org, LKML , Linux FS Devel , Tim Murray , Minchan Kim , Sandeep Patil , kernel@android.com, Daniel Colascione , Kalesh Singh Subject: Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only Message-ID: <20200806004351-mutt-send-email-mst@kernel.org> References: <202005200921.2BD5A0ADD@keescook> <20200520194804.GJ26186@redhat.com> <20200520195134.GK26186@redhat.com> <20200520211634.GL26186@redhat.com> <20200724093852-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin wrote: > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra wrote: > > > > From the discussion so far it seems that there is a consensus that > > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > > anything that is pending on that patch? > > > > > > That's my reading of this thread too. > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > > already applied to unprivileged software running without > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > Yes. > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > > seccomp on a few processes, but we have found that it has a > > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > > on system wide is not a good option. > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > not applied to all processes. It's too expensive and impractical when > > > simpler implementations (such as this sysctl) can exist. It's also > > > significantly simpler to test a sysctl value for correctness as > > > opposed to a seccomp filter. > > > > Given that selinux is already used system-wide on Android, what is wrong > > with using selinux to control userfaultfd as opposed to seccomp? > > Userfaultfd file descriptors will be generally controlled by SELinux. > You can see the patchset at > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ > (which is also referenced in the original commit message for this > patchset). However, the SELinux patchset doesn't include the ability > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > which combined with this patch, is largely equivalent to direct > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > above, movement of userfaultfd file descriptors can be mediated by > SELinux, preventing one process from acquiring userfaultfd descriptors > of other processes unless allowed by security policy. > > It's an interesting question whether finer-grain SELinux support for > controlling UFFD_USER_MODE_ONLY should be added. I can see some > advantages to implementing this. However, we don't need to decide that > now. > > Kernel security checks generally break down into DAC (discretionary > access control) and MAC (mandatory access control) controls. Most > kernel security features check via both of these mechanisms. Security > attributes of the system should be settable without necessarily > relying on an LSM such as SELinux. This patch follows the same basic > model -- system wide control of a hardening feature is provided by the > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > SELinux support for this can also be implemented on top of the DAC > controls. > > This DAC/MAC split has been successful in several other security > features. For example, the ability to map at page zero is controlled > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > mmap_zero access vector [2]. Similarly, access to the kernel ring > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > sysctl is very similar to this patch -- it introduces a capability > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > If we want to ensure that a security feature will be well tested and > vetted, it's important to not limit its use to LSMs only. This ensures > that kernel and application developers will always be able to test the > effects of a security feature, without relying on LSMs like SELinux. > It also ensures that all distributions can enable this security > mitigation should it be necessary for their unique environments, > without introducing an SELinux dependency. And this patch does not > preclude an SELinux implementation should it be necessary. > > Even if we decide to implement fine-grain SELinux controls on > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > an either/or choice between SELinux and this patch. Both are > necessary. > > -- Nick > > [1] https://wiki.debian.org/mmap_min_addr > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt I am not sure I agree this is similar to dmesg access. The reason I say it is this: it is pretty easy for admins to know whether they run something that needs to access the kernel ring buffer. Or if it's a tool developer poking at dmesg, they can tell admins "we need these permissions". But it seems impossible for either an admin to know that a userfaultfd page e.g. used with shared memory is accessed from the kernel. So I guess the question is: how does anyone not running Android know to set this flag? I got the feeling it's not really possible, and so for a single-user feature like this a single API seems enough. Given a choice between a knob an admin is supposed to set and selinux policy written by presumably knowledgeable OS vendors, I'd opt for a second option. Hope this helps. > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > assembly. Otherwise it'd require more changes. > > > > > > It would be good to understand what these changes are. > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > > if you set with option "2"? > > > > > > I can see your point if you think about the feature as a whole. > > > However, distributions (such as Android) have specialized knowledge of > > > their security environments, and may not want to support the typical > > > usages of userfaultfd. For such distributions, providing a mechanism > > > to prevent userfaultfd from being useful as an exploit primitive, > > > while still allowing the very limited use of userfaultfd for userspace > > > faults only, is desirable. Distributions shouldn't be forced into > > > supporting 100% of the use cases envisioned by userfaultfd when their > > > needs may be more specialized, and this sysctl knob empowers > > > distributions to make this choice for themselves. > > > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > > nothing else that could break? > > > > > > This is a great example of a capability which a distribution may not > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > > this single user can already enforce model "2" by only tweaking its > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > Can you elaborate on what bug is present by processes being > > > unprotected by seccomp? > > > > > > Seccomp cannot be universally applied on Android due to previously > > > mentioned performance concerns. Seccomp is used in Android primarily > > > as a tool to enforce the list of allowed syscalls, so that such > > > syscalls can be audited before being included as part of the Android > > > API. > > > > > > -- Nick > > > > > > -- > > > Nick Kralevich | nnk@google.com > > > > > -- > Nick Kralevich | nnk@google.com