Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933843AbcJRISX (ORCPT ); Tue, 18 Oct 2016 04:18:23 -0400 Received: from mail-lf0-f66.google.com ([209.85.215.66]:34097 "EHLO mail-lf0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751689AbcJRISC (ORCPT ); Tue, 18 Oct 2016 04:18:02 -0400 MIME-Version: 1.0 Reply-To: mtk.manpages@gmail.com In-Reply-To: References: <1476743724-9104-1-git-send-email-john.stultz@linaro.org> From: "Michael Kerrisk (man-pages)" Date: Tue, 18 Oct 2016 10:17:39 +0200 Message-ID: Subject: Re: [PATCH] cgroup: Add new capability to allow a process to migrate other tasks between cgroups To: John Stultz Cc: Andy Lutomirski , lkml , Tejun Heo , Li Zefan , Jonathan Corbet , "open list:CONTROL GROUP (CGROUP)" , Android Kernel Team , Rom Lemarchand , Colin Cross , Dmitry Shmidt , Ricky Zhou , Dmitry Torokhov , Todd Kjos , Christian Poetzsch , Amit Pundir , "Serge E . Hallyn" , Linux API Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6208 Lines: 150 Hi John, On 18 October 2016 at 01:35, John Stultz wrote: > On Mon, Oct 17, 2016 at 3:40 PM, Andy Lutomirski wrote: >> On Mon, Oct 17, 2016 at 3:35 PM, John Stultz wrote: >>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process >>> to migrate other tasks between cgroups. >>> >>> In Android (where this feature originated), the ActivityManager tracks >>> various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM, >>> etc), and then as applications change states, the SchedPolicy logic >>> will migrate the application tasks between different cgroups used >>> to control the different application states (for example, there is a >>> background cpuset cgroup which can limit background tasks to stay >>> on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can >>> then further limit those background tasks to a small percentage of >>> that one cpu's cpu time). >>> >>> However, for security reasons, Android doesn't want to make the >>> system_server (the process that runs the ActivityManager and >>> SchedPolicy logic), run as root. So in the Android common.git >>> kernel, they have some logic to allow cgroups to loosen their >>> permissions so CAP_SYS_NICE tasks can migrate other tasks between >>> cgroups. >>> >>> The approach taken there overloads CAP_SYS_NICE a bit much, and >>> is maybe more complicated then needed. >>> >>> So this patch, as suggested by Tejun, simply adds a new process >>> capability flag (CAP_CGROUP_MIGRATE), and uses it when checking >>> if a task can migrate other tasks between cgroups. >>> >>> I've tested this with AOSP master (though its a bit hacked in as I >>> still need to properly get the selinux bits aware of the new >>> capability bit) with selinux set to permissive and it seems to be >>> working well. >>> >>> Thoughts and feedback would be appreciated! >>> >>> Cc: Tejun Heo >>> Cc: Li Zefan >>> Cc: Jonathan Corbet >>> Cc: cgroups@vger.kernel.org >>> Cc: Android Kernel Team >>> Cc: Rom Lemarchand >>> Cc: Colin Cross >>> Cc: Dmitry Shmidt >>> Cc: Ricky Zhou >>> Cc: Dmitry Torokhov >>> Cc: Todd Kjos >>> Cc: Christian Poetzsch >>> Cc: Amit Pundir >>> Cc: Serge E. Hallyn >>> Cc: linux-api@vger.kernel.org >>> Signed-off-by: John Stultz >>> --- >>> v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun >>> --- >>> include/uapi/linux/capability.h | 5 ++++- >>> kernel/cgroup.c | 3 ++- >>> 2 files changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h >>> index 49bc062..44d7ff4 100644 >>> --- a/include/uapi/linux/capability.h >>> +++ b/include/uapi/linux/capability.h >>> @@ -349,8 +349,11 @@ struct vfs_cap_data { >>> >>> #define CAP_AUDIT_READ 37 >>> >>> +/* Allow migrating tasks between cgroups */ >>> >>> -#define CAP_LAST_CAP CAP_AUDIT_READ >>> +#define CAP_CGROUP_MIGRATE 38 >>> + >>> +#define CAP_LAST_CAP CAP_CGROUP_MIGRATE >>> >>> #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP) >>> >>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c >>> index 85bc9be..09f84d2 100644 >>> --- a/kernel/cgroup.c >>> +++ b/kernel/cgroup.c >>> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task, >>> */ >>> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && >>> !uid_eq(cred->euid, tcred->uid) && >>> - !uid_eq(cred->euid, tcred->suid)) >>> + !uid_eq(cred->euid, tcred->suid) && >>> + !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE)) >>> ret = -EACCES; >> >> This logic seems rather confused to me. Without this patch, a user >> can write to procs if it's root *or* it matches the target uid *or* it >> matches the target suid. How does this make sense? How about >> ptrace_may_access(...) || ns_capable(tcred->user_ns, >> CAP_CGROUP_MIGRATE)? > > Though ptrace_may_access would open it also to apps with > CAP_SYS_PTRACE as well, no? > > Would pulling out from __ptrace_may_access the: > if (uid_eq(caller_uid, tcred->euid) && > uid_eq(caller_uid, tcred->suid) && > uid_eq(caller_uid, tcred->uid) && > gid_eq(caller_gid, tcred->egid) && > gid_eq(caller_gid, tcred->sgid) && > gid_eq(caller_gid, tcred->gid)) > goto ok; > > check and creating a new helper that could be shared between them be > the right approach? So, is creating a new capability here necessarily the right approach? Is this operation so unique, or is there an existing silo (not CAP_SYS_ADMIN) that we can re-use? I ask, because we currently use 38 silos out of a possible 64 capabilities, and when everyone chooses single-use capabilities, we will quickly exhaust the silos. I'm not saying that creating a new capability here is wrong, but it is worth further considering the existing silos to see if there is one that is a suitable match. Looking at http://man7.org/linux/man-pages/man7/capabilities.7.html throws up the following possibilities: CAP_SYS_NICE CAP_SYS_PTRACE CAP_SYS_RESOURCE I'm aware that you said above that use CAP_SYS_NICE overloads that capability a bit too much. Maybe it's true, but on the other hand, by my count from dome rough grepping of the kernel source, there are a total of 14 capable() checks for CAP_SYS_NICE, out of a total of around 1256 capable() checks altogether. So, I think this does need to be balanced against the limited number of silos. Also, CAP_SYS_RESOURCE deserves consideration (34 uses in capable() checks). I'd say, since cgroups are about resources, so there's something of a match there., so it's also worth considering. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/