Received: by 2002:a25:2c96:0:0:0:0:0 with SMTP id s144csp1190863ybs; Mon, 25 May 2020 09:18:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz+WYD7HC+xiBLy7U62I19bxoBBtPcbXeX59wH8AeVfWC90w6h1Zo9dpDHOPeDRGfizSgiG X-Received: by 2002:a50:a7a5:: with SMTP id i34mr6484429edc.55.1590423521961; Mon, 25 May 2020 09:18:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590423521; cv=none; d=google.com; s=arc-20160816; b=XZ9i1eoZ0xgL969QkFqwV3ENhRzij3xyel9N374kQqDKUmYwHxt+9XxfQY5RqUtZcm YXrDU3MtA9gzPmcx6Xg2H+qzlD/3hbkNgvYnP5yn1Lv7TCo1u26RWLxWE0nZCX15V8aj HbFemLuuQRVa6+hfqUzTHCSyhq2cJ/ky0mzpjCwl3BmLMcNR8Ur5a900jy+Pah34QHae 6fTyGDSrUroPCepntC3ZNeoS++djpS41Py0NMj+l+1JMgfbwmE5fqYVEW6mFgu34CEFh RPjW6aKcEUuFgIrzD/oeKSvFh3ak0G2dP9Co2v1IVcwR9FTB1lPy/fd7FZxXtYUqtk2S p5DA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:organization:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=z+opzIRNoYb9OnlfPQTVpRV9gxLXN2N/oIbR4/2XJL4=; b=uDugzHaM66fWKV/KY/VhMjbAdOV8URj5jFot5CKsog/wRwBOk3H2Ouhn7nZoi61w1i R77Dz02s4LpDQtk4T6GV4s7ufKSb/wDh6G6619/5NFIKNSh0m4W6ZwIY7I2xj/CTwITG NrVJvvXj/xdNifwBO3oAcw5e0HCm8q1kVnCxFpBTpRc8gE3BSGmcTvmcRuaxemdTlx1p /QooLprRaQqcvnaVYb7vnsJikIdhfSpBRU74jmL0rnkzUlGSLZMujQ0F+utTEwJvQTlW +qe1sj6m09qNbr9JOE3fIdJjWnBRTp1DfSh5yVhQyOynn7aFTEYkuDBpbEQjv9jm1uzj dfVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=bdhr2H6x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f22si9629481edx.204.2020.05.25.09.18.18; Mon, 25 May 2020 09:18:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=bdhr2H6x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388361AbgEYIGI (ORCPT + 99 others); Mon, 25 May 2020 04:06:08 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:25546 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388112AbgEYIGF (ORCPT ); Mon, 25 May 2020 04:06:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1590393961; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=z+opzIRNoYb9OnlfPQTVpRV9gxLXN2N/oIbR4/2XJL4=; b=bdhr2H6xkKbXCUcUcr30wOKSSAlYDkfx+V4FkIBCCfFPs55rGDhrPoKEkdH7ZI+SMwpyu1 skSQgSVvTWmOrYduwuM9SSxoCE1YkcnXQsIVUF/9Mp3ra99nG4X0pWVdlIBsfxDu8WH4pp zoIdZ2UMhfxsb7Wa2h9H9oV5OMMoFao= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-8-C6rZBi3eNGWN9IGUps9eaw-1; Mon, 25 May 2020 04:05:57 -0400 X-MC-Unique: C6rZBi3eNGWN9IGUps9eaw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C0C901855A01; Mon, 25 May 2020 08:05:54 +0000 (UTC) Received: from dcbz.redhat.com (ovpn-112-116.ams2.redhat.com [10.36.112.116]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BBA726EDAF; Mon, 25 May 2020 08:05:43 +0000 (UTC) Date: Mon, 25 May 2020 10:05:41 +0200 From: Adrian Reber To: Casey Schaufler Cc: Christian Brauner , Eric Biederman , Pavel Emelyanov , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com>, Andrei Vagin , Nicolas Viennot , =?utf-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?= , Kamil Yurtsever , Dirk Petersen , Christine Flood , Mike Rapoport , Radostin Stoyanov , Cyrill Gorcunov , Serge Hallyn , Stephen Smalley , Sargun Dhillon , Arnd Bergmann , Aaron Goidel , linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, Eric Paris , Jann Horn Subject: Re: [PATCH] capabilities: Introduce CAP_RESTORE Message-ID: <20200525080541.GF104922@dcbz.redhat.com> References: <20200522055350.806609-1-areber@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: Linux (5.6.11-300.fc32.x86_64) X-Load-Average: 0.84 0.48 0.30 X-Unexpected: The Spanish Inquisition X-GnuPG-Key: gpg --recv-keys D3C4906A Organization: Red Hat X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote: > On 5/21/2020 10:53 PM, Adrian Reber wrote: > > This enables CRIU to checkpoint and restore a process as non-root. > > I know it sounds pedantic, but could you spell out CRIU once? > While I know that everyone who cares either knows or can guess > what you're talking about, it may be a mystery to some of the > newer kernel developers. Sure. CRIU - Checkpoint/Restore In Userspace. > > Over the last years CRIU upstream has been asked a couple of time if it > > is possible to checkpoint and restore a process as non-root. The answer > > usually was: 'almost'. > > > > The main blocker to restore a process was that selecting the PID of the > > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN. > > What are the other blockers? Are you going to suggest additional new > capabilities to clear them? As mentioned somewhere else access to /proc//map_files/ would be helpful. Right now I am testing with a JVM and it works without root just with the attached patch. Without access to /proc//map_files/ not everything CRIU can do will actually work, but we are a lot closer to what our users have been asking for. > > In the last two years the questions about checkpoint/restore as non-root > > have increased and especially in the last few months we have seen > > multiple people inventing workarounds. > > Giving a process CAP_SYS_ADMIN is a non-root solution. Yes, but like mentioned somewhere else not a solution that actually works, because CAP_SYS_ADMIN allows too much. Especially for the checkpoint/restore case, we really need one (setting the PID of a new process) and to make it complete a second (reading map_files). Reading the comments in include/uapi/linux/capability.h concerning CAP_SYS_ADMIN it allows the binary to do at least 35 things. The two (three) I mentioned above (ns_last_pid (clone3) map_files) are not mentioned in that list, so CAP_SYS_ADMIN allows probably much more. To allow checkpoint/restore as non-root nobody will give CRIU CAP_SYS_ADMIN because it is too wide. > > The use-cases so far and their workarounds: > > > > * Checkpoint/Restore in an HPC environment in combination with > > a resource manager distributing jobs. Users are always running > > as non root, but there was the desire to provide a way to > > checkpoint and restore long running jobs. > > Workaround: setuid wrapper to start CRIU as root as non-root > > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c > > This is a classic and well understood mechanism for dealing with > this kind of situation. You could have checkpointer-filecap-sys_admin.c > instead, if you want to reduce use of the super-user. > > > * Another use case to checkpoint/restore processes as non-root > > uses as workaround a non privileged process which cycles through > > PIDs by calling fork() as fast as possible with a rate of > > 100,000 pids/s instead of writing to ns_last_pid > > https://github.com/twosigma/set_ns_last_pid > > Oh dear. > > > * Fast Java startup using checkpoint/restore. > > We have been in contact with JVM developers who are integrating > > CRIU into a JVM to decrease the startup time. > > Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel > > That's not a workaround, it's a policy violation. > Bad JVM! No biscuit! This was used as a proof of concept to see if we can checkpoint and restore a JVM without root. Only the ns_last_pid check was removed to see if it works and it does. > > * Container migration as non root. There are people already > > using CRIU to migrate containers as non-root. The solution > > there is to run it in a user namespace. So if you are able > > to carefully setup your environment with the namespaces > > it is already possible to restore a container/process as non-root. > > This is exactly the kind of situation that user namespaces are > supposed to address. > > > Unfortunately it is not always possible to setup an environment > > in such a way and for easier access to non-root based container > > migration this patch is also required. > > If a user namespace solution is impossible or (more likely) too > expensive, there's always the checkpointer-filecap-sys_admin option. But then again we open up all of CAP_SYS_ADMIN, which is not necessary. > > There are probably a few more things guarded by CAP_SYS_ADMIN required > > to run checkpoint/restore as non-root, > > If you need CAP_SYS_ADMIN anyway you're not gaining anything by > separating out CAP_RESTORE. No, as described we can checkpoint and restore a JVM with this patch and it also solves the problem the set_ns_last_pid fork() loop daemon tries to solve. It is not enough to support the full functionality of CRIU as map_files is also important, but we do not need CAP_SYS_ADMIN and CAP_RESTORE. Only CAP_RESTORE would be necessary. With a new capability users can enable checkpoint/restore as non-root without giving CRIU access to any of the other possibilities offered by CAP_SYS_ADMIN. Setting a PID and map_files have been introduced for CRIU and used to live behind CONFIG_CHECKPOINT_RESTORE. Having a capability for checkpoint/restore would make it easier for CRIU users to run it as non-root and make it very clear what is possible when giving CRIU the new capability. No other things would be allowed than necessary for checkpoint/restore. Setting a PID is most important for the restore part and reading map_files would be helpful during checkpoint. So it actually should be called CAP_CHECKPOINT_RESTORE as Christian mentioned in another email. > > but by applying this patch I can > > already checkpoint and restore processes as non-root. As there are > > already multiple workarounds I would prefer to do it correctly in the > > kernel to avoid that CRIU users are starting to invent more workarounds. > > You've presented a couple of really inappropriate implementations > that would qualify as workarounds. But the other two are completely > appropriate within the system security policy. They don't "get around" > the problem, they use existing mechanisms as they are intended. I agree with the user namespace approach to be appropriate, but not the CAP_SYS_ADMIN approach as CRIU only needs a tiny subset (2 things) of what CAP_SYS_ADMIN allows. > > I have used the following tests to verify that this change works as > > expected by setting the new capability CAP_RESTORE on the two resulting > > test binaries: > > > > $ cat ns_last_pid.c > > // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html > > #include > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char *argv[]) > > { > > pid_t pid, new_pid; > > char buf[32]; > > int fd; > > > > if (argc != 2) > > return 1; > > > > printf("Opening ns_last_pid...\n"); > > fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644); > > if (fd < 0) { > > perror("Cannot open ns_last_pid"); > > return 1; > > } > > > > printf("Locking ns_last_pid...\n"); > > if (flock(fd, LOCK_EX)) { > > close(fd); > > printf("Cannot lock ns_last_pid\n"); > > return 1; > > } > > > > pid = atoi(argv[1]); > > snprintf(buf, sizeof(buf), "%d", pid - 1); > > printf("Writing pid-1 to ns_last_pid...\n"); > > if (write(fd, buf, strlen(buf)) != strlen(buf)) { > > printf("Cannot write to buf\n"); > > return 1; > > } > > > > printf("Forking...\n"); > > new_pid = fork(); > > if (new_pid == 0) { > > printf("I am the child!\n"); > > exit(0); > > } else if (new_pid == pid) > > printf("I am the parent. My child got the pid %d!\n", new_pid); > > else > > printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid); > > > > printf("Cleaning up...\n"); > > if (flock(fd, LOCK_UN)) > > printf("Cannot unlock\n"); > > close(fd); > > return 0; > > } > > $ id -u; /home/libcap/ns_last_pid 300000 > > 1001 > > Opening ns_last_pid... > > Locking ns_last_pid... > > Writing pid-1 to ns_last_pid... > > Forking... > > I am the parent. My child got the pid 300000! > > I am the child! > > Cleaning up... > > > > For the clone3() based approach: > > $ cat clone3_set_tid.c > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr))) > > > > int main(int argc, char *argv[]) > > { > > struct clone_args c_args = { }; > > pid_t pid, new_pid; > > > > if (argc != 2) > > return 1; > > > > pid = atoi(argv[1]); > > c_args.set_tid = ptr_to_u64(&pid); > > c_args.set_tid_size = 1; > > > > printf("Forking...\n"); > > new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args)); > > if (new_pid == 0) { > > printf("I am the child!\n"); > > exit(0); > > } else if (new_pid == pid) > > printf("I am the parent. My child got the pid %d!\n", new_pid); > > else > > printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid); > > printf("Done\n"); > > > > return 0; > > } > > $ id -u; /home/libcap/clone3_set_tid 300000 > > 1001 > > Forking... > > I am the parent. My child got the pid 300000! > > Done > > I am the child! > > > > Signed-off-by: Adrian Reber > > --- > > include/linux/capability.h | 5 +++++ > > include/uapi/linux/capability.h | 9 ++++++++- > > kernel/pid.c | 2 +- > > kernel/pid_namespace.c | 2 +- > > security/selinux/include/classmap.h | 5 +++-- > > 5 files changed, 18 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/capability.h b/include/linux/capability.h > > index b4345b38a6be..1278313cb2bc 100644 > > --- a/include/linux/capability.h > > +++ b/include/linux/capability.h > > @@ -261,6 +261,11 @@ static inline bool bpf_capable(void) > > return capable(CAP_BPF) || capable(CAP_SYS_ADMIN); > > } > > > > +static inline bool restore_ns_capable(struct user_namespace *ns) > > +{ > > + return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN); > > +} > > + > > /* audit system wants to get cap info from files as well */ > > extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); > > > > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h > > index c7372180a0a9..4bcc4e3d41ff 100644 > > --- a/include/uapi/linux/capability.h > > +++ b/include/uapi/linux/capability.h > > @@ -406,7 +406,14 @@ struct vfs_ns_cap_data { > > */ > > #define CAP_BPF 39 > > > > -#define CAP_LAST_CAP CAP_BPF > > + > > +/* Allow checkpoint/restore related operations */ > > +/* Allow PID selection during clone3() */ > > +/* Allow writing to ns_last_pid */ > > + > > +#define CAP_RESTORE 40 > > + > > +#define CAP_LAST_CAP CAP_RESTORE > > > > #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP) > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 3122043fe364..bbc26f2bcff6 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > > @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > > if (tid != 1 && !tmp->child_reaper) > > goto out_free; > > retval = -EPERM; > > - if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN)) > > + if (!restore_ns_capable(tmp->user_ns)) > > goto out_free; > > set_tid_size--; > > } > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > > index 0e5ac162c3a8..f58186b31ce6 100644 > > --- a/kernel/pid_namespace.c > > +++ b/kernel/pid_namespace.c > > @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write, > > struct ctl_table tmp = *table; > > int ret, next; > > > > - if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN)) > > + if (write && !restore_ns_capable(pid_ns->user_ns)) > > return -EPERM; > > > > /* > > diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h > > index 98e1513b608a..f8b8f12a6ebd 100644 > > --- a/security/selinux/include/classmap.h > > +++ b/security/selinux/include/classmap.h > > @@ -27,9 +27,10 @@ > > "audit_control", "setfcap" > > > > #define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \ > > - "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf" > > + "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \ > > + "restore" > > > > -#if CAP_LAST_CAP > CAP_BPF > > +#if CAP_LAST_CAP > CAP_RESTORE > > #error New capability defined, please update COMMON_CAP2_PERMS. > > #endif > > > > > > base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5 >