Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp64821ybg; Thu, 11 Jun 2020 17:19:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyFiXnS4liL15k7NrUv3F4HgZOgT+4pbiHK2n1oe/IC7H48Vyv+wCvD+rV9WVJMnMRxRnnS X-Received: by 2002:aa7:c403:: with SMTP id j3mr9432578edq.294.1591921166196; Thu, 11 Jun 2020 17:19:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591921166; cv=none; d=google.com; s=arc-20160816; b=fcDNvbYAnjlXnca4yBKODT/yNNoBLR3Pr9Q0TVI6vyeNJXEjcbkBLgtOKp2FFkfDG5 vrR92WqMLNwQb2Wff2fJqCxOtvuLY7jpOwQ+R+5IwTGfKQ0xocsM3O7N4pQqbUHJP9A/ nYw0ZNTKdWenHFbBCwyOFFYVHh0ISba+WfU/Le0pua4jcUMyfNekUtaE/AKibFT51xAn t2N4GUT8XtQziaQD00GPSNjfwJg+5Uc79ZWJhw0utJg9CetNQQd4gYkLrmHcmhkmlBgr /qQItr8jSEifXYDeQhwC2ROY24EqgRW7x3DceOi5Hy9qNJcs1b97XfI0xyufzAQjJdp9 BtJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:mail-followup-to:message-id:subject:cc:to :from:date; bh=YxHeVNClq1Vy/iRdtuMxR9+LyuuJKX3yY9F6lz2w1oU=; b=PHb/HMyVLFEToa/des7XE53wn7MdrAcKm5j+bkZeiXINCKwH4w1+DU63H2cIM5y8Kr 8WsXbd35iFgxa7p3tTWeY6PA9wfrVed/GUAcotBHuk3LVqiDxgwRaDsFcyy6HPskLZjS aMTNGi9OZlqe+X8gvCmb3oBkIohg9K01jreSnyCMizlllgi7vU7a4xTmE1e1z58AMtr3 eeg8wVyzFsiwvaZjAQXVcMtH71akr3fpMSx9Faxdad56BdT5HUzvSd1X4XszHMOOw68y LFPdBD3HSgy5vtNg3pHe3WkcmVRioA6gKQ9NZj9SzVXGhj1SNHzFoKoO1iu0mKVYdZB4 WZrQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=vmware.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e26si2947608ejc.211.2020.06.11.17.19.03; Thu, 11 Jun 2020 17:19:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=vmware.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726336AbgFLARK (ORCPT + 99 others); Thu, 11 Jun 2020 20:17:10 -0400 Received: from ex13-edg-ou-002.vmware.com ([208.91.0.190]:53937 "EHLO EX13-EDG-OU-002.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726285AbgFLARK (ORCPT ); Thu, 11 Jun 2020 20:17:10 -0400 Received: from sc9-mailhost3.vmware.com (10.113.161.73) by EX13-EDG-OU-002.vmware.com (10.113.208.156) with Microsoft SMTP Server id 15.0.1156.6; Thu, 11 Jun 2020 17:17:05 -0700 Received: from localhost (unknown [10.200.193.92]) by sc9-mailhost3.vmware.com (Postfix) with ESMTP id 7A01A40CBD; Thu, 11 Jun 2020 17:17:09 -0700 (PDT) Date: Thu, 11 Jun 2020 17:17:09 -0700 From: Matt Helsley To: Adrian Reber CC: Christian Brauner , Eric Biederman , Pavel Emelyanov , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com>, Andrei Vagin , Nicolas Viennot , =?utf-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?= , Kamil Yurtsever , Dirk Petersen , Christine Flood , Mike Rapoport , Radostin Stoyanov , Cyrill Gorcunov , Serge Hallyn , Stephen Smalley , Sargun Dhillon , Arnd Bergmann , Aaron Goidel , , , , Eric Paris , Jann Horn Subject: Re: [PATCH] capabilities: Introduce CAP_RESTORE Message-ID: <20200612001709.GA25598@rlwimi.vmware.com> Mail-Followup-To: Matt Helsley , Adrian Reber , Christian Brauner , Eric Biederman , Pavel Emelyanov , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com>, Andrei Vagin , Nicolas Viennot , =?utf-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?= , Kamil Yurtsever , Dirk Petersen , Christine Flood , Mike Rapoport , Radostin Stoyanov , Cyrill Gorcunov , Serge Hallyn , Stephen Smalley , Sargun Dhillon , Arnd Bergmann , Aaron Goidel , linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, Eric Paris , Jann Horn References: <20200522055350.806609-1-areber@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20200522055350.806609-1-areber@redhat.com> Received-SPF: None (EX13-EDG-OU-002.vmware.com: mhelsley@vmware.com does not designate permitted sender hosts) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote: > This enables CRIU to checkpoint and restore a process as non-root. > > Over the last years CRIU upstream has been asked a couple of time if it > is possible to checkpoint and restore a process as non-root. The answer > usually was: 'almost'. > > The main blocker to restore a process was that selecting the PID of the > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN. > > In the last two years the questions about checkpoint/restore as non-root > have increased and especially in the last few months we have seen > multiple people inventing workarounds. > > The use-cases so far and their workarounds: > > * Checkpoint/Restore in an HPC environment in combination with > a resource manager distributing jobs. Users are always running > as non root, but there was the desire to provide a way to > checkpoint and restore long running jobs. > Workaround: setuid wrapper to start CRIU as root as non-root > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c > * Another use case to checkpoint/restore processes as non-root > uses as workaround a non privileged process which cycles through > PIDs by calling fork() as fast as possible with a rate of > 100,000 pids/s instead of writing to ns_last_pid > https://github.com/twosigma/set_ns_last_pid > * Fast Java startup using checkpoint/restore. > We have been in contact with JVM developers who are integrating > CRIU into a JVM to decrease the startup time. > Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel > * Container migration as non root. There are people already > using CRIU to migrate containers as non-root. The solution > there is to run it in a user namespace. So if you are able > to carefully setup your environment with the namespaces > it is already possible to restore a container/process as non-root. > Unfortunately it is not always possible to setup an environment > in such a way and for easier access to non-root based container > migration this patch is also required. > > There are probably a few more things guarded by CAP_SYS_ADMIN required > to run checkpoint/restore as non-root, but by applying this patch I can > already checkpoint and restore processes as non-root. As there are > already multiple workarounds I would prefer to do it correctly in the > kernel to avoid that CRIU users are starting to invent more workarounds. > > I have used the following tests to verify that this change works as > expected by setting the new capability CAP_RESTORE on the two resulting > test binaries: > > $ cat ns_last_pid.c > // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html > #include > #include > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > pid_t pid, new_pid; > char buf[32]; > int fd; > > if (argc != 2) > return 1; > > printf("Opening ns_last_pid...\n"); > fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644); > if (fd < 0) { > perror("Cannot open ns_last_pid"); > return 1; > } > > printf("Locking ns_last_pid...\n"); > if (flock(fd, LOCK_EX)) { > close(fd); > printf("Cannot lock ns_last_pid\n"); > return 1; > } > > pid = atoi(argv[1]); > snprintf(buf, sizeof(buf), "%d", pid - 1); > printf("Writing pid-1 to ns_last_pid...\n"); > if (write(fd, buf, strlen(buf)) != strlen(buf)) { > printf("Cannot write to buf\n"); > return 1; > } > > printf("Forking...\n"); > new_pid = fork(); > if (new_pid == 0) { > printf("I am the child!\n"); > exit(0); > } else if (new_pid == pid) > printf("I am the parent. My child got the pid %d!\n", new_pid); > else > printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid); > > printf("Cleaning up...\n"); > if (flock(fd, LOCK_UN)) > printf("Cannot unlock\n"); > close(fd); > return 0; > } > $ id -u; /home/libcap/ns_last_pid 300000 > 1001 > Opening ns_last_pid... > Locking ns_last_pid... > Writing pid-1 to ns_last_pid... > Forking... > I am the parent. My child got the pid 300000! > I am the child! > Cleaning up... > > For the clone3() based approach: > $ cat clone3_set_tid.c > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr))) > > int main(int argc, char *argv[]) > { > struct clone_args c_args = { }; > pid_t pid, new_pid; > > if (argc != 2) > return 1; > > pid = atoi(argv[1]); > c_args.set_tid = ptr_to_u64(&pid); > c_args.set_tid_size = 1; > > printf("Forking...\n"); > new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args)); (Note: I'm going to call the capability CAP_RESTORE but I think this applies regardless of whether the permissions stay with CAP_SYS_ADMIN..) I haven't fully reviewed the discussion of the security consequences but my sense is this would require retaining CAP_RESTORE down the entire tree of processes being restored so each parent could call clone3() with the correct pid value for its child(ren). Ideally you would drop CAP_RESTORE sooner -- preferrably only one process would need it. I think you could do that by changing what you pass down; instead of passing down a capability and a pid number, pass down a special "reservation" pidfd: 1. Have CAP_RESTORE enable opening a pidfd with the desired pid as a reservation for the pid (i.e. can't use it to signal, wait, ... perhaps these return -EBUSY, -EAGAIN or something...). 2. Only one process needs CAP_RESTORE -- it can drop CAP_RESTORE after reserving all of the pids but before kicking off the clone3() calls to recreate all of the tasks. 3. Pass the pidfd down the tree of restoring processes. Note how the set of specific pids to be created is limited at this point -- the software cannot be tricked into recreating processes other pids using a capability, CAP_RESTORE, it doesn't have. You might even set a flag on the pidfd which closes the pidfd in the sender when it's passed over a socket so that only the appropriate processes retain the pidfd. 4. The parent can then pass the *reserved* pidfd into clone3(). The reserved pidfd seamlessly turns into a pidfd reference to the child if clone3() succeeds. If it fails the eservation is still consumed -- a given pidfd reserved for clone3() can only be passed to clone3() and succeed once. Perhaps this scheme could concentrate the need for CAP_SYS_ADMIN to one process so maybe it would obviate the need for CAP_RESTORE. That said, perhaps there's something that prevents implementing such a pidfd or perhaps I have misunderstood what CRIU is doing and this pidfd idea isn't workable. Cheers, -Matt Helsley