Received: by 2002:a25:1104:0:0:0:0:0 with SMTP id 4csp69301ybr; Fri, 22 May 2020 00:58:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyC9JHo4SV5yg7pvemB1vH1Fhe64VZK8V4HeXRDOaT1nBOotcM5mbQmzxJ8SpCSkrs+wHyU X-Received: by 2002:a50:cdd8:: with SMTP id h24mr2008416edj.260.1590134293751; Fri, 22 May 2020 00:58:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590134293; cv=none; d=google.com; s=arc-20160816; b=PlwoXoN6cba09Teei2I5pprPFrQocxh1z3gyW7uXGsDWJT1pF45l6fDFePGL38MeHu bZ5tVE3ufLoPrUtZtY+tx88/npCY2+vYDkfXZdb+4+/laFPK7yKgW4q5vPrspkHkgelu JlKf1+qVUQpmFAtkt23pBWeVyfJ0qKIHfVKqShpiN4vuHQZcDrrhlbvjZWBdzPUlHaW3 QnoVv4LSR18V+3jXAU6Lx/jtwd6H+gnRZ5fghYCXj2TQHGa5MpNaBnwTIlBjrIlW/KHb 2o8hbdfO6AyAHIzqdkh/9bhZFJGws2p8/g4tmDob01HCUXuP1ZCXSSVbhW3c5bGDsqYG 1wSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=Vw5zCU8EB1mR5yY1CSwiCrsvVFPbuGx1m3GbK21Ryhc=; b=KKRgXVq3Kl2JX87t8nSxqUO7N3VLXbgvwAwYMwFASsXLFlyRkTf9p1e/H2+4XiQd1v WFJZwR0O8WdJBGO1d5cbHhA/dC0S1DDUEREOr3KU2DwXFyuzstZdk1JTa6Rg0YeZ+AjZ V5doL8ZwVWV2KePVAxa9kMiu67nEQqkhhomSHZis9RdTu9Zr+T6Am+brrrsq86diWkPt 2EGKexFWSU558KZCaAV2F4OI+alSD06aY2QVACririyHDNfGSjTmE6wNRfNEWuz+4kPE Hvm3ptZkhRqOKcLWZPHteA4C25jMl7rcumqoxAUaVzKISVNgMIwyETs/B7+8KlhUJqqp Lxrw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h7si4175049edf.284.2020.05.22.00.57.49; Fri, 22 May 2020 00:58:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728794AbgEVHyH (ORCPT + 99 others); Fri, 22 May 2020 03:54:07 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:52941 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728152AbgEVHyG (ORCPT ); Fri, 22 May 2020 03:54:06 -0400 Received: from [95.90.241.131] (helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jc2Um-0002AH-Ab; Fri, 22 May 2020 07:53:32 +0000 Date: Fri, 22 May 2020 09:53:31 +0200 From: Christian Brauner To: Adrian Reber Cc: Eric Biederman , Pavel Emelyanov , Oleg Nesterov , Dmitry Safonov <0x7f454c46@gmail.com>, Andrei Vagin , Nicolas Viennot , =?utf-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?= , Kamil Yurtsever , Dirk Petersen , Christine Flood , Mike Rapoport , Radostin Stoyanov , Cyrill Gorcunov , Serge Hallyn , Stephen Smalley , Sargun Dhillon , Arnd Bergmann , Aaron Goidel , linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, selinux@vger.kernel.org, Eric Paris , Jann Horn Subject: Re: [PATCH] capabilities: Introduce CAP_RESTORE Message-ID: <20200522075331.ef7zcz3hbke7qvem@wittgenstein> References: <20200522055350.806609-1-areber@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200522055350.806609-1-areber@redhat.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote: > This enables CRIU to checkpoint and restore a process as non-root. > > Over the last years CRIU upstream has been asked a couple of time if it > is possible to checkpoint and restore a process as non-root. The answer > usually was: 'almost'. > > The main blocker to restore a process was that selecting the PID of the > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN. > > In the last two years the questions about checkpoint/restore as non-root > have increased and especially in the last few months we have seen > multiple people inventing workarounds. > > The use-cases so far and their workarounds: > > * Checkpoint/Restore in an HPC environment in combination with > a resource manager distributing jobs. Users are always running > as non root, but there was the desire to provide a way to > checkpoint and restore long running jobs. > Workaround: setuid wrapper to start CRIU as root as non-root > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c > * Another use case to checkpoint/restore processes as non-root > uses as workaround a non privileged process which cycles through > PIDs by calling fork() as fast as possible with a rate of > 100,000 pids/s instead of writing to ns_last_pid > https://github.com/twosigma/set_ns_last_pid > * Fast Java startup using checkpoint/restore. > We have been in contact with JVM developers who are integrating > CRIU into a JVM to decrease the startup time. > Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel > * Container migration as non root. There are people already > using CRIU to migrate containers as non-root. The solution > there is to run it in a user namespace. So if you are able > to carefully setup your environment with the namespaces > it is already possible to restore a container/process as non-root. > Unfortunately it is not always possible to setup an environment > in such a way and for easier access to non-root based container > migration this patch is also required. > > There are probably a few more things guarded by CAP_SYS_ADMIN required > to run checkpoint/restore as non-root, but by applying this patch I can > already checkpoint and restore processes as non-root. As there are > already multiple workarounds I would prefer to do it correctly in the > kernel to avoid that CRIU users are starting to invent more workarounds. It sounds ok to me as long as this feature is guarded by any sensible capability. I don't want users to be able to randomly choose their pid without any capability required. We've heard the plea for unprivileged checkpoint/restore through the grapevine and a few times about CAP_RESTORE at plumbers but it's one of those cases where nobody pushed for it so it's urgency was questionable. This is 5.9 material though and could you please add selftests? It also seems you have future changes planned that would make certain things accessible via CAP_RESTORE that are currently guarded by other capabilities. Any specific things in mind? It might be worth knowing what we'd be getting ourselves into if you're planning on flipping switches in other places. Christian