Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp414963ybb; Wed, 1 Apr 2020 02:31:52 -0700 (PDT) X-Google-Smtp-Source: ADFU+vub645zKAzCzI4Glax7u1pQeyrLs8+t2TMHgMPbOKeqxUFuxtPGddkumR47SXonf1fBXdH4 X-Received: by 2002:a4a:2fd0:: with SMTP id p199mr6031282oop.14.1585733512075; Wed, 01 Apr 2020 02:31:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1585733512; cv=none; d=google.com; s=arc-20160816; b=lzzABCmoOEgOEJxboL5j2DmUAw4Es3Piz2vIC20CFpuYM/4zz2kDaGVvwRqh9CPaKd 1ESUUBLTvZGSXpVuTJjn2yoLGVqQUrYDsdwrsQ88F1tZoZp+sZzxArOpXkWqg0vpYqzP x/CbZ+Egb0nOoUDwNfzddOtkW20+qOvbLzHzOSvsqNvwJGTU8ydVVAt1MxksGH+qCBl9 S4GADqce1+sX7KkIzUv77SmwwgJD6ohFvyBonU66ETyH+i+n43oq3WGyFkBywjW78HH4 jIT/Z7j5NN4k0gp8sleXJQ78soSrGJocyHsc9sY5fR9biPahUu79OxrVzmdROrJFgBPy XZtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=cxtvDm1VvJFB9Mgn1VqvYbCokzNoQWD8prtGIwBZ3MQ=; b=VbwonBjBN5qLX5KrK9u+9zjfR2qKp6Qwj1inawrfOqkJM1WMZKjMGISSPsREO4o+JZ ZuHPo9SeqXYTUEMxHv2HJxTav/3MJqr8z4E25I1cjDjz+NaapPxYvaJ+kzuwKguEb9aW VQYlhMEQIdTbKekpNXezI/SVQB5EYCxqTmEdicgjlmG8fF2fF28DIvmz3ZnFWv/HYd2g tWwf3opiwsXk0c3Ih3lyMTNUQyF0qRcLPJEv4AQDtTi0IVU63RXNNFT2kdw+0PanIsod vKIalGKtCbt9bK/FRPEkjUsWBLcZHYsalLE3FBIYB5uTWY04d+ddkvh4ekaw8TNaSaRs GSqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=lZYtk3M1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cloudflare.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i1si613745oto.72.2020.04.01.02.31.38; Wed, 01 Apr 2020 02:31:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=lZYtk3M1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728225AbgDAJbD (ORCPT + 99 others); Wed, 1 Apr 2020 05:31:03 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:54924 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726205AbgDAJbD (ORCPT ); Wed, 1 Apr 2020 05:31:03 -0400 Received: by mail-wm1-f66.google.com with SMTP id c81so5785503wmd.4 for ; Wed, 01 Apr 2020 02:31:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=cxtvDm1VvJFB9Mgn1VqvYbCokzNoQWD8prtGIwBZ3MQ=; b=lZYtk3M149mrSzxboHicdNgduVmvN4GsyG3JuDDEEYVoYhE0HL02ybjXbaXTAnexeW 4lis+tJxsraFjN8nan3Gwr/vijdEhz3wNgUmGq15Iwvd41vPtFs7DX1Aaw8Aw/PEg+Hi layedYFdZEK9izXv8X2l4kCHmoROw8Or7jBz0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=cxtvDm1VvJFB9Mgn1VqvYbCokzNoQWD8prtGIwBZ3MQ=; b=puKJQzDJjGqnBr6pCUl6qdzlbAhIkTvmjvZ0mA0Us+CsSY4B3qCf6Y8r1RGnhUAhiy C+XkICyHsO/QDITB8d2S7jPwzkhY93lRneXfrDSx8xjiAOlitVglM/LwARwnAA9EBo5Q LFBl6Rs4nLKfKC9Sa5IeUUv1XBYlM0/E9onlMvTaV/NvQaDE+N4eEFmhft2Q/AXJn8Bj RL0QiAyJl4U4RZ/UmQr3kccwUYiZ4A8vY4hZXjGGmB8mHF/n2KkxSKlExDJ2UdyZg8QK KOV+SdhtQSJ0Z0O+cbWkfki+fpB+Kz0yMCVXvQVpvCvyYCjh/VcnU9phVFTEObga/Jcx 7oBg== X-Gm-Message-State: AGi0Pua8fTa6Lc90yHgRMtAUGq/Fa6TyWDmYW/S+QbqcIIT15M5BKEVT YtAUXNLFLXXDbrrveRUDfjfSG9e65aO9ysb4m1ExjA== X-Received: by 2002:a1c:5fc4:: with SMTP id t187mr3190261wmb.81.1585733461192; Wed, 01 Apr 2020 02:31:01 -0700 (PDT) MIME-Version: 1.0 References: <20200331124017.2252-1-ignat@cloudflare.com> <20200331124017.2252-2-ignat@cloudflare.com> <20200401063620.catm73fbp5n4wv5r@yavin.dot.cyphar.com> <20200401063806.5crx6pnm6vzuc3la@yavin.dot.cyphar.com> In-Reply-To: <20200401063806.5crx6pnm6vzuc3la@yavin.dot.cyphar.com> From: Ignat Korchagin Date: Wed, 1 Apr 2020 10:30:50 +0100 Message-ID: Subject: Re: [PATCH v2 1/1] mnt: add support for non-rootfs initramfs To: Aleksa Sarai Cc: Al Viro , linux-fsdevel@vger.kernel.org, linux-kernel , kernel-team , containers@lists.linux-foundation.org, christian.brauner@ubuntu.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 1, 2020 at 7:38 AM Aleksa Sarai wrote: > > On 2020-04-01, Aleksa Sarai wrote: > > On 2020-03-31, Ignat Korchagin wrote: > > > The main need for this is to support container runtimes on stateless Linux > > > system (pivot_root system call from initramfs). > > > > > > Normally, the task of initramfs is to mount and switch to a "real" root > > > filesystem. However, on stateless systems (booting over the network) it is just > > > convenient to have your "real" filesystem as initramfs from the start. > > > > > > This, however, breaks different container runtimes, because they usually use > > > pivot_root system call after creating their mount namespace. But pivot_root does > > > not work from initramfs, because initramfs runs form rootfs, which is the root > > > of the mount tree and can't be unmounted. > > > > > > One workaround is to do: > > > > > > mount --bind / / > > > > > > However, that defeats one of the purposes of using pivot_root in the cloned > > > containers: get rid of host root filesystem, should the code somehow escapes the > > > chroot. > > > > > > There is a way to solve this problem from userspace, but it is much more > > > cumbersome: > > > * either have to create a multilayered archive for initramfs, where the outer > > > layer creates a tmpfs filesystem and unpacks the inner layer, switches root > > > and does not forget to properly cleanup the old rootfs > > > * or we need to use keepinitrd kernel cmdline option, unpack initramfs to > > > rootfs, run a script to create our target tmpfs root, unpack the same > > > initramfs there, switch root to it and again properly cleanup the old root, > > > thus unpacking the same archive twice and also wasting memory, because > > > the kernel stores compressed initramfs image indefinitely. > > > > > > With this change we can ask the kernel (by specifying nonroot_initramfs kernel > > > cmdline option) to create a "leaf" tmpfs mount for us and switch root to it > > > before the initramfs handling code, so initramfs gets unpacked directly into > > > the "leaf" tmpfs with rootfs being empty and no need to clean up anything. > > > > > > This also bring the behaviour in line with the older style initrd, where the > > > initrd is located on some leaf filesystem in the mount tree and rootfs remaining > > > empty. > > > > > > Signed-off-by: Ignat Korchagin > > > > I know this is a bit of a stretch, but I thought I'd ask -- is it > > possible to solve the problem with pivot_root(2) without requiring this > > workaround (and an additional cmdline option)? > > > > From the container runtime side of things, most runtimes do support > > working on initramfs but it requires disabling pivot_root(2) support (in > > the runc world this is --no-pivot-root). We would love to be able to > > remove support for disabling pivot_root(2) because lots of projects have > > been shipping with pivot_root(2) disabled (such as minikube until > > recently[1]) -- which opens such systems to quite a few breakout and > > other troubling exploits (obviously they also ship without using user > > namespaces *sigh*). > > > > But requiring a new cmdline option might dissuade people from switching. > > If there was a way to fix the underlying restriction on pivot_root(2), > > I'd be much happier with that as a solution. > > > > Thanks. > > > > [1]: https://github.com/kubernetes/minikube/issues/3512 > > (I forgot to add the kernel containers ML to Cc.) > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > In my opinion we just did not expect pivot_root to be so popular with containers as well as the fact people are running full stateless systems from initramfs rather than immediately switching to another root filesystem on boot. This all feels to me use-cases which were not considered before for the pivot_root+initramfs combo. However now we see more and more cases needing this and the boilerplate code and the additional memory copying (and sometimes security issues like you mentioned), which can handle this from the userspace becomes too much. I understand the simplicity reasons described in [1] ("You can't unmount rootfs for approximately the same reason you can't kill the init process..."), but to support this simplicity as well as the new containerised Linux world the kernel should give us a hand. I currently see no reason why we can't apply this patch without the cmdline conditional, because we would just be in the same place as we would have used initrd instead of the initramfs. But I leave the decision to the subsystem maintainers. After all, if you are running from initramfs, this is a stateless system, so I would expect maintainers of such system having an easy way to add the cmdline parameter on reboot. [1]: https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt Ignat