Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp1934012imm; Sat, 6 Oct 2018 13:57:27 -0700 (PDT) X-Google-Smtp-Source: ACcGV61ckDZ7Q6j8x6i3ZsIRMTSxX3N5StQV98iefy0/FQ7bZwht96E2NyyGQTa+prTDzJMxRlV5 X-Received: by 2002:a63:9809:: with SMTP id q9-v6mr15391338pgd.58.1538859447608; Sat, 06 Oct 2018 13:57:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538859447; cv=none; d=google.com; s=arc-20160816; b=rnS+1R3TJaa+dPGzV1RsZ1cIE5Q4sVBmZalSDPfjGhtZ+INS+fFOuHhHE+x3xXFaaK rFsRyQJqVmeJn+F8Uv8R3oH+m217AGyOndHfykb/13D3ct8R9Ynp+S/gwSeu/s5if6Ly ei2GumIXOvUj9yFpcn9e7agb4e49/BM16FOLe0zlkwo76Xd4fIc4WYca797B1XpToUIu lQUpGHbynypzsUHLGDXQLTAOvg2aadMHUYIAqdVvHFJMCXQp64SOuHAAd/00xdEjrZRR ZF3RVjCHW7D7r8syjct1DutN89i9tOsw4waMZXtq81+teekLm1WVHxpnJWzm6VU6Vt0h 5jew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:in-reply-to:date:references:subject:cc:to:from; bh=99aOgAApUxdgUFFJVmJcrezulRPjpcTDuWKSYIGBn/0=; b=fWXwCSoTlnlQn5JHVZqpuobHhvOIi5j0Ctv9WzGbcReob7FJdSXr0ocKUJWgTtGhdj HLgW+gm7xHTSTAbKkVtiVXNZR2PBJOCyjjItXs4tSbsfc5z/IbuKujNkiXYpddjB9lhK osTYBuSFlYkJdkMesxDccOtlQmCSnnIm29am/10IF7F54kwtauPZ6VPJEd8jMmO2XiVE MISoQUdaSMFJpJouJFU6IKivmalPuRezFI+/9Ha27DyTPIm4ht/VELOUpBHHLqqEy+FT zzc2fXEh/qvdMFaRm1GsXZwIU03GDyPMVF6Gb7b9xWKvYqBMR5LpewyMUTgaOViyT7k8 +XCA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y6-v6si12738233pfb.161.2018.10.06.13.56.57; Sat, 06 Oct 2018 13:57:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726192AbeJGEBi convert rfc822-to-8bit (ORCPT + 99 others); Sun, 7 Oct 2018 00:01:38 -0400 Received: from albireo.enyo.de ([5.158.152.32]:34638 "EHLO albireo.enyo.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725779AbeJGEBi (ORCPT ); Sun, 7 Oct 2018 00:01:38 -0400 Received: from [172.17.203.2] (helo=deneb.enyo.de) by albireo.enyo.de with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) id 1g8tce-0006hj-21; Sat, 06 Oct 2018 20:56:24 +0000 Received: from fw by deneb.enyo.de with local (Exim 4.89) (envelope-from ) id 1g8tcd-0004TR-Qh; Sat, 06 Oct 2018 22:56:23 +0200 From: Florian Weimer To: Aleksa Sarai Cc: Andy Lutomirski , Jann Horn , "Eric W. Biederman" , jlayton@kernel.org, Bruce Fields , Al Viro , Arnd Bergmann , shuah@kernel.org, David Howells , Andy Lutomirski , christian@brauner.io, Tycho Andersen , kernel list , linux-fsdevel@vger.kernel.org, linux-arch , linux-kselftest@vger.kernel.org, dev@opencontainers.org, containers@lists.linux-foundation.org, Linux API Subject: Re: [PATCH 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution References: <20180929103453.12025-1-cyphar@cyphar.com> <20180929131534.24472-1-cyphar@cyphar.com> <20181001054246.gfinmx3api7kjhmc@ryuk> <20181002073220.7mzndna4tdnxdvdt@ryuk> Date: Sat, 06 Oct 2018 22:56:23 +0200 In-Reply-To: <20181002073220.7mzndna4tdnxdvdt@ryuk> (Aleksa Sarai's message of "Tue, 2 Oct 2018 17:32:20 +1000") Message-ID: <875zyeg5fs.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Aleksa Sarai: > On 2018-10-01, Andy Lutomirski wrote: >> >>> Currently most container runtimes try to do this resolution in >> >>> userspace[1], causing many potential race conditions. In addition, the >> >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) >> >>> requires a fork+exec which is *very* costly if necessary for every >> >>> filesystem operation involving a container. >> >> >> >> Wait. fork() I understand, but why exec? And actually, you don't need >> >> a full fork() either, clone() lets you do this with some process parts >> >> shared. And then you also shouldn't need to use SCM_RIGHTS, just keep >> >> the file descriptor table shared. And why chroot()/pivot_root(), >> >> wouldn't you want to use setns()? >> > >> > You're right about this -- for C runtimes. In Go we cannot do a raw >> > clone() or fork() (if you do it manually with RawSyscall you'll end with >> > broken runtime state). So you're forced to do fork+exec (which then >> > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same goes >> > for CLONE_VFORK. >> >> I must admit that I’m not very sympathetic to the argument that “Go’s >> runtime model is incompatible with the simpler solution.” > > Multi-threaded programs have a similar issue (though with Go it's much > worse). If you fork a multi-threaded C program then you can only safely > use AS-Safe glibc functions (those that are safe within a signal > handler). But if you're just doing three syscalls this shouldn't be as > big of a problem as Go where you can't even do said syscalls. The situation is a bit more complicated. There are many programs out there which use malloc and free (at least indirectly) after a fork, and we cannot break them. In glibc, we have a couple of subsystems which are put into a known state before calling the fork/clone system call if the application calls fork. The price we pay for that is a fork which is not POSIX-compliant because it is not async-signal-safe. Admittedly, other libcs chose different trade-offs. However, what is the same across libcs is this: You cannot call the clone system call directly and get a fully working new process. Some things break. For example, for recursive mutexes, we need to know the TID of the current thread, and we cannot perform a system call to get it for performance reasons. So everyone has a TID cache for that. But the TID cache does not get reset when you bypass the fork implementation in libc, so you end up with subtle corruption bugs on TID reuse. So I'd say that in most cases, the C situation is pretty much the same as the Go situation. If I recall correctly, the problem for Go is that it cannot call setns from Go code because it fails in the kernel for multi-threaded processes, and Go processes are already multi-threaded when user Go code runs.