Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp232946imm; Wed, 3 Oct 2018 15:10:01 -0700 (PDT) X-Google-Smtp-Source: ACcGV63aum4YbdJPvPJTwtWGnrk03gtrMvg2mcNYIdkYA+MJiJhcN6PTBkdh0HpUDF31UnHKRjAK X-Received: by 2002:a17:902:89:: with SMTP id a9-v6mr3451346pla.279.1538604601451; Wed, 03 Oct 2018 15:10:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1538604601; cv=none; d=google.com; s=arc-20160816; b=J7JrnVP+q4tXZrvyORRbpetc40ij52uHH1awebY+H4GAKVIolZGJFMc9NacrSBF6f3 JAaut1OygShdIyRnCQUsbzLgk1NDZqNCZ7ltx87NNZ+dTeWJ0ahuHT6FPjfa5PJTcVIc otrehPwnRl6Rj91PJLgaLQFop8J3Q5beLtDOy+NwRKXPPXZc/uFYpxcFMip0b4zcn+Er zVO7SpLimLaNPns3ydPvFNQ3kaByzvruU2ExHskeZhU66HbSGOtpgKel1tXdGEij5AFF RSG3LXOws6ZcPeGLjqq7iuzVFOExgXQLJDdKYOs0OgrdFzSY92Q4luBk0BIyXXr81hlS tqdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=xxvgLSCVy3AIIobM99xNjO1M2tpj3kz+ZJxwF+A8X9M=; b=tbXargUGzp/bPIFDebApKwbjUMPo1c0l/X/wNLK/SuvMuMXsAoZoxoaDyaq/uWnt2R 13dbsd9dXrkCiibo2biJ9O9fDgHrxZkmTkzQU9ti6kxBqXMiONN58ggBc+1u2EnhkskF +PWPtPkQBNUj1Ls074l76haFgxxK1sdDQ/8pai9OWUP+5S6mV3gsH9EHhmgo33kwWXwM nIDg90YfQJnWmiZDUJd+w4DvAuB/GUPJDKkFOhCI02WKfFX7bvSq0fYg2a3axa8+gfyt mm1W19XEjmm/HFVq5XdjmXoUQhZhjzHcZkKf85wlBj+QxaznNy1cMslprpRnAFvDhDts 1bqw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=0mGAtVtW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d16-v6si2705542pfj.251.2018.10.03.15.09.46; Wed, 03 Oct 2018 15:10:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amacapital-net.20150623.gappssmtp.com header.s=20150623 header.b=0mGAtVtW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727058AbeJDE7r (ORCPT + 99 others); Thu, 4 Oct 2018 00:59:47 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:55449 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725922AbeJDE7r (ORCPT ); Thu, 4 Oct 2018 00:59:47 -0400 Received: by mail-wm1-f68.google.com with SMTP id 206-v6so7147484wmb.5 for ; Wed, 03 Oct 2018 15:09:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=xxvgLSCVy3AIIobM99xNjO1M2tpj3kz+ZJxwF+A8X9M=; b=0mGAtVtWHvAzMi9mTFhvrwTZ/UN7U0Dbc5kGNKFsq0qi+13t3HgF3VWbaLqcGFzjBl 74fFoQ89Q06RC4adg7gnsy61ufXJSlLnMdkJHSfx11+Xk4j+YKOQO45Nt3B9KROk73U+ +K3Vcl5FybugCt3qbmuklU+0NnnSkmuv/k3ZVapU+QYemUP6s4V5RjZSWycNID2KdxsA 0AV9O29eBuvYXlxY9VOEijtdSann/ILm6YNUKajhg3iQcUbekN3TaLzocuRyGKOkdzCS jRH/El3AsJN7HY7RS66ALFXdvMYY6ai5G6+j3PXHY0UL6I/h60MuwdQS0pzfgdOFgQqc wymA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=xxvgLSCVy3AIIobM99xNjO1M2tpj3kz+ZJxwF+A8X9M=; b=FEXJu4pEBpC8HeYVEkjPus8dSWLIRri3A1Il83HfDAWbe9U+pMHNUkNI0DyUq5JAcL lUr/m7v0MZ8YttMcJaq9obKu4nLGa3dv4RmRcJuk2aq+YJVBViUoXrArIcc3pMej+qan Wqxv4/soFGY1TM5N/DAuOOszeafAPLEFGMgMVyltDL6vzTGvru87X59FzkU4bKKaNgBD nv7dH2riSfDVibYdlimayFUD5seV9TMUKcIt3qJeeXpkQy2v8KEOZI70ieQK8Ktpx9vh g8ybPW/qHQZXjlkCyZdI1i+fm+OU8qF+JdpZIiz8LkWWYZ1VIOxnLb5IFXfyutyvTwg7 dVJQ== X-Gm-Message-State: ABuFfohL6v7DZLFoVSqtrnR/AeNMWdeTMIQ9H+nCQnGRsMQ4UY8ohEUO IelbSeMdf1bfcwihSO17VGs89YPKLdHJsYwx6N/Tkw== X-Received: by 2002:a1c:1fcd:: with SMTP id f196-v6mr2748804wmf.19.1538604569860; Wed, 03 Oct 2018 15:09:29 -0700 (PDT) MIME-Version: 1.0 References: <20180929103453.12025-1-cyphar@cyphar.com> <20180929131534.24472-1-cyphar@cyphar.com> <20181001054246.gfinmx3api7kjhmc@ryuk> <20181002073220.7mzndna4tdnxdvdt@ryuk> In-Reply-To: <20181002073220.7mzndna4tdnxdvdt@ryuk> From: Andy Lutomirski Date: Wed, 3 Oct 2018 15:09:18 -0700 Message-ID: Subject: Re: [PATCH 2/3] namei: implement AT_THIS_ROOT chroot-like path resolution To: Aleksa Sarai Cc: Jann Horn , "Eric W. Biederman" , Jeff Layton , "J. Bruce Fields" , Al Viro , Arnd Bergmann , Shuah Khan , David Howells , Andrew Lutomirski , Christian Brauner , Tycho Andersen , LKML , Linux FS Devel , linux-arch , "open list:KERNEL SELFTEST FRAMEWORK" , dev@opencontainers.org, Linux Containers , Linux API Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 2, 2018 at 12:32 AM Aleksa Sarai wrote: > > On 2018-10-01, Andy Lutomirski wrote: > > >>> Currently most container runtimes try to do this resolution in > > >>> userspace[1], causing many potential race conditions. In addition, = the > > >>> "obvious" alternative (actually performing a {ch,pivot_}root(2)) > > >>> requires a fork+exec which is *very* costly if necessary for every > > >>> filesystem operation involving a container. > > >> > > >> Wait. fork() I understand, but why exec? And actually, you don't nee= d > > >> a full fork() either, clone() lets you do this with some process par= ts > > >> shared. And then you also shouldn't need to use SCM_RIGHTS, just kee= p > > >> the file descriptor table shared. And why chroot()/pivot_root(), > > >> wouldn't you want to use setns()? > > > > > > You're right about this -- for C runtimes. In Go we cannot do a raw > > > clone() or fork() (if you do it manually with RawSyscall you'll end w= ith > > > broken runtime state). So you're forced to do fork+exec (which then > > > means that you can't use CLONE_FILES and must use SCM_RIGHTS). Same g= oes > > > for CLONE_VFORK. > > > > I must admit that I=E2=80=99m not very sympathetic to the argument that= =E2=80=9CGo=E2=80=99s > > runtime model is incompatible with the simpler solution.=E2=80=9D > > Multi-threaded programs have a similar issue (though with Go it's much > worse). If you fork a multi-threaded C program then you can only safely > use AS-Safe glibc functions (those that are safe within a signal > handler). But if you're just doing three syscalls this shouldn't be as > big of a problem as Go where you can't even do said syscalls. > > > Anyway, it occurs to me that the real problem is that setns() and > > chroot() are both overkill for this use case. > > I agree. My diversion to Go was to explain why it was particularly bad > for cri-o/rkt/runc/Docker/etc. > > > What=E2=80=99s needed is to start your walk from /proc/pid-in-container= /root, > > with two twists: > > > > 1. Do the walk as though rooted at a directory. This is basically just > > your AT_THIS_ROOT, but the footgun is avoided because the dirfd you > > use is from a foreign namespace, and, except for symlinks to absolute > > paths, no amount of .. racing is going to escape the *namespace*. > > This is quite clever and I'll admit I hadn't thought of this. This > definitely fixes the ".." issue, but as you've said it won't handle > absolute symlinks (which means userspace has the same races that we > currently have even if you assume that you have a container process > already running -- CVE-2018-15664 is one of many examples of this). > > (AT_THIS_ROOT using /proc/$container/root would in principle fix all of > the mentioned issues -- but as I said before I'd like to see whether > hardening ".." would be enough to solve the escape problem.) Hmm. Good point.