Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp5126427imm; Tue, 9 Oct 2018 10:03:14 -0700 (PDT) X-Google-Smtp-Source: ACcGV63uPCIhpBWsNdkc2A948kkbZYcDWvby+3XjgUh1eM1E+1HzalTg1+EyNwyXKO3I4RitSB5+ X-Received: by 2002:a17:902:d01:: with SMTP id 1-v6mr29947321plu.88.1539104594629; Tue, 09 Oct 2018 10:03:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539104594; cv=none; d=google.com; s=arc-20160816; b=arjmAlDpJM13p1MDCyIHPZx43qweSTBkD7Vb9YgKztScn44ds9j3xh41zIpKBegwJq qEjuZ6srQN6fORTJbnfHRwHmFHZKuEp0I1aET+syzQZfN/02GjxYR3KLgjaWf/7l3S2p ptJfiRNhx29NvRVzR+e4dJUO8SITGuuaxhrtuZpz6hQ+Yw1TpYOC3qpU9T5gjwejrQdc M+clsm24bzR3C0lwwLp0+4YfbiTZeobksxybAZRNd0xPZzbm604tELTf+1cUBnPuWErc DNpXs6NGo87pk+h2eqVgYFuddsnOze/MgECBVqBpyp94L5jq1GNWc/6hzKq3WyfQQdsb 4aZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=eYFPc4s+QwgY+HY9B2CjCHW9yeLVZZtyLJMTEOLHSIk=; b=MdAXzWus1YsZ68HCxMykDwck275BgH+G4wCTyKW16Okh/Cl9m90mkUlI3js7q8JYnm jg23GqTwHv0g6U+35IFCXbnhW2tZZvTtDdW83e/mlrGEOLtzHj6Cs7kmz39od5iTnnaq JzbFSedQGtS7pR2dHa4Lvikk6gxDMftLkkeIa7hllZEf2+djpK87TBrR3a9LKGuAdp7d g/qgxliywFa0kxSdhSgpGpdks6CBbvpZLQgHDZ5RznspOOtqHWhGG+V6scXZXBPxwqXy F8cPFjYv2yq4dncjf+r3gk/hF+X4BKUiTt5NQzwQCGzlE4d2l7lnXCN1TwDfDnHXkFIQ E2vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="W2UnPXR/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c6-v6si7528106pfg.2.2018.10.09.10.02.59; Tue, 09 Oct 2018 10:03:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="W2UnPXR/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726996AbeJJAUS (ORCPT + 99 others); Tue, 9 Oct 2018 20:20:18 -0400 Received: from mail-ot1-f67.google.com ([209.85.210.67]:35380 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726486AbeJJAUR (ORCPT ); Tue, 9 Oct 2018 20:20:17 -0400 Received: by mail-ot1-f67.google.com with SMTP id j9-v6so2412183otl.2 for ; Tue, 09 Oct 2018 10:02:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=eYFPc4s+QwgY+HY9B2CjCHW9yeLVZZtyLJMTEOLHSIk=; b=W2UnPXR/V2+7n0pT9Nf2EMmVOVE3ra4JNkdKbP/ZrkxnwjNHiyExIwmsvK838h4C/l n7S9KQc8O2zf3IoGaYt4VFKubmBh50NK1am8PaeURY3mnmtFO3PZZB7jSWRVjsQEzBPc vld2gM364sT0g6UTuLuu2M7JnZ37ROdtDd7J/Zloy/2MqEQQsb6nORf8eDBST/xMMdFm GstGGITuOW6li1HXb/EfVt1YZQo88q16AtXoJLsX7VEr0tTDQ4EdYhgBRxh/2rzb/NwN XfamLd/za/Iq9qaPiAlK0PTZVwyPDqv+doONpdsqcGc/7MW9hHRpxF/RLW2GaTS1dfhG reYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=eYFPc4s+QwgY+HY9B2CjCHW9yeLVZZtyLJMTEOLHSIk=; b=LPwXCze8E3gW2ikmvxGnKmoty/D2VWDgyFOtpLhZa6UT6qCzBlZpeNNUXci67XsYls ZQ02UQ2ilHv7ZU3m2fltX88CBhwCwTP0lhVpiRElx1NekmHbO8w4LjZ3aYhMo1QsQ/KA 0w4oUSnikdqN5lo1JQTrR3j6BSXg8wquOfRSo2J0azJe103YYZyVIyYQR/AOjap+UXvb takqdjZm1py5ZVvRJJdXlm04PNu4+KXyCHJTfXOgSQJdRPVajvXlR46Vw+o034dDb2tx CbXO91ZeRIG/jpuJjUaKCG7mJdBYW2wXeklpuBoHW9DFHRTz6b2RpEV8I/N87vki8HTs 2dHg== X-Gm-Message-State: ABuFfogIM+3mALO+79wKRix8CmwRzIBnQFRbBS6rDepuDZ7Cp+kyBMbG 5KhJEgW/8Gq9C6gDRIcyBM2pZYVHByT5zl0dQWJazQ== X-Received: by 2002:a9d:4c15:: with SMTP id l21mr17856286otf.242.1539104542365; Tue, 09 Oct 2018 10:02:22 -0700 (PDT) MIME-Version: 1.0 References: <20181009103752.21482-1-laurent@vivier.eu> <20181009103752.21482-2-laurent@vivier.eu> <795fea1a-97bd-87f5-6a14-6d47a4e42f74@vivier.eu> In-Reply-To: <795fea1a-97bd-87f5-6a14-6d47a4e42f74@vivier.eu> From: Jann Horn Date: Tue, 9 Oct 2018 19:01:56 +0200 Message-ID: Subject: Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace To: Laurent Vivier Cc: ktkhai@virtuozzo.com, kernel list , "Eric W. Biederman" , dima@arista.com, Linux API , James Bottomley , Al Viro , linux-fsdevel@vger.kernel.org, Andrei Vagin , containers@lists.linux-foundation.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 9, 2018 at 6:57 PM Laurent Vivier wrote: > Le 09/10/2018 =C3=A0 18:53, Jann Horn a =C3=A9crit : > > On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier wrote= : > >> Le 09/10/2018 =C3=A0 18:15, Kirill Tkhai a =C3=A9crit : > >>> On 09.10.2018 13:37, Laurent Vivier wrote: > >>>> This patch allows to have a different binfmt_misc configuration > >>>> for each new user namespace. By default, the binfmt_misc configurati= on > >>>> is the one of the previous level, but if the binfmt_misc filesystem = is > >>>> mounted in the new namespace a new empty binfmt instance is created = and > >>>> used in this namespace. > >>>> > >>>> For instance, using "unshare" we can start a chroot of an another > >>>> architecture and configure the binfmt_misc interpreter without being= root > >>>> to run the binaries in this chroot. > >>>> > >>>> Signed-off-by: Laurent Vivier > >>>> --- > >>>> fs/binfmt_misc.c | 106 ++++++++++++++++++++++++------= --- > >>>> include/linux/user_namespace.h | 13 ++++ > >>>> kernel/user.c | 13 ++++ > >>>> kernel/user_namespace.c | 3 + > >>>> 4 files changed, 107 insertions(+), 28 deletions(-) > >>>> > >>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c > >>>> index aa4a7a23ff99..1e0029d097d9 100644 > >>>> --- a/fs/binfmt_misc.c > >>>> +++ b/fs/binfmt_misc.c > >> ... > >>>> @@ -80,18 +74,32 @@ static int entry_count; > >>>> */ > >>>> #define MAX_REGISTER_LENGTH 1920 > >>>> > >>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns= ) > >>>> +{ > >>>> + struct binfmt_namespace *b_ns; > >>>> + > >>>> + while (ns) { > >>>> + b_ns =3D READ_ONCE(ns->binfmt_ns); > >>>> + if (b_ns) > >>>> + return b_ns; > >>>> + ns =3D ns->parent; > >>>> + } > >>>> + WARN_ON_ONCE(1); > >>>> + return NULL; > >>>> +} > >>>> + > >> ... > >>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops =3D= { > >>>> static int bm_fill_super(struct super_block *sb, void *data, int si= lent) > >>>> { > >>>> int err; > >>>> + struct user_namespace *ns =3D sb->s_user_ns; > >>>> static const struct tree_descr bm_files[] =3D { > >>>> [2] =3D {"status", &bm_status_operations, S_IWUSR|S_IRU= GO}, > >>>> [3] =3D {"register", &bm_register_operations, S_IWUSR}, > >>>> /* last one */ {""} > >>>> }; > >>>> > >>>> + /* create a new binfmt namespace > >>>> + * if we are not in the first user namespace > >>>> + * but the binfmt namespace is the first one > >>>> + */ > >>>> + if (READ_ONCE(ns->binfmt_ns) =3D=3D NULL) { > >>>> + struct binfmt_namespace *new_ns; > >>>> + > >>>> + new_ns =3D kmalloc(sizeof(struct binfmt_namespace), > >>>> + GFP_KERNEL); > >>>> + if (new_ns =3D=3D NULL) > >>>> + return -ENOMEM; > >>>> + INIT_LIST_HEAD(&new_ns->entries); > >>>> + new_ns->enabled =3D 1; > >>>> + rwlock_init(&new_ns->entries_lock); > >>>> + new_ns->bm_mnt =3D NULL; > >>>> + new_ns->entry_count =3D 0; > >>>> + /* ensure new_ns is completely initialized before shari= ng it */ > >>>> + smp_wmb(); > >>> > >>> (I haven't dived into patch logic, here just small barrier remark fro= m quick sight). > >>> smp_wmb() has no sense without paired smp_rmb() on the read side. Pos= sible, > >>> you want something like below in read hunk: > >>> > >>> + b_ns =3D READ_ONCE(ns->binfmt_ns); > >>> + if (b_ns) { > >>> + smp_rmb(); > >>> + return b_ns; > >>> + } > >>> > >>> > >> > >> The write barrier is here to ensure the structure is fully written > >> before we set the pointer. > >> > >> I don't understand how read barrier can change something at this level= , > >> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we > >> have correctly initialized the pointer and the structure when we read > >> the pointer back. > >> > >> I think the pointer itself is the "barrier" to access the memory > >> modified before. > > > > Things don't work that way on alpha, but that's why READ_ONCE() > > includes an smp_read_barrier_depends(): > > > > #define __READ_ONCE(x, check) = \ > > ({ = \ > > union { typeof(x) __val; char __c[1]; } __u; = \ > > if (check) = \ > > __read_once_size(&(x), __u.__c, sizeof(x)); = \ > > else = \ > > __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); = \ > > smp_read_barrier_depends(); /* Enforce dependency ordering from= x */ \ > > __u.__val; = \ > > }) > > #define READ_ONCE(x) __READ_ONCE(x, 1) > > > > So my questions are: > > - do we need a smp_wmb() barrier if we use READ_ONCE() and WRITE_ONCE()? Yes. > - if we need an smp_wmb() barrier, do we need an smp_rmb() barrier as > the data we want to "protect" are behind an access to the pointer? No. You only need an smp_read_barrier_depends() to access things behind the pointer you're reading (documented in a big comment block in arch/alpha/include/asm/barrier.h); and that barrier is implied by READ_ONCE(), so you don't have to write it yourself.