Received: by 2002:ac0:98c7:0:0:0:0:0 with SMTP id g7-v6csp2077342imd; Fri, 2 Nov 2018 05:45:03 -0700 (PDT) X-Google-Smtp-Source: AJdET5ffRvQuHireqTIUqVJsaa5k49JKeeaEg09nu0JBPUph4+Uo4Du4zXDD13Nziqf/qcgJucoS X-Received: by 2002:a17:902:8a93:: with SMTP id p19-v6mr11437365plo.41.1541162703715; Fri, 02 Nov 2018 05:45:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1541162703; cv=none; d=google.com; s=arc-20160816; b=i3cWP9uKlgU3tFKNUyWIy7OTQnFhONpJZ/YzATcKjBK8fNmhRVpboMoUSwGCROgHoA V0bbgToSWWvOBCh7YJZ6nvOGnUyr3UXAjdFoqydU9v/k4hDvKTTn/C37BGsVTKn5ayFj Ml0HRkJAQ0v1PYL1loh5+jpnrVJazt6E1yz2ofMAKW+rieeRaOiBeylNS0TJpx34A8fS JsEw6WSoU+rJe5PK+1zsJ6cbn23LF5+Klg2WZw7DuTWBbkzk8080WxJgnEukrdcC39y9 2cPJU4j2rtI5Q1pxMXtROCdNH5i8KMtZA7xfVf1lJ/HPzl8gVHYnWfmmlSlUB0YDt7KW /7GQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=LiTGJy/yASg4Kf9qswzmezhHG46xu34E64ej2cVtOOo=; b=kqm+N4eOQO4ZgkPFlk0rhsE4NWsAVAODhOXbeViRTvMowfZHbeeCs8HuT3ufHo9iEs pVIuWvMdcS0d7K7WVDvzTpcnpzsSvVnSW6ySqcB/xGRc92J45brPftlAAavk0L/qCUmB N7JHmZ0Qies1nsMG0kO4G/SFDZbgzYjd3sPLiMDAH2iEO16YNJB9W+QEF76dwKwqNAEH PzWI8AYAGuZeOMhB7/iov/JUplIeYpw72fPoGPpFydV41BqjBhdmYRnwlJtnBb32QZ+I VGSFhg0/CYiF2aTsu9u4Kwz8x4JYzdej6rKf1uPycheSw+IBApgbkTJmYUaBMsNB2Tx+ e+VQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p3-v6si12353718plb.14.2018.11.02.05.44.48; Fri, 02 Nov 2018 05:45:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727150AbeKBVvL (ORCPT + 99 others); Fri, 2 Nov 2018 17:51:11 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:44416 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726104AbeKBVvL (ORCPT ); Fri, 2 Nov 2018 17:51:11 -0400 Received: from mail-io1-f72.google.com ([209.85.166.72]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1gIYo0-0007Fp-D4 for linux-kernel@vger.kernel.org; Fri, 02 Nov 2018 12:44:04 +0000 Received: by mail-io1-f72.google.com with SMTP id o7-v6so1786422ioh.22 for ; Fri, 02 Nov 2018 05:44:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=LiTGJy/yASg4Kf9qswzmezhHG46xu34E64ej2cVtOOo=; b=Ll52BzLw4FfeHgw/KCj4EcCbBkubKKYP+FUUo//0R7rBZ7y9NEA7QJ+FQ2x165HnmA ckLCnM87ETYHn7JxpeSDXJSuCZ4La/dogxEOBsL1tS3bl3OhpQkaccbMWqyoxBKsln46 j5vs6p+3dVkR9T11JRvLk0r7RhuhGKbWp2ffjMDUuUYqoDMMqZJzyRP989I5SrgkrtpC qLzXOQOC/esXBOvvMqq91Bq7yH71LUHEih6EW/O0Qfg1gxhnvvC7rTtW+Z8sIG5Iy/CV sSQz1rNOi9Hc/MH86/u3W5TqspJQDRzOrrcdBykkOWf4U6L4PaynpBoilktPP6ZKvHBe XBPg== X-Gm-Message-State: AGRZ1gLfB+butPTkF6ZPkPLIxOhv2ONnT/aAtJV6We52hMWoP03OMR3E yZz3EpfIdsvSb/ZLqOdyhidonuZHaPa9yAcqhys+0hWWZIXX6aWQ40pD+3l4eXftHIbXudXWt2Z W9bUtB50vdMztl6ZXuJgcpLs7T3SDDd215XnatHeWPg== X-Received: by 2002:a6b:5f13:: with SMTP id t19-v6mr9382106iob.268.1541162643196; Fri, 02 Nov 2018 05:44:03 -0700 (PDT) X-Received: by 2002:a6b:5f13:: with SMTP id t19-v6mr9382076iob.268.1541162642629; Fri, 02 Nov 2018 05:44:02 -0700 (PDT) Received: from localhost ([2605:a601:ac7:2a20:7c8b:4047:a2ef:69cd]) by smtp.gmail.com with ESMTPSA id v134-v6sm13546918ita.23.2018.11.02.05.44.01 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 02 Nov 2018 05:44:01 -0700 (PDT) Date: Fri, 2 Nov 2018 07:44:00 -0500 From: Seth Forshee To: Amir Goldstein Cc: linux-fsdevel , linux-kernel , Linux Containers , James Bottomley , overlayfs Subject: Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts Message-ID: <20181102124400.GB29262@ubuntu-xps13> References: <20181101214856.4563-1-seth.forshee@canonical.com> <20181101214856.4563-7-seth.forshee@canonical.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote: > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee wrote: > > > > shiftfs mounts cannot be nested for two reasons -- global > > CAP_SYS_ADMIN is required to set up a mark mount, and a single > > functional shiftfs mount meets the filesystem stacking depth > > limit. > > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel > > ids in a mount must be within that mount's s_user_ns, so all that > > is needed is CAP_SYS_ADMIN within that s_user_ns. > > > > The stack depth issue can be worked around with a couple of > > adjustments. First, a mark mount doesn't really need to count > > against the stacking depth as it doesn't contribute to the call > > stack depth during filesystem operations. Therefore the mount > > over the mark mount only needs to count as one more than the > > lower filesystems stack depth. > > That's true, but it also highlights the point that the "mark" sb is > completely unneeded and it really is just a nice little hack. > All the information it really stores is a lower mount reference, > a lower root dentry and a declarative statement "I am shiftable!". Seems I should have saved some of the things I said in my previous response for this one. As you no doubt gleaned from that email, I agree with this. > Come to think about it, "container shiftable" really isn't that different from > NFS export with "no_root_squash" and auto mounted USB drive. > I mean the shifting itself is different of course, but the > declaration, not so much. > If I am allowing sudoers on another machine to mess with root owned > files visible > on my machine, I am pretty much have the same issues as container admins > accessing root owned files on my init_user_ns filesystem. In all those cases, > I'd better not be executing suid binaries from the untrusted "external" source. > > Instead of mounting a dummy filesystem to make the declaration, you could > get the same thing with: > mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC) > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED > or whatnot) constant to uapi and manage to come up good man page description. > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and > avoid the extra bind mount step (for a full filesystem tree export). > Declaring a mounted image MS_EXTERN has merits on its own even without > containers and shitfs, for example with pluggable storage. Other LSMs could make > good use of that declaration. I'm missing how we figure out the target user ns in this scheme. We need a context with privileges towards the source path's s_user_ns to say it's okay to shift ids for the files under the source path, and then we need a target user ns for the id shifts. Currently the target is current_user_ns when the final shiftfs mount is created. So, how do we determine the target s_user_ns in your scheme? > > > > > Second, when the lower mount is shiftfs we can just skip down to > > that mount's lower filesystem and shift ids relative to that. > > There is no reason to shift ids twice, and the lower path has > > already been marked safe for id shifting by a user privileged > > towards all ids in that mount's user ns. > > > > Signed-off-by: Seth Forshee > > --- > > fs/shiftfs.c | 68 +++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 46 insertions(+), 22 deletions(-) > > > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > > index b19af7b2fe75..008ace2842b9 100644 > > --- a/fs/shiftfs.c > > +++ b/fs/shiftfs.c > > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > > struct shiftfs_data *data = raw_data; > > char *name = kstrdup(data->path, GFP_KERNEL); > > int err = -ENOMEM; > > - struct shiftfs_super_info *ssi = NULL; > > + struct shiftfs_super_info *ssi = NULL, *mp_ssi; > > struct path path; > > struct dentry *dentry; > > > > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > > if (err) > > goto out; > > > > - /* to mark a mount point, must be real root */ > > - if (ssi->mark && !capable(CAP_SYS_ADMIN)) > > - goto out; > > - > > - /* else to mount a mark, must be userns admin */ > > + /* to mount a mark, must be userns admin */ > > if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) > > goto out; > > Isn't this check performed by vfs anyway? i.e. in mount_nodev() -> sget() Yeah, I noticed that too. I left it in for the moment to emphasize the change I was making, but it can be removed. > > > > > @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, > > > > if (!S_ISDIR(path.dentry->d_inode->i_mode)) { > > err = -ENOTDIR; > > - goto out_put; > > - } > > - > > - sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1; > > - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) { > > - printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n"); > > - err = -EINVAL; > > - goto out_put; > > + goto out_put_path; > > } > > > > if (ssi->mark) { > > + struct super_block *lower_sb = path.mnt->mnt_sb; > > + > > + /* to mark a mount point, must root wrt lower s_user_ns */ > > + if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN)) > > + goto out_put_path; > > + > > + > > /* > > * this part is visible unshifted, so make sure no > > * executables that could be used to give suid > > * privileges > > */ > > sb->s_iflags = SB_I_NOEXEC; > > As commented on cover letter, why allow access to any files besides root at all? > In fact, the only justification for a dummy sb (instead of bind mount with > MS_EXTERN flag) would be in order to override inode operations with noop ops > to prevent access to unshifted files from within container. Summarizing my response to the other message, if the mark mount is kept (and I would prefer that it isn't kept) that seems reasonable to me. Thanks, Seth