Received: by 10.192.165.148 with SMTP id m20csp4883014imm; Tue, 8 May 2018 16:39:14 -0700 (PDT) X-Google-Smtp-Source: AB8JxZp33cIme+dAN3DQxFxEqsS2+bpfG/S54A7bZFRO1bcj3d4l/jnqcSuEvT8nbyMOvROGAVHv X-Received: by 10.98.247.19 with SMTP id h19mr31723300pfi.165.1525822754442; Tue, 08 May 2018 16:39:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525822754; cv=none; d=google.com; s=arc-20160816; b=tHRW1Ct7lN2ahHtRzpkvIFLhUuBmubBWEU4s5OQNmbxP32svEK9TsjgQqA9Ou0iTR1 Lg1nc5fzBBLXm/Au9vPM7YNr9ElB3VqAn94FZUx0baC+eBkx0uiIwRhpAmlHCCkGZgdg 1iqWkk4gdwFbFNxzhccGM0U3LEHFyAZYe9z/cxMHwlww1YdbapQLVoJL7rOXqVR22P27 pwqvsYiO82kfnhU3qYzZ32QqymL7B0Q8tC3608nNdjgSVQ2P9Hld3Hii+AJ4jIoTZLuk 9J2AK7ETvdUDcfN9Icoy7Nv2JZT0B4yc45VEervzTf+ze5ZoSgE3ewosmMYa6HMcj5FJ S2IA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=TcJcixYDAVC5jJ0dIs6KCckldiwkI9ntKGs/lkefrhI=; b=xi3X9ynDZQNY+s87WD9i0PuFajqK9NFjkktTKz1M8gRhAHaUcAdsX/fVZXwEygMhzA cbnKYdewivz4QXJn9JO4e8yOW4Hi7Me/aUzK9ddAHFX2UVhcEc70in79hLn5S+7DobRo 3b+IJHbsMKjysT7OsRevP8FpqhL04ilMlyUgUVl/a1EnL8kXzTmsmUFkt67+d1ScEyds GvEpAr+/A4H413fWCkNqiWmJ/yVo5+V7rcAD7FIrs00dMcgmQ6u/Yqm3YmLGbVY9xAo6 VhBWi4a8JMbxp2H+TXRH4mlBhLKzA5noimqHGiOZ5dce7VHr9zT6oOrdGvIjxcD0Vh2x qwEA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g72si23590145pfb.280.2018.05.08.16.38.57; Tue, 08 May 2018 16:39:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933212AbeEHXip (ORCPT + 99 others); Tue, 8 May 2018 19:38:45 -0400 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:2612 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755853AbeEHXio (ORCPT ); Tue, 8 May 2018 19:38:44 -0400 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail06.adl2.internode.on.net with ESMTP; 09 May 2018 09:08:41 +0930 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1fGCBs-0003jw-5U; Wed, 09 May 2018 09:38:40 +1000 Date: Wed, 9 May 2018 09:38:40 +1000 From: Dave Chinner To: Mark Fasheh Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-btrfs@vger.kernel.org Subject: Re: [RFC][PATCH 0/76] vfs: 'views' for filesystems with more than one root Message-ID: <20180508233840.GM10363@dastard> References: <20180508180436.716-1-mfasheh@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180508180436.716-1-mfasheh@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 08, 2018 at 11:03:20AM -0700, Mark Fasheh wrote: > Hi, > > The VFS's super_block covers a variety of filesystem functionality. In > particular we have a single structure representing both I/O and > namespace domains. > > There are requirements to de-couple this functionality. For example, > filesystems with more than one root (such as btrfs subvolumes) can > have multiple inode namespaces. This starts to confuse userspace when > it notices multiple inodes with the same inode/device tuple on a > filesystem. Devil's Advocate - I'm not looking at the code, I'm commenting on architectural issues I see here. The XFS subvolume work I've been doing explicitly uses a superblock per subvolume. That's because subvolumes are designed to be completely independent of the backing storage - they know nothing about the underlying storage except to share a BDI for writeback purposes and write to whatever block device the remapping layer gives them at IO time. Hence XFS subvolumes have (at this point) their own unique s_dev, on-disk format configuration, journal, space accounting, etc. i.e. They are fully independent filesystems in their own right, and as such we do not have multiple inode namespaces per superblock. So this doesn't sound like a "subvolume problem" - it's a "how do we sanely support multiple independent namespaces per superblock" problem. AFAICT, this same problem exists with bind mounts and mount namespaces - they are effectively multiple roots on a single superblock, but it's done at the vfsmount level and so the superblock knows nothing about them. So this kinda feel like there's still a impedence mismatch between btrfs subvolumes being mounted as subtrees on the underlying root vfsmount rather than being created as truly independent vfs namespaces that share a superblock. To put that as a question: why aren't btrfs subvolumes vfsmounts in their own right, and the unique information subvolume information get stored in (or obtained from) the vfsmount? > In addition, it's currently impossible for a filesystem subvolume to > have a different security context from it's parent. If we could allow > for subvolumes to optionally specify their own security context, we > could use them as containers directly instead of having to go through > an overlay. Again, XFS subvolumes don't have this problem. So really we need to frame this discussion in terms of supporting multiple namespaces within a superblock sanely, not subvolumes. > I ran into this particular problem with respect to Btrfs some years > ago and sent out a very naive set of patches which were (rightfully) > not incorporated: > > https://marc.info/?l=linux-btrfs&m=130074451403261&w=2 > https://marc.info/?l=linux-btrfs&m=130532890824992&w=2 > > During the discussion, one question did come up - why can't > filesystems like Btrfs use a superblock per subvolume? There's a > couple of problems with that: > > - It's common for a single Btrfs filesystem to have thousands of > subvolumes. So keeping a superblock for each subvol in memory would > get prohibively expensive - imagine having 8000 copies of struct > super_block for a file system just because we wanted some separation > of say, s_dev. That's no different to using individual overlay mounts for the thousands of containers that are on the system. This doesn't seem to be a major problem... > - Writeback would also have to walk all of these superblocks - > again not very good for system performance. Background writeback is backing device focussed, not superblock focussed. It will only iterate the superblocks that have dirty inodes on the bdi writeback lists, not all the superblocks on the bdi. IOWs, this isn't a major problem except for sync() operations that iterate superblocks..... > - Anyone wanting to lock down I/O on a filesystem would have to > freeze all the superblocks. This goes for most things related to > I/O really - we simply can't afford to have the kernel walking > thousands of superblocks to sync a single fs. Not with XFS subvolumes. Freezing the underlying parent filesystem will effectively stop all IO from the mounted subvolumes by freezing remapping calls before IO. Sure, those subvolumes aren't in a consistent state, but we don't freeze userspace so none of the application data is ever in a consistent state when filesystems are frozen. So, again, I'm not sure there's /subvolume/ problem here. There's definitely a "freeze heirarchy" problem, but that already exists and it's something we talked about at LSFMM because we need to solve it for reliable hibernation. > It's far more efficient then to pull those fields we need for a > subvolume namespace into their own structure. I'm not convinced yet - it still feels like it's the wrong layer to be solving the multiple namespace per superblock problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com