Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp225271pxa; Fri, 14 Aug 2020 02:21:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyo0+JdaUd0pZlyrCoSJr3VmDRVGaYpYki92Q7REHpulZyo04ZEUjnX+eR08KBS0liNZow9 X-Received: by 2002:a17:906:6d54:: with SMTP id a20mr1564719ejt.501.1597396874364; Fri, 14 Aug 2020 02:21:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597396874; cv=none; d=google.com; s=arc-20160816; b=TjHhRH2C8K1bxuNXF7Ojr6Lv7/q+ugACSR4KLL31SqFLYjwyYV7Y/slTh/F2e1zjJe wJGPST+YAcIvqyn9v2bcsr3VA3UzTQOQaIBIsjRM+iUJkf69/65nXiWlOH0yUsMQXmyy KyZDA0yHqmGuPvoHFHxWiC3JuMc7SurHj09QaZk0gvyD+ovkqYguXY82V4kW7KuRV/VN ttANXDJ0Jos1r7ueNfmD5rcqJlzjbzTsZ/EtbZzXcNwG/tCsSD226geTR/FXLAzqHro5 Z3fSfqDiF9TBkR6MFBWNcwGGQom4cafEKQEoMC7JlygRBrx+eUie7+491LXhtrLloCr5 wd1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=0piKyAj6mn1prl3Y6Oz4CvhkuCkTCPOSnG6BExsibE4=; b=tIV8JZ4KXpLCXdPY73kbwSuj4EhAtFHCn0bwK1LWzLhjRqrm6paxUur32xvCieB8Kg xDI0tzvlan87wHD7AAo2/EkLbzecocPDEUeRcnbafBWLDJlHWYEfvbOXd95mAB6dQfJY mVR4hz170yYX7nmZdinbT3JoebSGUjMqkQ+Rz/8uNjUcZB+s0mTBZc/P2Wgi8J/CGgdS Brmn2UPL1cpoWzhl403GEgL8FuNktCelZpskFI9QURBJXw7o2LNL3cn06rVTfn+HRbou PA91YbQOZ9KskCRTeQsCrIUjrosAqevLDURpqBODcVsHz2YgG+nJSotvif/Ohb1tQ4jC 4qeA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bm16si5304910ejb.525.2020.08.14.02.20.50; Fri, 14 Aug 2020 02:21:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726754AbgHNH6o (ORCPT + 99 others); Fri, 14 Aug 2020 03:58:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726263AbgHNH6o (ORCPT ); Fri, 14 Aug 2020 03:58:44 -0400 Received: from gardel.0pointer.net (gardel.0pointer.net [IPv6:2a01:238:43ed:c300:10c3:bcf3:3266:da74]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5AA78C061383; Fri, 14 Aug 2020 00:58:43 -0700 (PDT) Received: from gardel-login.0pointer.net (gardel.0pointer.net [IPv6:2a01:238:43ed:c300:10c3:bcf3:3266:da74]) by gardel.0pointer.net (Postfix) with ESMTP id 973F3E814D8; Fri, 14 Aug 2020 09:58:37 +0200 (CEST) Received: by gardel-login.0pointer.net (Postfix, from userid 1000) id 1001B16081D; Fri, 14 Aug 2020 09:58:36 +0200 (CEST) Date: Fri, 14 Aug 2020 09:58:36 +0200 From: Lennart Poettering To: Linus Torvalds Cc: Steven Whitehouse , David Howells , Miklos Szeredi , linux-fsdevel , Al Viro , Karel Zak , Jeff Layton , Miklos Szeredi , Nicolas Dichtel , Christian Brauner , Linux API , Ian Kent , LSM , Linux Kernel Mailing List Subject: Re: file metadata via fs API Message-ID: <20200814075836.GA230635@gardel-login> References: <20200811135419.GA1263716@miu.piliscsaba.redhat.com> <52483.1597190733@warthog.procyon.org.uk> <066f9aaf-ee97-46db-022f-5d007f9e6edb@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mi, 12.08.20 12:50, Linus Torvalds (torvalds@linux-foundation.org) wrote: > On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse wrote: > > > > The point of this is to give us the ability to monitor mounts from > > userspace. > > We haven't had that before, I don't see why it's suddenly such a big deal. > > The notification side I understand. Polling /proc files is not the answer. > > But the whole "let's design this crazy subsystem for it" seems way > overkill. I don't see anybody caring that deeply. > > It really smells like "do it because we can, not because we must". With my systemd maintainer hat on (and of other userspace stuff), there's a couple of things I really want from the kernel because it would fix real problems for us: 1. we want mount notifications that don't require to scan /proc/self/mountinfo entirely again every time things change, over and over again, simply because that doesn't scale. We have various bugs open about this performance bottleneck, I could point you to, but I figure it's easy to see why this currently doesn't scale... 2. We want an unpriv API to query (and maybe set) the fs UUID, like we have nowadays for the fs label FS_IOC_[GS]ETFSLABEL 3. We want an API to query time granularity of file systems timestamps. Otherwise it's so hard in userspace to reproducibly re-generate directory trees. We need to know for example that some fs only has 2s granularity (like fat). 4. Similar, we want to know if an fs is case-sensitive for file names. Or case-preserving. And which charset it accepts for filenames. 5. We want to know if a file system supports access modes, xattrs, file ownership, device nodes, symlinks, hardlinks, fifos, atimes, btimes, ACLs and so on. All these things currently can only be figured out by changing things and reading back if it worked. Which sucks hard of course. 6. We'd like to know the max file size on a file system. 7. Right now it's hard to figure out mount options used for the fs backing some file: you can now statx() the file, determine the mnt_id by that, and then search that in /proc/self/mountinfo, but it's slow, because again we need to scan the whole file until we find the entry we need. And that can be huge IRL. 8. Similar: we quite often want to know submounts of a mount. It would be great if for that kind of information (i.e. list of mnt_ids below some other mnt_id) we wouldn't have to scan the whole of /p/s/mi again. In many cases in our code we operate recursively, and want to know the mounts below some specific dir, but currently pay performance price for it if the number of file systems on the host is huge. This doesn't sound like a biggie, but actually is a biggie. In systemd we spend a lot of time scaninng /p/s/mi... 9. How are file locks implemented on this fs? Are they local only, and orthogonal to remote locks? Are POSIX and BSD locks possibly merged at the backend? Do they work at all? I don't really care too much how an API for this looks like, but let me just say that I am not a fan of APIs that require allocating an fd for querying info about an fd. This 'feels' a bit too recursive: if you expose information about some fd in some magic procfs subdir, or even in some virtual pseudo-file below the file's path then this means we have to allocate a new fd to figure out things or the first fd, and if we'd know the same info for that, we'd theoretically recurse down. Now of course, most likely IRL we wouldn't actually recurse down, but it is still smelly. In particular if fd limits are tight. I mean, I really don't care if you expose non-file-system stuff via the fs, if that's what you want, but I think exposing *fs* metainfo in the *fs*, it's just ugly. I generally detest APIs that have no chance to ever returning multiple bits of information atomically. Splitting up querying of multiple attributes into multiple system calls means they couldn't possibly be determined in a congruent way. I much prefer APIs where we provide a struct to fill in and do a single syscall, and at least for some fields we'd know afterwards that the fields were filled in together and are congruent with each other. I am a fan of the statx() system call I must say. If we had something like this for the file system itself I'd be quite happy, it could tick off many of the requests I list above. Hope this is useful, Lennart -- Lennart Poettering, Berlin