Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp4703358rdb; Fri, 15 Sep 2023 09:41:58 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG9igrgjbclhxM832xx6sQ+xt78WNo1xc198uA4P0v7KI42+lhRk+fJSHiFDnzgP6HSQhuJ X-Received: by 2002:a05:6359:6c8a:b0:143:723:8f89 with SMTP id td10-20020a0563596c8a00b0014307238f89mr1754955rwb.4.1694796117899; Fri, 15 Sep 2023 09:41:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694796117; cv=none; d=google.com; s=arc-20160816; b=pJsL5PuMSC+6Xu7K4cuTLWrV30Yl3VQYHoDj+Q8DkAle38txKJ9LoqdFeWRykwSkDU jH9oFZgX0oAvceGI25PblnEzvdydp2zZZTZV/Bmxo69MJhdcCkr637jOtmkt3eSLpq9Y e+SfrDXVc7TKfc4UIlYROWoiKAyHQm5thpifjqgs9Ef3/YGzOgySszXaStyWZj2xHbrB 9cnnBbYxUtZguY2IwbX924/rm6H3ei5sNpAWvmMd3WfcsYcdWzxOgefL+5dK5y0UdxT6 FeaoynnOgxbKdKuSModDClSZhgZBfEQzrXv3zVCCkJCERKyyiHo0Dl4CmJV/anpPXgSy 7aNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=CBIiaOHoOHrCk3LeEQNU7t+gT7xznMj52aOiRS+1hC4=; fh=8G8DDOWeXTrynpwGxRU6+UU65RCHDEs7NE1+CluhaKI=; b=SMu8d72KtdlXDes+tApVhPxk3CubR3qrUnuusB66uCyItdR6lfDIsbTqDL36u7eQ/+ m0fssXCh7u3MKVMz3PpkXZ6MJl/WkpgFKrEG8Vf536Eo6ibi2nhLLRKnjO98mCRoP9xT OhMnK1HupIhnJxKZFN/6yEBWY5NPTbKs6OQfOjRT3nw1w5F+HHI6Kp206iMauYHX5cJa QDSNyGc8arcEqjm4JOWuetgEf+7nVOizeoFdLCwLZ1r64EfouewdC8fhx58aUydcqWa4 VFpJ1Ig2j0tuLlb4V1sLPsaBtZQ/NkeTnHBSOsNRVSbk4jHlCvgBr177dgY9gzNcIwY3 5O2g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@szeredi.hu header.s=google header.b=Omd8LNHO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=szeredi.hu Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id ca20-20020a056a02069400b0055be9526b7fsi3871949pgb.416.2023.09.15.09.41.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Sep 2023 09:41:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@szeredi.hu header.s=google header.b=Omd8LNHO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=szeredi.hu Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id F0D038040901; Fri, 15 Sep 2023 02:01:57 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233179AbjIOJBq (ORCPT + 99 others); Fri, 15 Sep 2023 05:01:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233725AbjIOJBh (ORCPT ); Fri, 15 Sep 2023 05:01:37 -0400 Received: from mail-ej1-x62a.google.com (mail-ej1-x62a.google.com [IPv6:2a00:1450:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A476E3AA2 for ; Fri, 15 Sep 2023 01:58:15 -0700 (PDT) Received: by mail-ej1-x62a.google.com with SMTP id a640c23a62f3a-9adb9fa7200so343587966b.0 for ; Fri, 15 Sep 2023 01:58:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=szeredi.hu; s=google; t=1694768230; x=1695373030; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CBIiaOHoOHrCk3LeEQNU7t+gT7xznMj52aOiRS+1hC4=; b=Omd8LNHO/TpEyR1D2FRKgBlsszWtWrz8cNDPH50EdQ+O4GGHmsBFoGC7jbxvQZQrop Sxe1ydVQ3hgGrGoPACkzHEiLeR8/YyQQWhhkbQYvyUY4EfGaHPyzdKIj7t1/mq/wkcOq 77jscDTKNpU2lm+taEKQu91RkDbTtw0ejgIZ0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694768230; x=1695373030; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CBIiaOHoOHrCk3LeEQNU7t+gT7xznMj52aOiRS+1hC4=; b=tUP3T7ep7oaBNlnp8QYTauaYS7QRnYS34Z4cjM/Y29S4offm0HsgZn0bKBqup54nNk RPXncFq9ePeBivxkmY5rCsqRwB/UEXWZB/rxU89bkOHsouLFWdkiR6gTFBjziGeirUYb AoKIttRb5DCMuLqdsXyMnqCYxHy9Z1qqhNKeX5vdC/VhRTEuv7VZn/xs0+I1saafLKEh znXSOdJOULH7z+8c4BuDR8sfeeYCrU7eK37hiAXQYUISwLBaAkVHp0PBHl5tvdgSNAji QUgoOCDsPZQQ6BYMbTVAaOdq7jZ92rJJ8hJu4Ae37RI8Ns5fB4tWcTDbqnnate92nqS4 IjSQ== X-Gm-Message-State: AOJu0YzkDs7EE9V4on7yZA9L0DFv32KQ0Q9JB86pVx1MBcLL6cKKyq5N xx+mvuHAL01YiYGc6znADQEX9x3gpCAoW8LNVNMhjg== X-Received: by 2002:a17:907:9620:b0:9a1:c35b:9e09 with SMTP id gb32-20020a170907962000b009a1c35b9e09mr6450533ejc.8.1694768229999; Fri, 15 Sep 2023 01:57:09 -0700 (PDT) MIME-Version: 1.0 References: <20230913152238.905247-1-mszeredi@redhat.com> <20230913152238.905247-3-mszeredi@redhat.com> <20230914-salzig-manifest-f6c3adb1b7b4@brauner> <20230914-lockmittel-verknallen-d1a18d76ba44@brauner> In-Reply-To: <20230914-lockmittel-verknallen-d1a18d76ba44@brauner> From: Miklos Szeredi Date: Fri, 15 Sep 2023 10:56:58 +0200 Message-ID: Subject: Re: [RFC PATCH 2/3] add statmnt(2) syscall To: Christian Brauner Cc: Miklos Szeredi , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, linux-security-module@vger.kernel.org, Karel Zak , Ian Kent , David Howells , Al Viro , Christian Brauner , Amir Goldstein Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 15 Sep 2023 02:01:58 -0700 (PDT) On Thu, 14 Sept 2023 at 17:27, Christian Brauner wrote: > > On Thu, Sep 14, 2023 at 12:13:54PM +0200, Miklos Szeredi wrote: > No worries, I think the discussion touching on this starts at: > https://youtu.be/j3fp2MtRr2I?si=f-YBg6uWq80dV3VC&t=1603 > (with David talking quietly without a microphone for some parts > unfortunately...) (Thanks for digging that out.) That discussion touched on two aspects of using a single call vs. multiple calls: - atomicity - marshalling Atomicity of getting a snapshot of the current mount tree with all of its attributes was never guaranteed, although reading /proc/self/mountinfo into a sufficiently large buffer would work that way. However, I don't see why mount trees would require stronger guarantees than dentry trees (for which we have basically none). Marshalling/demashalling of arbitrary structures is indeed ugly. I think what Linus suggested, and what this interface was based on is much less than that. Also see my suggestion below: it doesn't need demashalling at all due to the fact that the kernel can fill in the pointers. And yes, this could be used for arbitrary structures without compromising type safety, but at the cost of adding more complexity to the kernel (at least ascii strings are just one type). Even more type clean interface: struct statmnt *statmnt(u64 mnt_id, u64 mask, void *buf, size_t bufsize, unsigned int flags); Kernel would return a fully initialized struct with the numeric as well as string fields filled. That part is trivial for userspace to deal with. For sizing the buffer and versioning the struct see discussion below. > > What I'm thinking is making it even simpler for userspace: > > > > struct statmnt { > > ... > > char *mnt_root; > > char *mountpoint; > > char *fs_type; > > u32 num_opts; > > char *opts; > > }; > > > > I'd still just keep options nul delimited. > > > > Is there a good reason not to return pointers (pointing to within the > > supplied buffer obviously) to userspace? > > It's really unpleasant to program with. Yes, I think you pointed out > before that it often doesn't matter much as long as the system call is > really only relevant to some special purpose userspace. > > But statmount() will be used pretty extensively pretty quickly for the > purpose of finding out mount options on a mount (Querying a whole > sequences of mounts via repeated listmount() + statmount() calls on the > other hand will be rarer.). > > And there's just so many tools that need this: libmount, systemd, all > kinds of container runtimes, path lookup libraries such as libpathrs, > languages like go and rust that expose and wrap these calls and so on. > > Most of these tools don't need to know about filesystem mount options > and if they do they can just query that through an extra system call. No > harm in doing that. Just pass sizeof(struct statmnt) as the buffer size, and it will work that way. > The agreement we came to to split out listing submounts into a separate > system call was exactly to avoid having to have a variable sized pointer > at the end of the struct statmnt (That's also part of the video above > btw.) and to make it as simple as possible. > > Plus, the format for how to return arbitrary filesystem mount options > warrants a separate discussion imho as that's not really vfs level > information. Okay. Let's take fs options out of this. That leaves: - fs type and optionally subtype - root of mount within fs - mountpoint path The type and subtype are naturally limited to sane sizes, those are not an issue. For paths the evolution of the relevant system/library calls was: char *getwd(char buf[PATH_MAX]); char *getcwd(char *buf, size_t size); char *get_current_dir_name(void); It started out using a fixed size buffer, then a variable sized buffer, then an automatically allocated buffer by the library, hiding the need to resize on overflow. The latest style is suitable for the statmnt() call as well, if we worry about pleasantness of the API. > > > > This will also allow us to turn statmnt() into an extensible argument > > > system call versioned by size just like we do any new system calls with > > > struct arguments (e.g., mount_setattr(), clone3(), openat2() and so on). > > > Which is how we should do things like that. > > > > The mask mechanism also allow versioning of the struct. > > Yes, but this is done with reserved space which just pushes away the > problem and bloats the struct for the sake of an unknown future. If we > were to use an extensible argument struct we would just version by size. > The only requirement is that you extend by 64 bit (see struct > clone_args) which had been extended. No need for reserved space in fact. Versioning would still work, as long as userspace is strictly checking the return mask. I.e. newly added fields will come after the old buffer, as assumed by the kernel. But the kernel will never set the mask bits for these fields, so userspace should not ever look at them. Note: the interface does have a bufsize parameter, so no possibility of memory corruption in any event. I added the reserved space so that userspace would be protected from rubbish at the end of the struct if the kernel was older. A library wrapper could work around that issue (move the variable part beyond the end of the new struct), but it would require code update in the wrapper, not just updating the struct. But in fact it's much simpler to just add ample reserved space and be done with it forever, no need to worry about versioning at all. > > > numbers for sub types as well. So we don't need to use strings here. > > > > Ugh. > > Hm, idk. It's not that bad imho. We'll have to make some ugly tradeoffs. Subtype is a fuse thing (e.g. sshfs would show up as fuse.sshfs /proc/self/mountinfo. Forcing each fuse filesystem to invent a magic number... please no. Thanks, Miklos