Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp4459231rwb; Mon, 31 Jul 2023 07:14:19 -0700 (PDT) X-Google-Smtp-Source: APBJJlF5+l/7H+8y3Tkw4p8tC16FVhS6ZDtQcPSXqOSULPlSZx1BRx/xzF+UZfTJuot9+ZWNb4V0 X-Received: by 2002:a17:907:7819:b0:99b:f392:10b1 with SMTP id la25-20020a170907781900b0099bf39210b1mr7313096ejc.37.1690812859656; Mon, 31 Jul 2023 07:14:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690812859; cv=none; d=google.com; s=arc-20160816; b=LJXALr5TTiFV5ZnFmaVVE6FpJZWy02h1kevn3esG19iw924Ba87lXkognH74fhMnSx JgKUN+pmk/w4koEoX34opfYwxDURsYKki4jjfYWhv4F3WAM2qFGYDLRoe7DpDxaW841V 2ymHqeayL56aD3MZ6f8GorU9fhufHmmJ+A1B8Zisnu99fVCmK2SRuYTlX78l4Lanz5A1 xzLR6GHcyXoFDP1bcixZhpvUNi/ak5htiqzoglxdup2O+pLBfLXpt7EZkv3josJs4K07 BMpwXdyx0hzhHHEydVKEZVVTMnUbNiNtq3JxYJ9eKg4TrT4k/s4ZBeF/tbCgOx+ihS8r crLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:dkim-signature; bh=s2MEuuBImPy2nJn6tjPuFhxEPNMc8/iZBGaeB7cr/BY=; fh=xSbkpoVApR/vnazb5Rh/Imu7TNPmJLWBsRqxmOsR29k=; b=i9dFBfoipO/VP1wJjo1FpLFRT7Ars6D+QS6TCTtUuXU1eIe+SZPP1kp+N9SHZ7QixJ XM7TiuQs2cTKp394sGTbWnTtCnFCpRwjPFRcAsQ5p4U5rQXbmWg189rNCQyBIBYGtKyw etzh+hv0tL1NaXSyJeeaEbt59bNEEp5NtEtRUews+yCXnImiCNU4w4/GKGMm5OG9P5+s 403MnS1qGK5S23Z6ZVYa5irforWkm73K47tT1ph6OqaqhbWrdLLFKu/KwqhIrBAko385 E7yI3gZlAYKZI1svha0/fp7jeD80MeOHUosF88kj7YL6k8PHVrEo6aRSTFEVFqoS9OlR PJ3w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="I/U9PFyR"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=WJ8ftMaO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id oq26-20020a170906cc9a00b0099b5c6c643fsi2465613ejb.316.2023.07.31.07.13.53; Mon, 31 Jul 2023 07:14:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="I/U9PFyR"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=WJ8ftMaO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229950AbjGaNqW (ORCPT + 99 others); Mon, 31 Jul 2023 09:46:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40350 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229906AbjGaNqU (ORCPT ); Mon, 31 Jul 2023 09:46:20 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 555011709; Mon, 31 Jul 2023 06:46:17 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id ECE511F74C; Mon, 31 Jul 2023 13:46:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1690811175; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=s2MEuuBImPy2nJn6tjPuFhxEPNMc8/iZBGaeB7cr/BY=; b=I/U9PFyRf25n0g70iuzCPxQS3YMkIlal3GeA99nV+He8epYVZMSqJ21tj7yu/AVMl71IF6 8LnE0Wp6VHCOVQhTqUe5oDeyRGxfT6UsLVTeJHqJezpcyQyUmp3udzU40xCfoXlS/6raSm DESHZWjCvZUE3R6ZPogtmDipiDajz7o= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1690811175; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=s2MEuuBImPy2nJn6tjPuFhxEPNMc8/iZBGaeB7cr/BY=; b=WJ8ftMaODqenx4pyPkV8FuPqcxqqGbUIBcQd79o+ZZu+DVwe/2zbhY+QxABBDnG5ovwEZC OulBdKsuXDlSjWCA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id D9C201322C; Mon, 31 Jul 2023 13:46:15 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id YkIcNSe7x2T3QQAAMHmgww (envelope-from ); Mon, 31 Jul 2023 13:46:15 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id 73B47A0767; Mon, 31 Jul 2023 15:46:15 +0200 (CEST) Date: Mon, 31 Jul 2023 15:46:15 +0200 From: Jan Kara To: Amir Goldstein Cc: Ivan Babrou , Greg Kroah-Hartman , linux-fsdevel@vger.kernel.org, kernel-team@cloudflare.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Tejun Heo , Hugh Dickins , Andrew Morton , Christoph Hellwig , Jan Kara , Zefan Li , Johannes Weiner , Christian Brauner Subject: Re: [PATCH] kernfs: attach uuid for every kernfs and report it in fsid Message-ID: <20230731134615.delje45enx3tkyco@quack3> References: <20230710183338.58531-1-ivan@cloudflare.com> <2023071039-negate-stalemate-6987@gregkh> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_SOFTFAIL,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 11-07-23 12:49:05, Amir Goldstein wrote: > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou wrote: > > > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > > wrote: > > > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > > The following two commits added the same thing for tmpfs: > > > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > > where one might be interested in knowing when they are created or removed. > > > > > > > > Signed-off-by: Ivan Babrou > > > > --- > > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > > index d49606accb07..930026842359 100644 > > > > --- a/fs/kernfs/mount.c > > > > +++ b/fs/kernfs/mount.c > > > > @@ -16,6 +16,8 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > +#include > > > > > > > > #include "kernfs-internal.h" > > > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > > return 0; > > > > } > > > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > > +{ > > > > + simple_statfs(dentry, buf); > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > > + return 0; > > > > +} > > > > + > > > > const struct super_operations kernfs_sops = { > > > > - .statfs = simple_statfs, > > > > + .statfs = kernfs_statfs, > > > > .drop_inode = generic_delete_inode, > > > > .evict_inode = kernfs_evict_inode, > > > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > > } > > > > sb->s_flags |= SB_ACTIVE; > > > > > > > > + uuid_gen(&sb->s_uuid); > > > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > > at times, being created at boot time), did you just slow down creating > > > them all, and increase the memory usage in a measurable way? > > > > This is just for the superblock, not every inode. The memory increase > > is one UUID per kernfs instance (there are maybe 10 of them on a basic > > system), which is trivial. Same goes for CPU usage. > > > > > We were trying to slim things down, what userspace tools need this > > > change? Who is going to use it, and what for? > > > > The one concrete thing is ebpf_exporter: > > > > * https://github.com/cloudflare/ebpf_exporter > > > > I want to monitor cgroup changes, so that I can have an up to date map > > of inode -> cgroup path, so that I can resolve the value returned from > > bpf_get_current_cgroup_id() into something that a human can easily > > grasp (think system.slice/nginx.service). Currently I do a full sweep > > to build a map, which doesn't work if a cgroup is short lived, as it > > just disappears before I can resolve it. Unfortunately, systemd > > recycles cgroups on restart, changing inode number, so this is a very > > real issue. > > > > There's also this old wiki page from systemd: > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > > > Quoting from there: > > > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > > > So a similar need to mine, but for different systemd-related needs. > > > > Initially I tried adding this for cgroup fs only, but the problem felt > > very generic, so I pivoted to having it in kernfs instead, so that any > > kernfs based filesystem would benefit. > > > > Given pretty much non-existing overhead and simplicity of this, I > > think it's a change worth doing, unless there's a good reason to not > > do it. I cc'd plenty of people to make sure it's not a bad decision. > > I agree. I think it was a good decision. > I have some followup questions though. > > I guess your use case cares about the creation of cgroups? > as long as the only way to create a cgroup is via vfs > vfs_mkdir() -> ... cgroup_mkdir() > fsnotify_mkdir() will be called. > Is that a correct statement? > Because if not, then explicit fsnotify_mkdir() calls may be needed > similar to tracefs/debugfs. > > I don't think that the statement holds for dieing cgroups, > so explicit fsnotify_rmdir() are almost certainly needed to make > inotify/fanotify monitoring on cgroups complete. Yeah, as Ivan writes, we should already have all that is needed to generate CREATE and DELETE events for the cgroup filesystem. In theory inotify or fanotify for inodes could be already used with cgroupfs now. Thus I have no objection to providing fsid for it so that filesystem-wide notifications can be used for it as well. Feel free to add: Acked-by: Jan Kara to your patch. > On an unrelated side topic, > I would like to point your attention to this comment in the patch that > was just merged to v6.5-rc1: > > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") > > /* > * mount and sb marks are not allowed on kernel internal pseudo fs, > * like pipe_mnt, because that would subscribe to events on all the > * anonynous pipes in the system. > * > * SB_NOUSER covers all of the internal pseudo fs whose objects are not > * exposed to user's mount namespace, but there are other SB_KERNMOUNT > * fs, like nsfs, debugfs, for which the value of allowing sb and mount > * mark is questionable. For now we leave them alone. > */ > > My question to you, as the only user I know of for fanotify FAN_REPORT_FID > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark > to monitor cgroups? or only inotify-like directory watches? Yeah, for situations file cgroupfs the filesystem-wide watches do make sense and are useful at times as Ivan describes so I guess we'll leave those alone... Honza -- Jan Kara SUSE Labs, CR