2024-01-17 14:37:09

by Steven Rostedt

[permalink] [raw]
Subject: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

From: "Steven Rostedt (Google)" <[email protected]>

The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.

Instead use a single unique inode number for all files and one for all
directories.

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]

Cc: Masami Hiramatsu <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ajay Kaher <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
---
fs/tracefs/event_inode.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..5edf0b96758b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,6 +32,10 @@
*/
static DEFINE_MUTEX(eventfs_mutex);

+/* Choose something "unique" ;-) */
+#define EVENTFS_FILE_INODE_INO 0x12c4e37
+#define EVENTFS_DIR_INODE_INO 0x134b2f5
+
/*
* The eventfs_inode (ei) itself is protected by SRCU. It is released from
* its parent's list and will have is_freed set (under eventfs_mutex).
@@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
inode->i_fop = fop;
inode->i_private = data;

+ /* All files will have the same inode number */
+ inode->i_ino = EVENTFS_FILE_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
d_instantiate(dentry, inode);
@@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
inode->i_op = &eventfs_root_dir_inode_operations;
inode->i_fop = &eventfs_file_operations;

+ /* All directories will have the same inode number */
+ inode->i_ino = EVENTFS_DIR_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;

--
2.43.0




2024-01-22 10:39:17

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

Hi Stephen,

On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <[email protected]> wrote:
> From: "Steven Rostedt (Google)" <[email protected]>
>
> The dentries and inodes are created in the readdir for the sole purpose of
> getting a consistent inode number. Linus stated that is unnecessary, and
> that all inodes can have the same inode number. For a virtual file system
> they are pretty meaningless.
>
> Instead use a single unique inode number for all files and one for all
> directories.
>
> Link: https://lore.kernel.org/all/[email protected]/
> Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Ajay Kaher <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Steven Rostedt (Google) <[email protected]>

Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs:
Have the inodes all for files and directories all be the same") in
v6.8-rc1, to which I have bisected the issue below.

> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -32,6 +32,10 @@
> */
> static DEFINE_MUTEX(eventfs_mutex);
>
> +/* Choose something "unique" ;-) */
> +#define EVENTFS_FILE_INODE_INO 0x12c4e37
> +#define EVENTFS_DIR_INODE_INO 0x134b2f5
> +
> /*
> * The eventfs_inode (ei) itself is protected by SRCU. It is released from
> * its parent's list and will have is_freed set (under eventfs_mutex).
> @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
> inode->i_fop = fop;
> inode->i_private = data;
>
> + /* All files will have the same inode number */
> + inode->i_ino = EVENTFS_FILE_INODE_INO;
> +
> ti = get_tracefs(inode);
> ti->flags |= TRACEFS_EVENT_INODE;
> d_instantiate(dentry, inode);
> @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
> inode->i_op = &eventfs_root_dir_inode_operations;
> inode->i_fop = &eventfs_file_operations;
>
> + /* All directories will have the same inode number */
> + inode->i_ino = EVENTFS_DIR_INODE_INO;
> +
> ti = get_tracefs(inode);
> ti->flags |= TRACEFS_EVENT_INODE;

This confuses "find".
Running "find /sys/" now prints lots of error messages to stderr:

find: File system loop detected;
‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of
the same file system loop as
‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected;
‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of
the same file system loop as
‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected;
‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of
the same file system loop as
‘/sys/kernel/debug/tracing/events/initcall’.
[...]

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2024-01-22 15:36:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 11:38:52 +0100
Geert Uytterhoeven <[email protected]> wrote:

> Hi Stephen,

I don't know who "Stephen" is, but I'll reply to this message.

>
> On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <[email protected]> wrote:
> > From: "Steven Rostedt (Google)" <[email protected]>
> >
> > The dentries and inodes are created in the readdir for the sole purpose of
> > getting a consistent inode number. Linus stated that is unnecessary, and
> > that all inodes can have the same inode number. For a virtual file system
> > they are pretty meaningless.
> >
> > Instead use a single unique inode number for all files and one for all
> > directories.
> >
> > Link: https://lore.kernel.org/all/[email protected]/

Yeah, Linus wanted me to try this first and see if there's any regressions.
Well, I guess you just answered that.

The above link has me saying to Linus:

It was me being paranoid that using the same inode number would break user
space. If that is not a concern, then I'm happy to just make it either the
same, or maybe just hash the ei and name that it is associated with.

> > Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
> >
> > Cc: Masami Hiramatsu <[email protected]>
> > Cc: Mark Rutland <[email protected]>
> > Cc: Mathieu Desnoyers <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Ajay Kaher <[email protected]>
> > Suggested-by: Linus Torvalds <[email protected]>
> > Signed-off-by: Steven Rostedt (Google) <[email protected]>
>
> Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs:
> Have the inodes all for files and directories all be the same") in
> v6.8-rc1, to which I have bisected the issue below.
>
> > --- a/fs/tracefs/event_inode.c
> > +++ b/fs/tracefs/event_inode.c
> > @@ -32,6 +32,10 @@
> > */
> > static DEFINE_MUTEX(eventfs_mutex);
> >
> > +/* Choose something "unique" ;-) */
> > +#define EVENTFS_FILE_INODE_INO 0x12c4e37
> > +#define EVENTFS_DIR_INODE_INO 0x134b2f5
> > +
> > /*
> > * The eventfs_inode (ei) itself is protected by SRCU. It is released from
> > * its parent's list and will have is_freed set (under eventfs_mutex).
> > @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
> > inode->i_fop = fop;
> > inode->i_private = data;
> >
> > + /* All files will have the same inode number */
> > + inode->i_ino = EVENTFS_FILE_INODE_INO;
> > +
> > ti = get_tracefs(inode);
> > ti->flags |= TRACEFS_EVENT_INODE;
> > d_instantiate(dentry, inode);
> > @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
> > inode->i_op = &eventfs_root_dir_inode_operations;
> > inode->i_fop = &eventfs_file_operations;
> >
> > + /* All directories will have the same inode number */
> > + inode->i_ino = EVENTFS_DIR_INODE_INO;
> > +
> > ti = get_tracefs(inode);
> > ti->flags |= TRACEFS_EVENT_INODE;
>
> This confuses "find".
> Running "find /sys/" now prints lots of error messages to stderr:
>
> find: File system loop detected;
> ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of
> the same file system loop as
> ‘/sys/kernel/debug/tracing/events/initcall’.

So at a minimum, the directories need to have unique inode numbers.


> find: File system loop detected;
> ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of
> the same file system loop as
> ‘/sys/kernel/debug/tracing/events/initcall’.
> find: File system loop detected;
> ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of
> the same file system loop as
> ‘/sys/kernel/debug/tracing/events/initcall’.
> [...]

Does this fix it for you? It hashes the eventfs_inode data structure after
adding some salt to it.

Kees,

I'm using the eventfs_inode pointer to create a unique value for the inode.
But it's being salted, hashed and then truncated. As it is very easy to
read inodes (although by default, only root has access to read these
inodes), the inode numbers themselves shouldn't be able to leak kernel
addresses via the results of these inode numbers, would it?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6795fda2af19..d54897b84596 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -19,6 +19,7 @@
#include <linux/namei.h>
#include <linux/workqueue.h>
#include <linux/security.h>
+#include <linux/siphash.h>
#include <linux/tracefs.h>
#include <linux/kref.h>
#include <linux/delay.h>
@@ -36,6 +37,31 @@ static DEFINE_MUTEX(eventfs_mutex);
#define EVENTFS_FILE_INODE_INO 0x12c4e37
#define EVENTFS_DIR_INODE_INO 0x134b2f5

+/* Used for making inode numbers */
+static siphash_key_t inode_key;
+
+/* Copied from scripts/kconfig/symbol.c */
+static unsigned strhash(const char *s)
+{
+ /* fnv32 hash */
+ unsigned hash = 2166136261U;
+ for (; *s; s++)
+ hash = (hash ^ *s) * 0x01000193;
+ return hash;
+}
+
+/* Just try to make something consistent and unique */
+static int eventfs_dir_ino(struct event_inode *ei, const char *name)
+{
+ unsigned long sip = (unsigned long)ei;
+
+ sip += strhash(name) + EVENTFS_DIR_INODE_INO;
+ sip = siphash_1u32((int)sip, &inode_key);
+
+ /* keep it positive */
+ return sip & ((1U << 31) - 1);
+}
+
/*
* The eventfs_inode (ei) itself is protected by SRCU. It is released from
* its parent's list and will have is_freed set (under eventfs_mutex).
@@ -396,7 +422,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
inode->i_fop = &eventfs_file_operations;

/* All directories will have the same inode number */
- inode->i_ino = EVENTFS_DIR_INODE_INO;
+ inode->i_ino = eventfs_dir_ino(ei, ei->name);

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -802,7 +828,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)

name = ei_child->name;

- ino = EVENTFS_DIR_INODE_INO;
+ ino = eventfs_dir_ino(ei_child, name);

if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
@@ -932,6 +958,9 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
if (IS_ERR(dentry))
return ERR_CAST(dentry);

+ if (siphash_key_is_zero(&inode_key))
+ get_random_bytes(&inode_key, sizeof(inode_key));
+
ei = kzalloc(sizeof(*ei), GFP_KERNEL);
if (!ei)
goto fail_ei;


2024-01-22 17:21:50

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

Hi Steven,

On Mon, Jan 22, 2024 at 4:05 PM Steven Rostedt <[email protected]> wrote:
> On Mon, 22 Jan 2024 11:38:52 +0100
> Geert Uytterhoeven <[email protected]> wrote:
>
> > Hi Stephen,
>
> I don't know who "Stephen" is, but I'll reply to this message.

My apologies (too many different spellings in too many languages)...

> > On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <rostedt@goodmisorg> wrote:
> > > From: "Steven Rostedt (Google)" <[email protected]>
> > >
> > > The dentries and inodes are created in the readdir for the sole purpose of
> > > getting a consistent inode number. Linus stated that is unnecessary, and
> > > that all inodes can have the same inode number. For a virtual file system
> > > they are pretty meaningless.
> > >
> > > Instead use a single unique inode number for all files and one for all
> > > directories.
> > >
> > > Link: https://lore.kernel.org/all/[email protected]/
>
> Yeah, Linus wanted me to try this first and see if there's any regressions.
> Well, I guess you just answered that.
>
> The above link has me saying to Linus:
>
> It was me being paranoid that using the same inode number would break user
> space. If that is not a concern, then I'm happy to just make it either the
> same, or maybe just hash the ei and name that it is associated with.
>
> > > Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
> > >
> > > Cc: Masami Hiramatsu <[email protected]>
> > > Cc: Mark Rutland <[email protected]>
> > > Cc: Mathieu Desnoyers <[email protected]>
> > > Cc: Christian Brauner <[email protected]>
> > > Cc: Al Viro <[email protected]>
> > > Cc: Ajay Kaher <[email protected]>
> > > Suggested-by: Linus Torvalds <[email protected]>
> > > Signed-off-by: Steven Rostedt (Google) <[email protected]>
> >
> > Thanks for your patch, which is now commit 53c41052ba312176 ("eventfs:
> > Have the inodes all for files and directories all be the same") in
> > v6.8-rc1, to which I have bisected the issue below.

> > This confuses "find".
> > Running "find /sys/" now prints lots of error messages to stderr:
> >
> > find: File system loop detected;
> > ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of
> > the same file system loop as
> > ‘/sys/kernel/debug/tracing/events/initcall’.
>
> So at a minimum, the directories need to have unique inode numbers.
>
>
> > find: File system loop detected;
> > ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of
> > the same file system loop as
> > ‘/sys/kernel/debug/tracing/events/initcall’.
> > find: File system loop detected;
> > ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of
> > the same file system loop as
> > ‘/sys/kernel/debug/tracing/events/initcall’.
> > [...]
>
> Does this fix it for you? It hashes the eventfs_inode data structure after
> adding some salt to it.
>
> Kees,
>
> I'm using the eventfs_inode pointer to create a unique value for the inode.
> But it's being salted, hashed and then truncated. As it is very easy to
> read inodes (although by default, only root has access to read these
> inodes), the inode numbers themselves shouldn't be able to leak kernel
> addresses via the results of these inode numbers, would it?
>
> -- Steve

Gotya ;-)

>
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 6795fda2af19..d54897b84596 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c

> @@ -36,6 +37,31 @@ static DEFINE_MUTEX(eventfs_mutex);
> #define EVENTFS_FILE_INODE_INO 0x12c4e37
> #define EVENTFS_DIR_INODE_INO 0x134b2f5
>
> +/* Used for making inode numbers */
> +static siphash_key_t inode_key;
> +
> +/* Copied from scripts/kconfig/symbol.c */
> +static unsigned strhash(const char *s)
> +{
> + /* fnv32 hash */
> + unsigned hash = 2166136261U;
> + for (; *s; s++)
> + hash = (hash ^ *s) * 0x01000193;
> + return hash;
> +}
> +
> +/* Just try to make something consistent and unique */
> +static int eventfs_dir_ino(struct event_inode *ei, const char *name)

eventfs_inode

> +{
> + unsigned long sip = (unsigned long)ei;
> +
> + sip += strhash(name) + EVENTFS_DIR_INODE_INO;
> + sip = siphash_1u32((int)sip, &inode_key);
> +
> + /* keep it positive */
> + return sip & ((1U << 31) - 1);
> +}
> +
> /*
> * The eventfs_inode (ei) itself is protected by SRCU. It is released from
> * its parent's list and will have is_freed set (under eventfs_mutex).

Thanks, find is happy now the directories have different inode numbers.
The files still have identical inodes numbers, I hope that doesn't cause
any issues...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2024-01-22 17:49:10

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On 2024-01-22 10:06, Steven Rostedt wrote:
> On Mon, 22 Jan 2024 11:38:52 +0100
> Geert Uytterhoeven <[email protected]> wrote:
>
[...]
>>
>> On Wed, Jan 17, 2024 at 3:37 PM Steven Rostedt <[email protected]> wrote:
>>> From: "Steven Rostedt (Google)" <[email protected]>
>>>
>>> The dentries and inodes are created in the readdir for the sole purpose of
>>> getting a consistent inode number. Linus stated that is unnecessary, and
>>> that all inodes can have the same inode number. For a virtual file system
>>> they are pretty meaningless.
>>>
>>> Instead use a single unique inode number for all files and one for all
>>> directories.
>>>
>>> Link: https://lore.kernel.org/all/[email protected]/
>
> Yeah, Linus wanted me to try this first and see if there's any regressions.
> Well, I guess you just answered that.
>
>> This confuses "find".
>> Running "find /sys/" now prints lots of error messages to stderr:
>>
>> find: File system loop detected;
>> ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of
>> the same file system loop as
>> ‘/sys/kernel/debug/tracing/events/initcall’.
>
> So at a minimum, the directories need to have unique inode numbers.

[...]

> I'm using the eventfs_inode pointer to create a unique value for the inode.
> But it's being salted, hashed and then truncated. As it is very easy to
> read inodes (although by default, only root has access to read these
> inodes), the inode numbers themselves shouldn't be able to leak kernel
> addresses via the results of these inode numbers, would it?

Why use an improvised hashing function (re-purposed from
scripts/kconfig/symbol.c to a use-case which is exposed through a
userspace ABI prone to kernel address leaks) rather than simply
reserving values by setting bits in a bitmap ?

How many inodes do we realistically expect to have there ?

On my 6.1.0 kernel:

find /sys/kernel/tracing | wc -l
15598

(mainly due to TRACE_EVENT ABI files)

Hashing risks:

- Exposing kernel addresses if the hashing algorithm is broken,
- Collisions if users are unlucky (which could trigger those
'find' errors).

Those 15598 inode values fit within a single page (bitmap of
1922 bytes).

So I would recommend simply adding a bitmap per tracefs filesystem
instance to keep track of inode number allocation.

Creation/removal of files/directories in tracefs should not be
a fast-path anyway, so who cares about the speed of a find first
bit within a single page ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-01-22 18:07:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 17:23:29 +0100
Geert Uytterhoeven <[email protected]> wrote:

> > +/* Just try to make something consistent and unique */
> > +static int eventfs_dir_ino(struct event_inode *ei, const char *name)
>
> eventfs_inode

Heh, I fixed that, but must have created the patch before catching it. :-/

>
> > +{
> > + unsigned long sip = (unsigned long)ei;
> > +
> > + sip += strhash(name) + EVENTFS_DIR_INODE_INO;
> > + sip = siphash_1u32((int)sip, &inode_key);
> > +
> > + /* keep it positive */
> > + return sip & ((1U << 31) - 1);
> > +}
> > +
> > /*
> > * The eventfs_inode (ei) itself is protected by SRCU. It is released from
> > * its parent's list and will have is_freed set (under eventfs_mutex).
>
> Thanks, find is happy now the directories have different inode numbers.
> The files still have identical inodes numbers, I hope that doesn't cause
> any issues...

Well, I guess I should ask Linus.

Linus,

I can add this patch to make sure directory inodes are unique, as it causes
a regression in find, but keep the file inodes the same. I can see how the
issue is with directories, as find (and probably other applications) check
for invalid directory loops. But with files, there should be no issue. It
could just think it's another hard link.

The question I have is, should this just make dir inodes unique and keep
files the same, as this patch does? Or make all inodes unique?

This is assuming that my algorithm is good enough to not leak kernel
addresses. I could also chop it down to 28 bits, as that's probably still
"good enough" to keep things unique.

return sip & ((1U << 28) - 1);

That would make it even harder to unhash to get an address.

-- Steve

2024-01-22 18:23:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 at 08:46, Steven Rostedt <[email protected]> wrote:
>
> I can add this patch to make sure directory inodes are unique, as it causes
> a regression in find, but keep the file inodes the same.

Yeah, limiting it to directories will at least somewhat help the
address leaking.

However, I also note that you never did the "set i_nlink to one"
trick, which is the traditional thing to do to tell 'find' that it
cannot do its directory optimization thing.

I'm not sure that the nlink trick disables this part of the find
sanity checks, but the *first* thing to check would be something like
this

--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -182,6 +182,7 @@ static int tracefs_getattr(struct mnt_idmap *idmap,

set_tracefs_inode_owner(inode);
generic_fillattr(idmap, request_mask, inode, stat);
+ stat->nlink = 1;
return 0;
}

because it might just fix the issue.

Having nlink == 1 is how non-unix filesystems (like FAT etc) indicate
that you can't try to count directory entries to optimize traversal.

And it is possible that that is where the whole find thing comes from,
but who knows, it could be a generic loop detector that runs
independently of the usual link detection.

Linus

2024-01-22 18:27:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 12:14:36 -0500
Mathieu Desnoyers <[email protected]> wrote:

>
> Why use an improvised hashing function (re-purposed from
> scripts/kconfig/symbol.c to a use-case which is exposed through a

That hash is just salt to the real hash function, which is the
siphash_1u32(). I added the name hash so that each file will get a little
different salt to the hash.

The siphash_1u32() is what the rest of the kernel uses for hashing kernel
address space.

> userspace ABI prone to kernel address leaks) rather than simply
> reserving values by setting bits in a bitmap ?
>
> How many inodes do we realistically expect to have there ?

If I only do directories, it is actually significantly less.

>
> On my 6.1.0 kernel:
>
> find /sys/kernel/tracing | wc -l
> 15598
>
> (mainly due to TRACE_EVENT ABI files)
>
> Hashing risks:
>
> - Exposing kernel addresses if the hashing algorithm is broken,

Well this was my biggest concern, but if I truncate at least a nibble, with
the unique salt to the algorithm for each file, how easily does that expose
kernel addresses.

The ei itself, is created from kmalloc() so you would at best get a heap
address. But with the missing nibble (if I mask it with ((1 << 28) - 1),
and much more taken away for 64 bit systems), and the added unique salt, is
it possible for this to expose anything that could be used in an attack?

> - Collisions if users are unlucky (which could trigger those
> 'find' errors).
>
> Those 15598 inode values fit within a single page (bitmap of
> 1922 bytes).
>
> So I would recommend simply adding a bitmap per tracefs filesystem
> instance to keep track of inode number allocation.

And how do I recover this bit after the inode is freed, but then referenced
again?

>
> Creation/removal of files/directories in tracefs should not be
> a fast-path anyway, so who cares about the speed of a find first
> bit within a single page ?
>

When an inode is no longer referenced, it is freed. When it is referenced
again, I want it to be recreated with the same inode number it had
previously. How would having a bitmask help with that? I need a way to map
an ei structure with a unique number without adding another 4 bytes to the
structure itself.

-- Steve

2024-01-22 18:29:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 at 09:37, Linus Torvalds
<[email protected]> wrote:
>
> Yeah, limiting it to directories will at least somewhat help the
> address leaking.

Actually, why not juist add an inode number to your data structures,
at least for directories? And just do a static increment on it as they
get registered?

That avoids the whole issue with possibly leaking kernel address data.

Linus

2024-01-22 18:58:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On 2024-01-22 13:19, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
> <[email protected]> wrote:
>>
>> Actually, why not juist add an inode number to your data structures,
>> at least for directories? And just do a static increment on it as they
>> get registered?
>>
>> That avoids the whole issue with possibly leaking kernel address data.
>
> The 'nlink = 1' thing doesn't seem to make 'find' any happier for this
> case, sadly.
>
> But the inode number in the 'struct eventfs_inode' looks trivial. And
> doesn't even grow that structure on 64-bit architectures at least,
> because the struct is already 64-bit aligned, and had only one 32-bit
> entry at the end.
>
> On 32-bit architectures the structure size grows, but I'm not sure the
> allocation size grows. Our kmalloc() is quantized at odd numbers.
>
> IOW, this trivial patch seems to be much safer than worrying about
> some pointer exposure.

My only concern about the simple ino_counter static increment is what
happens in the unlikely scenario of a 32-bit overflow. This is why
I suggested using a bitmap to track inode allocation. It's compact, and
we don't care that much about the linear bitmap scan overhead because
it's far from being a fast path.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-01-22 19:11:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On 2024-01-22 12:50, Steven Rostedt wrote:
> On Mon, 22 Jan 2024 12:14:36 -0500
> Mathieu Desnoyers <[email protected]> wrote:
[...]
>> On my 6.1.0 kernel:
>>
>> find /sys/kernel/tracing | wc -l
>> 15598
>>
>> (mainly due to TRACE_EVENT ABI files)
>>
>> Hashing risks:
>>
>> - Exposing kernel addresses if the hashing algorithm is broken,
>
> Well this was my biggest concern, but if I truncate at least a nibble, with
> the unique salt to the algorithm for each file, how easily does that expose
> kernel addresses.
>
> The ei itself, is created from kmalloc() so you would at best get a heap
> address. But with the missing nibble (if I mask it with ((1 << 28) - 1),
> and much more taken away for 64 bit systems), and the added unique salt, is
> it possible for this to expose anything that could be used in an attack?

I don't know, which is why I am concerned about it.

But I don't think we should spend time trying to understand all
possible attack scenarios associated with hashing of kernel addresses
when there are much simpler options available.

>
>> - Collisions if users are unlucky (which could trigger those
>> 'find' errors).
>>
>> Those 15598 inode values fit within a single page (bitmap of
>> 1922 bytes).
>>
>> So I would recommend simply adding a bitmap per tracefs filesystem
>> instance to keep track of inode number allocation.
>
> And how do I recover this bit after the inode is freed, but then referenced
> again?

You would keep the allocated inode number value within your data
structure associated with the inode.

If you never free inodes, then you can just use a static increment
as Linus suggested. But AFAIU there are cases where you free inodes,
hence my suggestion of bitmap.

When the inode is freed, you know which inode number is associated from
the field in your data structure, so you can clear this bit in the bitmap.

On the next inode allocation, you find-first-zero-bit in the bitmap, and
set it to one to reserve it.

>
>>
>> Creation/removal of files/directories in tracefs should not be
>> a fast-path anyway, so who cares about the speed of a find first
>> bit within a single page ?
>>
>
> When an inode is no longer referenced, it is freed. When it is referenced
> again, I want it to be recreated with the same inode number it had
> previously. How would having a bitmask help with that? I need a way to map
> an ei structure with a unique number without adding another 4 bytes to the
> structure itself.

As discussed in a separate exchange with Linus, why do you care so much about
not adding a 4 bytes field to the structure ?

Thanks,

Mathieu



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-01-22 19:13:01

by Kees Cook

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same



On January 22, 2024 10:19:12 AM PST, Linus Torvalds <[email protected]> wrote:
>On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
><[email protected]> wrote:
>>
>> Actually, why not juist add an inode number to your data structures,
>> at least for directories? And just do a static increment on it as they
>> get registered?

Yeah, this is what I'd suggest too. It avoids all the hash complexity. Is wrap-around realistic for it?

-Kees

--
Kees Cook

2024-01-22 19:13:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
<[email protected]> wrote:
>
> Actually, why not juist add an inode number to your data structures,
> at least for directories? And just do a static increment on it as they
> get registered?
>
> That avoids the whole issue with possibly leaking kernel address data.

The 'nlink = 1' thing doesn't seem to make 'find' any happier for this
case, sadly.

But the inode number in the 'struct eventfs_inode' looks trivial. And
doesn't even grow that structure on 64-bit architectures at least,
because the struct is already 64-bit aligned, and had only one 32-bit
entry at the end.

On 32-bit architectures the structure size grows, but I'm not sure the
allocation size grows. Our kmalloc() is quantized at odd numbers.

IOW, this trivial patch seems to be much safer than worrying about
some pointer exposure.

Linus


Attachments:
patch.diff (1.56 kB)

2024-01-22 19:36:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 13:27:25 -0500
Mathieu Desnoyers <[email protected]> wrote:

> > IOW, this trivial patch seems to be much safer than worrying about
> > some pointer exposure.
>
> My only concern about the simple ino_counter static increment is what
> happens in the unlikely scenario of a 32-bit overflow. This is why
> I suggested using a bitmap to track inode allocation. It's compact, and
> we don't care that much about the linear bitmap scan overhead because
> it's far from being a fast path.

The original code use to get its inode via "get_next_ino()" I don't think
there's any reason not to just go back and do that again.

It can still overflow, but it's not anything new that couldn't have happen
to debugfs and tracefs over the last decade.

-- Steve

2024-01-22 19:46:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 10:19:12 -0800
Linus Torvalds <[email protected]> wrote:

> On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
> <[email protected]> wrote:
> >
> > Actually, why not juist add an inode number to your data structures,
> > at least for directories? And just do a static increment on it as they
> > get registered?
> >
> > That avoids the whole issue with possibly leaking kernel address data.
>
> The 'nlink = 1' thing doesn't seem to make 'find' any happier for this
> case, sadly.
>
> But the inode number in the 'struct eventfs_inode' looks trivial. And
> doesn't even grow that structure on 64-bit architectures at least,
> because the struct is already 64-bit aligned, and had only one 32-bit
> entry at the end.
>
> On 32-bit architectures the structure size grows, but I'm not sure the
> allocation size grows. Our kmalloc() is quantized at odd numbers.
>
> IOW, this trivial patch seems to be much safer than worrying about
> some pointer exposure.

I originally wanted to avoid the addition of the 4 bytes, but your comment
about it not making a difference on 64bit due to alignment makes sense.

Slightly different version below.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6795fda2af19..6b211522a13e 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex);

/* Choose something "unique" ;-) */
#define EVENTFS_FILE_INODE_INO 0x12c4e37
-#define EVENTFS_DIR_INODE_INO 0x134b2f5
+
+/* Just try to make something consistent and unique */
+static int eventfs_dir_ino(struct eventfs_inode *ei)
+{
+ if (!ei->ino)
+ ei->ino = get_next_ino();
+
+ return ei->ino;
+}

/*
* The eventfs_inode (ei) itself is protected by SRCU. It is released from
@@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
inode->i_fop = &eventfs_file_operations;

/* All directories will have the same inode number */
- inode->i_ino = EVENTFS_DIR_INODE_INO;
+ inode->i_ino = eventfs_dir_ino(ei);

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)

name = ei_child->name;

- ino = EVENTFS_DIR_INODE_INO;
+ ino = eventfs_dir_ino(ei_child);

if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 12b7d0150ae9..1a574d306ea9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -64,6 +64,7 @@ struct eventfs_inode {
struct llist_node llist;
struct rcu_head rcu;
};
+ unsigned int ino;
unsigned int is_freed:1;
unsigned int is_events:1;
unsigned int nr_entries:30;

2024-01-22 19:48:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 14:44:43 -0500
Steven Rostedt <[email protected]> wrote:

> Slightly different version below.

The main difference between this and your patch is that the inode numbers
are only generated when needed, and files that are never referenced, will
not add to the possibility of overflow or collision.

-- Steve

2024-01-22 19:59:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, 22 Jan 2024 13:35:43 -0500
Mathieu Desnoyers <[email protected]> wrote:

> > When an inode is no longer referenced, it is freed. When it is referenced
> > again, I want it to be recreated with the same inode number it had
> > previously. How would having a bitmask help with that? I need a way to map
> > an ei structure with a unique number without adding another 4 bytes to the
> > structure itself.
>
> As discussed in a separate exchange with Linus, why do you care so much about
> not adding a 4 bytes field to the structure ?

I'm trying to keep the memory overhead of tracing down as much as possible.
4 bytes for 2000 events (and growing) just adds to the memory footprint of
tracing.

And an eventfs_inode is the link between control of an event per instance.
Thus, it may only be 8k for those 2000 events, but that's another 8k for
each instance you use. You make 10 instances, that is now 80k.

This is used in embedded systems as well, so every byte counts.

But as Linus pointed out, the issue is moot, as the structure ends with a
single 32bit int. And on 64 bit machines, there's likely a 4 byte hole that
we can use there.

-- Steve

2024-01-22 21:35:34

by Kees Cook

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, Jan 22, 2024 at 02:44:43PM -0500, Steven Rostedt wrote:
> On Mon, 22 Jan 2024 10:19:12 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > Actually, why not juist add an inode number to your data structures,
> > > at least for directories? And just do a static increment on it as they
> > > get registered?
> > >
> > > That avoids the whole issue with possibly leaking kernel address data.
> >
> > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this
> > case, sadly.
> >
> > But the inode number in the 'struct eventfs_inode' looks trivial. And
> > doesn't even grow that structure on 64-bit architectures at least,
> > because the struct is already 64-bit aligned, and had only one 32-bit
> > entry at the end.
> >
> > On 32-bit architectures the structure size grows, but I'm not sure the
> > allocation size grows. Our kmalloc() is quantized at odd numbers.
> >
> > IOW, this trivial patch seems to be much safer than worrying about
> > some pointer exposure.
>
> I originally wanted to avoid the addition of the 4 bytes, but your comment
> about it not making a difference on 64bit due to alignment makes sense.
>
> Slightly different version below.
>
> -- Steve
>
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 6795fda2af19..6b211522a13e 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex);
>
> /* Choose something "unique" ;-) */
> #define EVENTFS_FILE_INODE_INO 0x12c4e37
> -#define EVENTFS_DIR_INODE_INO 0x134b2f5
> +
> +/* Just try to make something consistent and unique */
> +static int eventfs_dir_ino(struct eventfs_inode *ei)
> +{
> + if (!ei->ino)
> + ei->ino = get_next_ino();
> +
> + return ei->ino;
> +}
>
> /*
> * The eventfs_inode (ei) itself is protected by SRCU. It is released from
> @@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
> inode->i_fop = &eventfs_file_operations;
>
> /* All directories will have the same inode number */
> - inode->i_ino = EVENTFS_DIR_INODE_INO;
> + inode->i_ino = eventfs_dir_ino(ei);
>
> ti = get_tracefs(inode);
> ti->flags |= TRACEFS_EVENT_INODE;
> @@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
>
> name = ei_child->name;
>
> - ino = EVENTFS_DIR_INODE_INO;
> + ino = eventfs_dir_ino(ei_child);
>
> if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
> goto out_dec;
> diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
> index 12b7d0150ae9..1a574d306ea9 100644
> --- a/fs/tracefs/internal.h
> +++ b/fs/tracefs/internal.h
> @@ -64,6 +64,7 @@ struct eventfs_inode {
> struct llist_node llist;
> struct rcu_head rcu;
> };
> + unsigned int ino;
> unsigned int is_freed:1;
> unsigned int is_events:1;
> unsigned int nr_entries:30;

I like it! :)

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2024-01-25 17:40:45

by Christian Brauner

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Mon, Jan 22, 2024 at 02:44:43PM -0500, Steven Rostedt wrote:
> On Mon, 22 Jan 2024 10:19:12 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Mon, 22 Jan 2024 at 09:39, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > Actually, why not juist add an inode number to your data structures,
> > > at least for directories? And just do a static increment on it as they
> > > get registered?
> > >
> > > That avoids the whole issue with possibly leaking kernel address data.
> >
> > The 'nlink = 1' thing doesn't seem to make 'find' any happier for this
> > case, sadly.
> >
> > But the inode number in the 'struct eventfs_inode' looks trivial. And
> > doesn't even grow that structure on 64-bit architectures at least,
> > because the struct is already 64-bit aligned, and had only one 32-bit
> > entry at the end.
> >
> > On 32-bit architectures the structure size grows, but I'm not sure the
> > allocation size grows. Our kmalloc() is quantized at odd numbers.
> >
> > IOW, this trivial patch seems to be much safer than worrying about
> > some pointer exposure.
>
> I originally wanted to avoid the addition of the 4 bytes, but your comment
> about it not making a difference on 64bit due to alignment makes sense.

Non-unique inode numbers aren't exactly great for userspace and there
are a surprisingly small yet large enough number of tools that trip over
this in various ways. So if that can be avoided then great.

But luckily no one is probably going to tar up tracefs. ;)

2024-01-25 18:07:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Thu, 25 Jan 2024 18:40:31 +0100
Christian Brauner <[email protected]> wrote:

> But luckily no one is probably going to tar up tracefs. ;)

Actually, inodes isn't the biggest issue of tar, as tar *is* a common
operation on tracefs.

If you want to create a synthetic event using the sqlhist tool for an
embedded board, we recommend copying the tracefs directory over to your
workstation from the embedded device. Unfortunately tar never works. That's
because all tracefs (and debugfs) files have zero size in stat().

# cd /tmp
# (cd /sys/kernel && tar cvf - tracing) | tar xvf -
# ls -s tracing
total 28
0 available_events 0 max_graph_depth 0 stack_trace
0 available_filter_functions 4 options 0 stack_trace_filter
0 available_filter_functions_addrs 4 osnoise 0 synthetic_events
0 available_tracers 4 per_cpu 0 timestamp_mode
0 buffer_percent 0 printk_formats 0 touched_functions
0 buffer_size_kb 0 README 0 trace
0 buffer_subbuf_size_kb 0 recursed_functions 0 trace_clock
0 buffer_total_size_kb 0 saved_cmdlines 0 trace_marker
0 current_tracer 0 saved_cmdlines_size 0 trace_marker_raw
0 dynamic_events 0 saved_tgids 0 trace_options
0 dyn_ftrace_total_info 0 set_event 0 trace_pipe
0 enabled_functions 0 set_event_notrace_pid 4 trace_stat
0 error_log 0 set_event_pid 0 tracing_cpumask
0 eval_map 0 set_ftrace_filter 0 tracing_max_latency
4 events 0 set_ftrace_notrace 0 tracing_on
0 free_buffer 0 set_ftrace_notrace_pid 0 tracing_thresh
0 function_profile_enabled 0 set_ftrace_pid 0 uprobe_events
4 hwlat_detector 0 set_graph_function 0 uprobe_profile
4 instances 0 set_graph_notrace 0 user_events_data
0 kprobe_events 0 snapshot 0 user_events_status
0 kprobe_profile 0 stack_max_size

So instead we have been recommending cp -r

Note, for the sqlhist command, only the events are needed, so the
instructions is only to copy the events directory, because copying all of
tracefs will try to copy the "trace_pipe" file which will block which hangs
'cp'. And I don't know an option to tell the cp command not to block.

# cd /tmp
# mkdir tracing
# cp -r /sys/kernel/tracing/events tracing/events
# ls -s tracing/events/sched/sched_switch/
total 20
4 enable 4 filter 4 format 0 hist 0 hist_debug 4 id 0 inject 4 trigger

Where the tar would have had:

# ls -s tracing/events/sched/sched_switch/
total 0
0 enable 0 filter 0 format 0 hist 0 hist_debug 0 id 0 inject 0 trigger


-- Steve

2024-01-25 18:09:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Thu, 25 Jan 2024 13:07:31 -0500
Steven Rostedt <[email protected]> wrote:

> Actually, inodes isn't the biggest issue of tar, as tar *is* a common
> operation on tracefs.

Correction. tar would be a common operation if it worked ;-)

-- Steve

2024-01-26 09:10:40

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

Hi Steven.

On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <[email protected]> wrote:
> On Thu, 25 Jan 2024 13:07:31 -0500
> Steven Rostedt <[email protected]> wrote:
> > Actually, inodes isn't the biggest issue of tar, as tar *is* a common
> > operation on tracefs.
>
> Correction. tar would be a common operation if it worked ;-)

What would be needed to fix that? I regularly use tar on other virtual
file systems (e.g. /sys/firmware/devicetree/), which works fine.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2024-01-26 10:29:30

by Christian Brauner

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Fri, Jan 26, 2024 at 09:07:06AM +0100, Geert Uytterhoeven wrote:
> Hi Steven.
>
> On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <[email protected]> wrote:
> > On Thu, 25 Jan 2024 13:07:31 -0500
> > Steven Rostedt <[email protected]> wrote:
> > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common
> > > operation on tracefs.
> >
> > Correction. tar would be a common operation if it worked ;-)
>
> What would be needed to fix that? I regularly use tar on other virtual
> file systems (e.g. /sys/firmware/devicetree/), which works fine.

The size would be one thing. The other is that tar requires unique inode
numbers for all files iirc (That's why we have this whole btrfs problem
- let's not get into this here - where inode numbers aren't unique and
are duplicated per subvolume.).

2024-01-26 13:16:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Fri, 26 Jan 2024 09:07:06 +0100
Geert Uytterhoeven <[email protected]> wrote:

> Hi Steven.
>
> On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <[email protected]> wrote:
> > On Thu, 25 Jan 2024 13:07:31 -0500
> > Steven Rostedt <[email protected]> wrote:
> > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common
> > > operation on tracefs.
> >
> > Correction. tar would be a common operation if it worked ;-)
>
> What would be needed to fix that? I regularly use tar on other virtual
> file systems (e.g. /sys/firmware/devicetree/), which works fine.

Looks like all /sys files have one page in size. I could change the default
file size to one page and it might work (if the inodes had different
numbers), as I don't see any format file greater than 4k.

This would fix the events for tar, but it couldn't do the same for tracing.
As some files don't actually have a size.

-- Steve




2024-01-26 14:06:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Fri, 26 Jan 2024 08:16:16 -0500
Steven Rostedt <[email protected]> wrote:

> On Fri, 26 Jan 2024 09:07:06 +0100
> Geert Uytterhoeven <[email protected]> wrote:
>
> > Hi Steven.
> >
> > On Thu, Jan 25, 2024 at 7:08 PM Steven Rostedt <rostedt@goodmisorg> wrote:
> > > On Thu, 25 Jan 2024 13:07:31 -0500
> > > Steven Rostedt <[email protected]> wrote:
> > > > Actually, inodes isn't the biggest issue of tar, as tar *is* a common
> > > > operation on tracefs.
> > >
> > > Correction. tar would be a common operation if it worked ;-)
> >
> > What would be needed to fix that? I regularly use tar on other virtual
> > file systems (e.g. /sys/firmware/devicetree/), which works fine.
>
> Looks like all /sys files have one page in size. I could change the default
> file size to one page and it might work (if the inodes had different
> numbers), as I don't see any format file greater than 4k.
>
> This would fix the events for tar, but it couldn't do the same for tracing.
> As some files don't actually have a size.

And this patch makes a unique inode number for files too.

I also found a slight bug where the inode number is generated twice. Once
to create the inode, and then again overwriting it with the eventfs inode
logic. That just makes it more likely for the inode numbers to overflow.

This fixes that too.

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6b211522a13e..7be7a694b106 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,14 +32,11 @@
*/
static DEFINE_MUTEX(eventfs_mutex);

-/* Choose something "unique" ;-) */
-#define EVENTFS_FILE_INODE_INO 0x12c4e37
-
/* Just try to make something consistent and unique */
-static int eventfs_dir_ino(struct eventfs_inode *ei)
+static int eventfs_dir_ino(struct eventfs_inode *ei, int nr_files)
{
if (!ei->ino)
- ei->ino = get_next_ino();
+ ei->ino = tracefs_get_next_ino(nr_files);

return ei->ino;
}
@@ -327,6 +324,7 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid)
* @parent: parent dentry for this file.
* @data: something that the caller will want to get to later on.
* @fop: struct file_operations that should be used for this file.
+ * @ino: inode number for this file
*
* This function creates a dentry that represents a file in the eventsfs_inode
* directory. The inode.i_private pointer will point to @data in the open()
@@ -335,7 +333,8 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid)
static struct dentry *create_file(const char *name, umode_t mode,
struct eventfs_attr *attr,
struct dentry *parent, void *data,
- const struct file_operations *fop)
+ const struct file_operations *fop,
+ unsigned int ino)
{
struct tracefs_inode *ti;
struct dentry *dentry;
@@ -363,9 +362,7 @@ static struct dentry *create_file(const char *name, umode_t mode,
inode->i_op = &eventfs_file_inode_operations;
inode->i_fop = fop;
inode->i_private = data;
-
- /* All files will have the same inode number */
- inode->i_ino = EVENTFS_FILE_INODE_INO;
+ inode->i_ino = ino;

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -377,12 +374,14 @@ static struct dentry *create_file(const char *name, umode_t mode,
/**
* create_dir - create a dir in the tracefs filesystem
* @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
+ * @parent: parent dentry for this directory.
+ * @nr_files: The number of files (not directories) this directory has
*
* This function will create a dentry for a directory represented by
* a eventfs_inode.
*/
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent)
+static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent,
+ int nr_files)
{
struct tracefs_inode *ti;
struct dentry *dentry;
@@ -404,7 +403,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
inode->i_fop = &eventfs_file_operations;

/* All directories will have the same inode number */
- inode->i_ino = eventfs_dir_ino(ei);
+ inode->i_ino = eventfs_dir_ino(ei, nr_files);

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -504,7 +503,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx,

mutex_unlock(&eventfs_mutex);

- dentry = create_file(name, mode, attr, parent, data, fops);
+ dentry = create_file(name, mode, attr, parent, data, fops, ei->ino + idx + 1);

mutex_lock(&eventfs_mutex);

@@ -598,7 +597,7 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
}
mutex_unlock(&eventfs_mutex);

- dentry = create_dir(ei, parent);
+ dentry = create_dir(ei, parent, ei->nr_entries);

mutex_lock(&eventfs_mutex);

@@ -786,7 +785,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
if (r <= 0)
continue;

- ino = EVENTFS_FILE_INODE_INO;
+ ino = ei->ino + i + 1;

if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
goto out;
@@ -810,7 +809,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)

name = ei_child->name;

- ino = eventfs_dir_ino(ei_child);
+ ino = eventfs_dir_ino(ei_child, ei_child->nr_entries);

if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..2187be6d7b23 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = {
.setattr = tracefs_setattr,
};

+/* Copied from get_next_ino() but adds allocation for multiple inodes */
+#define LAST_INO_BATCH 1024
+#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+unsigned int tracefs_get_next_ino(int files)
+{
+ unsigned int *p = &get_cpu_var(last_ino);
+ unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+ /* Check if adding files+1 overflows */
+ if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) {
+ static atomic_t shared_last_ino;
+ int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+
+ res = next - LAST_INO_BATCH;
+ }
+#endif
+
+ res++;
+ /* get_next_ino should not provide a 0 inode number */
+ if (unlikely(!res))
+ res++;
+ *p = res + files;
+ put_cpu_var(last_ino);
+ return res;
+}
+
struct inode *tracefs_get_inode(struct super_block *sb)
{
struct inode *inode = new_inode(sb);
- if (inode) {
- inode->i_ino = get_next_ino();
+ if (inode)
simple_inode_init_ts(inode);
- }
+
return inode;
}

@@ -644,6 +672,8 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode,
inode->i_private = data;
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
inode->i_gid = d_inode(dentry->d_parent)->i_gid;
+ inode->i_ino = tracefs_get_next_ino(0);
+
d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
return tracefs_end_creating(dentry);
@@ -669,6 +699,7 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent,
inode->i_fop = &simple_dir_operations;
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
inode->i_gid = d_inode(dentry->d_parent)->i_gid;
+ inode->i_ino = tracefs_get_next_ino(0);

ti = get_tracefs(inode);
ti->private = instance_inode(parent, inode);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 45397df9bb65..7dd6678229d0 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -75,6 +75,7 @@ static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
return container_of(inode, struct tracefs_inode, vfs_inode);
}

+unsigned int tracefs_get_next_ino(int files);
struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
struct dentry *tracefs_end_creating(struct dentry *dentry);
struct dentry *tracefs_failed_creating(struct dentry *dentry);


2024-01-26 16:27:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Fri, 26 Jan 2024 11:11:39 +0100
Christian Brauner <[email protected]> wrote:

> The size would be one thing. The other is that tar requires unique inode
> numbers for all files iirc (That's why we have this whole btrfs problem
> - let's not get into this here - where inode numbers aren't unique and
> are duplicated per subvolume.).

Well, I guess that answers Linus's question about wondering if there's any
user space program that actually cares what the inodes are for files. The
answer is "yes" and the program is "tar".

And because tar cares, I think I should fix it for tracefs even if it
doesn't work because of size. But the size issue is a trivial fix if I just
default it to 1 page.

I currently use get_next_ino(), but I can use my own version of that, and
because each directory has a fixed number of files, I could have it be:

/* Copied from get_next_ino but adds allocation for multiple inodes */
#define LAST_INO_BATCH 1024
#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
static DEFINE_PER_CPU(unsigned int, last_ino);

unsigned int tracefs_get_next_ino(int files)
{
unsigned int *p = &get_cpu_var(last_ino);
unsigned int res = *p;

#ifdef CONFIG_SMP
/* Check if adding files+1 overflows */
if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) {
static atomic_t shared_last_ino;
int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);

res = next - LAST_INO_BATCH;
}
#endif

res++;
/* get_next_ino should not provide a 0 inode number */
if (unlikely(!res))
res++;
*p = res + files;
put_cpu_var(last_ino);
return res;
}


This way the eventfs inode would tell tracefs_get_next_ino() how many inode
numbers it needs for its files and then when it creates the file inode, it
can use:

inode->i_ino = ei->ino + idx;

Where idx is the index into the d_children array of the directory's
eventfs_inode.

-- Steve

2024-01-26 19:10:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [for-linus][PATCH 1/3] eventfs: Have the inodes all for files and directories all be the same

On Fri, 26 Jan 2024 at 08:26, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 26 Jan 2024 11:11:39 +0100
> Christian Brauner <[email protected]> wrote:
>
> > The size would be one thing. The other is that tar requires unique inode
> > numbers for all files iirc (That's why we have this whole btrfs problem
> > - let's not get into this here - where inode numbers aren't unique and
> > are duplicated per subvolume.).
>
> Well, I guess that answers Linus's question about wondering if there's any
> user space program that actually cares what the inodes are for files. The
> answer is "yes" and the program is "tar".

Well, the fact that it hits snapshots, shows that the real problem is
just "tar does stupid things that it shouldn't do".

Yes, inode numbers used to be special, and there's history behind it.
But we should basically try very hard to walk away from that broken
history.

An inode number just isn't a unique descriptor any more. We're not
living in the 1970s, and filesystems have changed.

Linus