2024-01-29 02:59:23

by kernel test robot

[permalink] [raw]
Subject: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address



Hello,

kernel test robot noticed "BUG:unable_to_handle_page_fault_for_address" on:

commit: 852e46e239ee6db3cd220614cf8bce96e79227c2 ("eventfs: Do not create dentries nor inodes in iterate_shared")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master 7ed2632ec7d72e926b9e8bcc9ad1bb0cd37274bf]
[test failed on linux-next/master 774551425799cb5bbac94e1768fd69eec4f78dd4]

in testcase: stress-ng
version: stress-ng-x86_64-1744999cb-1_20240112
with following parameters:

nr_threads: 10%
disk: 1HDD
testtime: 60s
fs: xfs
class: filesystem
test: getdent
cpufreq_governor: performance



compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


[ 52.318955][ T4398] BUG: unable to handle page fault for address: 0000042c000004db
[ 52.326780][ T4398] #PF: supervisor read access in kernel mode
[ 52.332830][ T4398] #PF: error_code(0x0000) - not-present page
[ 52.338864][ T4398] PGD 186970067 P4D 0
[ 52.342993][ T4398] Oops: 0000 [#1] SMP NOPTI
[ 52.347552][ T4398] CPU: 32 PID: 4398 Comm: stress-ng-getde Not tainted 6.7.0-rc2-00049-g852e46e239ee #1
[ 52.357235][ T4398] Hardware name: Inspur NF5180M6/NF5180M6, BIOS 06.00.04 04/12/2022
[ 52.365278][ T4398] RIP: 0010:set_top_events_ownership (fs/tracefs/event_inode.c:172 (discriminator 1))
[ 52.371424][ T4398] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 f8 48 85 c0 74 11 <f6> 40 78 02 74 0b 8b 50 50 f7 c2 00 00 08 00 75 05 c3 cc cc cc cc
All code
========
0: 00 00 add %al,(%rax)
2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
9: 00 00 00
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
10: 90 nop
11: 90 nop
12: 90 nop
13: 90 nop
14: 90 nop
15: 90 nop
16: 90 nop
17: 90 nop
18: 90 nop
19: 90 nop
1a: 90 nop
1b: 90 nop
1c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
21: 48 8b 47 f8 mov -0x8(%rdi),%rax
25: 48 85 c0 test %rax,%rax
28: 74 11 je 0x3b
2a:* f6 40 78 02 testb $0x2,0x78(%rax) <-- trapping instruction
2e: 74 0b je 0x3b
30: 8b 50 50 mov 0x50(%rax),%edx
33: f7 c2 00 00 08 00 test $0x80000,%edx
39: 75 05 jne 0x40
3b: c3 retq
3c: cc int3
3d: cc int3
3e: cc int3
3f: cc int3

Code starting with the faulting instruction
===========================================
0: f6 40 78 02 testb $0x2,0x78(%rax)
4: 74 0b je 0x11
6: 8b 50 50 mov 0x50(%rax),%edx
9: f7 c2 00 00 08 00 test $0x80000,%edx
f: 75 05 jne 0x16
11: c3 retq
12: cc int3
13: cc int3
14: cc int3
15: cc int3
[ 52.391324][ T4398] RSP: 0018:ffa000000f2efc90 EFLAGS: 00010206
[ 52.397481][ T4398] RAX: 0000042c00000463 RBX: ff110002452ae2c8 RCX: 0000000000004000
[ 52.405547][ T4398] RDX: 0000000000000024 RSI: ff110002452ae2c8 RDI: ff110002452ae2c8
[ 52.413616][ T4398] RBP: ffffffff83080320 R08: 0000000000000064 R09: ff1100018a25dd40
[ 52.421686][ T4398] R10: ffffffffffff8c98 R11: 0000000000000000 R12: 0000000000000024
[ 52.429755][ T4398] R13: ffffffff83080320 R14: ffa000000f2efedc R15: 0000000000018000
[ 52.437828][ T4398] FS: 00007f3b60e3d740(0000) GS:ff11001fffe00000(0000) knlGS:0000000000000000
[ 52.446854][ T4398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 52.453547][ T4398] CR2: 0000042c000004db CR3: 0000000113750004 CR4: 0000000000771ef0
[ 52.461629][ T4398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 52.469707][ T4398] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 52.477783][ T4398] PKRU: 55555554
[ 52.481444][ T4398] Call Trace:
[ 52.484848][ T4398] <TASK>
[ 52.487894][ T4398] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 52.491898][ T4398] ? page_fault_oops (arch/x86/mm/fault.c:707)
[ 52.496854][ T4398] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
[ 52.501725][ T4398] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[ 52.506858][ T4398] ? set_top_events_ownership (fs/tracefs/event_inode.c:172 (discriminator 1))
[ 52.512425][ T4398] eventfs_permission (fs/tracefs/event_inode.c:203)
[ 52.517390][ T4398] inode_permission (fs/namei.c:462 fs/namei.c:529 fs/namei.c:504)
[ 52.522271][ T4398] may_open (fs/namei.c:3250)
[ 52.526453][ T4398] do_open (fs/namei.c:3621)
[ 52.530540][ T4398] ? open_last_lookups (fs/namei.c:3569)
[ 52.535670][ T4398] path_openat (fs/namei.c:3780)
[ 52.540187][ T4398] do_filp_open (fs/namei.c:3809)
[ 52.544696][ T4398] do_sys_openat2 (fs/open.c:1440)
[ 52.549285][ T4398] __x64_sys_openat (fs/open.c:1466)
[ 52.554044][ T4398] do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82)
[ 52.558541][ T4398] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
[ 52.564509][ T4398] RIP: 0033:0x7f3b60fcb127
[ 52.568991][ T4398] Code: 25 00 00 41 00 3d 00 00 41 00 74 47 64 8b 04 25 18 00 00 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 95 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
All code
========
0: 25 00 00 41 00 and $0x410000,%eax
5: 3d 00 00 41 00 cmp $0x410000,%eax
a: 74 47 je 0x53
c: 64 8b 04 25 18 00 00 mov %fs:0x18,%eax
13: 00
14: 85 c0 test %eax,%eax
16: 75 6b jne 0x83
18: 44 89 e2 mov %r12d,%edx
1b: 48 89 ee mov %rbp,%rsi
1e: bf 9c ff ff ff mov $0xffffff9c,%edi
23: b8 01 01 00 00 mov $0x101,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 0f 87 95 00 00 00 ja 0xcb
36: 48 8b 4c 24 28 mov 0x28(%rsp),%rcx
3b: 64 fs
3c: 48 rex.W
3d: 2b .byte 0x2b
3e: 0c 25 or $0x25,%al

Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 0f 87 95 00 00 00 ja 0xa1
c: 48 8b 4c 24 28 mov 0x28(%rsp),%rcx
11: 64 fs
12: 48 rex.W
13: 2b .byte 0x2b
14: 0c 25 or $0x25,%al


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240129/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2024-01-29 04:36:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Sun, 28 Jan 2024 at 18:59, kernel test robot <[email protected]> wrote:
>
> 21: 48 8b 47 f8 mov -0x8(%rdi),%rax
> 25: 48 85 c0 test %rax,%rax
> 28: 74 11 je 0x3b
> 2a:* f6 40 78 02 testb $0x2,0x78(%rax) <-- trapping instruction

So this is

struct tracefs_inode *ti = get_tracefs(inode);
struct eventfs_inode *ei = ti->private;

if (!ei || !ei->is_events || ..

in set_top_events_ownership(), and it's that 'ei->is_events' test that oopses.

The 'inode' is the incoming argument (in %rdi), and you don't see code
generation for the "get_tracefs(inode)" because it's just an offset
off the inode.

So the "ti->private" read is that read off "-8(%rdi)", because

struct tracefs_inode {
unsigned long flags;
void *private;
struct inode vfs_inode;
};

so 'private' is 8 bytes below the 'struct inode' pointer.

So 'ei' isn't NULL, but it's a bad pointer, and 'ei->is_events' is
that "testb $0x2,0x78(%rax)" and it oopses as a result.

I don't think this is directly related to that commit 852e46e239ee
("eventfs: Do not create dentries nor inodes in iterate_shared") that
the kernel test robot talks about.

It looks like some inode creation never filled in the ->private field
at all, and it's just garbage.

I note that we have code like this:

create_dir_dentry():
...
mutex_unlock(&eventfs_mutex);

dentry = create_dir(ei, parent);

mutex_lock(&eventfs_mutex);
...
if (!ei->dentry && !ei->is_freed) {
ei->dentry = dentry;
eventfs_post_create_dir(ei);
dentry->d_fsdata = ei;
} else {

and that eventfs_post_create_dir() seems to be where it fills in that
->private pointer:

ti = get_tracefs(ei->dentry->d_inode);
ti->private = ei;

but notice how create_dir() has done that

d_instantiate(dentry, inode);

which exposes the inode to lookup - long before it's actually then filled in.

IOW, what I think is going on is a very basic race, where
create_dir_dentry() will allocate the inode and attach it to the
dentry long before the inode has been fully initialized.

So if somebody comes in *immediately* after that, and does a 'stat()'
on that name that now is exposed, it will see an inode that has not
yet made it to that eventfs_post_create_dir() phase, and that in turn
explains why

struct eventfs_inode *ei = ti->private;

just reads random garbage values.

So if I'm right (big "if" - it looks likely, but who knows) this bug
is entirely unrelated to any dentry caching or any reference counting.

It just needs just the right timing, where one CPU happens to do a
'stat()' just as another one creates the directory.

It's not a big window, so you do need some timing "luck".

End result: the d_instantiate() needs to be done *after* the inode has
been fully filled in.

Alternatively, you could

(a) not drop the eventfs_mutex around the create_dir() call

(b) take the eventfs_mutex around all of set_top_events_ownership()

and just fix it by having the lock protect the lack of ordering.

Linus

2024-01-29 17:21:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Sun, 28 Jan 2024 20:36:12 -0800
Linus Torvalds <[email protected]> wrote:


> End result: the d_instantiate() needs to be done *after* the inode has
> been fully filled in.
>
> Alternatively, you could
>
> (a) not drop the eventfs_mutex around the create_dir() call
>
> (b) take the eventfs_mutex around all of set_top_events_ownership()
>
> and just fix it by having the lock protect the lack of ordering.

Hi Linus,

Thanks for the analysis. I have a patch that removes the dropping of the
mutex over the create_dir/file() calls, and lockdep hasn't complained about
it.

I was going to add that to my queue for the next merge window along with
other clean ups but this looks like it actually fixes a real bug. I'll move
that over to my urgent queue and start testing it.

-- Steve


2024-01-29 17:53:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 09:01, Steven Rostedt <[email protected]> wrote:
>
> Thanks for the analysis. I have a patch that removes the dropping of the
> mutex over the create_dir/file() calls, and lockdep hasn't complained about
> it.
>
> I was going to add that to my queue for the next merge window along with
> other clean ups but this looks like it actually fixes a real bug. I'll move
> that over to my urgent queue and start testing it.

Note that it is *not* enough to just keep the mutex.

All the *users* need to get the mutex too.

Otherwise you have this:

CPU1:

create_dir_dentry():
mutex locked the whole time..
dentry = create_dir(ei, parent);
does d_instantiate(dentry, inode);
eventfs_post_create_dir(ei);
dentry->d_fsdata = ei;
mutex dropped;

but CPU2 can still come in, see the dentry immediately after the
"d_instantiate()", and do an "open()" or "stat()" on the dentry (which
will *not* cause a 'lookup()', since it's in the dentry cache), and
that will then cause either

->permission() -> eventfs_permission() -> set_top_events_ownership()

or

->get_attr() -> eventfs_get_attr() -> set_top_events_ownership()

and both of those will now do the dentry->inode->ei lookup. And
neither of them takes the mutex.

So then it doesn't even matter that you didn't drop the mutex in the
middle, because the users simply won't be serializing with it anyway.

So you'd have to make set_top_events_ownership() take the mutex around
it all too.

In fact, pretty much *any* use of "ti->private" needs the mutex.

Which is obviously a bit painful.

Honestly, I think the right model is to just make sure that the inode
is fully initialized when you do 'd_instantiate()'

The patch looks obvious, and I think this actually fixes *another*
bug, namely that the old

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;

was buggy, because 'ti->flags' was uninitialized before.

eventfs_create_events_dir() seems to have the same bug with ti->flags,
btw, but got the ti->private initialization right.

Funnily enough, create_file() got this right. I don't even understand
why create_dir() did what it did.

IOW, I think the right fix is really just this:

--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,8 @@
inode->i_ino = EVENTFS_FILE_INODE_INO;

ti = get_tracefs(inode);
- ti->flags |= TRACEFS_EVENT_INODE;
+ ti->flags = TRACEFS_EVENT_INODE;
+ ti->private = ei;
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
@@ -513,7 +514,6 @@
static void eventfs_post_create_dir(struct eventfs_inode *ei)
{
struct eventfs_inode *ei_child;
- struct tracefs_inode *ti;

lockdep_assert_held(&eventfs_mutex);

@@ -523,9 +523,6 @@
srcu_read_lock_held(&eventfs_srcu)) {
ei_child->d_parent = ei->dentry;
}
-
- ti = get_tracefs(ei->dentry->d_inode);
- ti->private = ei;
}

/**
@@ -943,7 +940,7 @@
INIT_LIST_HEAD(&ei->list);

ti = get_tracefs(inode);
- ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+ ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;

inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

which fixes the initialization errors with the 'ti' fields.

Now, I do agree that your locking is strange, and that should be fixed
*too*, but I think the above is the "right" fix for this particular
issue.

Linus

2024-01-29 17:58:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
<[email protected]> wrote:
>
> eventfs_create_events_dir() seems to have the same bug with ti->flags,
> btw, but got the ti->private initialization right.
>
> Funnily enough, create_file() got this right. I don't even understand
> why create_dir() did what it did.
>
> IOW, I think the right fix is really just this:

Actually, I think you have another uninitialized field here too:
'dentry->d_fsdata'.

And it looks like both create_file and create_dir got that wrong, but
eventfs_create_events_dir() got it right.

So you *also* need to do that

dentry->d_fsdata = ei;

before you do the d_instantiate().

Now, from a quick look, all the d_fsdata accesses *do* seem to be
protected by the eventfs_mutex, except

(a) eventfs_create_events_dir() doesn't seem to take the mutex, but
gets the ordering right, so is ok

(b) create_dir_dentry() drops the mutex in the middle, so the mutex
doesn't actually serialize anything

Not dropping the mutex in create_dir_dentry() *will* fix this bug, but
honestly, I'd suggest that in addition to not dropping the mutex, you
just fix the ordering too.

IOW, just do that

dentry->d_fsdata = ei;

before you do d_instantiate(), and now accessing d_fsdata is just
always safe and doesn't even need the mutex.

The whole "initialize everything before exposing it to others" is
simply just a good idea.

Linus

2024-01-29 19:24:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
<[email protected]> wrote:
>
> IOW, I think the right fix is really just this:

Oh, for some reason I sent out the original patch I had which didn't
fix the create_dir() case.

So that patch was missing the important hunk that added the

ti->flags = TRACEFS_EVENT_INODE;
ti->private = ei;

to create_dir() (to match the removal in eventfs_post_create_dir()).

I had incorrectly put it in the create_file() case, that should just
set ->private to NULL. afaik

So the patch was completely broken. Here's the one that should
actually compile (although still not actually *tested*).

Linus


Attachments:
0001-eventfsfs-initialize-the-tracefs-inode-properly.patch (2.48 kB)

2024-01-29 19:54:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
<[email protected]> wrote:
>
> So the patch was completely broken. Here's the one that should
> actually compile (although still not actually *tested*).

Note that this fixes the d_instantiate() ordering wrt initializing the inode.

But as I look up the call chain, I see many more fundamental mistakes.

Steven - the reason you think that the VFS doesn't have documentation
is that we *do* have tons of documentation, but it's of the kind "Here
is what you should do".

It is *not* of the kind that says "You messed up and did something
else, and how do you recover from it?".

So the fundamental bug I now find is that eventfs_root_lookup() gets a
target dentry, and for some unfathomable reason it then does

ret = simple_lookup(dir, dentry, flags);

on it. Which is *completely* broken, because what "simple_lookup()"
does is just say "oh, you didn't have a dentry of this kind before, so
clearly a lookup must be a non-existent file". Remember: this is for
'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
files.

For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
"simple_lookup()" will do is just a plain

d_add(dentry, NULL);

and nothing else. And guess what *that* does? It basically
instantiates a negative dentry, telling all other lookups that the
path does not exist.

So if you have two concurrent lookups, one will do that
simple_lookup(), and the other will then - depending on timing -
either see the negative dentry and return -ENOENT, or - if it comes in
a bit later - see the new inode that then later gets added by the
first lookup with d_instantiate().

See? That simple_lookup() is not just unnecessary, but it's also
actively completely WRONG. Because it instantiates a NULL pointer,
other processes that race with the lookup may now end up saying "that
file doesn't exist", even though it should.

Basically, you can't use *any* of the "simple" filesystem helpers.
Because they are all designed for that "the dentry tree is all there
is" case.

Linus

2024-01-29 20:26:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 11:51:52 -0800
Linus Torvalds <[email protected]> wrote:

> On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
> <[email protected]> wrote:
> >
> > So the patch was completely broken. Here's the one that should
> > actually compile (although still not actually *tested*).
>
> Note that this fixes the d_instantiate() ordering wrt initializing the inode.
>
> But as I look up the call chain, I see many more fundamental mistakes.
>
> Steven - the reason you think that the VFS doesn't have documentation
> is that we *do* have tons of documentation, but it's of the kind "Here
> is what you should do".
>
> It is *not* of the kind that says "You messed up and did something
> else, and how do you recover from it?".

When I first worked on this, I did read all the VFS documentation, and it
was difficult to understand what I needed and what I didn't. It is focused
on a real file system and not a pseudo ones. Another mistake I made was
thinking that debugfs was doing things the "right" way as well. And having
a good idea of how debugfs worked, and thinking it was correct, just made
reading VFS documentation even more difficult as I couldn't relate what I
knew (about debugfs) with what was being explained. I thought it was
perfectly fine to use dentry as a handle for the file system. I did until
you told me it wasn't. That made a profound change in my understanding of
how things are supposed to work.

>
> So the fundamental bug I now find is that eventfs_root_lookup() gets a
> target dentry, and for some unfathomable reason it then does
>
> ret = simple_lookup(dir, dentry, flags);
>
> on it. Which is *completely* broken, because what "simple_lookup()"
> does is just say "oh, you didn't have a dentry of this kind before, so
> clearly a lookup must be a non-existent file". Remember: this is for
> 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> files.

Sorry, I don't really understand what you mean by "ALL files"? You mean
that all files in the pseudo file system has a dentry to it (like debugfs,
and the rest of tracefs)?

>
> For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
> "simple_lookup()" will do is just a plain
>
> d_add(dentry, NULL);
>
> and nothing else. And guess what *that* does? It basically
> instantiates a negative dentry, telling all other lookups that the
> path does not exist.
>
> So if you have two concurrent lookups, one will do that
> simple_lookup(), and the other will then - depending on timing -
> either see the negative dentry and return -ENOENT, or - if it comes in
> a bit later - see the new inode that then later gets added by the
> first lookup with d_instantiate().
>
> See? That simple_lookup() is not just unnecessary, but it's also
> actively completely WRONG. Because it instantiates a NULL pointer,
> other processes that race with the lookup may now end up saying "that
> file doesn't exist", even though it should.
>
> Basically, you can't use *any* of the "simple" filesystem helpers.
> Because they are all designed for that "the dentry tree is all there
> is" case.

Yeah, the above code did come about with me not fully understanding the
above. It's not that it wasn't documented, but I admit, when I read the VFS
documentation, I had a lot of trouble trying to make sense of things like
negative dentries and how they relate.

I now have a much better understanding of most of this, thanks to our
discussion here, and also knowing that using dentry as the main handle to a
file is *not* how to do it. When thinking it was, it made things much more
difficult to comprehend.

-- Steve

2024-01-29 20:53:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 12:25, Steven Rostedt <[email protected]> wrote:
>
> > So the fundamental bug I now find is that eventfs_root_lookup() gets a
> > target dentry, and for some unfathomable reason it then does
> >
> > ret = simple_lookup(dir, dentry, flags);
> >
> > on it. Which is *completely* broken, because what "simple_lookup()"
> > does is just say "oh, you didn't have a dentry of this kind before, so
> > clearly a lookup must be a non-existent file". Remember: this is for
> > 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> > files.
>
> Sorry, I don't really understand what you mean by "ALL files"? You mean
> that all files in the pseudo file system has a dentry to it (like debugfs,
> and the rest of tracefs)?

Yes.

So the whole - and *ONLY* - point of 'simple_lookup()' is for
filesystems like tmpfs, or like debugfs or other filesystems like
that, which never actually *need* to look anything up, because
everything is already cached in the dentry tree.

That's what the "simple" part of the simple functions mean. They are
simple from a dcache standpoint, because the dcache is all there is.

End result: what simple_lookup() does is say "oh, you didn't have the
file, so it's by definition a negative dentry", and thus all it does
is to do "d_add(dentry, NULL)".

Anyway, removing this was painful. I initially thought "I'll just
remove the calls". But it all ended up cascading into "that's also
wrong".

So now I have a patch that tries to fix this all up, and it looks like thisL:

1 file changed, 50 insertions(+), 219 deletions(-)

because it basically removed all the old code, and replaced it with
much simpler code.

I'm including the patch here as an attachment, but I want to note very
clearly that this *builds* for me, and it looks a *lot* more obvious
and correct than the old code did, but I haven't tested it. AT ALL.

Also note that it depends on my previous patches, so I guess I'll
include them here again just to make it unambiguous.

Finally - this does *not* fix up the refcounting. I still think the
SRCU stuff is completely broken. But that's another headache. But at
least now the *lookup* parts look like they DTRT wrt eventfs_mutex.

The SRCU logic from the directory iteration parts still needs crapectomy.

AGAIN: these patches (ie particularly that last one - 0004) were all
done entirely "blindly" - I've looked at the code, and fixed the bugs
and problems I've seen by pure code inspection.

That's great, but it really means that it's all untested. It *looks*
better than the old code, but there may be some silly gotcha that I
have missed.

Linus


Attachments:
0002-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch (3.48 kB)
0003-eventfsfs-initialize-the-tracefs-inode-properly.patch (2.49 kB)
0001-tracefs-remove-stale-update_gid-code.patch (2.55 kB)
0004-tracefs-dentry-lookup-crapectomy.patch (11.49 kB)
Download all attachments

2024-01-29 21:45:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <[email protected]> wrote:

> End result: what simple_lookup() does is say "oh, you didn't have the
> file, so it's by definition a negative dentry", and thus all it does
> is to do "d_add(dentry, NULL)".
>
> Anyway, removing this was painful. I initially thought "I'll just
> remove the calls". But it all ended up cascading into "that's also
> wrong".
>
> So now I have a patch that tries to fix this all up, and it looks like thisL:
>
> 1 file changed, 50 insertions(+), 219 deletions(-)

Thanks, much appreciated.

>
> because it basically removed all the old code, and replaced it with
> much simpler code.
>
> I'm including the patch here as an attachment, but I want to note very
> clearly that this *builds* for me, and it looks a *lot* more obvious
> and correct than the old code did, but I haven't tested it. AT ALL.

I'm going to stare at them as I test them. Because I want to understand
them. I may come back with questions.

>
> Also note that it depends on my previous patches, so I guess I'll
> include them here again just to make it unambiguous.
>
> Finally - this does *not* fix up the refcounting. I still think the
> SRCU stuff is completely broken. But that's another headache. But at
> least now the *lookup* parts look like they DTRT wrt eventfs_mutex.
>
> The SRCU logic from the directory iteration parts still needs crapectomy.

I think the not dropping the mutex lock lets me get rid of the SRCU. I
added the SRCU when I was hitting the deadlocks with the iput code which
I'm not hitting anymore. So getting rid of the SRCU shouldn't be hard.

>
> AGAIN: these patches (ie particularly that last one - 0004) were all
> done entirely "blindly" - I've looked at the code, and fixed the bugs
> and problems I've seen by pure code inspection.
>
> That's great, but it really means that it's all untested. It *looks*
> better than the old code, but there may be some silly gotcha that I
> have missed.

I'll let you know.

Oh, does b4 handle attachments? Because this breaks the patchwork flow.
I haven't used b4 yet.

Thanks,

-- Steve

2024-01-29 21:55:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <[email protected]> wrote:

> [0001-tracefs-remove-stale-update_gid-code.patch text/x-patch (2612 bytes)]

Oh, I already applied this and even sent you a pull request with it.

https://lore.kernel.org/all/[email protected]/

-- Steve

2024-01-29 22:20:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 13:45, Steven Rostedt <[email protected]> wrote:
> > 1 file changed, 50 insertions(+), 219 deletions(-)
>
> Thanks, much appreciated.

Well, I decided I might as well give it a test-run, and there's an
immediate deadlock on eventfs_mutex, because I missed removing it from
eventfs_find_events() when the callers now already hold it.

So at a minimum, it will require this patch on top:

--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode
*eventfs_find_events(
{
struct eventfs_inode *ei;

- mutex_lock(&eventfs_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
- mutex_unlock(&eventfs_mutex);

update_top_events_attr(ei, dentry->d_sb);

to not deadlock immediately on the first lookup.

And honestly, there might be other such obvious "I missed that when
reading the code".

Let me reboot into a fixed system and do some more basic smoke-testing.

Linus

2024-01-29 22:22:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds <[email protected]> wrote:

> [0004-tracefs-dentry-lookup-crapectomy.patch text/x-patch (11761 bytes)]

I had to add:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de3244442..89897d934302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
{
struct eventfs_inode *ei;

- mutex_lock(&eventfs_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
- mutex_unlock(&eventfs_mutex);

update_top_events_attr(ei, dentry->d_sb);


As eventfs_find_events() is only called by update_inode_attr() which is
only called by: lookup_file() and lookup_dir_entry() which are called by
eventfs_root_lookup() where eventfs_mutex is already held.

But crashes with just a:

# ls /sys/kernel/tracing/events

[ 66.423983] ------------[ cut here ]------------
[ 66.426447] kernel BUG at fs/dcache.c:1876!
[ 66.428363] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[ 66.430320] CPU: 4 PID: 863 Comm: ls Not tainted 6.8.0-rc1-test-00009-gcaff43732484-dirty #463
[ 66.433192] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 66.436122] RIP: 0010:d_instantiate+0x69/0x70
[ 66.437537] Code: 00 4c 89 e7 e8 18 f0 49 01 48 89 df 48 89 ee e8 0d fe ff ff 4c 89 e7 5b 5d 41 5c e9 e1 f3 49 01 5b 5d 41 5c c3 cc cc cc cc 90 <0f> 0b 90 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 66.442759] RSP: 0018:ffffc900015f7830 EFLAGS: 00010282
[ 66.444315] RAX: 0000000000000000 RBX: ffff8881006aec40 RCX: ffffffffb77fcbff
[ 66.446237] RDX: dffffc0000000000 RSI: ffff888127c46ec0 RDI: ffff8881006aed70
[ 66.448170] RBP: ffff888127c46ec0 R08: 0000000000000001 R09: fffffbfff7766c4f
[ 66.450094] R10: ffffffffbbb3627f R11: 0000000000000000 R12: 00000000000081a0
[ 66.452007] R13: ffff88810f850000 R14: 0000000000000000 R15: 0000000000000000
[ 66.455224] FS: 00007fe1e56d9800(0000) GS:ffff88823c800000(0000) knlGS:0000000000000000
[ 66.457484] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 66.459139] CR2: 0000556b5293b000 CR3: 000000010f056001 CR4: 0000000000170ef0
[ 66.461192] Call Trace:
[ 66.462094] <TASK>
[ 66.462915] ? die+0x36/0x90
[ 66.463928] ? do_trap+0x133/0x230
[ 66.465089] ? d_instantiate+0x69/0x70
[ 66.466325] ? d_instantiate+0x69/0x70
[ 66.467532] ? do_error_trap+0x90/0x130
[ 66.468787] ? d_instantiate+0x69/0x70
[ 66.470020] ? handle_invalid_op+0x2c/0x40
[ 66.471340] ? d_instantiate+0x69/0x70
[ 66.472559] ? exc_invalid_op+0x2e/0x50
[ 66.473807] ? asm_exc_invalid_op+0x1a/0x20
[ 66.475154] ? d_instantiate+0x1f/0x70
[ 66.476396] ? d_instantiate+0x69/0x70
[ 66.477629] eventfs_root_lookup+0x366/0x660
[ 66.479021] ? __pfx_eventfs_root_lookup+0x10/0x10
[ 66.480567] ? print_circular_bug_entry+0x170/0x170
[ 66.482107] ? lockdep_init_map_type+0xd3/0x3a0
[ 66.485243] __lookup_slow+0x194/0x2a0
[ 66.486410] ? __pfx___lookup_slow+0x10/0x10
[ 66.487694] ? rwsem_read_trylock+0x118/0x1b0
[ 66.489057] ? i915_ttm_backup+0x2a0/0x5e0
[ 66.490358] ? down_read+0xbb/0x240
[ 66.491506] ? down_read+0xbb/0x240
[ 66.492674] ? trace_preempt_on+0xc8/0xe0
[ 66.493962] ? i915_ttm_backup+0x2a0/0x5e0
[ 66.495291] walk_component+0x166/0x220
[ 66.496564] path_lookupat+0xa9/0x2e0
[ 66.497766] ? __pfx_mark_lock+0x10/0x10
[ 66.499026] filename_lookup+0x19c/0x350
[ 66.500335] ? __pfx_filename_lookup+0x10/0x10
[ 66.501688] ? __pfx___lock_acquire+0x10/0x10
[ 66.502996] ? __pfx___lock_acquire+0x10/0x10
[ 66.504305] ? _raw_read_unlock_irqrestore+0x40/0x80
[ 66.505765] ? stack_depot_save_flags+0x1f0/0x790
[ 66.507137] vfs_statx+0xe1/0x270
[ 66.508196] ? __pfx_vfs_statx+0x10/0x10
[ 66.509385] ? __virt_addr_valid+0x155/0x330
[ 66.510667] ? __pfx_lock_release+0x10/0x10
[ 66.511924] do_statx+0xac/0x110
[ 66.515394] ? __pfx_do_statx+0x10/0x10
[ 66.516631] ? getname_flags.part.0+0xd6/0x260
[ 66.517956] __x64_sys_statx+0xa0/0xc0
[ 66.519079] do_syscall_64+0xca/0x1e0
[ 66.520206] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 66.521674] RIP: 0033:0x7fe1e586e2ea
[ 66.522780] Code: 48 8b 05 31 bb 0d 00 ba ff ff ff ff 64 c7 00 16 00 00 00 e9 a5 fd ff ff e8 03 10 02 00 0f 1f 00 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2e 89 c1 85 c0 74 0f 48 8b 05 f9 ba 0d 00 64
[ 66.527728] RSP: 002b:00007fff06f5ef98 EFLAGS: 00000246 ORIG_RAX: 000000000000014c
[ 66.529922] RAX: ffffffffffffffda RBX: 0000556b52935d18 RCX: 00007fe1e586e2ea
[ 66.531863] RDX: 0000000000000900 RSI: 00007fff06f5f0d0 RDI: 00000000ffffff9c
[ 66.533832] RBP: 0000000000000002 R08: 00007fff06f5efa0 R09: 0000000000000000
[ 66.535768] R10: 0000000000000002 R11: 0000000000000246 R12: 00007fff06f5f0d0
[ 66.537733] R13: 0000000000000005 R14: 0000556b52935d00 R15: 0000000000000003
[ 66.539682] </TASK>
[ 66.540454] Modules linked in: vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock ip_tables
[ 66.543272] ---[ end trace 0000000000000000 ]---


void d_instantiate(struct dentry *entry, struct inode * inode)
{
BUG_ON(!hlist_unhashed(&entry->d_u.d_alias)); <<<--- HERE
if (inode) {
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
__d_instantiate(entry, inode);
spin_unlock(&inode->i_lock);
}
}

-- Steve

2024-01-29 22:37:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 14:21, Steven Rostedt <[email protected]> wrote:
>
> But crashes with just a:
>
> # ls /sys/kernel/tracing/events
>
> [ 66.423983] ------------[ cut here ]------------
> [ 66.426447] kernel BUG at fs/dcache.c:1876!

Duh.

That's a bit too much copy-and-paste by me.

So what is going on is that a ->lookup() function should *not* call
d_instantiate() at all, and the only reason it actually used to work
here was due to the incorrect "simple_lookup()", which basically did
all the preliminaries.

A ->lookup() should do 'd_add()' on the dentry.

So just replace all the d_instantiate() calls there with "d_add()"
instead. I think that will fix it.

Basically the "simple_lookup()" had done the "d_add(dentry, NULL)",
and at that point the "d_instantiate()" just exposed the inode and
turned the negative dentry into a positive one.

So "d_add()" is "I'm adding the inode to a new dentry under lookup".
And "d_instantiate()" is "I'm adding this inode to an existing dentry
that used to be negative"

And so the old "d_add(NULL)+d_instantiate(inode)" _kind_ of worked,
except it made that negative dentry visible for a short while.

And when I did the cleanup, I didn't think of this thing, so I left
the d_instantiate() calls as such, even though they now really need to
be d_add().

Hope that explains it.

And I hope there aren't any other stupid things I missed like that.

Linus

2024-01-29 22:43:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 14:35, Linus Torvalds
<[email protected]> wrote:
>
> So just replace all the d_instantiate() calls there with "d_add()"
> instead. I think that will fix it.

I can confirm that with the mutex deadlock removed and the d_add()
fix, at least things *look* superficially ok.

I didn't actually do anything with it. So it might be leaking dentry
refs like mad or something like that, but at least the obvious cases
look fine.

just for completeness, here's the fixup diff I used.

Linus


Attachments:
fixup.diff (1.75 kB)

2024-01-29 22:47:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 14:35:37 -0800
Linus Torvalds <[email protected]> wrote:

> And I hope there aren't any other stupid things I missed like that.

Well the preliminary tests pass with this added to your patch:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de3244442..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
{
struct eventfs_inode *ei;

- mutex_lock(&eventfs_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
- mutex_unlock(&eventfs_mutex);

update_top_events_attr(ei, dentry->d_sb);

@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files not

- d_instantiate(dentry, inode);
+ d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
};
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
ei->dentry = dentry; // Remove me!

inc_nlink(inode);
- d_instantiate(dentry, inode);
+ d_add(dentry, inode);
inc_nlink(dentry->d_parent->d_inode);
fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);

-- Steve

2024-01-29 22:49:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 14:42:47 -0800
Linus Torvalds <[email protected]> wrote:

> @@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
> ti->flags = TRACEFS_EVENT_INODE;
> ti->private = NULL; // Directories have 'ei', files not
>
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);
> fsnotify_create(dentry->d_parent->d_inode, dentry);
> return eventfs_end_creating(dentry);
> };
> @@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
> ei->dentry = dentry; // Remove me!
>
> inc_nlink(inode);
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);
> inc_nlink(dentry->d_parent->d_inode);
> fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
> return eventfs_end_creating(dentry);
> @@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
>
> /* directory inodes start off with i_nlink == 2 (for "." entry) */
> inc_nlink(inode);
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);

Now I didn't change this last d_instantiate, because this is not called
through the lookup code. This is the root events directory and acts more
like debugfs. It's not "dynamically" added.

-- Steve


> inc_nlink(dentry->d_parent->d_inode);
> fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
> tracefs_end_creating(dentry);


2024-01-29 23:15:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 17:47:32 -0500
Steven Rostedt <[email protected]> wrote:

> > And I hope there aren't any other stupid things I missed like that.
>
> Well the preliminary tests pass with this added to your patch:

Spoke too soon. The later tests started failing.

It fails on creating a kprobe, deleting it, and then recreating it. Even
though the directory is there, it can't be accessed.

# cd /sys/kernel/tracing

// Create a kprobe

# echo 'p:sched schedule' >> kprobe_events
# ls events/kprobes/
enable filter sched

// Now delete the kprobe

# echo '-:sched schedule' >> kprobe_events

// Make sure it's gone

# ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

// Recreate it

# echo 'p:sched schedule' >> kprobe_events
# ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory
# ls events | grep kprobes
kprobes

No longer able to access it.

# ls -l events | grep kprobes
ls: cannot access 'events/kprobes': No such file or directory
d????????? ? ? ? ? ? kprobes


-- Steve

2024-01-30 00:02:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 14:49, Steven Rostedt <[email protected]> wrote:
>
> Now I didn't change this last d_instantiate, because this is not called
> through the lookup code. This is the root events directory and acts more
> like debugfs. It's not "dynamically" added.

Ahh, yes, I see, the dentry was created (as a negative one) with
tracefs_start_creating() -> lookup_one_len().

So yes, there d_instantiate() is correct, as it's exactly that "turn
negative dentry into a positive one" case.

I'll go see what's up with the "create it again" case - I don't
immediately see what's wrong.

Linus

2024-01-30 00:35:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 16:01:25 -0800
Linus Torvalds <[email protected]> wrote:

> I'll go see what's up with the "create it again" case - I don't
> immediately see what's wrong.

Interesting. I added a printk in the lookup, and just did this:

# cd /sys/kernel/tracing
# ls events/kprobes

And it showed that it tried to see if "kprobes" existed in the lookup.
Which it did not because I haven't created any kprobes yet.

Then I did:

# echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
# ls -l events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

Where it should now exist but doesn't. But the lookup code never triggered.

If the lookup fails, does it cache the result?

-- Steve

2024-01-30 01:58:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 16:35, Steven Rostedt <[email protected]> wrote:
>
> # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
> # ls -l events/kprobes/
> ls: cannot access 'events/kprobes/': No such file or directory
>
> Where it should now exist but doesn't. But the lookup code never triggered.
>
> If the lookup fails, does it cache the result?

I think you end up still having negative dentries around.

The old code then tried to compensate for that by trying to remember
the old dentry with 'ei->dentry' and 'ei->d_children[]', and would at
lookup time try to use the *old* dentry instead of the new one.

And because dentries are just caches and can go away, it then had that
odd dance with '.d_iput', so that when a dentry was removed, it would
be removed from the 'ei->dentry' and 'ei->d_children[]' array too.

Except that d_iput() of an old dentry isn't actually serialized with
->d_lookup() in any way, so you end up with the whole race that I
already talked about earlier, where you could still have an
'ei->dentry' that pointed to something that had already been unhashed,
but d_iput() hadn't been called *yet*, so d_lookup() is called with a
new dentry, but the tracefs code then desperately tries to use the old
dentry pointer that just isn't _valid_ any more, but it doesn't know
that because d_iput() hasn't been called yet...

And as I *also* pointed out when I described that originally, you'll
practically never hit this race, because you just need to be *very*
unlucky with the whole "dentry is freed due to memory pressure".

But basically, this is why I absolutely *HATE* that "ei->dentry"
backpointer. It's truly fundamentally broken.

You can't reference-count it, since the whole point of your current
tracefs scheme is to *not* keep dentries and inodes around forever,
and doing a "dget()" on that 'ei->dentry' would thus fundamentally
screw that up.

But you also cannot keep it in sync with dentries being released due
to memory pressure, because of the above thing.

See why I've tried to tell you that the back-pointer is basically a
100% sign of a bug.

The *only* time you can have a valid dentry pointer is when you have
also taken a ref to it with dget(), and you can't do that.

So then you have all that completely broken code that _tries_ to
maintain consistency with ->d_children[] etc, and it works 99.9% in
practice, because the race is just so hard to hit because dentries
only normally get evicted either synchronously (which you do under the
eventfs_mutex) or under memory pressure (which is basically never
going to be something you can test).

And yes, my lookup patch removed all the band-aids for "if I have an
ei->dentry, I'll reuse it". So I think it ends up exposing all the
previous bugs that the old "let's reuse the old dentry" code tried to
hide.

But, as mentioned, that ei->dentry pointer really REALLY is broken.

NBow, having looked at this a lot, I think I have a way forward.
Because there is actually *one* case where you actually *do* do the
whole "dget()" to get a stable dentry pointer. And that's exactly the
"events" directory creation (ie eventfs_create_events_dir()).

So what I propose is that

- ei->dentry and ei->d_children[] need to die. Really. They are
buggy. There is no way to save them. There never was.

- but we *can* introduce a new 'ei->events_dir' pointer that is
*only* set by eventfs_create_events_dir(), and which is stable exactly
because that function also does a dget() on it, so now the dentry will
actually continue to exist reliably

I think that works. The only thing that actually *needs* the existing
'ei->dentry' is literally the eventfs_remove_events_dir() that gets
rid of the stable events directory. It's undoing
eventfs_create_events_dir(), and it will do the final dput() too.

I will try to make a patch for this. I do think it means that every
time we do that

dentry->d_fsdata = ei;

we need to also do proper reference counting of said 'ei'. Because we
can't release 'ei' early when we have dentries that point to it.

Let me see how painful this will be.

Linus

2024-01-30 03:57:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 17:50, Linus Torvalds
<[email protected]> wrote:
>
> So what I propose is that
>
> - ei->dentry and ei->d_children[] need to die. Really. They are
> buggy. There is no way to save them. There never was.
>
> - but we *can* introduce a new 'ei->events_dir' pointer that is
> *only* set by eventfs_create_events_dir(), and which is stable exactly
> because that function also does a dget() on it, so now the dentry will
> actually continue to exist reliably
>
> I think that works.

Well, it doesn't. I don't see where the bug is, but since Al is now
aware of the thread, maybe when he wakes up he will tell me where I've
gone wrong.

In the meantime I did do the pending tracefs pull, so the series has
changed a bit, and this is the rebased series on top of my current
public git tree.

It is still broken wrt 'events' directories. You don't even need the
"create, delete, create" sequence that Steven pointed out, just a
plain sequence of

# cd /sys/kernel/tracing
# ls events/kprobes/
# echo 'p:sched schedule' >> kprobe_events

messes up - ie it's enough to just have 'lookup' create a negative
dentry by trying to look up 'events/kprobes/' before actually trying
to create that kprobe_events.

But I've been staring at this code for too long, so I'm posting this
just as a "it's broken, but _something_ like this", because I'm taking
a break from looking at this.

I'll get back to it tomorrow, but I hope that Al will show me the
error of my ways.

Linus


Attachments:
0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch (15.29 kB)
0003-tracefs-dentry-lookup-crapectomy.patch (12.33 kB)
0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch (3.48 kB)
0002-eventfsfs-initialize-the-tracefs-inode-properly.patch (2.49 kB)
0004-eventfs-remove-unused-d_parent-pointer-field.patch (1.81 kB)
Download all attachments

2024-01-30 08:12:03

by Al Viro

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Sun, Jan 28, 2024 at 08:36:12PM -0800, Linus Torvalds wrote:

[snip]

apologies for being MIA on that, will post tomorrow morning once I get some
sleep...

2024-01-30 08:45:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 at 19:56, Linus Torvalds
<[email protected]> wrote:
>
> But I've been staring at this code for too long, so I'm posting this
> just as a "it's broken, but _something_ like this", because I'm taking
> a break from looking at this.

Bah. I literally woke up and realized what the problem is.

We're caching negative dentries, but eventfs_create_dir() has no way
to invalidate the old negative dentry when it adds a new entry.

The old "try to reuse dentry" ended up hiding that issue, because it
would look up the negative dentry from the 'ei' and turn it into a
positive one.

And I was stupidly staring at the wrong code - all these functions
with almost the same names.

eventfs_create_events_dir() is fine, because it gets the parent as a
dentry, so looking up the new dentry is trivial.

But eventfs_create_dir() doesn't get a dentry at all. It gets the parent as a

struct eventfs_inode *parent

which is no good.

So that explains why creating an event after deleting an old one - or
after just looking it up before it was created - ends up with the
nonsensical "ls" output - it gets listed by readdir() because the
entry is there in the eventfs data structures, but then doing a
"stat()" on it will just hit the negative dentry. So it won't actually
look up the dentry.

The simple solution is probably just to not cache negative dentries
for eventfs. So instead of doing the "d_add(dentry, NULL);" we should
just return "ERR_PTR(ENOENT)".

Which is actually what /proc/<pid> lookups do, for the same reason.

I'll go back to bed, but I think the fix is something trivial like this:

--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -517,9 +517,8 @@ static struct dentry *eventfs_root_lookup(
}

enoent:
- /* Nothing found? */
- d_add(dentry, NULL);
- result = NULL;
+ /* Don't cache negative lookups, just return an error */
+ result = ERR_PTR(ENOENT);

out:
mutex_unlock(&eventfs_mutex);

I just wanted to write it down when the likely solution struck me.

Linus

2024-01-30 09:27:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
<[email protected]> wrote:
>
> I'll go back to bed, but I think the fix is something trivial like this:

Almost.

> + result = ERR_PTR(ENOENT);

That needs a '-' in front of the ENOENT, otherwise you have a positive
error number and things go wrong very quickly.

And that does indeed fix the lookup problem, but you end up with the
same problem later when you do the eventfs_remove_dir(). Again the
eventfs data structure changes, but we don't have a reliable dentry
that we can invalidate.

The dentry cache is just very good at caching those old dentries, and
the interface for eventfs_create_dir() and eventfs_remove_dir() is
just not great.

If those did an actual path lookup (like eventfs_create_events_dir()
does), we'd have the dentry, and it's trivial to get from dentry to
eventfs_inode.

But going the other way is the broken thing because of how the
dentries are just temporary caches.

I suspect the solution is to make eventfs_create_dir() do the same as
the events directory case does, and actually pin the directory dentry
and save it off.

Oh well. At least I understand what the problem is. Now I'm going to
try to go back to sleep.

Linus

2024-01-30 12:45:58

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On 30/01/2024 10.12, Linus Torvalds wrote:
> On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
> <[email protected]> wrote:
>>
>> I'll go back to bed, but I think the fix is something trivial like this:
>
> Almost.
>
>> + result = ERR_PTR(ENOENT);
>
> That needs a '-' in front of the ENOENT, otherwise you have a positive
> error number and things go wrong very quickly.

OT, but 'git grep' suggests that drivers/soc/apple/mailbox.c might need
exactly that fix... (there's also ext4, but their custom E thing is
already negative).

Rasmus


2024-01-30 14:39:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 01:12:05 -0800
Linus Torvalds <[email protected]> wrote:

> On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
> <[email protected]> wrote:
> >
> > I'll go back to bed, but I think the fix is something trivial like this:
>
> Almost.
>
> > + result = ERR_PTR(ENOENT);
>
> That needs a '-' in front of the ENOENT, otherwise you have a positive
> error number and things go wrong very quickly.
>
> And that does indeed fix the lookup problem, but you end up with the
> same problem later when you do the eventfs_remove_dir(). Again the
> eventfs data structure changes, but we don't have a reliable dentry
> that we can invalidate.
>
> The dentry cache is just very good at caching those old dentries, and
> the interface for eventfs_create_dir() and eventfs_remove_dir() is
> just not great.
>
> If those did an actual path lookup (like eventfs_create_events_dir()
> does), we'd have the dentry, and it's trivial to get from dentry to
> eventfs_inode.
>
> But going the other way is the broken thing because of how the
> dentries are just temporary caches.
>
> I suspect the solution is to make eventfs_create_dir() do the same as
> the events directory case does, and actually pin the directory dentry
> and save it off.

I rather not have the create do that because that happens for every event
directory. On my machine that's:

# find events -type d | wc -l
2102

And that's regardless if tracefs is mounted or not. And that's how many are
also created with every instance creation. And doesn't pinning the dentry
also require it to be positive? That is, have a full inode allocated with
it?

I may try something that will still let me get rid of the ei->dentry.

>
> Oh well. At least I understand what the problem is.


Yep!


> Now I'm going to try to go back to sleep.

Hope you sleep better than I did. We just bought a new mattress, which felt
great in the store, but now after 4 or 5 hours sleeping in it, I wake up
with a sore back and have to sleep on the couch. And we bought the most
expensive "best" mattress in the store :-p

Oh, and NY State law has it that you can't return in once it is delivered.

-- Steve

2024-01-30 16:49:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 09:39:42 -0500
Steven Rostedt <[email protected]> wrote:

> I may try something that will still let me get rid of the ei->dentry.

This appears to work, but like always, I may have missed something. I need
to add debugging (printks) to make sure the that I don't leave any dangling
dentries around (dget without a dput).

Here's what I did:

- Removed the ei->dentry that you loved so much

- Removed the ei->d_children that you shared as much love for.

- Removed SRCU. All accesses are under the eventfs_mutex
I need to add comments that the callbacks need to be aware of this.
Currently, the callbacks do not take any other locks. I may comment
that they should never take a lock.

- Added the kref that you recommended

- Created a eventfs_root_inode that has the structure of:

struct eventfs_root_inode {
struct eventfs_inode ei;
struct dentry *root_dentry;
};

The "events" directory is the only directory that allocates this.
It is required that its ei->is_events is set, and no other ei has that
set. This will hold the only non-dynamic dentry.

- I added "parent" to the eventfs_inode that points to the parent
eventfs_inode.

- On removal, I got rid of the SRCU callback and the work queue.
Instead, I find the dentry of the current eventfs_inode that is being
deleted by walking the ei->parent until I find the events inode that has
a dentry. I then use that to do a lookup walking back down to the
eventfs_inode I want to delete. This gives me the dentry that I can call
d_invalidate() on.

This all works with light testing. I'm sure I did something wrong, but
hopefully this is more inline to what you are looking for.

This patch is on top of your last patch series.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..49d4630d5d70 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -24,17 +24,24 @@
#include <linux/delay.h>
#include "internal.h"

-/*
- * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access
- * to the ei->dentry must be done under this mutex and after checking
- * if ei->is_freed is not set. When ei->is_freed is set, the dentry
- * is on its way to being freed after the last dput() is made on it.
- */
+/* eventfs_mutex protects ei->is_freed and the ei->ref counting. */
static DEFINE_MUTEX(eventfs_mutex);

/* Choose something "unique" ;-) */
#define EVENTFS_FILE_INODE_INO 0x12c4e37

+/* Used only by the "events" directory */
+struct eventfs_root_inode {
+ struct eventfs_inode ei;
+ struct dentry *root_dentry;
+};
+
+static struct eventfs_root_inode *get_root_inode(struct eventfs_inode *ei)
+{
+ WARN_ON_ONCE(!ei->is_events);
+ return container_of(ei, struct eventfs_inode, ei);
+}
+
/* Just try to make something consistent and unique */
static int eventfs_dir_ino(struct eventfs_inode *ei)
{
@@ -44,14 +51,6 @@ static int eventfs_dir_ino(struct eventfs_inode *ei)
return ei->ino;
}

-/*
- * The eventfs_inode (ei) itself is protected by SRCU. It is released from
- * its parent's list and will have is_freed set (under eventfs_mutex).
- * After the SRCU grace period is over and the last dput() is called
- * the ei is freed.
- */
-DEFINE_STATIC_SRCU(eventfs_srcu);
-
/* Mode is unsigned short, use the upper bits for flags */
enum {
EVENTFS_SAVE_MODE = BIT(16),
@@ -360,7 +359,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
ti->private = ei;

dentry->d_fsdata = ei;
- ei->dentry = dentry; // Remove me!
+ kref_get(&ei->kref);

inc_nlink(inode);
d_add(dentry, inode);
@@ -371,10 +370,30 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,

static void free_ei(struct eventfs_inode *ei)
{
+ struct eventfs_root_inode *rei;
+
kfree_const(ei->name);
- kfree(ei->d_children);
kfree(ei->entry_attrs);
- kfree(ei);
+
+ if (ei->is_events) {
+ rei = get_root_inode(ei);
+ kfree(rei);
+ } else {
+ kfree(ei);
+ }
+}
+
+static void kref_ei_release(struct kref *kref)
+{
+ struct eventfs_inode *ei = container_of(kref, struct eventfs_inode, kref);
+ WARN_ON_ONCE(!ei->is_freed);
+ free_ei(ei);
+}
+
+
+static void eventfs_inode_put(struct eventfs_inode *ei)
+{
+ kref_put(&ei->kref, kref_ei_release);
}

/**
@@ -387,7 +406,6 @@ static void free_ei(struct eventfs_inode *ei)
void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
{
struct eventfs_inode *ei;
- int i;

mutex_lock(&eventfs_mutex);

@@ -395,20 +413,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
if (!ei)
goto out;

- /* This could belong to one of the files of the ei */
- if (ei->dentry != dentry) {
- for (i = 0; i < ei->nr_entries; i++) {
- if (ei->d_children[i] == dentry)
- break;
- }
- if (WARN_ON_ONCE(i == ei->nr_entries))
- goto out;
- ei->d_children[i] = NULL;
- } else if (ei->is_freed) {
- free_ei(ei);
- } else {
- ei->dentry = NULL;
- }
+ eventfs_inode_put(ei);

dentry->d_fsdata = NULL;
out:
@@ -435,16 +440,14 @@ lookup_file_dentry(struct dentry *dentry,
const struct file_operations *fops)
{
struct eventfs_attr *attr = NULL;
- struct dentry **e_dentry = &ei->d_children[idx];

if (ei->entry_attrs)
attr = &ei->entry_attrs[idx];

dentry->d_fsdata = ei; // NOTE: ei of _parent_
+ kref_get(&ei->kref);
lookup_file(dentry, mode, attr, data, fops);

- *e_dentry = dentry; // Remove me
-
return dentry;
}

@@ -465,6 +468,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
struct eventfs_inode *ei_child;
struct tracefs_inode *ti;
struct eventfs_inode *ei;
+ struct dentry *result = NULL;
const char *name = dentry->d_name.name;

ti = get_tracefs(dir);
@@ -505,11 +509,10 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,

enoent:
/* Nothing found? */
- d_add(dentry, NULL);
-
+ result = ERR_PTR(-ENOENT);
out:
mutex_unlock(&eventfs_mutex);
- return NULL;
+ return result;
}

/*
@@ -525,7 +528,6 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
struct eventfs_inode *ei;
const char *name;
umode_t mode;
- int idx;
int ret = -EINVAL;
int ino;
int i, r, c;
@@ -539,15 +541,9 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)

c = ctx->pos - 2;

- idx = srcu_read_lock(&eventfs_srcu);
-
mutex_lock(&eventfs_mutex);
ei = READ_ONCE(ti->private);
if (ei && ei->is_freed)
- ei = NULL;
- mutex_unlock(&eventfs_mutex);
-
- if (!ei)
goto out;

/*
@@ -563,14 +559,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
entry = &ei->entries[i];
name = entry->name;

- mutex_lock(&eventfs_mutex);
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(&eventfs_mutex);
goto out;
}
r = entry->callback(name, &mode, &cdata, &fops);
- mutex_unlock(&eventfs_mutex);
if (r <= 0)
continue;

@@ -583,8 +577,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
/* Subtract the skipped entries above */
c -= min((unsigned int)c, (unsigned int)ei->nr_entries);

- list_for_each_entry_srcu(ei_child, &ei->children, list,
- srcu_read_lock_held(&eventfs_srcu)) {
+ list_for_each_entry(ei_child, &ei->children, list) {

if (c > 0) {
c--;
@@ -605,7 +598,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
}
ret = 1;
out:
- srcu_read_unlock(&eventfs_srcu, idx);
+ mutex_unlock(&eventfs_mutex);

return ret;

@@ -669,25 +662,17 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode
return ERR_PTR(-ENOMEM);
}

- if (size) {
- ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
- if (!ei->d_children) {
- kfree_const(ei->name);
- kfree(ei);
- return ERR_PTR(-ENOMEM);
- }
- }
-
ei->entries = entries;
ei->nr_entries = size;
ei->data = data;
INIT_LIST_HEAD(&ei->children);
INIT_LIST_HEAD(&ei->list);
+ kref_init(&ei->kref);

mutex_lock(&eventfs_mutex);
if (!parent->is_freed) {
list_add_tail(&ei->list, &parent->children);
- ei->d_parent = parent->dentry;
+ ei->parent = parent;
}
mutex_unlock(&eventfs_mutex);

@@ -716,6 +701,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
int size, void *data)
{
struct dentry *dentry = tracefs_start_creating(name, parent);
+ struct eventfs_root_inode *rei;
struct eventfs_inode *ei;
struct tracefs_inode *ti;
struct inode *inode;
@@ -728,24 +714,21 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
if (IS_ERR(dentry))
return ERR_CAST(dentry);

- ei = kzalloc(sizeof(*ei), GFP_KERNEL);
- if (!ei)
+ rei = kzalloc(sizeof(*rei), GFP_KERNEL);
+ if (!rei)
goto fail_ei;

+ ei = &rei->ei;
+ ei->is_events = 1;
+
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
goto fail;

- if (size) {
- ei->d_children = kcalloc(size, sizeof(*ei->d_children), GFP_KERNEL);
- if (!ei->d_children)
- goto fail;
- }
+ rei->root_dentry = dentry;

- ei->dentry = dentry;
ei->entries = entries;
ei->nr_entries = size;
- ei->is_events = 1;
ei->data = data;
ei->name = kstrdup_const(name, GFP_KERNEL);
if (!ei->name)
@@ -781,6 +764,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
inode->i_fop = &eventfs_file_operations;

dentry->d_fsdata = ei;
+ kref_init(&ei->kref);

/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
@@ -792,83 +776,24 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
return ei;

fail:
- kfree(ei->d_children);
- kfree(ei);
+ kfree(rei);
fail_ei:
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
}

-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
- struct eventfs_inode *ei, *tmp;
- struct llist_node *llnode;
-
- llnode = llist_del_all(&free_list);
- llist_for_each_entry_safe(ei, tmp, llnode, llist) {
- /* This dput() matches the dget() from unhook_dentry() */
- for (int i = 0; i < ei->nr_entries; i++) {
- if (ei->d_children[i])
- dput(ei->d_children[i]);
- }
- /* This should only get here if it had a dentry */
- if (!WARN_ON_ONCE(!ei->dentry))
- dput(ei->dentry);
- }
-}
-
-static DECLARE_WORK(eventfs_work, eventfs_workfn);
-
-static void free_rcu_ei(struct rcu_head *head)
-{
- struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu);
-
- if (ei->dentry) {
- /* Do not free the ei until all references of dentry are gone */
- if (llist_add(&ei->llist, &free_list))
- queue_work(system_unbound_wq, &eventfs_work);
- return;
- }
-
- /* If the ei doesn't have a dentry, neither should its children */
- for (int i = 0; i < ei->nr_entries; i++) {
- WARN_ON_ONCE(ei->d_children[i]);
- }
-
- free_ei(ei);
-}
-
-static void unhook_dentry(struct dentry *dentry)
-{
- if (!dentry)
- return;
- /*
- * Need to add a reference to the dentry that is expected by
- * simple_recursive_removal(), which will include a dput().
- */
- dget(dentry);
-
- /*
- * Also add a reference for the dput() in eventfs_workfn().
- * That is required as that dput() will free the ei after
- * the SRCU grace period is over.
- */
- dget(dentry);
-}
-
/**
* eventfs_remove_rec - remove eventfs dir or file from list
* @ei: eventfs_inode to be removed.
+ * @head: List head to add the ei's to remove
* @level: prevent recursion from going more than 3 levels deep.
*
* This function recursively removes eventfs_inodes which
* contains info of files and/or directories.
*/
-static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
+static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, int level)
{
- struct eventfs_inode *ei_child;
+ struct eventfs_inode *ei_child, *tmp;

if (!ei)
return;
@@ -883,28 +808,67 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, int level)
return;

/* search for nested folders or files */
- list_for_each_entry_srcu(ei_child, &ei->children, list,
- lockdep_is_held(&eventfs_mutex)) {
- /* Children only have dentry if parent does */
- WARN_ON_ONCE(ei_child->dentry && !ei->dentry);
- eventfs_remove_rec(ei_child, level + 1);
+ list_for_each_entry_safe(ei_child, tmp, &ei->children, list) {
+ eventfs_remove_rec(ei_child, head, level + 1);
}

-
ei->is_freed = 1;

- for (int i = 0; i < ei->nr_entries; i++) {
- if (ei->d_children[i]) {
- /* Children only have dentry if parent does */
- WARN_ON_ONCE(!ei->dentry);
- unhook_dentry(ei->d_children[i]);
- }
+ list_del(&ei->list);
+ list_add(&ei->list, head);
+}
+
+static struct dentry *find_ei_dentry(struct eventfs_inode *ei, int level)
+{
+ struct dentry *d_parent;
+ struct dentry *dentry;
+ const char *name = ei->name;
+
+ /*
+ * Check recursion depth. It should never be greater than 2;
+ * 0 - events/
+ * 1 - events/group/
+ * 2 - events/group/event/
+ */
+ if (WARN_ON_ONCE(level > 2))
+ return NULL;
+
+ /* Only the events directory has a dentry we can use */
+ if (ei->parent->is_events) {
+ struct eventfs_root_inode *rei;
+
+ rei = get_root_inode(ei);
+ d_parent = rei->root_dentry;
+ dget(d_parent);
+ } else {
+ d_parent = find_ei_dentry(ei->parent, level + 1);
+ if (!d_parent)
+ return NULL;
}

- unhook_dentry(ei->dentry);
+ inode_lock(d_inode(d_parent));
+ dentry = lookup_one_len(name, d_parent, strlen(name));
+ inode_unlock(d_inode(d_parent));
+ dput(d_parent);
+
+ if (IS_ERR(dentry))
+ dentry = NULL;
+ return dentry;
+}
+
+static void remove_dir(struct eventfs_inode *ei)
+{
+ struct eventfs_inode *tmp;
+ LIST_HEAD(head);

- list_del_rcu(&ei->list);
- call_srcu(&eventfs_srcu, &ei->rcu, free_rcu_ei);
+ mutex_lock(&eventfs_mutex);
+ eventfs_remove_rec(ei, &head, 0);
+ mutex_unlock(&eventfs_mutex);
+
+ list_for_each_entry_safe(ei, tmp, &head, list) {
+ list_del(&ei->list);
+ eventfs_inode_put(ei);
+ }
}

/**
@@ -920,17 +884,14 @@ void eventfs_remove_dir(struct eventfs_inode *ei)
if (!ei)
return;

- mutex_lock(&eventfs_mutex);
- dentry = ei->dentry;
- eventfs_remove_rec(ei, 0);
- mutex_unlock(&eventfs_mutex);
+ dentry = find_ei_dentry(ei, 0);

- /*
- * If any of the ei children has a dentry, then the ei itself
- * must have a dentry.
- */
- if (dentry)
- simple_recursive_removal(dentry, NULL);
+ remove_dir(ei);
+
+ if (dentry) {
+ d_invalidate(dentry);
+ dput(dentry);
+ }
}

/**
@@ -941,10 +902,13 @@ void eventfs_remove_dir(struct eventfs_inode *ei)
*/
void eventfs_remove_events_dir(struct eventfs_inode *ei)
{
+ struct eventfs_root_inode *rei;
struct dentry *dentry;

- dentry = ei->dentry;
- eventfs_remove_dir(ei);
+ rei = get_root_inode(ei);
+ dentry = rei->root_dentry;
+
+ remove_dir(ei);

/*
* Matches the dget() done by tracefs_start_creating()
@@ -953,5 +917,8 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
* sticks around while the other ei->dentry are created
* and destroyed dynamically.
*/
+ simple_recursive_removal(dentry, NULL);
+
+ eventfs_inode_put(ei);
dput(dentry);
}
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..2af78fd95c93 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -34,9 +34,7 @@ struct eventfs_attr {
* @entries: the array of entries representing the files in the directory
* @name: the name of the directory to create
* @children: link list into the child eventfs_inode
- * @dentry: the dentry of the directory
* @d_parent: pointer to the parent's dentry
- * @d_children: The array of dentries to represent the files when created
* @entry_attrs: Saved mode and ownership of the @d_children
* @attr: Saved mode and ownership of eventfs_inode itself
* @data: The private data to pass to the callbacks
@@ -49,12 +47,11 @@ struct eventfs_inode {
const struct eventfs_entry *entries;
const char *name;
struct list_head children;
- struct dentry *dentry; /* Check is_freed to access */
- struct dentry *d_parent;
- struct dentry **d_children;
+ struct eventfs_inode *parent;
struct eventfs_attr *entry_attrs;
struct eventfs_attr attr;
void *data;
+ struct kref kref;
unsigned int is_freed:1;
unsigned int is_events:1;
unsigned int nr_entries:30;


2024-01-30 16:53:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 06:39, Steven Rostedt <[email protected]> wrote:
>
> On Tue, 30 Jan 2024 01:12:05 -0800
> >
> > I suspect the solution is to make eventfs_create_dir() do the same as
> > the events directory case does, and actually pin the directory dentry
> > and save it off.
>
> I rather not have the create do that because that happens for every event
> directory.

I wasn't thinking straight yesterday - the real fix is to just do a
"d_revalidate()". It's literally why that thing exists: check whether
a dentry is still valid.

In fact, that trivially gets rid of the 'events' subdirectory issues
too, so we can stop doing that horrendous simple_recursive_removal()
too.

Let me reboot into the trivial fix. I just needed to think about this
the right way.

None of this is anything that the VFS layer has any problems with.

Linus

2024-01-30 16:56:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 08:49, Steven Rostedt <[email protected]> wrote:
>
> - On removal, I got rid of the SRCU callback and the work queue.
> Instead, I find the dentry of the current eventfs_inode that is being
> deleted by walking the ei->parent until I find the events inode that has
> a dentry. I then use that to do a lookup walking back down to the
> eventfs_inode I want to delete. This gives me the dentry that I can call
> d_invalidate() on.

Yes, that works.

However, I have a patch that is *much* smaller and simpler, and
doesn't need that walk.

The VFS layer already has a good interface for "should I still use
this dentry", which is needed for various network filesystems etc that
want to time out caches (or check explicitly whether the file still
exists etc): it's the dentry d_revalidate() check.

Let me just reboot into it to test that I got all the cases.

It makes the code even more obvious, and avoids all the complexity.

Linus

2024-01-30 17:06:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 08:55:51 -0800
Linus Torvalds <[email protected]> wrote:

> On Tue, 30 Jan 2024 at 08:49, Steven Rostedt <[email protected]> wrote:
> >
> > - On removal, I got rid of the SRCU callback and the work queue.
> > Instead, I find the dentry of the current eventfs_inode that is being
> > deleted by walking the ei->parent until I find the events inode that has
> > a dentry. I then use that to do a lookup walking back down to the
> > eventfs_inode I want to delete. This gives me the dentry that I can call
> > d_invalidate() on.
>
> Yes, that works.
>
> However, I have a patch that is *much* smaller and simpler, and
> doesn't need that walk.
>
> The VFS layer already has a good interface for "should I still use
> this dentry", which is needed for various network filesystems etc that
> want to time out caches (or check explicitly whether the file still
> exists etc): it's the dentry d_revalidate() check.
>
> Let me just reboot into it to test that I got all the cases.
>
> It makes the code even more obvious, and avoids all the complexity.


I actually had this before, but it wasn't working (likely to something else
that wasn't working or I did it wrong) so I reverted it.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 49d4630d5d70..9867b39ae24c 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -451,6 +451,13 @@ lookup_file_dentry(struct dentry *dentry,
return dentry;
}

+int eventfs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct eventfs_inode *ei = dentry->d_fsdata;
+
+ return ei && !ei->is_freed;
+}
+
/**
* eventfs_root_lookup - lookup routine to create file/dir
* @dir: in which a lookup is being done
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..0395459d919e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -392,8 +392,24 @@ static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
iput(inode);
}

+static int tracefs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct inode *inode = dentry->d_inode;
+ struct tracefs_inode *ti;
+
+ if (!dentry || !inode)
+ return 0;
+
+ ti = get_tracefs(inode);
+ if (!ti || !(ti->flags & TRACEFS_EVENT_INODE))
+ return 1;
+
+ return eventfs_revalidate(dentry, flags);
+}
+
static const struct dentry_operations tracefs_dentry_operations = {
- .d_iput = tracefs_dentry_iput,
+ .d_iput = tracefs_dentry_iput,
+ .d_revalidate = tracefs_revalidate,
};

static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 2af78fd95c93..a1024202c4e5 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -80,5 +80,6 @@ struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
struct dentry *eventfs_failed_creating(struct dentry *dentry);
struct dentry *eventfs_end_creating(struct dentry *dentry);
void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+int eventfs_revalidate(struct dentry *dentry, unsigned int flags);

#endif /* _TRACEFS_INTERNAL_H */


2024-01-30 17:31:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 08:55, Linus Torvalds
<[email protected]> wrote:
>
> Let me just reboot into it to test that I got all the cases.
>
> It makes the code even more obvious, and avoids all the complexity.

Yup, this works at least for the case you pointed out, and it really
feels like the RightThing(tm) from a VFS standpoint.

NOTE! This patch is on top of my previous 5-patch series.

Linus


Attachments:
0001-eventfs-clean-up-dentry-ops-and-add-revalidate-funct.patch (4.82 kB)

2024-01-30 18:24:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds <[email protected]> wrote:

> [0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch text/x-patch (15652 bytes)]

I missed this email when I wrote basically the same thing :-p

It was sent just as I went to bed and this morning I mistaken it as an
email I already read. When you mentioned that your last patch was on top of
your other 5, I said to myself, "Wait? there's five?" and went back though
this thread looking at every email with an attachment. Well, even though I
duplicated your work, I'll take that as a learning experience.

Anyway, I'm assuming that the other 4 patches are exactly the same as the
previous version, as applying patches from attachments is a manual process.
I don't even develop on the machine with my email client, so it either is
me downloading each and scp'ing them over to my development box or finding
the lore link and going through them individually one by one.

I know you don't send patches inlined anymore, which is a shame, because
patchwork takes care of all the administering when patches are inlined, and
I don't miss patches like I use to.

So, I'll down load patch 5 here and just assume that I already have your
other 4 patches from the previous email.

-- Steve

2024-01-30 18:36:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds <[email protected]> wrote:

> [0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch text/x-patch (3561 bytes)]
>
> [0002-eventfsfs-initialize-the-tracefs-inode-properly.patch text/x-patch (2548 bytes)]

Ah these two are new.

-- Steve

2024-01-30 19:19:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 10:23, Steven Rostedt <[email protected]> wrote:
>
> I know you don't send patches inlined anymore, which is a shame, because
> patchwork takes care of all the administering when patches are inlined, and
> I don't miss patches like I use to.

I just sent the whole series as individual patches., and doing a

b4 am [email protected]

should get them all (or if you use patchwork, do that)

I don't normally do these inlined patches any more, because honestly,
99% of all patches I do end up being "I can't test this very well, I
think you should do something like this".

In fact, for simple ones where I might not have even compile-tested
them, much less booted into a kernel with them, I will actively
whitespace-corrupt the patch, just to make sure they aren't applied as
any kind of real workflow - they are almost always meant as a "I think
you should do this, and take credit for it all".

And so when I'm working on a series like this, I'll send attachments
just because it's easier, and because I don't want to patch-bomb
people with some kind of crazy work-in-progress thing.

But I'm reasonably comfortable with this series now, so I sent it as a
"real" patch series. I like it partly because it just removes a lot of
lines:

3 files changed, 160 insertions(+), 433 deletions(-)

but mostly because the lines it removes are what I consider actively
broken code. So it's not just getting rid of LOC, it's getting rid of
complexity (and bugs) IMHO.

That said, I also don't think that patch series is any kind of
"final". I didn't fix up the readdir iterator locking, for example. I
don't think the SRCU parts are needed at all any more thanks to the
refcounting - the 'ei' is no longer protected by SRCU, it's protected
by virtue of us having the file open (and thus holding the dentry).

So please think of that series not as any kind of final word. More as
a "I strongly believe this is the direction eventfs should go".

I am perfectly ok with you munging those patches and taking full
credit for them, for example.

My "testing" has not involved any real tracing usage, and while I
*have* booted that thing, and have done some very basic smoke-testing
in /sys/kernel/tracing, 99% of the work was literally me just going
through the lookup code, removing everything I found objectionable,
and replacing it with what the VFS layer generally would want.

Linus

2024-01-30 19:37:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 11:19:01 -0800
Linus Torvalds <[email protected]> wrote:

> On Tue, 30 Jan 2024 at 10:23, Steven Rostedt <[email protected]> wrote:
> >
> > I know you don't send patches inlined anymore, which is a shame, because
> > patchwork takes care of all the administering when patches are inlined, and
> > I don't miss patches like I use to.
>
> I just sent the whole series as individual patches., and doing a
>
> b4 am [email protected]
>
> should get them all (or if you use patchwork, do that)

Thanks. I can get the mbox with:

wget https://patchwork.kernel.org/series/821406/mbox/

For those interested, the series is at:

https://patchwork.kernel.org/project/linux-trace-kernel/list/?series=821406

And I got the above mbox link from the first patch and selecting the 'mbox'
link.

I download that and then run my scripts which adds the Link tags to them as
well as Cc's. I'll add the link to this thread as well. Then I simply do a
'git am -s'

Note, I mentioned the pain of pulling in your multi-attachment patches by
hand on IRC, and Konstantin replied that he'll fix b4 to do those too ;-)

But I still prefer patchwork, as that keeps history as well.

>
> I don't normally do these inlined patches any more, because honestly,
> 99% of all patches I do end up being "I can't test this very well, I
> think you should do something like this".
>
> In fact, for simple ones where I might not have even compile-tested
> them, much less booted into a kernel with them, I will actively
> whitespace-corrupt the patch, just to make sure they aren't applied as
> any kind of real workflow - they are almost always meant as a "I think
> you should do this, and take credit for it all".
>
> And so when I'm working on a series like this, I'll send attachments
> just because it's easier, and because I don't want to patch-bomb
> people with some kind of crazy work-in-progress thing.
>
> But I'm reasonably comfortable with this series now, so I sent it as a
> "real" patch series. I like it partly because it just removes a lot of
> lines:
>
> 3 files changed, 160 insertions(+), 433 deletions(-)
>
> but mostly because the lines it removes are what I consider actively
> broken code. So it's not just getting rid of LOC, it's getting rid of
> complexity (and bugs) IMHO.

Do you want me to put this in my urgent branch and mark them for stable,
and then send them for 6.8?

>
> That said, I also don't think that patch series is any kind of
> "final". I didn't fix up the readdir iterator locking, for example. I
> don't think the SRCU parts are needed at all any more thanks to the
> refcounting - the 'ei' is no longer protected by SRCU, it's protected
> by virtue of us having the file open (and thus holding the dentry).
>
> So please think of that series not as any kind of final word. More as
> a "I strongly believe this is the direction eventfs should go".
>
> I am perfectly ok with you munging those patches and taking full
> credit for them, for example.

Before seeing your 5 patch series, I wrote that patch I showed you which
did basically the same thing. It passed all the preliminary tests so I'm
sure your version should work too. I'll still have to run it through my full
ktest suite, but that takes several hours to complete. I do that before I
send out my 'for-linus/next' or any git pulls.


>
> My "testing" has not involved any real tracing usage, and while I
> *have* booted that thing, and have done some very basic smoke-testing
> in /sys/kernel/tracing, 99% of the work was literally me just going
> through the lookup code, removing everything I found objectionable,
> and replacing it with what the VFS layer generally would want.

I'll take your patches and start working on them.

Thanks for doing this!

-- Steve

2024-01-30 19:57:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 at 11:37, Steven Rostedt <[email protected]> wrote:
>
> Do you want me to put this in my urgent branch and mark them for stable,
> and then send them for 6.8?

Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
2/6]. And it turns out that while 'ti->private' really is entirely
uninitialized (and I still think it's the cause of the kernel test
robot report that started this thread), the ti->flags field _is_
initialized to zero in tracefs_alloc_inode().

So even in that [PATCH 2/6], these parts:

- ti->flags |= TRACEFS_EVENT_INODE;
+ ti->flags = TRACEFS_EVENT_INODE;

aren't strictly needed (but aren't wrong either).

The 'make sure to initialize ti->private before exposing the dentry'
part *definitely* needs to be part of 6.8, though. That has an
outstanding actually triggered bug report on it.

I do think that tracefs_alloc_inode() should also initialize
ti->private to NULL, but while that would fix the oops that the test
robot reported, it wouldn't fix the data-race on any ti->private
accesses.

So that "ti->private = ei" needs to be done before the d_instantiate()
(that later became a d_add()) regardless. But not having random fields
left uninitialized for future subtle bugs would be a good idea too.

Anyway.

If you do run the full tracefs tests on the whole series, and there
are no other major problems, I'll happily take it all for 6.8. And
yes, even mark it for stable. I think the other bugs are much harder
to hit, but I do think they exist. And code deletion is always good.

So give it the full test attention, and *if* it all still looks good
and there are no new subtle issues that crop up, let's just put this
saga behind us asap.

Linus

2024-01-30 20:03:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 11:54:56 -0800
Linus Torvalds <[email protected]> wrote:

> On Tue, 30 Jan 2024 at 11:37, Steven Rostedt <[email protected]> wrote:
> >
> > Do you want me to put this in my urgent branch and mark them for stable,
> > and then send them for 6.8?
>
> Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
> 2/6]. And it turns out that while 'ti->private' really is entirely
> uninitialized (and I still think it's the cause of the kernel test
> robot report that started this thread), the ti->flags field _is_
> initialized to zero in tracefs_alloc_inode().
>
> So even in that [PATCH 2/6], these parts:
>
> - ti->flags |= TRACEFS_EVENT_INODE;
> + ti->flags = TRACEFS_EVENT_INODE;
>
> aren't strictly needed (but aren't wrong either).
>
> The 'make sure to initialize ti->private before exposing the dentry'
> part *definitely* needs to be part of 6.8, though. That has an
> outstanding actually triggered bug report on it.
>
> I do think that tracefs_alloc_inode() should also initialize
> ti->private to NULL, but while that would fix the oops that the test
> robot reported, it wouldn't fix the data-race on any ti->private
> accesses.
>
> So that "ti->private = ei" needs to be done before the d_instantiate()
> (that later became a d_add()) regardless. But not having random fields
> left uninitialized for future subtle bugs would be a good idea too.


I'll add a patch to add __GFP_ZERO to the tracefs inode allocation, and
modify your patch 2 to just move the ti->private = ei;


>
> Anyway.
>
> If you do run the full tracefs tests on the whole series, and there
> are no other major problems, I'll happily take it all for 6.8. And
> yes, even mark it for stable. I think the other bugs are much harder
> to hit, but I do think they exist. And code deletion is always good.

Sounds good.

>
> So give it the full test attention, and *if* it all still looks good
> and there are no new subtle issues that crop up, let's just put this
> saga behind us asap.

Yes, I've been wanting to get away from eventfs for a month now.

Again, I really do appreciate the time you put in to fixing this for me.
I'm going to be backporting this to older Chromebooks as we really need to
cut down on the memory overhead of instances.

-- Steve

2024-01-31 15:58:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Tue, 30 Jan 2024 11:54:56 -0800
Linus Torvalds <[email protected]> wrote:

> If you do run the full tracefs tests on the whole series, and there
> are no other major problems, I'll happily take it all for 6.8. And
> yes, even mark it for stable. I think the other bugs are much harder
> to hit, but I do think they exist. And code deletion is always good.
>
> So give it the full test attention, and *if* it all still looks good
> and there are no new subtle issues that crop up, let's just put this
> saga behind us asap.

BTW, I ran my full test suite on your patches with the below updates and it
all passed. Note, I did not run the "bisectable" portion of my test. That
is, the part that runs tests on each patch in the series. Because I know
that fails. I just ran the tests on the last applied patch.

I can break up and clean up the patches so that they are bisectable, and if
that passes the bisectable portion of my tests, I can still send them to
you for 6.8. I think this does fix a lot of hidden bugs, and would like all
this to go back to 6.6 when the eventfs was first introduced.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index acdc797bd9c0..31cbe38739fa 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -74,7 +74,8 @@ static void release_ei(struct kref *ref)
{
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
kfree(ei->entry_attrs);
- kfree(ei);
+ kfree_const(ei->name);
+ kfree_rcu(ei, rcu);
}

static inline void put_ei(struct eventfs_inode *ei)
@@ -348,8 +349,7 @@ static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
inode->i_ino = EVENTFS_FILE_INODE_INO;

ti = get_tracefs(inode);
- ti->flags = TRACEFS_EVENT_INODE;
- ti->private = NULL; // Directories have 'ei', files not
+ ti->flags |= TRACEFS_EVENT_INODE;

// Files have their parent's ei as their fsdata
dentry->d_fsdata = get_ei(parent_ei);
@@ -388,7 +388,8 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
inode->i_ino = eventfs_dir_ino(ei);

ti = get_tracefs(inode);
- ti->flags = TRACEFS_EVENT_INODE;
+ ti->flags |= TRACEFS_EVENT_INODE;
+ /* Only directories have ti->private set to an ei, not files */
ti->private = ei;

dentry->d_fsdata = get_ei(ei);
@@ -402,17 +403,20 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,

static inline struct eventfs_inode *alloc_ei(const char *name)
{
- int namesize = strlen(name) + 1;
- struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
+ struct eventfs_inode *ei = kzalloc(sizeof(*ei), GFP_KERNEL);

- if (ei) {
- memcpy((char *)ei->name, name, namesize);
- kref_init(&ei->kref);
+ if (!ei)
+ return NULL;
+
+ ei->name = kstrdup_const(name, GFP_KERNEL);
+ if (!ei->name) {
+ kfree(ei);
+ return NULL;
}
+ kref_init(&ei->kref);
return ei;
}

-
/**
* eventfs_d_release - dentry is going away
* @dentry: dentry which has the reference to remove.
@@ -750,7 +754,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
INIT_LIST_HEAD(&ei->list);

ti = get_tracefs(inode);
- ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+ ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;

inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
@@ -847,5 +851,6 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
* sticks around while the other ei->dentry are created
* and destroyed dynamically.
*/
+ d_invalidate(dentry);
dput(dentry);
}
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 64122787e5d0..cf90ea86baf8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -38,8 +38,6 @@ static struct inode *tracefs_alloc_inode(struct super_block *sb)
if (!ti)
return NULL;

- ti->flags = 0;
-
return &ti->vfs_inode;
}

@@ -506,75 +504,6 @@ struct dentry *tracefs_end_creating(struct dentry *dentry)
return dentry;
}

-/**
- * eventfs_start_creating - start the process of creating a dentry
- * @name: Name of the file created for the dentry
- * @parent: The parent dentry where this dentry will be created
- *
- * This is a simple helper function for the dynamically created eventfs
- * files. When the directory of the eventfs files are accessed, their
- * dentries are created on the fly. This function is used to start that
- * process.
- */
-struct dentry *eventfs_start_creating(const char *name, struct dentry *parent)
-{
- struct dentry *dentry;
- int error;
-
- /* Must always have a parent. */
- if (WARN_ON_ONCE(!parent))
- return ERR_PTR(-EINVAL);
-
- error = simple_pin_fs(&trace_fs_type, &tracefs_mount,
- &tracefs_mount_count);
- if (error)
- return ERR_PTR(error);
-
- if (unlikely(IS_DEADDIR(parent->d_inode)))
- dentry = ERR_PTR(-ENOENT);
- else
- dentry = lookup_one_len(name, parent, strlen(name));
-
- if (!IS_ERR(dentry) && dentry->d_inode) {
- dput(dentry);
- dentry = ERR_PTR(-EEXIST);
- }
-
- if (IS_ERR(dentry))
- simple_release_fs(&tracefs_mount, &tracefs_mount_count);
-
- return dentry;
-}
-
-/**
- * eventfs_failed_creating - clean up a failed eventfs dentry creation
- * @dentry: The dentry to clean up
- *
- * If after calling eventfs_start_creating(), a failure is detected, the
- * resources created by eventfs_start_creating() needs to be cleaned up. In
- * that case, this function should be called to perform that clean up.
- */
-struct dentry *eventfs_failed_creating(struct dentry *dentry)
-{
- dput(dentry);
- simple_release_fs(&tracefs_mount, &tracefs_mount_count);
- return NULL;
-}
-
-/**
- * eventfs_end_creating - Finish the process of creating a eventfs dentry
- * @dentry: The dentry that has successfully been created.
- *
- * This function is currently just a place holder to match
- * eventfs_start_creating(). In case any synchronization needs to be added,
- * this function will be used to implement that without having to modify
- * the callers of eventfs_start_creating().
- */
-struct dentry *eventfs_end_creating(struct dentry *dentry)
-{
- return dentry;
-}
-
/* Find the inode that this will use for default */
static struct inode *instance_inode(struct dentry *parent, struct inode *inode)
{
@@ -788,6 +717,7 @@ static void init_once(void *foo)
{
struct tracefs_inode *ti = (struct tracefs_inode *) foo;

+ memset(ti, 0, sizeof(*ti));
inode_init_once(&ti->vfs_inode);
}

diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index d4194466b643..ca1ccc86986f 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -46,6 +46,7 @@ struct eventfs_inode {
struct kref kref;
struct list_head list;
const struct eventfs_entry *entries;
+ const char *name;
struct list_head children;
struct dentry *events_dir;
struct eventfs_attr *entry_attrs;
@@ -64,7 +65,6 @@ struct eventfs_inode {
struct llist_node llist;
struct rcu_head rcu;
};
- const char name[];
};

static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
@@ -76,9 +76,6 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
struct dentry *tracefs_end_creating(struct dentry *dentry);
struct dentry *tracefs_failed_creating(struct dentry *dentry);
struct inode *tracefs_get_inode(struct super_block *sb);
-struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
-struct dentry *eventfs_failed_creating(struct dentry *dentry);
-struct dentry *eventfs_end_creating(struct dentry *dentry);

void eventfs_d_release(struct dentry *dentry);


2024-01-31 16:14:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Wed, 31 Jan 2024 10:58:47 -0500
Steven Rostedt <[email protected]> wrote:

> @@ -788,6 +717,7 @@ static void init_once(void *foo)
> {
> struct tracefs_inode *ti = (struct tracefs_inode *) foo;
>
> + memset(ti, 0, sizeof(*ti));
> inode_init_once(&ti->vfs_inode);
> }
>

Note, that inode_init_once() also does a memset on the entire inode, so the
initial memset is redundant on the inode portion. But I didn't think it was
really worth the time to complicate the code by optimizing it. I guess if I
changed the structure to:

struct tracefs_inode {
+ struct inode vfs_inode;
unsigned long flags;
void *private;
- struct inode vfs_inode;
};

I could have it do:

memset_after(ti, 0, vfs_inode);

But this can be done as a separate clean up and doesn't need to be done now.

-- Steve

2024-01-31 17:26:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Wed, 31 Jan 2024 10:58:47 -0500
Steven Rostedt <[email protected]> wrote:

> BTW, I ran my full test suite on your patches with the below updates and it
> all passed. Note, I did not run the "bisectable" portion of my test. That
> is, the part that runs tests on each patch in the series. Because I know
> that fails. I just ran the tests on the last applied patch.
>
> I can break up and clean up the patches so that they are bisectable, and if
> that passes the bisectable portion of my tests, I can still send them to
> you for 6.8. I think this does fix a lot of hidden bugs, and would like all
> this to go back to 6.6 when the eventfs was first introduced.

So I swapped the order of patch 5 and patch 6, and it appears that patch 6
works without 5 (with some massaging).

Here's the new patch 6 without patch 5, and now patch 5 just finishes the
rest of the changes.

I'll post this series, pull it in through my normal work flow, and rerun
the full test suite with the bisecting check as well.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0289ec787367..0213a3375d53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -378,13 +378,12 @@ static void free_ei(struct eventfs_inode *ei)
}

/**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
* @dentry: dentry which has the reference to remove.
*
* Remove the association between a dentry from an eventfs_inode.
*/
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
{
struct eventfs_inode *ei;
int i;
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index f403d32bd7cd..cf90ea86baf8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -377,21 +377,30 @@ static const struct super_operations tracefs_super_operations = {
.show_options = tracefs_show_options,
};

-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
{
- struct tracefs_inode *ti;
+ if (dentry->d_fsdata)
+ eventfs_d_release(dentry);
+}

- if (!dentry || !inode)
- return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct eventfs_inode *ei = dentry->d_fsdata;

- ti = get_tracefs(inode);
- if (ti && ti->flags & TRACEFS_EVENT_INODE)
- eventfs_set_ei_status_free(ti, dentry);
- iput(inode);
+ return !(ei && ei->is_freed);
}

static const struct dentry_operations tracefs_dentry_operations = {
- .d_iput = tracefs_dentry_iput,
+ .d_revalidate = tracefs_d_revalidate,
+ .d_release = tracefs_d_release,
};

static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 9e64ca4829c7..687faba9807b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -77,6 +77,7 @@ struct dentry *tracefs_start_creating(const char *name, struct dentry *parent);
struct dentry *tracefs_end_creating(struct dentry *dentry);
struct dentry *tracefs_failed_creating(struct dentry *dentry);
struct inode *tracefs_get_inode(struct super_block *sb);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+
+void eventfs_d_release(struct dentry *dentry);

#endif /* _TRACEFS_INTERNAL_H */

2024-01-31 17:29:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Wed, 31 Jan 2024 11:13:58 -0500
Steven Rostedt <[email protected]> wrote:

> On Wed, 31 Jan 2024 10:58:47 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > @@ -788,6 +717,7 @@ static void init_once(void *foo)
> > {
> > struct tracefs_inode *ti = (struct tracefs_inode *) foo;
> >
> > + memset(ti, 0, sizeof(*ti));
> > inode_init_once(&ti->vfs_inode);
> > }
> >
>
> Note, that inode_init_once() also does a memset on the entire inode, so the
> initial memset is redundant on the inode portion. But I didn't think it was
> really worth the time to complicate the code by optimizing it. I guess if I
> changed the structure to:
>
> struct tracefs_inode {
> + struct inode vfs_inode;
> unsigned long flags;
> void *private;
> - struct inode vfs_inode;
> };
>
> I could have it do:
>
> memset_after(ti, 0, vfs_inode);
>
> But this can be done as a separate clean up and doesn't need to be done now.
>

Hmm, since I need to run all the tests again, I think I'll change this
patch to do the above.

-- Steve

2024-01-31 19:35:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Wed, 31 Jan 2024 at 07:58, Steven Rostedt <[email protected]> wrote:
>
> BTW, I ran my full test suite on your patches with the below updates and it
> all passed.

Those patch updates all look sane to me.

> I can break up and clean up the patches so that they are bisectable, and if
> that passes the bisectable portion of my tests, I can still send them to
> you for 6.8.

Ack. That series you posted looks fine. I didn't do any actual testing
or applying the patches, just looking at them.

The one thing I noticed is that the 'llist' removal still needs to be
done. The logical point is that "[PATCH v2 7/7]" where the
eventfs_workfn stuff is ripped out.

And the 'rcu' head should now be a union with something that is no
longer used after the last kref. The only thing that *is* used after
the last kref is the "is_freed" bit, so there's lots of choice. Using
the 'struct list_head listl' that is used for the child list would
seem to be the obvious choice, but it could be anything (including all
of the beginning of that eventfs_inode, but then you would need to
group that as another nested unnamed struct, so picking a "big enough"
entry like 'list' makes it syntactically simpler.

Linus

2024-01-31 19:42:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

On Wed, 31 Jan 2024 11:35:18 -0800
Linus Torvalds <[email protected]> wrote:

> On Wed, 31 Jan 2024 at 07:58, Steven Rostedt <[email protected]> wrote:
> >
> > BTW, I ran my full test suite on your patches with the below updates and it
> > all passed.
>
> Those patch updates all look sane to me.
>
> > I can break up and clean up the patches so that they are bisectable, and if
> > that passes the bisectable portion of my tests, I can still send them to
> > you for 6.8.
>
> Ack. That series you posted looks fine. I didn't do any actual testing
> or applying the patches, just looking at them.
>
> The one thing I noticed is that the 'llist' removal still needs to be
> done. The logical point is that "[PATCH v2 7/7]" where the
> eventfs_workfn stuff is ripped out.
>
> And the 'rcu' head should now be a union with something that is no
> longer used after the last kref. The only thing that *is* used after
> the last kref is the "is_freed" bit, so there's lots of choice. Using
> the 'struct list_head listl' that is used for the child list would
> seem to be the obvious choice, but it could be anything (including all
> of the beginning of that eventfs_inode, but then you would need to
> group that as another nested unnamed struct, so picking a "big enough"
> entry like 'list' makes it syntactically simpler.

Yeah, that was what I was talking about in my cover letter with:

Note, there's more clean ups that can happen. One being cleaning up the
eventfs_inode structure. But that's not critical now and can be added
later.

I just want to get the majority of the broken parts done. The clean up of
the eventfs_inode is something that I'd add a separate patch. Not sure that
falls in your "fixes" category for 6.8.

-- Steve