LinuxLists.cc - [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
>
> I guess the regression fix needs a regression fix....

Yup.

From the description of the problem, it sounds like this happens on
real hardware, no vhost anywhere?

Or maybe Darrick (who doesn't see the issue) is running on raw
hardware, and you and Zorro are running in a virtual environment?

It sounds like zap_other_threads() and coredump_task_exit() do not
agree about the core_state->nr_threads counting, which is part of what
changed there.

[ Goes off to look ]

Hmm. Both seem to be using the same test for

(t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER

which I don't love - I don't think io_uring threads should participate
in core dumping either, so I think the test could just be

(t->flags & PF_IO_WORKER)

but that shouldn't be the issue here.

But according to

https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/

it's clearly hanging in wait_for_completion_state() in
coredump_wait(), so it really looks like some confusion about that
core_waiters (aka core_state->nr_threads) count.

Oh. Humm. Mike changed that initial rough patch of mine, and I had
moved the "if you don't participate in c ore dumps" test up also past
the "do_coredump()" logic.

And I think it's horribly *wrong* for a thread that doesn't get
counted for core-dumping to go into do_coredump(), because then it
will set the "core_state" to possibly be the core-state of the vhost
thread that isn't even counted.

So *maybe* this attached patch might fix it? I haven't thought very
deeply about this, but vhost workers most definitely shouldn't call
do_coredump(), since they are then not counted.

(And again, I think we should just check that PF_IO_WORKER bit, not
use this more complex test, but that's a separate and bigger change).

Linus

Attachments:

patch.diff (554.00 B)

2023-06-12 05:38:46

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
> >
> > I guess the regression fix needs a regression fix....
>
> Yup.
>
> From the description of the problem, it sounds like this happens on
> real hardware, no vhost anywhere?
>
> Or maybe Darrick (who doesn't see the issue) is running on raw
> hardware, and you and Zorro are running in a virtual environment?

I'm testing inside VMs and seeing it, I can't speak for anyone else.

....

> So *maybe* this attached patch might fix it? I haven't thought very
> deeply about this, but vhost workers most definitely shouldn't call
> do_coredump(), since they are then not counted.
>
> (And again, I think we should just check that PF_IO_WORKER bit, not
> use this more complex test, but that's a separate and bigger change).
>
> Linus

> kernel/signal.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 2547fa73bde5..a1e11ee8537c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2847,6 +2847,10 @@ bool get_signal(struct ksignal *ksig)
> */
> current->flags |= PF_SIGNALED;
>
> + /* vhost workers don't participate in core dups */
> + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> + goto out;
> +
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);

That would appear to make things worse. mkfs.xfs hung in Z state on
exit and never returned to the shell. Also, multiple processes are
livelocked like this:

Sending NMI from CPU 0 to CPUs 1-3:
NMI backtrace for cpu 2
CPU: 2 PID: 3409 Comm: pmlogger_farm Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:uprobe_deny_signal+0x5/0x90
Code: 48 c7 c1 c4 64 62 82 48 c7 c7 d1 64 62 82 e8 b2 39 ec ff e9 70 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <55> 31 4
RSP: 0018:ffffc900023abdf0 EFLAGS: 00000202
RAX: 0000000000000004 RBX: ffff888103b127c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000296 RDI: ffffc900023abe70
RBP: ffffc900023abe60 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: ffff88813bd2ccf0 R12: ffff888103b127c0
R13: ffffc900023abe70 R14: ffff888110413700 R15: ffff888103d26e80
FS: 00007f35497a4740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00007ffd4ca0ce80 CR3: 000000010f7d1000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
</NMI>
<TASK>
? get_signal+0x94/0x9b0
? signal_setup_done+0x66/0x190
arch_do_signal_or_restart+0x2f/0x260
exit_to_user_mode_prepare+0x181/0x1c0
syscall_exit_to_user_mode+0x16/0x40
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0xffff888103b127c0
Code: Unable to access opcode bytes at 0xffff888103b12796.
RSP: 002b:00007ffd4ca0d0ac EFLAGS: 00000202 ORIG_RAX: 000000000000003d
RAX: 0000000000000009 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd4d20bb9c RDI: 00000000ffffffff
RBP: 00007ffd4d20bb9c R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd4d20bba0 R14: 00005604571fc380 R15: 0000000000000001
</TASK>
NMI backtrace for cpu 3
CPU: 3 PID: 3526 Comm: pmlogger_check Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:fixup_exception+0x72/0x260
Code: 14 0f 87 03 02 00 00 ff 24 d5 98 67 22 82 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 81 cd 00 00 00 40 4d 63 ed 4d 89 6c 24 50 <31> c0 9
RSP: 0018:ffffc9000275bb58 EFLAGS: 00000083
RAX: 000000000000000f RBX: ffffffff827d0a4c RCX: ffffffff810c5f95
RDX: 000000000000000f RSI: ffffffff827d0a4c RDI: ffffc9000275bb28
RBP: ffffc9000275bb80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000275bc78
R13: 000000000000000e R14: 000000008f5ded3f R15: 0000000000000000
FS: 00007f56a36de740(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 000000008f5ded3f CR3: 000000010dcde000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? copy_fpstate_to_sigframe+0x1c5/0x3a0
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
</NMI>
<TASK>
kernelmode_fixup_or_oops+0x49/0x120
__bad_area_nosemaphore+0x15a/0x230
? __bad_area+0x57/0x80
bad_area_nosemaphore+0x16/0x20
exc_page_fault+0x323/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0010:copy_fpstate_to_sigframe+0x1c5/0x3a0
Code: 45 89 bc 24 40 25 00 00 f0 41 80 64 24 01 bf e9 f5 fe ff ff be 3c 00 00 00 48 c7 c7 77 9c 5f 82 e8 00 2a 23 00 31 c0 0f 1f 00 <49> 0f 1
RSP: 0018:ffffc9000275bd28 EFLAGS: 00010246
RAX: 000000000000000e RBX: 000000008f5de7ec RCX: ffffc9000275bda8
RDX: 000000008f5ded40 RSI: 000000000000003c RDI: ffffffff825f9c77
RBP: ffffc9000275bd98 R08: ffffc9000275be30 R09: 0000000000000001
R10: 0000000000000000 R11: ffffc90000138ff8 R12: ffff8881106527c0
R13: 000000008f5deb40 R14: ffff888110654d40 R15: ffff88810a653f40
? copy_fpstate_to_sigframe+0x1c0/0x3a0
? __might_sleep+0x42/0x70
get_sigframe+0xcd/0x2b0
ia32_setup_frame+0x61/0x230
arch_do_signal_or_restart+0x1d1/0x260
exit_to_user_mode_prepare+0x181/0x1c0
irqentry_exit_to_user_mode+0x9/0x30
irqentry_exit+0x33/0x40
exc_page_fault+0x1b6/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0023:0x106527c0
Code: Unable to access opcode bytes at 0x10652796.
RSP: 002b:000000008f5ded6c EFLAGS: 00010202
RAX: 000000000000000b RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd8f5df2ec RDI: 00000000ffffffff
RBP: 00007ffd8f5df2ec R08: 0000000000000000 R09: 00005558962eb526
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd8f5df2f0 R14: 00005558962b5e60 R15: 0000000000000001
</TASK>

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-12 06:13:31

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 10:16 PM Dave Chinner <[email protected]> wrote:
>
> > + /* vhost workers don't participate in core dups */
> > + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> > + goto out;
> > +
>
> That would appear to make things worse. mkfs.xfs hung in Z state on
> exit and never returned to the shell.

Well, duh, that's because I'm a complete nincompoop who just copied
the condition from the other cases, but those other cases were for
testing the "this is *not* a vhost worker".

Here the logic - as per the comment I added - was supposed to be "is
this a vhost worker".

So that "!=" should obviously have been a "==".

Not that I'm at all convinced that that will fix the problem you are
seeing, but at least it shouldn't have made things worse like the
getting the condition completely wrong did.

Linus

2023-06-12 06:35:16

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
>
> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> >
> > So that "!=" should obviously have been a "==".
>
> Same as without the condition - all the fsstress tasks hang in
> do_coredump().

Ok, that at least makes sense. Your "it made things worse" made me go
"What?" until I noticed the stupid backwards test.

I'm not seeing anything else that looks odd in that commit
f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
regression").

Let's see if somebody else goes "Ahh" when they wake up tomorrow...

Linus

2023-06-12 06:36:13

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 10:16 PM Dave Chinner <[email protected]> wrote:
> >
> > > + /* vhost workers don't participate in core dups */
> > > + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> > > + goto out;
> > > +
> >
> > That would appear to make things worse. mkfs.xfs hung in Z state on
> > exit and never returned to the shell.
>
> Well, duh, that's because I'm a complete nincompoop who just copied
> the condition from the other cases, but those other cases were for
> testing the "this is *not* a vhost worker".
>
> Here the logic - as per the comment I added - was supposed to be "is
> this a vhost worker".
>
> So that "!=" should obviously have been a "==".

Same as without the condition - all the fsstress tasks hang in
do_coredump().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-12 07:36:37

by Zorro Lang

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

2023-06-12 09:51:17

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

Linus Torvalds <[email protected]> writes:

> On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
>>
>> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
>> >
>> > So that "!=" should obviously have been a "==".
>>
>> Same as without the condition - all the fsstress tasks hang in
>> do_coredump().
>
> Ok, that at least makes sense. Your "it made things worse" made me go
> "What?" until I noticed the stupid backwards test.
>
> I'm not seeing anything else that looks odd in that commit
> f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> regression").
>
> Let's see if somebody else goes "Ahh" when they wake up tomorrow...

It feels like there have been about half a dozen bugs pointed out in
that version of the patch. I am going to have to sleep before I can get
as far as "Ahh"

One thing that really stands out for me is.

if (test_if_loop_should_continue) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
}

/* elsewhere */
llist_add(...);
wake_up_process()

So it is possible that the code can sleep indefinitely waiting for a
wake-up that has already come, because the order of set_current_state
and the test are in the wrong order.

Unfortunately I don't see what would effect a coredump on a process that
does not trigger the vhost_worker code.

About the only thing I can image is if io_uring is involved. Some of
the PF_IO_WORKER code was changed, and the test
"((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
was sprinkled around.

That is the only code outside of vhost specific code that was changed.

Is io_uring involved in the cases that hang?

Eric

2023-06-12 10:18:17

by Zorro Lang

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 03:45:12AM -0500, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
> >>
> >> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> >> >
> >> > So that "!=" should obviously have been a "==".
> >>
> >> Same as without the condition - all the fsstress tasks hang in
> >> do_coredump().
> >
> > Ok, that at least makes sense. Your "it made things worse" made me go
> > "What?" until I noticed the stupid backwards test.
> >
> > I'm not seeing anything else that looks odd in that commit
> > f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> > regression").
> >
> > Let's see if somebody else goes "Ahh" when they wake up tomorrow...
>
> It feels like there have been about half a dozen bugs pointed out in
> that version of the patch. I am going to have to sleep before I can get
> as far as "Ahh"
>
> One thing that really stands out for me is.
>
> if (test_if_loop_should_continue) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> }
>
> /* elsewhere */
> llist_add(...);
> wake_up_process()
>
> So it is possible that the code can sleep indefinitely waiting for a
> wake-up that has already come, because the order of set_current_state
> and the test are in the wrong order.
>
> Unfortunately I don't see what would effect a coredump on a process that
> does not trigger the vhost_worker code.
>
>
>
> About the only thing I can image is if io_uring is involved. Some of
> the PF_IO_WORKER code was changed, and the test
> "((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
> was sprinkled around.
>
> That is the only code outside of vhost specific code that was changed.
>
>
> Is io_uring involved in the cases that hang?

Oh, right, I involved io_uring into in fstests' fsstress.c, and I built kernel
with CONFIG_IO_URING=y. If Darrick (said he didn't hit this issue) didn't enable
io_uring, that might mean it's io_uring related.

>
>
> Eric
>

2023-06-12 11:54:32

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

2023-06-12 15:59:50

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

2023-06-12 16:13:30

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 8:36 AM Darrick J. Wong <[email protected]> wrote:
>
> > Or maybe Darrick (who doesn't see the issue) is running on raw
> > hardware, and you and Zorro are running in a virtual environment?
>
> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
> so fstests didn't get built with io_uring support. That probably
> explains why I don't see any of these hangs.
>
> Oh. I can't *install* the debian liburing-dev package because it has
> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
> with me having a linux-libc-dev-djwong package that contains the uapi
> headers for the latest upstream kernel and Replaces: linux-libc-dev.
> So either I have to create a dummy linux-libc-dev with adequate version
> number that pulls in my own libc header package, or rename that package.
>
> <sigh> It's going to take me a while to research how best to split this
> stupid knot.

Oh, no, that's great. It explains why you don't see the problem, and
Dave and Zorro do. Perfect.

No need for you to install any liburing packages, at least for this
issue. You'll probably want it eventually just for test coverage, but
for now it's the smoking gun we wanted - I was looking at why vhost
would be impacted, because that commit so intentionally *tried* to not
do anything at all to io_uring.

But it obviously failed. Which then in turn explains the bug.

Not that I see exactly where it went wrong yet, but at least we're
looking at the right thing. Adding Jens to the participants, in case
he sees what goes wrong.

Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
freezer/ps regression") seems to have broken core dumping with
io_uring threads, even though it tried very hard not to. See

https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/

for the beginning of this thread.

Honestly, that "try to not change io_uring" was my least favorite part
of that patch, because I really think we want to try to aim for these
user helper threads having as much infrastructure in common as
possible. And when it comes to core dumps, I do not believe that
waiting for the io_uring thread adds anything to the end result,
because the only reason we wait for it is to put in the thread
register state into the core dump, and for kernel helper threads, that
information just isn't useful. It's going to be the state that caused
the thread to be created, not anything that is worth saving in a core
dump for.

So I'd actually prefer to just simplify the logic entirely, and say
"PF_USER_WORKER tasks do not participate in core dumps, end of story".
io_uring didn't _care_, so including them wasn't a pain, but if the
vhost exit case can be delayed, I'd rather just say "let's do thig
thing for both io_uring and vhost, and not split those two cases up".

Anyway, I don't see exactly what goes wrong, but I feel better just
from this having been narrowed down to io_uring threads. I suspect
Jens actually might even have a core-dumping test-case somewhere,
since core dumping was a thing that io_uring ended up having some
issues with at one point.

Linus

2023-06-12 16:51:24

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

Can someone who can reproduce the hang run this test patch.

I am currently drawing a blank looking at the changes, so I am
proposing some debug code to help us narrow things down.

Can someone who can reproduce this run the code below?

The tests reproducing this don't appear to use use /dev/host-net or
/dev/vhost-vsock. So if the WARN_ON's trigger it is a good sign
that code connected to the WARN_ON's are wrong.

If the WARN_ON's don't trigger I suspect the code in kernel/fork.c

But as I said staring at the code I don't see anything wrong.

Eric

diff --git a/fs/coredump.c b/fs/coredump.c
index 88740c51b942..e9acf0a2d2f0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -374,6 +374,7 @@ static int zap_process(struct task_struct *start, int exit_code)
/* The vhost_worker does not particpate in coredumps */
if ((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)
nr++;
+ else WARN_ON_ONCE(true);
}
}

diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..56002a58ec33 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -437,6 +437,7 @@ static void coredump_task_exit(struct task_struct *tsk)
}
__set_current_state(TASK_RUNNING);
}
+ else if (core_state) WARN_ON_ONCE(true);
}

#ifdef CONFIG_MEMCG
diff --git a/kernel/signal.c b/kernel/signal.c
index 2547fa73bde5..1be27dbbce62 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1371,6 +1371,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't require de_thread to wait for the vhost_worker */
if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
count++;
+ else WARN_ON_ONCE(true);

/* Don't bother with already dead threads */
if (t->exit_state)

2023-06-12 17:12:26

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 9:56?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <[email protected]> wrote:
>>
>>> Or maybe Darrick (who doesn't see the issue) is running on raw
>>> hardware, and you and Zorro are running in a virtual environment?
>>
>> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
>> so fstests didn't get built with io_uring support. That probably
>> explains why I don't see any of these hangs.
>>
>> Oh. I can't *install* the debian liburing-dev package because it has
>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
>> with me having a linux-libc-dev-djwong package that contains the uapi
>> headers for the latest upstream kernel and Replaces: linux-libc-dev.
>> So either I have to create a dummy linux-libc-dev with adequate version
>> number that pulls in my own libc header package, or rename that package.
>>
>> <sigh> It's going to take me a while to research how best to split this
>> stupid knot.
>
> Oh, no, that's great. It explains why you don't see the problem, and
> Dave and Zorro do. Perfect.
>
> No need for you to install any liburing packages, at least for this
> issue. You'll probably want it eventually just for test coverage, but
> for now it's the smoking gun we wanted - I was looking at why vhost
> would be impacted, because that commit so intentionally *tried* to not
> do anything at all to io_uring.
>
> But it obviously failed. Which then in turn explains the bug.
>
> Not that I see exactly where it went wrong yet, but at least we're
> looking at the right thing. Adding Jens to the participants, in case
> he sees what goes wrong.
>
> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
> freezer/ps regression") seems to have broken core dumping with
> io_uring threads, even though it tried very hard not to. See
>
> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>
> for the beginning of this thread.
>
> Honestly, that "try to not change io_uring" was my least favorite part
> of that patch, because I really think we want to try to aim for these
> user helper threads having as much infrastructure in common as
> possible. And when it comes to core dumps, I do not believe that
> waiting for the io_uring thread adds anything to the end result,
> because the only reason we wait for it is to put in the thread
> register state into the core dump, and for kernel helper threads, that
> information just isn't useful. It's going to be the state that caused
> the thread to be created, not anything that is worth saving in a core
> dump for.
>
> So I'd actually prefer to just simplify the logic entirely, and say
> "PF_USER_WORKER tasks do not participate in core dumps, end of story".
> io_uring didn't _care_, so including them wasn't a pain, but if the
> vhost exit case can be delayed, I'd rather just say "let's do thig
> thing for both io_uring and vhost, and not split those two cases up".
>
> Anyway, I don't see exactly what goes wrong, but I feel better just
> from this having been narrowed down to io_uring threads. I suspect
> Jens actually might even have a core-dumping test-case somewhere,
> since core dumping was a thing that io_uring ended up having some
> issues with at one point.

I'll take a look - at the exact same time as your email, someone just
reported this issue separately on the liburing GH page as well. Tried
myself, and yup, anything that ends up spawning an io-wq worker and then
core dumps will now get stuck:

[ 136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 136.271711] task:ih state:D stack:0 pid:736 ppid:727 flags:0x00000004
[ 136.272218] Call trace:
[ 136.272353] __switch_to+0xb0/0xc8
[ 136.272555] __schedule+0x528/0x584
[ 136.272757] schedule+0x4c/0x90
[ 136.272936] schedule_timeout+0x30/0xdc
[ 136.273179] __wait_for_common+0x8c/0x118
[ 136.273407] wait_for_completion_state+0x1c/0x30
[ 136.273686] do_coredump+0x334/0x1000
[ 136.273898] get_signal+0x19c/0x5d8
[ 136.274108] do_notify_resume+0x10c/0xa0c
[ 136.274346] el0_da+0x50/0x5c
[ 136.274555] el0t_64_sync_handler+0xb8/0x134
[ 136.274812] el0t_64_sync+0x168/0x16c

Not good... I don't immediately see what the issue is, but I'll poke
shortly after a few meetings.

--
Jens Axboe

2023-06-12 17:14:50

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 10:27?AM, Jens Axboe wrote:
> On 6/12/23 9:56?AM, Linus Torvalds wrote:
>> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <[email protected]> wrote:
>>>
>>>> Or maybe Darrick (who doesn't see the issue) is running on raw
>>>> hardware, and you and Zorro are running in a virtual environment?
>>>
>>> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
>>> so fstests didn't get built with io_uring support. That probably
>>> explains why I don't see any of these hangs.
>>>
>>> Oh. I can't *install* the debian liburing-dev package because it has
>>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
>>> with me having a linux-libc-dev-djwong package that contains the uapi
>>> headers for the latest upstream kernel and Replaces: linux-libc-dev.
>>> So either I have to create a dummy linux-libc-dev with adequate version
>>> number that pulls in my own libc header package, or rename that package.
>>>
>>> <sigh> It's going to take me a while to research how best to split this
>>> stupid knot.
>>
>> Oh, no, that's great. It explains why you don't see the problem, and
>> Dave and Zorro do. Perfect.
>>
>> No need for you to install any liburing packages, at least for this
>> issue. You'll probably want it eventually just for test coverage, but
>> for now it's the smoking gun we wanted - I was looking at why vhost
>> would be impacted, because that commit so intentionally *tried* to not
>> do anything at all to io_uring.
>>
>> But it obviously failed. Which then in turn explains the bug.
>>
>> Not that I see exactly where it went wrong yet, but at least we're
>> looking at the right thing. Adding Jens to the participants, in case
>> he sees what goes wrong.
>>
>> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
>> freezer/ps regression") seems to have broken core dumping with
>> io_uring threads, even though it tried very hard not to. See
>>
>> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>>
>> for the beginning of this thread.
>>
>> Honestly, that "try to not change io_uring" was my least favorite part
>> of that patch, because I really think we want to try to aim for these
>> user helper threads having as much infrastructure in common as
>> possible. And when it comes to core dumps, I do not believe that
>> waiting for the io_uring thread adds anything to the end result,
>> because the only reason we wait for it is to put in the thread
>> register state into the core dump, and for kernel helper threads, that
>> information just isn't useful. It's going to be the state that caused
>> the thread to be created, not anything that is worth saving in a core
>> dump for.
>>
>> So I'd actually prefer to just simplify the logic entirely, and say
>> "PF_USER_WORKER tasks do not participate in core dumps, end of story".
>> io_uring didn't _care_, so including them wasn't a pain, but if the
>> vhost exit case can be delayed, I'd rather just say "let's do thig
>> thing for both io_uring and vhost, and not split those two cases up".
>>
>> Anyway, I don't see exactly what goes wrong, but I feel better just
>> from this having been narrowed down to io_uring threads. I suspect
>> Jens actually might even have a core-dumping test-case somewhere,
>> since core dumping was a thing that io_uring ended up having some
>> issues with at one point.
>
> I'll take a look - at the exact same time as your email, someone just
> reported this issue separately on the liburing GH page as well. Tried
> myself, and yup, anything that ends up spawning an io-wq worker and then
> core dumps will now get stuck:
>
> [ 136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 136.271711] task:ih state:D stack:0 pid:736 ppid:727 flags:0x00000004
> [ 136.272218] Call trace:
> [ 136.272353] __switch_to+0xb0/0xc8
> [ 136.272555] __schedule+0x528/0x584
> [ 136.272757] schedule+0x4c/0x90
> [ 136.272936] schedule_timeout+0x30/0xdc
> [ 136.273179] __wait_for_common+0x8c/0x118
> [ 136.273407] wait_for_completion_state+0x1c/0x30
> [ 136.273686] do_coredump+0x334/0x1000
> [ 136.273898] get_signal+0x19c/0x5d8
> [ 136.274108] do_notify_resume+0x10c/0xa0c
> [ 136.274346] el0_da+0x50/0x5c
> [ 136.274555] el0t_64_sync_handler+0xb8/0x134
> [ 136.274812] el0t_64_sync+0x168/0x16c
>
> Not good... I don't immediately see what the issue is, but I'll poke
> shortly after a few meetings.

Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
exit, and now we fail the check in coredump_task_exit() that was added.

From my quick recollection, this is to avoid hitting the schedule out
callback on exit. But I could be totally wrong... In any case, I'd be
surprised if this isn't why it got broken by Mike's patch.

--
Jens Axboe

2023-06-12 17:15:53

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 9:38 AM Jens Axboe <[email protected]> wrote:
>
> Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
> exit, and now we fail the check in coredump_task_exit() that was added.

Oh, that makes sense.

Well, it makes sense for the bug, but that whole

preempt_disable();
current->flags &= ~PF_IO_WORKER;
preempt_enable();

thin in io_worker_exit() does *not* make sense to me.

Does removing those three lines make things "JustWork(tm)"?

Linus

2023-06-12 17:16:09

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 10:42?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 9:38?AM Jens Axboe <[email protected]> wrote:
>>
>> Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
>> exit, and now we fail the check in coredump_task_exit() that was added.
>
> Oh, that makes sense.
>
> Well, it makes sense for the bug, but that whole
>
> preempt_disable();
> current->flags &= ~PF_IO_WORKER;
> preempt_enable();
>
> thin in io_worker_exit() does *not* make sense to me.
>
> Does removing those three lines make things "JustWork(tm)"?

You snipped the suspicion in my reply on why that exists, to avoid
io_wq_worker_sleeping() triggering. But arguably this should be covered
by the RUNNING flag. I'll poke at it shortly, and yeah then we should
just remove the PF_IO_WORKER clearing.

Or maybe I'm just smoking crack and it's no longer needed at all. Will
test and send a patch.

--
Jens Axboe

2023-06-12 17:17:08

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 9:42 AM Linus Torvalds
<[email protected]> wrote:
>
> Well, it makes sense for the bug, but that whole
>
> preempt_disable();
> current->flags &= ~PF_IO_WORKER;
> preempt_enable();
>
> thin in io_worker_exit() does *not* make sense to me.

Oh, it looks like that goes back to the original io_worker
implementation, when io_worker_start() would set the PF_IO_WORKER bit,
and io_worker_exit() would clear it, and they were otherwise created
like normal kthreads.

It *looks* all entirely historical to me, and removing those three
lines as left-over baggage smells like the right thing.

Linus

2023-06-12 17:21:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>>
>> You snipped the suspicion in my reply on why that exists, to avoid
>> io_wq_worker_sleeping() triggering.
>
> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
> problem in the first place.
>
> I suspect that is entirely historical too, and has to do with how it
> used to do that
>
> struct io_worker *worker = kthread_data(tsk);
> struct io_wqe *wqe = worker->wqe;
>
> back in the bad old days of kthreads.
>
> But yeah, I don't know that code.

If it is a problem it looks like the thread shutdown can clear
"worker->flags & IO_WORKER_F_UP" rather than
"current->flags & PF_IO_WORKER".

I don't see how it makes sense for the load balancing logic for
a per-process thread pool to be running at that point.

Eric

2023-06-12 17:31:28

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>
> You snipped the suspicion in my reply on why that exists, to avoid
> io_wq_worker_sleeping() triggering.

I'm not seeing why triggering io_wq_worker_sleeping() should even be a
problem in the first place.

I suspect that is entirely historical too, and has to do with how it
used to do that

struct io_worker *worker = kthread_data(tsk);
struct io_wqe *wqe = worker->wqe;

back in the bad old days of kthreads.

But yeah, I don't know that code.

Linus

2023-06-12 17:31:48

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 10:57 AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>>
>> You snipped the suspicion in my reply on why that exists, to avoid
>> io_wq_worker_sleeping() triggering.
>
> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
> problem in the first place.
>
> I suspect that is entirely historical too, and has to do with how it
> used to do that
>
> struct io_worker *worker = kthread_data(tsk);
> struct io_wqe *wqe = worker->wqe;
>
> back in the bad old days of kthreads.
>
> But yeah, I don't know that code.

Looks fine to me to just kill it indeed, whatever we did need this
for is definitely no longer the case. I _think_ we used to have
something in the worker exit that would potentially sleep which
is why we killed it before doing that, now it just looks like dead
code.

--
Jens Axboe

2023-06-12 17:31:56

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 11:11?AM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Mon, Jun 12, 2023 at 9:45?AM Jens Axboe <[email protected]> wrote:
>>>
>>> You snipped the suspicion in my reply on why that exists, to avoid
>>> io_wq_worker_sleeping() triggering.
>>
>> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
>> problem in the first place.
>>
>> I suspect that is entirely historical too, and has to do with how it
>> used to do that
>>
>> struct io_worker *worker = kthread_data(tsk);
>> struct io_wqe *wqe = worker->wqe;
>>
>> back in the bad old days of kthreads.
>>
>> But yeah, I don't know that code.
>
> If it is a problem it looks like the thread shutdown can clear
> "worker->flags & IO_WORKER_F_UP" rather than
> "current->flags & PF_IO_WORKER".
>
> I don't see how it makes sense for the load balancing logic for
> a per-process thread pool to be running at that point.

Yep that was my thinking too, if we did need it, we could fiddle with
the UP flag instead. But as per the previous reply, it should be able to
just get removed at this point.

--
Jens Axboe

2023-06-12 18:05:41

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On 6/12/23 11:51?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 10:29?AM Jens Axboe <[email protected]> wrote:
>>
>> Looks fine to me to just kill it indeed, whatever we did need this
>> for is definitely no longer the case. I _think_ we used to have
>> something in the worker exit that would potentially sleep which
>> is why we killed it before doing that, now it just looks like dead
>> code.
>
> Ok, can you (and the fsstress people) confirm that this
> whitespace-damaged patch fixes the coredump issue:
>
>
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -221,9 +221,6 @@ static void io_worker_exit(..
> raw_spin_unlock(&wq->lock);
> io_wq_dec_running(worker);
> worker->flags = 0;
> - preempt_disable();
> - current->flags &= ~PF_IO_WORKER;
> - preempt_enable();
>
> kfree_rcu(worker, rcu);
> io_worker_ref_put(wq);

Yep, it fixes it on my end and it passes my basic tests as well.

> Jens, I think that the two lines above there, ie the whole
>
> io_wq_dec_running(worker);
> worker->flags = 0;
>
> thing may actually be the (partial?) reason for those PF_IO_WORKER
> games. It's basically doing "now I'm doing stats by hand", and I
> wonder if now it decrements the running worker one time too much?
>
> Ie when the finally *dead* worker schedules away, never to return,
> that's when that io_wq_worker_sleeping() case triggers and decrements
> things one more time.
>
> So there might be some bookkeeping reason for those games, but it
> looks like if that's the case, then that
>
> worker->flags = 0;
>
> will have already taken care of it.
>
> I wonder if those two lines could just be removed too. But I think
> that's separate from the "let's fix the core dumping" issue.

Something like that was/is my worry. Let me add some tracing to confirm
it's fine, don't fully trust just my inspection of it. I'll send out the
patch separately once done, and then would be great if you can just pick
it up so it won't have to wait until I need to send a pull later in the
week. Particularly as I have nothing planned for 6.4 unless something
else comes up of course.

--
Jens Axboe

2023-06-12 18:16:36

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 10:54 AM Jens Axboe <[email protected]> wrote:
>
> Something like that was/is my worry. Let me add some tracing to confirm
> it's fine, don't fully trust just my inspection of it. I'll send out the
> patch separately once done, and then would be great if you can just pick
> it up so it won't have to wait until I need to send a pull later in the
> week. Particularly as I have nothing planned for 6.4 unless something
> else comes up of course.

Ack, sounds good.

Linus

2023-06-12 18:19:50

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 10:29 AM Jens Axboe <[email protected]> wrote:
>
> Looks fine to me to just kill it indeed, whatever we did need this
> for is definitely no longer the case. I _think_ we used to have
> something in the worker exit that would potentially sleep which
> is why we killed it before doing that, now it just looks like dead
> code.

Ok, can you (and the fsstress people) confirm that this
whitespace-damaged patch fixes the coredump issue:

--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -221,9 +221,6 @@ static void io_worker_exit(..
raw_spin_unlock(&wq->lock);
io_wq_dec_running(worker);
worker->flags = 0;
- preempt_disable();
- current->flags &= ~PF_IO_WORKER;
- preempt_enable();

kfree_rcu(worker, rcu);
io_worker_ref_put(wq);

Jens, I think that the two lines above there, ie the whole

io_wq_dec_running(worker);
worker->flags = 0;

thing may actually be the (partial?) reason for those PF_IO_WORKER
games. It's basically doing "now I'm doing stats by hand", and I
wonder if now it decrements the running worker one time too much?

Ie when the finally *dead* worker schedules away, never to return,
that's when that io_wq_worker_sleeping() case triggers and decrements
things one more time.

So there might be some bookkeeping reason for those games, but it
looks like if that's the case, then that

worker->flags = 0;

will have already taken care of it.

I wonder if those two lines could just be removed too. But I think
that's separate from the "let's fix the core dumping" issue.

Linus

2023-06-12 18:59:52

[permalink] [raw]

Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

On Mon, Jun 12, 2023 at 10:51 AM Linus Torvalds
<[email protected]> wrote:
>
> Ok, can you (and the fsstress people) confirm that this
> whitespace-damaged patch fixes the coredump issue:

The proper patch from Jens is now in my tree, and I'm archiving this
thread on the assumption that it's all good.

Dave/Zorro - if you still see any issues with that patch in place, holler.

Linus

2023-06-12 23:56:30