On Sun, Jun 11, 2023 at 06:51:45PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 12, 2023 at 09:01:19AM +1000, Dave Chinner wrote:
> > On Sun, Jun 11, 2023 at 08:48:36PM +0800, Zorro Lang wrote:
> > > Hi,
> > >
> > > When I tried to do fstests regression test this weekend on latest linux
> > > v6.5-rc5+ (HEAD=64569520920a3ca5d456ddd9f4f95fc6ea9b8b45), nearly all
> > > testing jobs on xfs hang on generic/051 (more than 24 hours, still blocked).
> > > No matter 1k or 4k blocksize, general disk or pmem dev, or any architectures,
> > > or any mkfs/mount options testing, all hang there.
> >
> > Yup, I started seeing this on upgrade to 6.5-rc5, too. xfs/079
> > generates it, because the fsstress process is crashing when the
> > XFS filesystems shuts down (maybe a SIGBUS from a mmap page fault?)
> > I don't know how reproducable it is yet; these only showed up in my
> > thrusday night testing so I haven't had a chance to triage it yet.
> >
> > > Someone console log as below (a bit long), the call trace doesn't contains any
> > > xfs functions, it might be not a xfs bug, but it can't be reproduced on ext4.
> >
> > AFAICT, the coredump is being done on the root drive (where fsstress
> > is being executed from), not the XFS test/scratch devices that
> > fsstress processes are exercising. I have ext3 root drives for my
> > test machines, so at this point I'm not sure that this is even a
> > filesystem related regression. i.e. it may be a recent regression in
> > the coredump or signal handling code....
>
> Willy was complaining about the same thing on Friday. Oddly I've not
> seen any problems with coredumps on 6.4-rc5, so .... <shrug>
A quick check against xfs/179 (the case I was seeing) indicates that
it is a regression between v6.4-rc4 and v6.4-rc5.
Looking at the kernel/ changelog, I'm expecting that it is a
regression introduced by this commit:
commit f9010dbdce911ee1f1af1398a24b1f9f992e0080
Author: Mike Christie <[email protected]>
Date: Thu Jun 1 13:32:32 2023 -0500
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.
To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.
This is a modified version of the patch written by Mike Christie
<[email protected]> which was a modified version of patch
originally written by Linus.
Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.
Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c. Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn. vhost_worker now returns true if work was done.
The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit. This collection clears
__fatal_signal_pending. This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.
For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file. To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec. The
coredump code and de_thread have been modified to ignore vhost threads.
Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.
Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.
Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <[email protected]>
Co-developed-by: Mike Christie <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
I'm just building a v6.5-rc5 with this one commit reverted to test
it right now.....
<a few minutes later>
Yup, 6.5-rc5 with that patch reverted doesn't hang at all. That's
the problem.
I guess the regression fix needs a regression fix....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
>
> I guess the regression fix needs a regression fix....
Yup.
From the description of the problem, it sounds like this happens on
real hardware, no vhost anywhere?
Or maybe Darrick (who doesn't see the issue) is running on raw
hardware, and you and Zorro are running in a virtual environment?
It sounds like zap_other_threads() and coredump_task_exit() do not
agree about the core_state->nr_threads counting, which is part of what
changed there.
[ Goes off to look ]
Hmm. Both seem to be using the same test for
(t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER
which I don't love - I don't think io_uring threads should participate
in core dumping either, so I think the test could just be
(t->flags & PF_IO_WORKER)
but that shouldn't be the issue here.
But according to
https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
it's clearly hanging in wait_for_completion_state() in
coredump_wait(), so it really looks like some confusion about that
core_waiters (aka core_state->nr_threads) count.
Oh. Humm. Mike changed that initial rough patch of mine, and I had
moved the "if you don't participate in c ore dumps" test up also past
the "do_coredump()" logic.
And I think it's horribly *wrong* for a thread that doesn't get
counted for core-dumping to go into do_coredump(), because then it
will set the "core_state" to possibly be the core-state of the vhost
thread that isn't even counted.
So *maybe* this attached patch might fix it? I haven't thought very
deeply about this, but vhost workers most definitely shouldn't call
do_coredump(), since they are then not counted.
(And again, I think we should just check that PF_IO_WORKER bit, not
use this more complex test, but that's a separate and bigger change).
Linus
On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
> >
> > I guess the regression fix needs a regression fix....
>
> Yup.
>
> From the description of the problem, it sounds like this happens on
> real hardware, no vhost anywhere?
>
> Or maybe Darrick (who doesn't see the issue) is running on raw
> hardware, and you and Zorro are running in a virtual environment?
I'm testing inside VMs and seeing it, I can't speak for anyone else.
....
> So *maybe* this attached patch might fix it? I haven't thought very
> deeply about this, but vhost workers most definitely shouldn't call
> do_coredump(), since they are then not counted.
>
> (And again, I think we should just check that PF_IO_WORKER bit, not
> use this more complex test, but that's a separate and bigger change).
>
> Linus
> kernel/signal.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 2547fa73bde5..a1e11ee8537c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2847,6 +2847,10 @@ bool get_signal(struct ksignal *ksig)
> */
> current->flags |= PF_SIGNALED;
>
> + /* vhost workers don't participate in core dups */
> + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> + goto out;
> +
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
That would appear to make things worse. mkfs.xfs hung in Z state on
exit and never returned to the shell. Also, multiple processes are
livelocked like this:
Sending NMI from CPU 0 to CPUs 1-3:
NMI backtrace for cpu 2
CPU: 2 PID: 3409 Comm: pmlogger_farm Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:uprobe_deny_signal+0x5/0x90
Code: 48 c7 c1 c4 64 62 82 48 c7 c7 d1 64 62 82 e8 b2 39 ec ff e9 70 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 <55> 31 4
RSP: 0018:ffffc900023abdf0 EFLAGS: 00000202
RAX: 0000000000000004 RBX: ffff888103b127c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000296 RDI: ffffc900023abe70
RBP: ffffc900023abe60 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: ffff88813bd2ccf0 R12: ffff888103b127c0
R13: ffffc900023abe70 R14: ffff888110413700 R15: ffff888103d26e80
FS: 00007f35497a4740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00007ffd4ca0ce80 CR3: 000000010f7d1000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
? uprobe_deny_signal+0x5/0x90
</NMI>
<TASK>
? get_signal+0x94/0x9b0
? signal_setup_done+0x66/0x190
arch_do_signal_or_restart+0x2f/0x260
exit_to_user_mode_prepare+0x181/0x1c0
syscall_exit_to_user_mode+0x16/0x40
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0xffff888103b127c0
Code: Unable to access opcode bytes at 0xffff888103b12796.
RSP: 002b:00007ffd4ca0d0ac EFLAGS: 00000202 ORIG_RAX: 000000000000003d
RAX: 0000000000000009 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd4d20bb9c RDI: 00000000ffffffff
RBP: 00007ffd4d20bb9c R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd4d20bba0 R14: 00005604571fc380 R15: 0000000000000001
</TASK>
NMI backtrace for cpu 3
CPU: 3 PID: 3526 Comm: pmlogger_check Not tainted 6.4.0-rc5-dgc+ #1822
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:fixup_exception+0x72/0x260
Code: 14 0f 87 03 02 00 00 ff 24 d5 98 67 22 82 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 81 cd 00 00 00 40 4d 63 ed 4d 89 6c 24 50 <31> c0 9
RSP: 0018:ffffc9000275bb58 EFLAGS: 00000083
RAX: 000000000000000f RBX: ffffffff827d0a4c RCX: ffffffff810c5f95
RDX: 000000000000000f RSI: ffffffff827d0a4c RDI: ffffc9000275bb28
RBP: ffffc9000275bb80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000275bc78
R13: 000000000000000e R14: 000000008f5ded3f R15: 0000000000000000
FS: 00007f56a36de740(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 000000008f5ded3f CR3: 000000010dcde000 CR4: 00000000000006e0
Call Trace:
<NMI>
? show_regs+0x61/0x70
? nmi_cpu_backtrace+0x88/0xf0
? nmi_cpu_backtrace_handler+0x11/0x20
? nmi_handle+0x57/0x150
? default_do_nmi+0x49/0x240
? exc_nmi+0xf4/0x110
? end_repeat_nmi+0x16/0x31
? copy_fpstate_to_sigframe+0x1c5/0x3a0
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
? fixup_exception+0x72/0x260
</NMI>
<TASK>
kernelmode_fixup_or_oops+0x49/0x120
__bad_area_nosemaphore+0x15a/0x230
? __bad_area+0x57/0x80
bad_area_nosemaphore+0x16/0x20
exc_page_fault+0x323/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0010:copy_fpstate_to_sigframe+0x1c5/0x3a0
Code: 45 89 bc 24 40 25 00 00 f0 41 80 64 24 01 bf e9 f5 fe ff ff be 3c 00 00 00 48 c7 c7 77 9c 5f 82 e8 00 2a 23 00 31 c0 0f 1f 00 <49> 0f 1
RSP: 0018:ffffc9000275bd28 EFLAGS: 00010246
RAX: 000000000000000e RBX: 000000008f5de7ec RCX: ffffc9000275bda8
RDX: 000000008f5ded40 RSI: 000000000000003c RDI: ffffffff825f9c77
RBP: ffffc9000275bd98 R08: ffffc9000275be30 R09: 0000000000000001
R10: 0000000000000000 R11: ffffc90000138ff8 R12: ffff8881106527c0
R13: 000000008f5deb40 R14: ffff888110654d40 R15: ffff88810a653f40
? copy_fpstate_to_sigframe+0x1c0/0x3a0
? __might_sleep+0x42/0x70
get_sigframe+0xcd/0x2b0
ia32_setup_frame+0x61/0x230
arch_do_signal_or_restart+0x1d1/0x260
exit_to_user_mode_prepare+0x181/0x1c0
irqentry_exit_to_user_mode+0x9/0x30
irqentry_exit+0x33/0x40
exc_page_fault+0x1b6/0x880
asm_exc_page_fault+0x27/0x30
RIP: 0023:0x106527c0
Code: Unable to access opcode bytes at 0x10652796.
RSP: 002b:000000008f5ded6c EFLAGS: 00010202
RAX: 000000000000000b RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00007ffd8f5df2ec RDI: 00000000ffffffff
RBP: 00007ffd8f5df2ec R08: 0000000000000000 R09: 00005558962eb526
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd8f5df2f0 R14: 00005558962b5e60 R15: 0000000000000001
</TASK>
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun, Jun 11, 2023 at 10:16 PM Dave Chinner <[email protected]> wrote:
>
> > + /* vhost workers don't participate in core dups */
> > + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> > + goto out;
> > +
>
> That would appear to make things worse. mkfs.xfs hung in Z state on
> exit and never returned to the shell.
Well, duh, that's because I'm a complete nincompoop who just copied
the condition from the other cases, but those other cases were for
testing the "this is *not* a vhost worker".
Here the logic - as per the comment I added - was supposed to be "is
this a vhost worker".
So that "!=" should obviously have been a "==".
Not that I'm at all convinced that that will fix the problem you are
seeing, but at least it shouldn't have made things worse like the
getting the condition completely wrong did.
Linus
On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
>
> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> >
> > So that "!=" should obviously have been a "==".
>
> Same as without the condition - all the fsstress tasks hang in
> do_coredump().
Ok, that at least makes sense. Your "it made things worse" made me go
"What?" until I noticed the stupid backwards test.
I'm not seeing anything else that looks odd in that commit
f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
regression").
Let's see if somebody else goes "Ahh" when they wake up tomorrow...
Linus
On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 10:16 PM Dave Chinner <[email protected]> wrote:
> >
> > > + /* vhost workers don't participate in core dups */
> > > + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> > > + goto out;
> > > +
> >
> > That would appear to make things worse. mkfs.xfs hung in Z state on
> > exit and never returned to the shell.
>
> Well, duh, that's because I'm a complete nincompoop who just copied
> the condition from the other cases, but those other cases were for
> testing the "this is *not* a vhost worker".
>
> Here the logic - as per the comment I added - was supposed to be "is
> this a vhost worker".
>
> So that "!=" should obviously have been a "==".
Same as without the condition - all the fsstress tasks hang in
do_coredump().
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
> >
> > I guess the regression fix needs a regression fix....
>
> Yup.
>
> From the description of the problem, it sounds like this happens on
> real hardware, no vhost anywhere?
>
> Or maybe Darrick (who doesn't see the issue) is running on raw
> hardware, and you and Zorro are running in a virtual environment?
I tested virtual environment and raw hardware both.
We have a testing machines pool which contains lots of different machines,
include real machines, kvm, and other kind of vm, include different
rches (aarch64, s390x, ppc64le and x86_64), and different kind of
storage (virt, hard RAID, generic scsi disk, pmem, etc...). They all hang
on fstests generic/051.
I remembered Darrick said he did test with ~160 VMs (need confirmation
from him).
So this issue might not be related with VMs or real machine. Hmm... maybe
related with some kernel config? If Darrick would like to provide his .config
file, I can make a diff with mine to check the difference.
Thanks,
Zorro
>
> It sounds like zap_other_threads() and coredump_task_exit() do not
> agree about the core_state->nr_threads counting, which is part of what
> changed there.
>
> [ Goes off to look ]
>
> Hmm. Both seem to be using the same test for
>
> (t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER
>
> which I don't love - I don't think io_uring threads should participate
> in core dumping either, so I think the test could just be
>
> (t->flags & PF_IO_WORKER)
>
> but that shouldn't be the issue here.
>
> But according to
>
> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>
> it's clearly hanging in wait_for_completion_state() in
> coredump_wait(), so it really looks like some confusion about that
> core_waiters (aka core_state->nr_threads) count.
>
> Oh. Humm. Mike changed that initial rough patch of mine, and I had
> moved the "if you don't participate in c ore dumps" test up also past
> the "do_coredump()" logic.
>
> And I think it's horribly *wrong* for a thread that doesn't get
> counted for core-dumping to go into do_coredump(), because then it
> will set the "core_state" to possibly be the core-state of the vhost
> thread that isn't even counted.
>
> So *maybe* this attached patch might fix it? I haven't thought very
> deeply about this, but vhost workers most definitely shouldn't call
> do_coredump(), since they are then not counted.
>
> (And again, I think we should just check that PF_IO_WORKER bit, not
> use this more complex test, but that's a separate and bigger change).
>
> Linus
Linus Torvalds <[email protected]> writes:
> On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
>>
>> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
>> >
>> > So that "!=" should obviously have been a "==".
>>
>> Same as without the condition - all the fsstress tasks hang in
>> do_coredump().
>
> Ok, that at least makes sense. Your "it made things worse" made me go
> "What?" until I noticed the stupid backwards test.
>
> I'm not seeing anything else that looks odd in that commit
> f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> regression").
>
> Let's see if somebody else goes "Ahh" when they wake up tomorrow...
It feels like there have been about half a dozen bugs pointed out in
that version of the patch. I am going to have to sleep before I can get
as far as "Ahh"
One thing that really stands out for me is.
if (test_if_loop_should_continue) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
}
/* elsewhere */
llist_add(...);
wake_up_process()
So it is possible that the code can sleep indefinitely waiting for a
wake-up that has already come, because the order of set_current_state
and the test are in the wrong order.
Unfortunately I don't see what would effect a coredump on a process that
does not trigger the vhost_worker code.
About the only thing I can image is if io_uring is involved. Some of
the PF_IO_WORKER code was changed, and the test
"((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
was sprinkled around.
That is the only code outside of vhost specific code that was changed.
Is io_uring involved in the cases that hang?
Eric
On Mon, Jun 12, 2023 at 03:45:12AM -0500, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
> >>
> >> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> >> >
> >> > So that "!=" should obviously have been a "==".
> >>
> >> Same as without the condition - all the fsstress tasks hang in
> >> do_coredump().
> >
> > Ok, that at least makes sense. Your "it made things worse" made me go
> > "What?" until I noticed the stupid backwards test.
> >
> > I'm not seeing anything else that looks odd in that commit
> > f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> > regression").
> >
> > Let's see if somebody else goes "Ahh" when they wake up tomorrow...
>
> It feels like there have been about half a dozen bugs pointed out in
> that version of the patch. I am going to have to sleep before I can get
> as far as "Ahh"
>
> One thing that really stands out for me is.
>
> if (test_if_loop_should_continue) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> }
>
> /* elsewhere */
> llist_add(...);
> wake_up_process()
>
> So it is possible that the code can sleep indefinitely waiting for a
> wake-up that has already come, because the order of set_current_state
> and the test are in the wrong order.
>
> Unfortunately I don't see what would effect a coredump on a process that
> does not trigger the vhost_worker code.
>
>
>
> About the only thing I can image is if io_uring is involved. Some of
> the PF_IO_WORKER code was changed, and the test
> "((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
> was sprinkled around.
>
> That is the only code outside of vhost specific code that was changed.
>
>
> Is io_uring involved in the cases that hang?
Oh, right, I involved io_uring into in fstests' fsstress.c, and I built kernel
with CONFIG_IO_URING=y. If Darrick (said he didn't hit this issue) didn't enable
io_uring, that might mean it's io_uring related.
>
>
> Eric
>
On Mon, Jun 12, 2023 at 03:45:12AM -0500, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Sun, Jun 11, 2023 at 10:49 PM Dave Chinner <[email protected]> wrote:
> >>
> >> On Sun, Jun 11, 2023 at 10:34:29PM -0700, Linus Torvalds wrote:
> >> >
> >> > So that "!=" should obviously have been a "==".
> >>
> >> Same as without the condition - all the fsstress tasks hang in
> >> do_coredump().
> >
> > Ok, that at least makes sense. Your "it made things worse" made me go
> > "What?" until I noticed the stupid backwards test.
> >
> > I'm not seeing anything else that looks odd in that commit
> > f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps
> > regression").
> >
> > Let's see if somebody else goes "Ahh" when they wake up tomorrow...
....
> About the only thing I can image is if io_uring is involved. Some of
> the PF_IO_WORKER code was changed, and the test
> "((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)"
> was sprinkled around.
>
> That is the only code outside of vhost specific code that was changed.
>
>
> Is io_uring involved in the cases that hang?
Yes. fsstress randomly uses io_uring for the ops that it runs.
-Dave.
--
Dave Chinner
[email protected]
On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote:
> On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner <[email protected]> wrote:
> >
> > I guess the regression fix needs a regression fix....
>
> Yup.
>
> From the description of the problem, it sounds like this happens on
> real hardware, no vhost anywhere?
>
> Or maybe Darrick (who doesn't see the issue) is running on raw
> hardware, and you and Zorro are running in a virtual environment?
Ahah, it turns out that liburing-dev isn't installed on the test fleet,
so fstests didn't get built with io_uring support. That probably
explains why I don't see any of these hangs.
Oh. I can't *install* the debian liburing-dev package because it has
a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
with me having a linux-libc-dev-djwong package that contains the uapi
headers for the latest upstream kernel and Replaces: linux-libc-dev.
So either I have to create a dummy linux-libc-dev with adequate version
number that pulls in my own libc header package, or rename that package.
<sigh> It's going to take me a while to research how best to split this
stupid knot.
--D
> It sounds like zap_other_threads() and coredump_task_exit() do not
> agree about the core_state->nr_threads counting, which is part of what
> changed there.
>
> [ Goes off to look ]
>
> Hmm. Both seem to be using the same test for
>
> (t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER
>
> which I don't love - I don't think io_uring threads should participate
> in core dumping either, so I think the test could just be
>
> (t->flags & PF_IO_WORKER)
>
> but that shouldn't be the issue here.
>
> But according to
>
> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>
> it's clearly hanging in wait_for_completion_state() in
> coredump_wait(), so it really looks like some confusion about that
> core_waiters (aka core_state->nr_threads) count.
>
> Oh. Humm. Mike changed that initial rough patch of mine, and I had
> moved the "if you don't participate in c ore dumps" test up also past
> the "do_coredump()" logic.
>
> And I think it's horribly *wrong* for a thread that doesn't get
> counted for core-dumping to go into do_coredump(), because then it
> will set the "core_state" to possibly be the core-state of the vhost
> thread that isn't even counted.
>
> So *maybe* this attached patch might fix it? I haven't thought very
> deeply about this, but vhost workers most definitely shouldn't call
> do_coredump(), since they are then not counted.
>
> (And again, I think we should just check that PF_IO_WORKER bit, not
> use this more complex test, but that's a separate and bigger change).
>
> Linus
> kernel/signal.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 2547fa73bde5..a1e11ee8537c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2847,6 +2847,10 @@ bool get_signal(struct ksignal *ksig)
> */
> current->flags |= PF_SIGNALED;
>
> + /* vhost workers don't participate in core dups */
> + if ((current->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> + goto out;
> +
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
On Mon, Jun 12, 2023 at 8:36 AM Darrick J. Wong <[email protected]> wrote:
>
> > Or maybe Darrick (who doesn't see the issue) is running on raw
> > hardware, and you and Zorro are running in a virtual environment?
>
> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
> so fstests didn't get built with io_uring support. That probably
> explains why I don't see any of these hangs.
>
> Oh. I can't *install* the debian liburing-dev package because it has
> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
> with me having a linux-libc-dev-djwong package that contains the uapi
> headers for the latest upstream kernel and Replaces: linux-libc-dev.
> So either I have to create a dummy linux-libc-dev with adequate version
> number that pulls in my own libc header package, or rename that package.
>
> <sigh> It's going to take me a while to research how best to split this
> stupid knot.
Oh, no, that's great. It explains why you don't see the problem, and
Dave and Zorro do. Perfect.
No need for you to install any liburing packages, at least for this
issue. You'll probably want it eventually just for test coverage, but
for now it's the smoking gun we wanted - I was looking at why vhost
would be impacted, because that commit so intentionally *tried* to not
do anything at all to io_uring.
But it obviously failed. Which then in turn explains the bug.
Not that I see exactly where it went wrong yet, but at least we're
looking at the right thing. Adding Jens to the participants, in case
he sees what goes wrong.
Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
freezer/ps regression") seems to have broken core dumping with
io_uring threads, even though it tried very hard not to. See
https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
for the beginning of this thread.
Honestly, that "try to not change io_uring" was my least favorite part
of that patch, because I really think we want to try to aim for these
user helper threads having as much infrastructure in common as
possible. And when it comes to core dumps, I do not believe that
waiting for the io_uring thread adds anything to the end result,
because the only reason we wait for it is to put in the thread
register state into the core dump, and for kernel helper threads, that
information just isn't useful. It's going to be the state that caused
the thread to be created, not anything that is worth saving in a core
dump for.
So I'd actually prefer to just simplify the logic entirely, and say
"PF_USER_WORKER tasks do not participate in core dumps, end of story".
io_uring didn't _care_, so including them wasn't a pain, but if the
vhost exit case can be delayed, I'd rather just say "let's do thig
thing for both io_uring and vhost, and not split those two cases up".
Anyway, I don't see exactly what goes wrong, but I feel better just
from this having been narrowed down to io_uring threads. I suspect
Jens actually might even have a core-dumping test-case somewhere,
since core dumping was a thing that io_uring ended up having some
issues with at one point.
Linus
Can someone who can reproduce the hang run this test patch.
I am currently drawing a blank looking at the changes, so I am
proposing some debug code to help us narrow things down.
Can someone who can reproduce this run the code below?
The tests reproducing this don't appear to use use /dev/host-net or
/dev/vhost-vsock. So if the WARN_ON's trigger it is a good sign
that code connected to the WARN_ON's are wrong.
If the WARN_ON's don't trigger I suspect the code in kernel/fork.c
But as I said staring at the code I don't see anything wrong.
Eric
diff --git a/fs/coredump.c b/fs/coredump.c
index 88740c51b942..e9acf0a2d2f0 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -374,6 +374,7 @@ static int zap_process(struct task_struct *start, int exit_code)
/* The vhost_worker does not particpate in coredumps */
if ((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)
nr++;
+ else WARN_ON_ONCE(true);
}
}
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..56002a58ec33 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -437,6 +437,7 @@ static void coredump_task_exit(struct task_struct *tsk)
}
__set_current_state(TASK_RUNNING);
}
+ else if (core_state) WARN_ON_ONCE(true);
}
#ifdef CONFIG_MEMCG
diff --git a/kernel/signal.c b/kernel/signal.c
index 2547fa73bde5..1be27dbbce62 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1371,6 +1371,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't require de_thread to wait for the vhost_worker */
if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
count++;
+ else WARN_ON_ONCE(true);
/* Don't bother with already dead threads */
if (t->exit_state)
On 6/12/23 9:56?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <[email protected]> wrote:
>>
>>> Or maybe Darrick (who doesn't see the issue) is running on raw
>>> hardware, and you and Zorro are running in a virtual environment?
>>
>> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
>> so fstests didn't get built with io_uring support. That probably
>> explains why I don't see any of these hangs.
>>
>> Oh. I can't *install* the debian liburing-dev package because it has
>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
>> with me having a linux-libc-dev-djwong package that contains the uapi
>> headers for the latest upstream kernel and Replaces: linux-libc-dev.
>> So either I have to create a dummy linux-libc-dev with adequate version
>> number that pulls in my own libc header package, or rename that package.
>>
>> <sigh> It's going to take me a while to research how best to split this
>> stupid knot.
>
> Oh, no, that's great. It explains why you don't see the problem, and
> Dave and Zorro do. Perfect.
>
> No need for you to install any liburing packages, at least for this
> issue. You'll probably want it eventually just for test coverage, but
> for now it's the smoking gun we wanted - I was looking at why vhost
> would be impacted, because that commit so intentionally *tried* to not
> do anything at all to io_uring.
>
> But it obviously failed. Which then in turn explains the bug.
>
> Not that I see exactly where it went wrong yet, but at least we're
> looking at the right thing. Adding Jens to the participants, in case
> he sees what goes wrong.
>
> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
> freezer/ps regression") seems to have broken core dumping with
> io_uring threads, even though it tried very hard not to. See
>
> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>
> for the beginning of this thread.
>
> Honestly, that "try to not change io_uring" was my least favorite part
> of that patch, because I really think we want to try to aim for these
> user helper threads having as much infrastructure in common as
> possible. And when it comes to core dumps, I do not believe that
> waiting for the io_uring thread adds anything to the end result,
> because the only reason we wait for it is to put in the thread
> register state into the core dump, and for kernel helper threads, that
> information just isn't useful. It's going to be the state that caused
> the thread to be created, not anything that is worth saving in a core
> dump for.
>
> So I'd actually prefer to just simplify the logic entirely, and say
> "PF_USER_WORKER tasks do not participate in core dumps, end of story".
> io_uring didn't _care_, so including them wasn't a pain, but if the
> vhost exit case can be delayed, I'd rather just say "let's do thig
> thing for both io_uring and vhost, and not split those two cases up".
>
> Anyway, I don't see exactly what goes wrong, but I feel better just
> from this having been narrowed down to io_uring threads. I suspect
> Jens actually might even have a core-dumping test-case somewhere,
> since core dumping was a thing that io_uring ended up having some
> issues with at one point.
I'll take a look - at the exact same time as your email, someone just
reported this issue separately on the liburing GH page as well. Tried
myself, and yup, anything that ends up spawning an io-wq worker and then
core dumps will now get stuck:
[ 136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 136.271711] task:ih state:D stack:0 pid:736 ppid:727 flags:0x00000004
[ 136.272218] Call trace:
[ 136.272353] __switch_to+0xb0/0xc8
[ 136.272555] __schedule+0x528/0x584
[ 136.272757] schedule+0x4c/0x90
[ 136.272936] schedule_timeout+0x30/0xdc
[ 136.273179] __wait_for_common+0x8c/0x118
[ 136.273407] wait_for_completion_state+0x1c/0x30
[ 136.273686] do_coredump+0x334/0x1000
[ 136.273898] get_signal+0x19c/0x5d8
[ 136.274108] do_notify_resume+0x10c/0xa0c
[ 136.274346] el0_da+0x50/0x5c
[ 136.274555] el0t_64_sync_handler+0xb8/0x134
[ 136.274812] el0t_64_sync+0x168/0x16c
Not good... I don't immediately see what the issue is, but I'll poke
shortly after a few meetings.
--
Jens Axboe
On 6/12/23 10:27?AM, Jens Axboe wrote:
> On 6/12/23 9:56?AM, Linus Torvalds wrote:
>> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <[email protected]> wrote:
>>>
>>>> Or maybe Darrick (who doesn't see the issue) is running on raw
>>>> hardware, and you and Zorro are running in a virtual environment?
>>>
>>> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
>>> so fstests didn't get built with io_uring support. That probably
>>> explains why I don't see any of these hangs.
>>>
>>> Oh. I can't *install* the debian liburing-dev package because it has
>>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
>>> with me having a linux-libc-dev-djwong package that contains the uapi
>>> headers for the latest upstream kernel and Replaces: linux-libc-dev.
>>> So either I have to create a dummy linux-libc-dev with adequate version
>>> number that pulls in my own libc header package, or rename that package.
>>>
>>> <sigh> It's going to take me a while to research how best to split this
>>> stupid knot.
>>
>> Oh, no, that's great. It explains why you don't see the problem, and
>> Dave and Zorro do. Perfect.
>>
>> No need for you to install any liburing packages, at least for this
>> issue. You'll probably want it eventually just for test coverage, but
>> for now it's the smoking gun we wanted - I was looking at why vhost
>> would be impacted, because that commit so intentionally *tried* to not
>> do anything at all to io_uring.
>>
>> But it obviously failed. Which then in turn explains the bug.
>>
>> Not that I see exactly where it went wrong yet, but at least we're
>> looking at the right thing. Adding Jens to the participants, in case
>> he sees what goes wrong.
>>
>> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
>> freezer/ps regression") seems to have broken core dumping with
>> io_uring threads, even though it tried very hard not to. See
>>
>> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>>
>> for the beginning of this thread.
>>
>> Honestly, that "try to not change io_uring" was my least favorite part
>> of that patch, because I really think we want to try to aim for these
>> user helper threads having as much infrastructure in common as
>> possible. And when it comes to core dumps, I do not believe that
>> waiting for the io_uring thread adds anything to the end result,
>> because the only reason we wait for it is to put in the thread
>> register state into the core dump, and for kernel helper threads, that
>> information just isn't useful. It's going to be the state that caused
>> the thread to be created, not anything that is worth saving in a core
>> dump for.
>>
>> So I'd actually prefer to just simplify the logic entirely, and say
>> "PF_USER_WORKER tasks do not participate in core dumps, end of story".
>> io_uring didn't _care_, so including them wasn't a pain, but if the
>> vhost exit case can be delayed, I'd rather just say "let's do thig
>> thing for both io_uring and vhost, and not split those two cases up".
>>
>> Anyway, I don't see exactly what goes wrong, but I feel better just
>> from this having been narrowed down to io_uring threads. I suspect
>> Jens actually might even have a core-dumping test-case somewhere,
>> since core dumping was a thing that io_uring ended up having some
>> issues with at one point.
>
> I'll take a look - at the exact same time as your email, someone just
> reported this issue separately on the liburing GH page as well. Tried
> myself, and yup, anything that ends up spawning an io-wq worker and then
> core dumps will now get stuck:
>
> [ 136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 136.271711] task:ih state:D stack:0 pid:736 ppid:727 flags:0x00000004
> [ 136.272218] Call trace:
> [ 136.272353] __switch_to+0xb0/0xc8
> [ 136.272555] __schedule+0x528/0x584
> [ 136.272757] schedule+0x4c/0x90
> [ 136.272936] schedule_timeout+0x30/0xdc
> [ 136.273179] __wait_for_common+0x8c/0x118
> [ 136.273407] wait_for_completion_state+0x1c/0x30
> [ 136.273686] do_coredump+0x334/0x1000
> [ 136.273898] get_signal+0x19c/0x5d8
> [ 136.274108] do_notify_resume+0x10c/0xa0c
> [ 136.274346] el0_da+0x50/0x5c
> [ 136.274555] el0t_64_sync_handler+0xb8/0x134
> [ 136.274812] el0t_64_sync+0x168/0x16c
>
> Not good... I don't immediately see what the issue is, but I'll poke
> shortly after a few meetings.
Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
exit, and now we fail the check in coredump_task_exit() that was added.
From my quick recollection, this is to avoid hitting the schedule out
callback on exit. But I could be totally wrong... In any case, I'd be
surprised if this isn't why it got broken by Mike's patch.
--
Jens Axboe
On Mon, Jun 12, 2023 at 9:38 AM Jens Axboe <[email protected]> wrote:
>
> Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
> exit, and now we fail the check in coredump_task_exit() that was added.
Oh, that makes sense.
Well, it makes sense for the bug, but that whole
preempt_disable();
current->flags &= ~PF_IO_WORKER;
preempt_enable();
thin in io_worker_exit() does *not* make sense to me.
Does removing those three lines make things "JustWork(tm)"?
Linus
On 6/12/23 10:42?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 9:38?AM Jens Axboe <[email protected]> wrote:
>>
>> Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
>> exit, and now we fail the check in coredump_task_exit() that was added.
>
> Oh, that makes sense.
>
> Well, it makes sense for the bug, but that whole
>
> preempt_disable();
> current->flags &= ~PF_IO_WORKER;
> preempt_enable();
>
> thin in io_worker_exit() does *not* make sense to me.
>
> Does removing those three lines make things "JustWork(tm)"?
You snipped the suspicion in my reply on why that exists, to avoid
io_wq_worker_sleeping() triggering. But arguably this should be covered
by the RUNNING flag. I'll poke at it shortly, and yeah then we should
just remove the PF_IO_WORKER clearing.
Or maybe I'm just smoking crack and it's no longer needed at all. Will
test and send a patch.
--
Jens Axboe
On Mon, Jun 12, 2023 at 9:42 AM Linus Torvalds
<[email protected]> wrote:
>
> Well, it makes sense for the bug, but that whole
>
> preempt_disable();
> current->flags &= ~PF_IO_WORKER;
> preempt_enable();
>
> thin in io_worker_exit() does *not* make sense to me.
Oh, it looks like that goes back to the original io_worker
implementation, when io_worker_start() would set the PF_IO_WORKER bit,
and io_worker_exit() would clear it, and they were otherwise created
like normal kthreads.
It *looks* all entirely historical to me, and removing those three
lines as left-over baggage smells like the right thing.
Linus
Linus Torvalds <[email protected]> writes:
> On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>>
>> You snipped the suspicion in my reply on why that exists, to avoid
>> io_wq_worker_sleeping() triggering.
>
> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
> problem in the first place.
>
> I suspect that is entirely historical too, and has to do with how it
> used to do that
>
> struct io_worker *worker = kthread_data(tsk);
> struct io_wqe *wqe = worker->wqe;
>
> back in the bad old days of kthreads.
>
> But yeah, I don't know that code.
If it is a problem it looks like the thread shutdown can clear
"worker->flags & IO_WORKER_F_UP" rather than
"current->flags & PF_IO_WORKER".
I don't see how it makes sense for the load balancing logic for
a per-process thread pool to be running at that point.
Eric
On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>
> You snipped the suspicion in my reply on why that exists, to avoid
> io_wq_worker_sleeping() triggering.
I'm not seeing why triggering io_wq_worker_sleeping() should even be a
problem in the first place.
I suspect that is entirely historical too, and has to do with how it
used to do that
struct io_worker *worker = kthread_data(tsk);
struct io_wqe *wqe = worker->wqe;
back in the bad old days of kthreads.
But yeah, I don't know that code.
Linus
On 6/12/23 10:57 AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 9:45 AM Jens Axboe <[email protected]> wrote:
>>
>> You snipped the suspicion in my reply on why that exists, to avoid
>> io_wq_worker_sleeping() triggering.
>
> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
> problem in the first place.
>
> I suspect that is entirely historical too, and has to do with how it
> used to do that
>
> struct io_worker *worker = kthread_data(tsk);
> struct io_wqe *wqe = worker->wqe;
>
> back in the bad old days of kthreads.
>
> But yeah, I don't know that code.
Looks fine to me to just kill it indeed, whatever we did need this
for is definitely no longer the case. I _think_ we used to have
something in the worker exit that would potentially sleep which
is why we killed it before doing that, now it just looks like dead
code.
--
Jens Axboe
On 6/12/23 11:11?AM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Mon, Jun 12, 2023 at 9:45?AM Jens Axboe <[email protected]> wrote:
>>>
>>> You snipped the suspicion in my reply on why that exists, to avoid
>>> io_wq_worker_sleeping() triggering.
>>
>> I'm not seeing why triggering io_wq_worker_sleeping() should even be a
>> problem in the first place.
>>
>> I suspect that is entirely historical too, and has to do with how it
>> used to do that
>>
>> struct io_worker *worker = kthread_data(tsk);
>> struct io_wqe *wqe = worker->wqe;
>>
>> back in the bad old days of kthreads.
>>
>> But yeah, I don't know that code.
>
> If it is a problem it looks like the thread shutdown can clear
> "worker->flags & IO_WORKER_F_UP" rather than
> "current->flags & PF_IO_WORKER".
>
> I don't see how it makes sense for the load balancing logic for
> a per-process thread pool to be running at that point.
Yep that was my thinking too, if we did need it, we could fiddle with
the UP flag instead. But as per the previous reply, it should be able to
just get removed at this point.
--
Jens Axboe
On 6/12/23 11:51?AM, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 10:29?AM Jens Axboe <[email protected]> wrote:
>>
>> Looks fine to me to just kill it indeed, whatever we did need this
>> for is definitely no longer the case. I _think_ we used to have
>> something in the worker exit that would potentially sleep which
>> is why we killed it before doing that, now it just looks like dead
>> code.
>
> Ok, can you (and the fsstress people) confirm that this
> whitespace-damaged patch fixes the coredump issue:
>
>
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -221,9 +221,6 @@ static void io_worker_exit(..
> raw_spin_unlock(&wq->lock);
> io_wq_dec_running(worker);
> worker->flags = 0;
> - preempt_disable();
> - current->flags &= ~PF_IO_WORKER;
> - preempt_enable();
>
> kfree_rcu(worker, rcu);
> io_worker_ref_put(wq);
Yep, it fixes it on my end and it passes my basic tests as well.
> Jens, I think that the two lines above there, ie the whole
>
> io_wq_dec_running(worker);
> worker->flags = 0;
>
> thing may actually be the (partial?) reason for those PF_IO_WORKER
> games. It's basically doing "now I'm doing stats by hand", and I
> wonder if now it decrements the running worker one time too much?
>
> Ie when the finally *dead* worker schedules away, never to return,
> that's when that io_wq_worker_sleeping() case triggers and decrements
> things one more time.
>
> So there might be some bookkeeping reason for those games, but it
> looks like if that's the case, then that
>
> worker->flags = 0;
>
> will have already taken care of it.
>
> I wonder if those two lines could just be removed too. But I think
> that's separate from the "let's fix the core dumping" issue.
Something like that was/is my worry. Let me add some tracing to confirm
it's fine, don't fully trust just my inspection of it. I'll send out the
patch separately once done, and then would be great if you can just pick
it up so it won't have to wait until I need to send a pull later in the
week. Particularly as I have nothing planned for 6.4 unless something
else comes up of course.
--
Jens Axboe
On Mon, Jun 12, 2023 at 10:54 AM Jens Axboe <[email protected]> wrote:
>
> Something like that was/is my worry. Let me add some tracing to confirm
> it's fine, don't fully trust just my inspection of it. I'll send out the
> patch separately once done, and then would be great if you can just pick
> it up so it won't have to wait until I need to send a pull later in the
> week. Particularly as I have nothing planned for 6.4 unless something
> else comes up of course.
Ack, sounds good.
Linus
On Mon, Jun 12, 2023 at 10:29 AM Jens Axboe <[email protected]> wrote:
>
> Looks fine to me to just kill it indeed, whatever we did need this
> for is definitely no longer the case. I _think_ we used to have
> something in the worker exit that would potentially sleep which
> is why we killed it before doing that, now it just looks like dead
> code.
Ok, can you (and the fsstress people) confirm that this
whitespace-damaged patch fixes the coredump issue:
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -221,9 +221,6 @@ static void io_worker_exit(..
raw_spin_unlock(&wq->lock);
io_wq_dec_running(worker);
worker->flags = 0;
- preempt_disable();
- current->flags &= ~PF_IO_WORKER;
- preempt_enable();
kfree_rcu(worker, rcu);
io_worker_ref_put(wq);
Jens, I think that the two lines above there, ie the whole
io_wq_dec_running(worker);
worker->flags = 0;
thing may actually be the (partial?) reason for those PF_IO_WORKER
games. It's basically doing "now I'm doing stats by hand", and I
wonder if now it decrements the running worker one time too much?
Ie when the finally *dead* worker schedules away, never to return,
that's when that io_wq_worker_sleeping() case triggers and decrements
things one more time.
So there might be some bookkeeping reason for those games, but it
looks like if that's the case, then that
worker->flags = 0;
will have already taken care of it.
I wonder if those two lines could just be removed too. But I think
that's separate from the "let's fix the core dumping" issue.
Linus
On Mon, Jun 12, 2023 at 10:51 AM Linus Torvalds
<[email protected]> wrote:
>
> Ok, can you (and the fsstress people) confirm that this
> whitespace-damaged patch fixes the coredump issue:
The proper patch from Jens is now in my tree, and I'm archiving this
thread on the assumption that it's all good.
Dave/Zorro - if you still see any issues with that patch in place, holler.
Linus
On Mon, Jun 12, 2023 at 11:34:28AM -0700, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 10:51 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Ok, can you (and the fsstress people) confirm that this
> > whitespace-damaged patch fixes the coredump issue:
>
> The proper patch from Jens is now in my tree, and I'm archiving this
> thread on the assumption that it's all good.
>
> Dave/Zorro - if you still see any issues with that patch in place, holler.
The fix seems to work. The reproducer just ran through 50 times
without fail; it was previously always failing on the first or
second cycle.
Cheers,
Dave.
--
Dave Chinner
[email protected]