2022-01-15 07:34:06

by Eric W. Biederman

[permalink] [raw]
Subject: [GIT PULL] signal/exit/ptrace changes for v5.17


Linus,

Please pull the signal-for-v5.17 branch from the git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git signal-for-v5.17

HEAD: a403df29789ba38796edb97dad9bfb47836b68c0 ptrace/m68k: Stop open coding ptrace_report_syscall


This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs
found along the way.

The end-game which I have not yet reached yet is for fatal signals that
generate coredumps to be short-circuit deliverable from complete_signal,
for force_siginfo_to_task not to require changing userspace configured
signal delivery state, and for the ptrace stops to always happen in
locations where we can guarantee on all architectures that the all of
the registers are saved and available on the stack.

Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.

A bunch of small bug fixes are included, as most of the issues reported
were small enough that they would not affect bisection so I simply added
the fixes and did not fold the fixes into the changes they were fixing.

There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.

I am currently investigating to figure out if wake_up_interruptible
(instead of simply wake_up) ever makes sense outside of the signal code.

Some successes after this set of changes: There are few enough calls to
do_exit to audit in a reasonable amount of time. The lifetime of struct
kthread now matches the lifetime of struct task, and the pointer to
struct kthread is no longer stored in set_child_tid. The flag
SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is removed.
Issues where task->exit_code was examined with signal->group_exit_code
should been examined were fixed.

There are several loosely related changes included because I am cleaning
up and if I don't include them they will probably get lost.

The original postings of these changes can be found at:
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]

I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped.

I am sending this later than I would like as there was an issue that
was discovered just before the merge window and there is a big storm
coming through where I live. Linus I hope your travel is going well.

Eric W. Biederman (37):
exit/s390: Remove dead reference to do_exit from copy_thread
exit: Add and use make_task_dead.
exit: Move oops specific logic from do_exit into make_task_dead
exit: Stop poorly open coding do_task_dead in make_task_dead
exit: Stop exporting do_exit
exit: Implement kthread_exit
exit: Rename module_put_and_exit to module_put_and_kthread_exit
exit: Rename complete_and_exit to kthread_complete_and_exit
kthread: Ensure struct kthread is present for all kthreads
exit/kthread: Move the exit code for kernel threads into struct kthread
exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit
objtool: Add a missing comma to avoid string concatenation
fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA
fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct
kthread: Warn about failed allocations for the init kthread
kthread: Never put_user the set_child_tid address
kthread: Generalize pf_io_worker so it can point to struct kthread
exit/xtensa: In arch/xtensa/entry.S:Linvalid_mask call make_task_dead
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
exit: Move force_uaccess back into do_exit
signal: Have the oom killer detect coredumps using signal->core_state
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Make coredump handling explicit in complete_signal
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Remove SIGNAL_GROUP_COREDUMP
coredump: Stop setting signal->group_exit_task
signal: Rename group_exit_task group_exec_task
signal: Remove the helper signal_group_exit
exit: Remove profile_task_exit & profile_munmap
exit: Remove profile_handoff_task
exit: Coredumps reach do_group_exit
exit: Fix the exit_code for wait_task_zombie
exit: Use the correct exit_code in /proc/<pid>/stat
taskstats: Cleanup the use of task->exit_code
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace/m68k: Stop open coding ptrace_report_syscall

Nathan Chancellor (3):
hexagon: Fix function name in die()
h8300: Fix build errors from do_exit() to make_task_dead() transition
csky: Fix function name in csky_alignment() and die()

Randy Dunlap (1):
signal: clean up kernel-doc comments

arch/alpha/kernel/traps.c | 6 +-
arch/alpha/mm/fault.c | 2 +-
arch/arm/kernel/traps.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/kernel/traps.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/csky/abiv1/alignment.c | 2 +-
arch/csky/kernel/traps.c | 2 +-
arch/csky/mm/fault.c | 2 +-
arch/h8300/kernel/traps.c | 3 +-
arch/h8300/mm/fault.c | 2 +-
arch/hexagon/kernel/traps.c | 2 +-
arch/ia64/kernel/mca_drv.c | 2 +-
arch/ia64/kernel/traps.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/kernel/ptrace.c | 12 +---
arch/m68k/kernel/traps.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/kernel/exceptions.c | 4 +-
arch/mips/kernel/traps.c | 2 +-
arch/nds32/kernel/fpu.c | 2 +-
arch/nds32/kernel/traps.c | 8 +--
arch/nios2/kernel/traps.c | 4 +-
arch/openrisc/kernel/traps.c | 2 +-
arch/parisc/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 8 +--
arch/riscv/kernel/traps.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/kernel/dumpstack.c | 2 +-
arch/s390/kernel/nmi.c | 2 +-
arch/s390/kernel/process.c | 1 -
arch/sh/kernel/traps.c | 2 +-
arch/sparc/kernel/traps_32.c | 4 +-
arch/sparc/kernel/traps_64.c | 4 +-
arch/x86/entry/entry_32.S | 6 +-
arch/x86/entry/entry_64.S | 6 +-
arch/x86/kernel/dumpstack.c | 4 +-
arch/xtensa/kernel/entry.S | 2 +-
arch/xtensa/kernel/traps.c | 2 +-
crypto/algboss.c | 4 +-
drivers/net/wireless/rsi/rsi_91x_coex.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_main.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_usb_ops.c | 2 +-
drivers/pnp/pnpbios/core.c | 6 +-
drivers/staging/rts5208/rtsx.c | 16 ++---
drivers/usb/atm/usbatm.c | 2 +-
drivers/usb/gadget/function/f_mass_storage.c | 2 +-
fs/cifs/connect.c | 2 +-
fs/coredump.c | 14 ++--
fs/exec.c | 12 ++--
fs/io-wq.c | 6 +-
fs/io-wq.h | 2 +-
fs/jffs2/background.c | 2 +-
fs/nfs/callback.c | 4 +-
fs/nfs/nfs4state.c | 2 +-
fs/nfsd/nfssvc.c | 2 +-
fs/proc/array.c | 6 +-
include/linux/kernel.h | 1 -
include/linux/kthread.h | 4 +-
include/linux/module.h | 6 +-
include/linux/profile.h | 45 -------------
include/linux/sched.h | 4 +-
include/linux/sched/signal.h | 18 +-----
include/linux/sched/task.h | 1 +
include/linux/tracehook.h | 7 +-
kernel/exit.c | 97 +++++++++++++++-------------
kernel/fork.c | 20 +++---
kernel/futex/core.c | 2 +-
kernel/kexec_core.c | 2 +-
kernel/kthread.c | 88 +++++++++++++++++--------
kernel/module.c | 6 +-
kernel/profile.c | 73 ---------------------
kernel/ptrace.c | 2 -
kernel/sched/core.c | 16 ++---
kernel/signal.c | 19 +++---
kernel/tsacct.c | 7 +-
lib/kunit/try-catch.c | 4 +-
mm/mmap.c | 1 -
mm/oom_kill.c | 2 +-
net/bluetooth/bnep/core.c | 2 +-
net/bluetooth/cmtp/core.c | 2 +-
net/bluetooth/hidp/core.c | 2 +-
tools/objtool/check.c | 8 ++-
84 files changed, 274 insertions(+), 377 deletions(-)

Eric


2022-01-17 12:01:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

On Sat, Jan 15, 2022 at 2:00 AM Eric W. Biederman <[email protected]> wrote:
>
> I am currently investigating to figure out if wake_up_interruptible
> (instead of simply wake_up) ever makes sense outside of the signal code.

It may not be a huge deal, but it *absolutely* makes sense.

Any subsystem that uses "wait_event_interruptible()" (or variations of
that) should by default only use "wake_up_interruptible()" to wake the
wait queue.

The reason? Being (interruptibly) on one wait-queue does *NOT* make it
impossible that the very same process isn't waiting non-interruptibly
on another one.

It's not hugely common, but the Linux kernel wait-queues have very
much been designed for the whole "one process can be on multiple wait
queues for different reasons at the same time" model. That is *very*
core.

People sometimes think that is just a "poll/select()" thing, but
that's not at all true. It's quite valid to do things like

add_wait_queue(..)
for (.. some loop ..) {
set_current_state(TASK_INTERRUPTIBLE);
... do various things, checking state etc ..
schedule();
}
set_current_state(TASK_RUNNABLE);
remove_wait_queue();

and part of that "do various things" is obviously checking for signals
and other "exit this wait loop", but other things can be things like
taking a page fault because you copied data from user space etc.

And that in turn may end up having a nested wait on another waitqueue
(for the IO), and the outer wait queue should basically strive to not
wake up unrelated "deeper" sleeps, because that is pointless and just
causes extra wakeups.

So the page fault event will cause a nested TASK_UNINTERRUPTIBLE
sleep, and when that IO has completed, it goes into TASK_RUNNABLE, so
the outer (interruptible) loop above will have a "dummy schedule()"
and loop around again to be put back into TASK_INTERRUPTIBLE sleep
next time around.

But note how it would be pointless to use a "wake_up()" on this outer
wait queue - it would wake up the deeper IO sleep too, and that would
just see "oh, the IO I'm waiting for hasn't completed, so I'll just
have to go to sleep again".

Would it still _work_? Yes. Is the above _common_? No. But it's a
really fundamnetal pattern in the kernel, and it's fundamentally how
wait queues work, and how they should work, and an interruptible sleep
should generally be seen as pairing with an interruptible wake,
because that's just how things are.

Why would you want to change something basic like that? Yes, using
"wake_up()" instead of "wake_up_interruptible()" would still result in
a working kernel, but it's just _pointless_.

Linus

2022-01-17 12:40:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

On Mon, Jan 17, 2022 at 6:08 AM Linus Torvalds
<[email protected]> wrote:
>
>
> People sometimes think that is just a "poll/select()" thing, but
> that's not at all true. It's quite valid to do things like
>
> add_wait_queue(..)
> for (.. some loop ..) {
> set_current_state(TASK_INTERRUPTIBLE);
> ... do various things, checking state etc ..
> schedule();
> }
> set_current_state(TASK_RUNNABLE);
> remove_wait_queue();

Of course, in most modern cases, the above sequence is actually
encoded as a "wait_event_interruptible()", because we don't generally
want to open-code the whole thing.

But the actual end result still ends up being exactly the same, it's
just syntactically denser and more legible version of the above thing,
and you can still have the "event" you wait on have nested waiting
situations.

The nested waiting is by no means common. The only _common_situation
where you're on multiple wait queues tends to be select/poll kind of
things, when they aren't really nested as much as iterated over, but
conceptually the nested case is still quite important, and it allows
you to do things that a traditional "wait_on()" interface with just
one single wait-queue just doesn't allow for.

Linus

2022-01-17 12:45:46

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

The pull request you sent on Fri, 14 Jan 2022 17:59:37 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git signal-for-v5.17

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/35ce8ae9ae2e471f92759f9d6880eab42cc1c3b6

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2022-01-18 02:34:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

Linus Torvalds <[email protected]> writes:

> On Sat, Jan 15, 2022 at 2:00 AM Eric W. Biederman <[email protected]> wrote:
>>
>> I am currently investigating to figure out if wake_up_interruptible
>> (instead of simply wake_up) ever makes sense outside of the signal code.
>
> It may not be a huge deal, but it *absolutely* makes sense.
>
> Any subsystem that uses "wait_event_interruptible()" (or variations of
> that) should by default only use "wake_up_interruptible()" to wake the
> wait queue.
>
> The reason? Being (interruptibly) on one wait-queue does *NOT* make it
> impossible that the very same process isn't waiting non-interruptibly
> on another one.
>
> It's not hugely common, but the Linux kernel wait-queues have very
> much been designed for the whole "one process can be on multiple wait
> queues for different reasons at the same time" model. That is *very*
> core.
>
> People sometimes think that is just a "poll/select()" thing, but
> that's not at all true. It's quite valid to do things like
>
> add_wait_queue(..)
> for (.. some loop ..) {
> set_current_state(TASK_INTERRUPTIBLE);
> ... do various things, checking state etc ..
> schedule();
> }
> set_current_state(TASK_RUNNABLE);
> remove_wait_queue();
>
> and part of that "do various things" is obviously checking for signals
> and other "exit this wait loop", but other things can be things like
> taking a page fault because you copied data from user space etc.
>
> And that in turn may end up having a nested wait on another waitqueue
> (for the IO), and the outer wait queue should basically strive to not
> wake up unrelated "deeper" sleeps, because that is pointless and just
> causes extra wakeups.
>
> So the page fault event will cause a nested TASK_UNINTERRUPTIBLE
> sleep, and when that IO has completed, it goes into TASK_RUNNABLE, so
> the outer (interruptible) loop above will have a "dummy schedule()"
> and loop around again to be put back into TASK_INTERRUPTIBLE sleep
> next time around.
>
> But note how it would be pointless to use a "wake_up()" on this outer
> wait queue - it would wake up the deeper IO sleep too, and that would
> just see "oh, the IO I'm waiting for hasn't completed, so I'll just
> have to go to sleep again".
>
> Would it still _work_? Yes. Is the above _common_? No. But it's a
> really fundamnetal pattern in the kernel, and it's fundamentally how
> wait queues work, and how they should work, and an interruptible sleep
> should generally be seen as pairing with an interruptible wake,
> because that's just how things are.
>
> Why would you want to change something basic like that? Yes, using
> "wake_up()" instead of "wake_up_interruptible()" would still result in
> a working kernel, but it's just _pointless_.

Thank you for the detailed reply. I am going to have to spend a little
bit digesting it.

They why is that I am digging into how to reliably test for and get
just the wakeups needed in the coredump code.

As a first approximation writing to files and causing dump_interrupted
to change for signal_pending to fatal_sending_pending worked like a
charm. Unfortunately the pipe code is still performing interruptible
waits and io_uring causes truncated coredumps.

I would like to have a version of pipe_write that sleeps in
TASK_KILLABLE. I want the I/O wake-ups and I want the SIGKILL wake ups
but I don't want any other wake-ups. Unfortunately the I/O wake-ups in
the pipe code are sent with wake_up_interruptible. So a task sleeping
in TASK_KILLABLE won't get them.

Which means that the obvious solution of changing
wait_event_interruptible to wake_event_killable breaks coredump support
(as I found out the hard way).

I understand things well enough that I can change the signal code and
not make the coredump code worse. I am still trying to figure out what
a clean maintainable solution for coredumps writing to pipes is going to
be. So I will dig in and read the code that has the nested wait queues
and see if I can understand that logic, and keep thinking about the
coredump support.

Eric

2022-01-18 02:35:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

On Mon, Jan 17, 2022 at 5:32 PM Eric W. Biederman <[email protected]> wrote:
>
> I would like to have a version of pipe_write that sleeps in
> TASK_KILLABLE.

That would actually be horrible for another reason - now it would
count towards the load average. That's another difference between
interruptible waits and non-interruptible ones.

Admittedly it's an entirely arbitrary one, but it's part of the whole
semantic difference between TASK_INTERRUPTIBLE and
TASK_UNINTERRUPTIBLE.

You can play with TASK_NOLOAD of course, so it's something that can be
worked around, but it gets a bit ugly.

> I want the I/O wake-ups and I want the SIGKILL wake ups
> but I don't want any other wake-ups. Unfortunately the I/O wake-ups in
> the pipe code are sent with wake_up_interruptible. So a task sleeping
> in TASK_KILLABLE won't get them.

Yeah. The code *could* use the non-interruptible 'wake_up()', and
everything should work - because waking things up too much doesn't
change semantics, it's just a slight pessimization. Plus the whole
"nested waitqueues" isn't actually any remotely normal case, so it
doesn't really matter for performance either.

But I really think it's wrong.

You're trying to work around a problem the wrong way around. If a task
is dead, and is dumping core, then signals just shouldn't matter in
the first place, and thus the whole "TASK_INTERRUPTIBLE vs
TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
is an issue means there's something wrong in signaling, not in the
pipe code.

So I really think that's where the fix should be - on the signal delivery side.

Linus

2022-01-18 02:36:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

Linus Torvalds <[email protected]> writes:

> On Mon, Jan 17, 2022 at 6:08 AM Linus Torvalds
> <[email protected]> wrote:
>>
>>
>> People sometimes think that is just a "poll/select()" thing, but
>> that's not at all true. It's quite valid to do things like
>>
>> add_wait_queue(..)
>> for (.. some loop ..) {
>> set_current_state(TASK_INTERRUPTIBLE);
>> ... do various things, checking state etc ..
>> schedule();
>> }
>> set_current_state(TASK_RUNNABLE);
>> remove_wait_queue();
>
> Of course, in most modern cases, the above sequence is actually
> encoded as a "wait_event_interruptible()", because we don't generally
> want to open-code the whole thing.

Yes.

What I was looking at that inspired the question is that
"wake_up" ultimately expands to "try_to_wake_up(task, TASK_NORMAL, 0)".

Whereas "wake_up_interruptible" expands to
"try_to_wake_up(task, TASK_INTERRUPTIBLE, 0)".

With the practical challenge that if I want to change
wait_event_interruptible to wait_event_killable I need to change all of
the wakers.

> But the actual end result still ends up being exactly the same, it's
> just syntactically denser and more legible version of the above thing,
> and you can still have the "event" you wait on have nested waiting
> situations.
>
> The nested waiting is by no means common. The only _common_situation
> where you're on multiple wait queues tends to be select/poll kind of
> things, when they aren't really nested as much as iterated over, but
> conceptually the nested case is still quite important, and it allows
> you to do things that a traditional "wait_on()" interface with just
> one single wait-queue just doesn't allow for.

I think it may just be the part of the kernel where I usually work.
Changing wait_event_interruptible to wait_event_killable has always just
worked for me, but it doesn't in the pipe code.

It doesn't because of wake_up_interruptible.

I do know that short-term-disk-sleep aka task_uninterruptible is special
to performing things like disk I/O, and really short term things.

It might just be the names but I look at wake_up_interruptible and my
klaxon's go off in my head saying something doesn't make sense. So I
will read up and look at those nested wait-queue scenarios and see if I
can find the piece of understanding I am missing.

Eric

2022-01-18 02:39:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] signal/exit/ptrace changes for v5.17

Linus Torvalds <[email protected]> writes:

> On Mon, Jan 17, 2022 at 5:32 PM Eric W. Biederman <[email protected]> wrote:
>>
>> I would like to have a version of pipe_write that sleeps in
>> TASK_KILLABLE.
>
> That would actually be horrible for another reason - now it would
> count towards the load average. That's another difference between
> interruptible waits and non-interruptible ones.
>
> Admittedly it's an entirely arbitrary one, but it's part of the whole
> semantic difference between TASK_INTERRUPTIBLE and
> TASK_UNINTERRUPTIBLE.
>
> You can play with TASK_NOLOAD of course, so it's something that can be
> worked around, but it gets a bit ugly.

Yes. I don't want to make a change that changes the load average.

>> I want the I/O wake-ups and I want the SIGKILL wake ups
>> but I don't want any other wake-ups. Unfortunately the I/O wake-ups in
>> the pipe code are sent with wake_up_interruptible. So a task sleeping
>> in TASK_KILLABLE won't get them.
>
> Yeah. The code *could* use the non-interruptible 'wake_up()', and
> everything should work - because waking things up too much doesn't
> change semantics, it's just a slight pessimization. Plus the whole
> "nested waitqueues" isn't actually any remotely normal case, so it
> doesn't really matter for performance either.
>
> But I really think it's wrong.
>
> You're trying to work around a problem the wrong way around. If a task
> is dead, and is dumping core, then signals just shouldn't matter in
> the first place, and thus the whole "TASK_INTERRUPTIBLE vs
> TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
> is an issue means there's something wrong in signaling, not in the
> pipe code.
>
> So I really think that's where the fix should be - on the signal delivery side.

The actual signaling is shutdown, (except for the special case of
SIGKILL being able to terminate the coredump). It is io_uring and
anything else that is not a signal that causes signal_pending() to
return true.

I have not found any solution I am happy with yet, I am just
brainstorming.

Part of the problem is that I really don't want to perform process
shutdown and remove evidence of why the process crashed. So maybe
shutting down io_uring is fine in that case but I don't like that either.

The more I look at all of the interesting corner cases the more I wonder
if the solution isn't to have the coredump code fork a kernel-only
userspace process (like the io_uring threads are kernel-only userspace
threads). That would at least allow kernel functionality to work like
normal and greatly reduce the chance of weird feature interactions.

Hmm. A special kernel-only thread might even be enough as io_uring would
not be directing task_work at it.

You have been me some good information and I think I just need to sleep
on this problem a bit more to come up with a non-hacky solution.

Eric