2021-12-08 20:18:13

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 00/10] Removal of most do_exit calls


We have a lot of calls to do_exit that really don't want the semantics
of userspace calling pthread_exit, aka exit(2). Instead the interesting
semantics are those of the current task exiting.

This set of changes removes a dead reference to do_exit on s390,
adds a function make_task_dead and changes all of the oops
implementations to use it, and adds function kthread_exit and
changes all of the kthread exits to use it.

The short term win of this set of changes is being able to move many
sanity checks out of do_exit that are only really interesting during an
oops. Making it easier to see what do_exit is actually doing.

After this set of changes the number there are only about a big screen
full of do_exit calls left. Making future changes much easier to
review.

s390 folks. Can you please verify I read the s390 code correctly when
observing the reference to do_exit really is dead? I would really
appreciate it. I am not very familiar with s390.

This is on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

It is my plan that after these changes are reviewed to apply these
changes into my signal-for-v5.17 branch. After that I can get to
cleaning up where signals, coredumps and the exit code meets.

Eric W. Biederman (10):
exit/s390: Remove dead reference to do_exit from copy_thread
exit: Add and use make_task_dead.
exit: Move oops specific logic from do_exit into make_task_dead
exit: Stop poorly open coding do_task_dead in make_task_dead
exit: Stop exporting do_exit
exit: Implement kthread_exit
exit: Rename module_put_and_exit to module_put_and_kthread_exit
exit: Rename complete_and_exit to kthread_complete_and_exit
kthread: Ensure struct kthread is present for all kthreads
exit/kthread: Move the exit code for kernel threads into struct kthread

arch/alpha/kernel/traps.c | 6 +-
arch/alpha/mm/fault.c | 2 +-
arch/arm/kernel/traps.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/kernel/traps.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/csky/abiv1/alignment.c | 2 +-
arch/csky/kernel/traps.c | 2 +-
arch/csky/mm/fault.c | 2 +-
arch/h8300/kernel/traps.c | 2 +-
arch/h8300/mm/fault.c | 2 +-
arch/hexagon/kernel/traps.c | 2 +-
arch/ia64/kernel/mca_drv.c | 2 +-
arch/ia64/kernel/traps.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/kernel/traps.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/kernel/exceptions.c | 4 +-
arch/mips/kernel/traps.c | 2 +-
arch/nds32/kernel/fpu.c | 2 +-
arch/nds32/kernel/traps.c | 8 +--
arch/nios2/kernel/traps.c | 4 +-
arch/openrisc/kernel/traps.c | 2 +-
arch/parisc/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 8 +--
arch/riscv/kernel/traps.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/kernel/dumpstack.c | 2 +-
arch/s390/kernel/nmi.c | 2 +-
arch/s390/kernel/process.c | 1 -
arch/sh/kernel/traps.c | 2 +-
arch/sparc/kernel/traps_32.c | 4 +-
arch/sparc/kernel/traps_64.c | 4 +-
arch/x86/entry/entry_32.S | 6 +-
arch/x86/entry/entry_64.S | 6 +-
arch/x86/kernel/dumpstack.c | 4 +-
arch/xtensa/kernel/traps.c | 2 +-
crypto/algboss.c | 4 +-
drivers/net/wireless/rsi/rsi_91x_coex.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_main.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_usb_ops.c | 2 +-
drivers/pnp/pnpbios/core.c | 6 +-
drivers/staging/rts5208/rtsx.c | 16 ++---
drivers/usb/atm/usbatm.c | 2 +-
drivers/usb/gadget/function/f_mass_storage.c | 2 +-
fs/cifs/connect.c | 2 +-
fs/exec.c | 2 +
fs/jffs2/background.c | 2 +-
fs/nfs/callback.c | 4 +-
fs/nfs/nfs4state.c | 2 +-
fs/nfsd/nfssvc.c | 2 +-
include/linux/kernel.h | 1 -
include/linux/kthread.h | 4 +-
include/linux/module.h | 6 +-
include/linux/sched/task.h | 1 +
kernel/exit.c | 88 ++++++++++++++--------------
kernel/fork.c | 4 ++
kernel/futex/core.c | 2 +-
kernel/kexec_core.c | 2 +-
kernel/kthread.c | 78 +++++++++++++++++-------
kernel/module.c | 6 +-
kernel/sched/core.c | 16 ++---
lib/kunit/try-catch.c | 4 +-
net/bluetooth/bnep/core.c | 2 +-
net/bluetooth/cmtp/core.c | 2 +-
net/bluetooth/hidp/core.c | 2 +-
tools/objtool/check.c | 8 ++-
68 files changed, 212 insertions(+), 173 deletions(-)

Eric


2021-12-08 20:26:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

My s390 assembly is not particularly good so I have read the history
of the reference to do_exit copy_thread and have been able to
verify that do_exit is not used.

The general argument is that s390 has been changed to use the generic
kernel_thread and kernel_execve and the generic versions do not call
do_exit. So it is strange to see a do_exit reference sitting there.

The history of the do_exit reference in s390's version of copy_thread
seems conclusive that the do_exit reference is something that lingers
and should have been removed several years ago.

Up through 8d19f15a60be ("[PATCH] s390 update (1/27): arch.") the
s390 code made a call to the exit(2) system call when a kernel thread
finished. Then kernel_thread_starter was added which branched
directly to the value in register 11 when the kernel thread finshed.
The value in register 11 was set in kernel_thread to
"regs.gprs[11] = (unsigned long) do_exit"

In commit 37fe5d41f640 ("s390: fold kernel_thread_helper() into
ret_from_fork()") kernel_thread_starter was moved into entry.S and
entry64.S unchanged (except for the syntax differences between inline
assemly and in the assembly file).

In commit f9a7e025dfc2 ("s390: switch to generic kernel_thread()") the
assignment to "gprs[11]" was moved into copy_thread from the old
kernel_thread. The helper kernel_thread_starter was still being used
and was still branching to "%r11" at the end.

In commit 30dcb0996e40 ("s390: switch to saner kernel_execve()
semantics") kernel_thread_starter was changed to unconditionally
branch to sysc_tracenogo instead to %r11 which held the value of
do_exit. Unfortunately copy_thread was not updated to stop passing
do_exit in "gprs[11]".

In commit 56e62a737028 ("s390: convert to generic entry")
kernel_thread_starter was replaced by __ret_from_fork. And the code
still continued to pass do_exit in "gprs[11]" despite __ret_from_fork
not caring in the slightest.

Remove this dead reference to do_exit to make it clear that s390 is
not doing anything with do_exit in copy_thread.

Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Al Viro <[email protected]>
Fixes: 30dcb0996e40 ("s390: switch to saner kernel_execve() semantics")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/s390/kernel/process.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index e8858b2de24b..71d86f73b02c 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -139,7 +139,6 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
(unsigned long)__ret_from_fork;
frame->childregs.gprs[9] = new_stackp; /* function */
frame->childregs.gprs[10] = arg;
- frame->childregs.gprs[11] = (unsigned long)do_exit;
frame->childregs.orig_gpr2 = -1;
frame->childregs.last_break = 1;
return 0;
--
2.29.2


2021-12-08 20:26:21

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 02/10] exit: Add and use make_task_dead.

There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.

Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.

Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.

As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/traps.c | 6 +++---
arch/alpha/mm/fault.c | 2 +-
arch/arm/kernel/traps.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/kernel/traps.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/csky/abiv1/alignment.c | 2 +-
arch/csky/kernel/traps.c | 2 +-
arch/csky/mm/fault.c | 2 +-
arch/h8300/kernel/traps.c | 2 +-
arch/h8300/mm/fault.c | 2 +-
arch/hexagon/kernel/traps.c | 2 +-
arch/ia64/kernel/mca_drv.c | 2 +-
arch/ia64/kernel/traps.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/kernel/traps.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/kernel/exceptions.c | 4 ++--
arch/mips/kernel/traps.c | 2 +-
arch/nds32/kernel/fpu.c | 2 +-
arch/nds32/kernel/traps.c | 8 ++++----
arch/nios2/kernel/traps.c | 4 ++--
arch/openrisc/kernel/traps.c | 2 +-
arch/parisc/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 8 ++++----
arch/riscv/kernel/traps.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/kernel/dumpstack.c | 2 +-
arch/s390/kernel/nmi.c | 2 +-
arch/sh/kernel/traps.c | 2 +-
arch/sparc/kernel/traps_32.c | 4 +---
arch/sparc/kernel/traps_64.c | 4 +---
arch/x86/entry/entry_32.S | 6 +++---
arch/x86/entry/entry_64.S | 6 +++---
arch/x86/kernel/dumpstack.c | 4 ++--
arch/xtensa/kernel/traps.c | 2 +-
include/linux/sched/task.h | 1 +
kernel/exit.c | 9 +++++++++
tools/objtool/check.c | 3 ++-
39 files changed, 63 insertions(+), 56 deletions(-)

diff --git a/arch/alpha/kernel/traps.c b/arch/alpha/kernel/traps.c
index 2ae34702456c..8a66fe544c69 100644
--- a/arch/alpha/kernel/traps.c
+++ b/arch/alpha/kernel/traps.c
@@ -190,7 +190,7 @@ die_if_kernel(char * str, struct pt_regs *regs, long err, unsigned long *r9_15)
local_irq_enable();
while (1);
}
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

#ifndef CONFIG_MATHEMU
@@ -575,7 +575,7 @@ do_entUna(void * va, unsigned long opcode, unsigned long reg,

printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
pc, va, opcode, reg);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);

got_exception:
/* Ok, we caught the exception, but we don't want it. Is there
@@ -630,7 +630,7 @@ do_entUna(void * va, unsigned long opcode, unsigned long reg,
local_irq_enable();
while (1);
}
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/*
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index eee5102c3d88..e9193d52222e 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -204,7 +204,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
printk(KERN_ALERT "Unable to handle kernel paging request at "
"virtual address %016lx\n", address);
die_if_kernel("Oops", regs, cause, (unsigned long*)regs - 16);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);

/* We ran out of memory, or some other thing happened to us that
made us unable to handle the page fault gracefully. */
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 195dff58bafc..b4bd2e5f17c1 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -333,7 +333,7 @@ static void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
if (panic_on_oops)
panic("Fatal exception");
if (signr)
- do_exit(signr);
+ make_task_dead(signr);
}

/*
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bc8779d54a64..bf1a0c618c49 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -111,7 +111,7 @@ static void die_kernel_fault(const char *msg, struct mm_struct *mm,
show_pte(KERN_ALERT, mm, addr);
die("Oops", regs, fsr);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

/*
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 7b21213a570f..bdd456e4e7f4 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -235,7 +235,7 @@ void die(const char *str, struct pt_regs *regs, int err)
raw_spin_unlock_irqrestore(&die_lock, flags);

if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

static void arm64_show_signal(int signo, const char *str)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 9ae24e3b72be..11a28cace2d2 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -302,7 +302,7 @@ static void die_kernel_fault(const char *msg, unsigned long addr,
show_pte(addr);
die("Oops", regs, esr);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

#ifdef CONFIG_KASAN_HW_TAGS
diff --git a/arch/csky/abiv1/alignment.c b/arch/csky/abiv1/alignment.c
index cb2a0d94a144..5e2fb45d605c 100644
--- a/arch/csky/abiv1/alignment.c
+++ b/arch/csky/abiv1/alignment.c
@@ -294,7 +294,7 @@ void csky_alignment(struct pt_regs *regs)
__func__, opcode, rz, rx, imm, addr);
show_regs(regs);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_dead_task(SIGKILL);
}

force_sig_fault(SIGBUS, BUS_ADRALN, (void __user *)addr);
diff --git a/arch/csky/kernel/traps.c b/arch/csky/kernel/traps.c
index e5fbf8653a21..88a47035b925 100644
--- a/arch/csky/kernel/traps.c
+++ b/arch/csky/kernel/traps.c
@@ -109,7 +109,7 @@ void die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception");
if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_dead_task(SIGSEGV);
}

void do_trap(struct pt_regs *regs, int signo, int code, unsigned long addr)
diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
index 466ad949818a..7215a46b6b8e 100644
--- a/arch/csky/mm/fault.c
+++ b/arch/csky/mm/fault.c
@@ -67,7 +67,7 @@ static inline void no_context(struct pt_regs *regs, unsigned long addr)
pr_alert("Unable to handle kernel paging request at virtual "
"addr 0x%08lx, pc: 0x%08lx\n", addr, regs->pc);
die(regs, "Oops");
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

static inline void mm_fault_error(struct pt_regs *regs, unsigned long addr, vm_fault_t fault)
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index bdbe988d8dbc..3d4e0bde37ae 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -106,7 +106,7 @@ void die(const char *str, struct pt_regs *fp, unsigned long err)
dump(fp);

spin_unlock_irq(&die_lock);
- do_exit(SIGSEGV);
+ make_dead_task(SIGSEGV);
}

static int kstack_depth_to_print = 24;
diff --git a/arch/h8300/mm/fault.c b/arch/h8300/mm/fault.c
index d4bc9c16f2df..0223528565dd 100644
--- a/arch/h8300/mm/fault.c
+++ b/arch/h8300/mm/fault.c
@@ -51,7 +51,7 @@ asmlinkage int do_page_fault(struct pt_regs *regs, unsigned long address,
printk(" at virtual address %08lx\n", address);
if (!user_mode(regs))
die("Oops", regs, error_code);
- do_exit(SIGKILL);
+ make_dead_task(SIGKILL);

return 1;
}
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index edfc35dafeb1..6dd6cf0ab711 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -214,7 +214,7 @@ int die(const char *str, struct pt_regs *regs, long err)
panic("Fatal exception");

oops_exit();
- do_exit(err);
+ make_dead_task(err);
return 0;
}

diff --git a/arch/ia64/kernel/mca_drv.c b/arch/ia64/kernel/mca_drv.c
index 5bfc79be4cef..23c203639a96 100644
--- a/arch/ia64/kernel/mca_drv.c
+++ b/arch/ia64/kernel/mca_drv.c
@@ -176,7 +176,7 @@ mca_handler_bh(unsigned long paddr, void *iip, unsigned long ipsr)
spin_unlock(&mca_bh_lock);

/* This process is about to be killed itself */
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

/**
diff --git a/arch/ia64/kernel/traps.c b/arch/ia64/kernel/traps.c
index e13cb905930f..753642366e12 100644
--- a/arch/ia64/kernel/traps.c
+++ b/arch/ia64/kernel/traps.c
@@ -85,7 +85,7 @@ die (const char *str, struct pt_regs *regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
return 0;
}

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 02de2e70c587..4796cccbf74f 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -259,7 +259,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
regs = NULL;
bust_spinlocks(0);
if (regs)
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
return;

out_of_memory:
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 34d6458340b0..59fc63feb0dc 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1131,7 +1131,7 @@ void die_if_kernel (char *str, struct pt_regs *fp, int nr)
pr_crit("%s: %08x\n", str, nr);
show_registers(fp);
add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

asmlinkage void set_esp0(unsigned long ssp)
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index ef46e77e97a5..fcb3a0d8421c 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -48,7 +48,7 @@ int send_fault_sig(struct pt_regs *regs)
pr_alert("Unable to handle kernel access");
pr_cont(" at virtual address %p\n", addr);
die_if_kernel("Oops", regs, 0 /*error_code*/);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

return 1;
diff --git a/arch/microblaze/kernel/exceptions.c b/arch/microblaze/kernel/exceptions.c
index 908788497b28..fd153d5fab98 100644
--- a/arch/microblaze/kernel/exceptions.c
+++ b/arch/microblaze/kernel/exceptions.c
@@ -44,10 +44,10 @@ void die(const char *str, struct pt_regs *fp, long err)
pr_warn("Oops: %s, sig: %ld\n", str, err);
show_regs(fp);
spin_unlock_irq(&die_lock);
- /* do_exit() should take care of panic'ing from an interrupt
+ /* make_task_dead() should take care of panic'ing from an interrupt
* context so we don't handle it here
*/
- do_exit(err);
+ make_task_dead(err);
}

/* for user application debugging */
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index d26b0fb8ea06..a486486b2355 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -422,7 +422,7 @@ void __noreturn die(const char *str, struct pt_regs *regs)
if (regs && kexec_should_crash(current))
crash_kexec(regs);

- do_exit(sig);
+ make_task_dead(sig);
}

extern struct exception_table_entry __start___dbe_table[];
diff --git a/arch/nds32/kernel/fpu.c b/arch/nds32/kernel/fpu.c
index 9edd7ed7d7bf..701c09a668de 100644
--- a/arch/nds32/kernel/fpu.c
+++ b/arch/nds32/kernel/fpu.c
@@ -223,7 +223,7 @@ inline void handle_fpu_exception(struct pt_regs *regs)
}
} else if (fpcsr & FPCSR_mskRIT) {
if (!user_mode(regs))
- do_exit(SIGILL);
+ make_task_dead(SIGILL);
si_signo = SIGILL;
}

diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ca75d475eda4..c0a8f3344fb9 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -141,7 +141,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, int err)

bust_spinlocks(0);
spin_unlock_irq(&die_lock);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

EXPORT_SYMBOL(die);
@@ -240,7 +240,7 @@ void unhandled_interruption(struct pt_regs *regs)
pr_emerg("unhandled_interruption\n");
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
force_sig(SIGKILL);
}

@@ -251,7 +251,7 @@ void unhandled_exceptions(unsigned long entry, unsigned long addr,
addr, type);
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
force_sig(SIGKILL);
}

@@ -278,7 +278,7 @@ void do_revinsn(struct pt_regs *regs)
pr_emerg("Reserved Instruction\n");
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGILL);
+ make_task_dead(SIGILL);
force_sig(SIGILL);
}

diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index 596986a74a26..85ac49d64cf7 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -37,10 +37,10 @@ void die(const char *str, struct pt_regs *regs, long err)
show_regs(regs);
spin_unlock_irq(&die_lock);
/*
- * do_exit() should take care of panic'ing from an interrupt
+ * make_task_dead() should take care of panic'ing from an interrupt
* context so we don't handle it here
*/
- do_exit(err);
+ make_task_dead(err);
}

void _exception(int signo, struct pt_regs *regs, int code, unsigned long addr)
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 0898cb159fac..0446a3c34372 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -212,7 +212,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, long err)
__asm__ __volatile__("l.nop 1");
do {} while (1);
#endif
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/* This is normally the 'Oops' routine */
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index b11fb26ce299..df2122c50d78 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -269,7 +269,7 @@ void die_if_kernel(char *str, struct pt_regs *regs, long err)
panic("Fatal exception");

oops_exit();
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/* gdb uses break 4,8 */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 11741703d26e..a08bb7cefdc5 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -245,7 +245,7 @@ static void oops_end(unsigned long flags, struct pt_regs *regs,

if (panic_on_oops)
panic("Fatal exception");
- do_exit(signr);
+ make_task_dead(signr);
}
NOKPROBE_SYMBOL(oops_end);

@@ -792,9 +792,9 @@ int machine_check_generic(struct pt_regs *regs)
void die_mce(const char *str, struct pt_regs *regs, long err)
{
/*
- * The machine check wants to kill the interrupted context, but
- * do_exit() checks for in_interrupt() and panics in that case, so
- * exit the irq/nmi before calling die.
+ * The machine check wants to kill the interrupted context,
+ * but make_task_dead() checks for in_interrupt() and panics
+ * in that case, so exit the irq/nmi before calling die.
*/
if (in_nmi())
nmi_exit();
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0daaa3e4630d..fe92e119e6a3 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -54,7 +54,7 @@ void die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception");
if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

void do_trap(struct pt_regs *regs, int signo, int code, unsigned long addr)
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index aa08dd2f8fae..42118bc728f9 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -31,7 +31,7 @@ static void die_kernel_fault(const char *msg, unsigned long addr,

bust_spinlocks(0);
die(regs, "Oops");
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

static inline void no_context(struct pt_regs *regs, unsigned long addr)
diff --git a/arch/s390/kernel/dumpstack.c b/arch/s390/kernel/dumpstack.c
index 0681c55e831d..1e3233eb510a 100644
--- a/arch/s390/kernel/dumpstack.c
+++ b/arch/s390/kernel/dumpstack.c
@@ -224,5 +224,5 @@ void __noreturn die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception: panic_on_oops");
oops_exit();
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 20f8e1868853..a4d8c058dd27 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -175,7 +175,7 @@ void __s390_handle_mcck(void)
"malfunction (code 0x%016lx).\n", mcck.mcck_code);
printk(KERN_EMERG "mcck: task: %s, pid: %d.\n",
current->comm, current->pid);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}
}

diff --git a/arch/sh/kernel/traps.c b/arch/sh/kernel/traps.c
index cbe3201d4f21..01884054aeb2 100644
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -57,7 +57,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

void die_if_kernel(const char *str, struct pt_regs *regs, long err)
diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
index 5630e5a395e0..179aabfa712e 100644
--- a/arch/sparc/kernel/traps_32.c
+++ b/arch/sparc/kernel/traps_32.c
@@ -86,9 +86,7 @@ void __noreturn die_if_kernel(char *str, struct pt_regs *regs)
}
printk("Instruction DUMP:");
instruction_dump ((unsigned long *) regs->pc);
- if(regs->psr & PSR_PS)
- do_exit(SIGKILL);
- do_exit(SIGSEGV);
+ make_task_dead((regs->psr & PSR_PS) ? SIGKILL : SIGSEGV);
}

void do_hw_interrupt(struct pt_regs *regs, unsigned long type)
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 6863025ed56d..21077821f427 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2559,9 +2559,7 @@ void __noreturn die_if_kernel(char *str, struct pt_regs *regs)
}
if (panic_on_oops)
panic("Fatal exception");
- if (regs->tstate & TSTATE_PRIV)
- do_exit(SIGKILL);
- do_exit(SIGSEGV);
+ make_task_dead((regs->tstate & TSTATE_PRIV)? SIGKILL : SIGSEGV);
}
EXPORT_SYMBOL(die_if_kernel);

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ccb9d32768f3..7738fad6a85e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1248,14 +1248,14 @@ SYM_CODE_START(asm_exc_nmi)
SYM_CODE_END(asm_exc_nmi)

.pushsection .text, "ax"
-SYM_CODE_START(rewind_stack_do_exit)
+SYM_CODE_START(rewind_stack_and_make_dead)
/* Prevent any naive code from trying to unwind to our caller. */
xorl %ebp, %ebp

movl PER_CPU_VAR(cpu_current_top_of_stack), %esi
leal -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp

- call do_exit
+ call make_task_dead
1: jmp 1b
-SYM_CODE_END(rewind_stack_do_exit)
+SYM_CODE_END(rewind_stack_and_make_dead)
.popsection
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..f09276457942 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1429,7 +1429,7 @@ SYM_CODE_END(ignore_sysret)
#endif

.pushsection .text, "ax"
-SYM_CODE_START(rewind_stack_do_exit)
+SYM_CODE_START(rewind_stack_and_make_dead)
UNWIND_HINT_FUNC
/* Prevent any naive code from trying to unwind to our caller. */
xorl %ebp, %ebp
@@ -1438,6 +1438,6 @@ SYM_CODE_START(rewind_stack_do_exit)
leaq -PTREGS_SIZE(%rax), %rsp
UNWIND_HINT_REGS

- call do_exit
-SYM_CODE_END(rewind_stack_do_exit)
+ call make_task_dead
+SYM_CODE_END(rewind_stack_and_make_dead)
.popsection
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ea4fe192189d..53de044e5654 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -351,7 +351,7 @@ unsigned long oops_begin(void)
}
NOKPROBE_SYMBOL(oops_begin);

-void __noreturn rewind_stack_do_exit(int signr);
+void __noreturn rewind_stack_and_make_dead(int signr);

void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
{
@@ -386,7 +386,7 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
* reuse the task stack and that existing poisons are invalid.
*/
kasan_unpoison_task_stack(current);
- rewind_stack_do_exit(signr);
+ rewind_stack_and_make_dead(signr);
}
NOKPROBE_SYMBOL(oops_end);

diff --git a/arch/xtensa/kernel/traps.c b/arch/xtensa/kernel/traps.c
index 4b4dbeb2d612..9345007d474d 100644
--- a/arch/xtensa/kernel/traps.c
+++ b/arch/xtensa/kernel/traps.c
@@ -552,5 +552,5 @@ void __noreturn die(const char * str, struct pt_regs * regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(err);
+ make_task_dead(err);
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..2d4bbd9c3278 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -59,6 +59,7 @@ extern void sched_post_fork(struct task_struct *p,
extern void sched_dead(struct task_struct *p);

void __noreturn do_task_dead(void);
+void __noreturn make_task_dead(int signr);

extern void proc_caches_init(void);

diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..bfa513c5b227 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -884,6 +884,15 @@ void __noreturn do_exit(long code)
}
EXPORT_SYMBOL_GPL(do_exit);

+void __noreturn make_task_dead(int signr)
+{
+ /*
+ * Take the task off the cpu after something catastrophic has
+ * happened.
+ */
+ do_exit(signr);
+}
+
void complete_and_exit(struct completion *comp, long code)
{
if (comp)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 21735829b860..e6ab5687770b 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -168,6 +168,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"panic",
"do_exit",
"do_task_dead",
+ "make_task_dead",
"__module_put_and_exit",
"complete_and_exit",
"__reiserfs_panic",
@@ -175,7 +176,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"fortify_panic",
"usercopy_abort",
"machine_real_restart",
- "rewind_stack_do_exit",
+ "rewind_stack_and_make_dead"
"kunit_try_catch_throw",
"xen_start_kernel",
"cpu_bringup_and_idle",
--
2.29.2


2021-12-08 20:26:26

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

The beginning of do_exit has become cluttered and difficult to read as
it is filled with checks to handle things that can only happen when
the kernel is operating improperly.

Now that we have a dedicated function for cleaning up a task when the
kernel is operating improperly move the checks there.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 78 ++++++++++++++++++++++-----------------------
kernel/futex/core.c | 2 +-
kernel/kexec_core.c | 2 +-
3 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index bfa513c5b227..d0ec6f6b41cb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -735,36 +735,8 @@ void __noreturn do_exit(long code)
struct task_struct *tsk = current;
int group_dead;

- /*
- * We can get here from a kernel oops, sometimes with preemption off.
- * Start by checking for critical errors.
- * Then fix up important state like USER_DS and preemption.
- * Then do everything else.
- */
-
WARN_ON(blk_needs_flush_plug(tsk));

- if (unlikely(in_interrupt()))
- panic("Aiee, killing interrupt handler!");
- if (unlikely(!tsk->pid))
- panic("Attempted to kill the idle task!");
-
- /*
- * If do_exit is called because this processes oopsed, it's possible
- * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
- * continuing. Amongst other possible reasons, this is to prevent
- * mm_release()->clear_child_tid() from writing to a user-controlled
- * kernel address.
- */
- force_uaccess_begin();
-
- if (unlikely(in_atomic())) {
- pr_info("note: %s[%d] exited with preempt_count %d\n",
- current->comm, task_pid_nr(current),
- preempt_count());
- preempt_count_set(PREEMPT_ENABLED);
- }
-
profile_task_exit(tsk);
kcov_task_exit(tsk);

@@ -773,17 +745,6 @@ void __noreturn do_exit(long code)

validate_creds_for_do_exit(tsk);

- /*
- * We're taking recursive faults here in do_exit. Safest is to just
- * leave this task alone and wait for reboot.
- */
- if (unlikely(tsk->flags & PF_EXITING)) {
- pr_alert("Fixing recursive fault but reboot is needed!\n");
- futex_exit_recursive(tsk);
- set_current_state(TASK_UNINTERRUPTIBLE);
- schedule();
- }
-
io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */

@@ -889,7 +850,46 @@ void __noreturn make_task_dead(int signr)
/*
* Take the task off the cpu after something catastrophic has
* happened.
+ *
+ * We can get here from a kernel oops, sometimes with preemption off.
+ * Start by checking for critical errors.
+ * Then fix up important state like USER_DS and preemption.
+ * Then do everything else.
*/
+ struct task_struct *tsk = current;
+
+ if (unlikely(in_interrupt()))
+ panic("Aiee, killing interrupt handler!");
+ if (unlikely(!tsk->pid))
+ panic("Attempted to kill the idle task!");
+
+ /*
+ * If make_task_dead is called because this processes oopsed, it's possible
+ * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+ * continuing. Amongst other possible reasons, this is to prevent
+ * mm_release()->clear_child_tid() from writing to a user-controlled
+ * kernel address.
+ */
+ force_uaccess_begin();
+
+ if (unlikely(in_atomic())) {
+ pr_info("note: %s[%d] exited with preempt_count %d\n",
+ current->comm, task_pid_nr(current),
+ preempt_count());
+ preempt_count_set(PREEMPT_ENABLED);
+ }
+
+ /*
+ * We're taking recursive faults here in make_task_dead. Safest is to just
+ * leave this task alone and wait for reboot.
+ */
+ if (unlikely(tsk->flags & PF_EXITING)) {
+ pr_alert("Fixing recursive fault but reboot is needed!\n");
+ futex_exit_recursive(tsk);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+ }
+
do_exit(signr);
}

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 25d8a88b32e5..39a1522865b5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1044,7 +1044,7 @@ static void futex_cleanup(struct task_struct *tsk)
* actually finished the futex cleanup. The worst case for this is that the
* waiter runs through the wait loop until the state becomes visible.
*
- * This is called from the recursive fault handling path in do_exit().
+ * This is called from the recursive fault handling path in make_task_dead().
*
* This is best effort. Either the futex exit code has run already or
* not. If the OWNER_DIED bit has been set on the futex then the waiter can
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5a5d192a89ac..68480f731192 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -81,7 +81,7 @@ int kexec_should_crash(struct task_struct *p)
if (crash_kexec_post_notifiers)
return 0;
/*
- * There are 4 panic() calls in do_exit() path, each of which
+ * There are 4 panic() calls in make_task_dead() path, each of which
* corresponds to each of these 4 conditions.
*/
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
--
2.29.2


2021-12-08 20:26:29

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

When the kernel detects it is oops or otherwise force killing a task
while it exits the code poorly attempts to permanently stop the task
from scheduling.

I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
to be woken up.

As it makes no sense for the task to continue call do_task_dead
instead which actually does the work and permanently removes the task
from the scheduler. Guaranteeing the task will never be woken
up again.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d0ec6f6b41cb..f975cd8a2ed8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -886,8 +886,7 @@ void __noreturn make_task_dead(int signr)
if (unlikely(tsk->flags & PF_EXITING)) {
pr_alert("Fixing recursive fault but reboot is needed!\n");
futex_exit_recursive(tsk);
- set_current_state(TASK_UNINTERRUPTIBLE);
- schedule();
+ do_task_dead();
}

do_exit(signr);
--
2.29.2


2021-12-08 20:26:32

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 05/10] exit: Stop exporting do_exit

Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f975cd8a2ed8..57afac845a0a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
lockdep_free_task(tsk);
do_task_dead();
}
-EXPORT_SYMBOL_GPL(do_exit);

void __noreturn make_task_dead(int signr)
{
--
2.29.2


2021-12-08 20:26:38

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 06/10] exit: Implement kthread_exit

The way the per task_struct exit_code is used by kernel threads is not
quite compatible how it is used by userspace applications. The low
byte of the userspace exit_code value encodes the exit signal. While
kthreads just use the value as an int holding ordinary kernel function
exit status like -EPERM.

Add kthread_exit to clearly separate the two kinds of uses.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 23 +++++++++++++++++++----
tools/objtool/check.c | 1 +
3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 346b0f269161..22c43d419687 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -70,6 +70,7 @@ void *kthread_probe_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
+void kthread_exit(long result) __noreturn;

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 7113003fab63..77b7c3f23f18 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -268,6 +268,21 @@ void kthread_parkme(void)
}
EXPORT_SYMBOL_GPL(kthread_parkme);

+/**
+ * kthread_exit - Cause the current kthread return @result to kthread_stop().
+ * @result: The integer value to return to kthread_stop().
+ *
+ * While kthread_exit can be called directly, it exists so that
+ * functions which do some additional work in non-modular code such as
+ * module_put_and_kthread_exit can be implemented.
+ *
+ * Does not return.
+ */
+void __noreturn kthread_exit(long result)
+{
+ do_exit(result);
+}
+
static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
@@ -286,13 +301,13 @@ static int kthread(void *_create)
done = xchg(&create->done, NULL);
if (!done) {
kfree(create);
- do_exit(-EINTR);
+ kthread_exit(-EINTR);
}

if (!self) {
create->result = ERR_PTR(-ENOMEM);
complete(done);
- do_exit(-ENOMEM);
+ kthread_exit(-ENOMEM);
}

self->threadfn = threadfn;
@@ -326,7 +341,7 @@ static int kthread(void *_create)
__kthread_parkme(self);
ret = threadfn(data);
}
- do_exit(ret);
+ kthread_exit(ret);
}

/* called from kernel_clone() to get node information for about to be created task */
@@ -627,7 +642,7 @@ EXPORT_SYMBOL_GPL(kthread_park);
* instead of calling wake_up_process(): the thread will exit without
* calling threadfn().
*
- * If threadfn() may call do_exit() itself, the caller must ensure
+ * If threadfn() may call kthread_exit() itself, the caller must ensure
* task_struct can't go away.
*
* Returns the result of threadfn(), or %-EINTR if wake_up_process()
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index e6ab5687770b..90108fe5610d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -168,6 +168,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"panic",
"do_exit",
"do_task_dead",
+ "kthread_exit",
"make_task_dead",
"__module_put_and_exit",
"complete_and_exit",
--
2.29.2


2021-12-08 20:26:40

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 07/10] exit: Rename module_put_and_exit to module_put_and_kthread_exit

Update module_put_and_exit to call kthread_exit instead of do_exit.

Change the name to reflect this change in functionality. All of the
users of module_put_and_exit are causing the current kthread to exit
so this change makes it clear what is happening. There is no
functional change.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
crypto/algboss.c | 4 ++--
fs/cifs/connect.c | 2 +-
fs/nfs/callback.c | 4 ++--
fs/nfs/nfs4state.c | 2 +-
fs/nfsd/nfssvc.c | 2 +-
include/linux/module.h | 6 +++---
kernel/module.c | 6 +++---
net/bluetooth/bnep/core.c | 2 +-
net/bluetooth/cmtp/core.c | 2 +-
net/bluetooth/hidp/core.c | 2 +-
tools/objtool/check.c | 2 +-
11 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/crypto/algboss.c b/crypto/algboss.c
index 1814d2c5188a..eb5fe84efb83 100644
--- a/crypto/algboss.c
+++ b/crypto/algboss.c
@@ -67,7 +67,7 @@ static int cryptomgr_probe(void *data)
complete_all(&param->larval->completion);
crypto_alg_put(&param->larval->alg);
kfree(param);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

static int cryptomgr_schedule_probe(struct crypto_larval *larval)
@@ -190,7 +190,7 @@ static int cryptomgr_test(void *data)
crypto_alg_tested(param->driver, err);

kfree(param);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

static int cryptomgr_schedule_test(struct crypto_alg *alg)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 82577a7a5bb1..39fbe9acbf51 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -1139,7 +1139,7 @@ cifs_demultiplex_thread(void *p)
}

memalloc_noreclaim_restore(noreclaim_flag);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

/*
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 86d856de1389..3c86a559a321 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -93,7 +93,7 @@ nfs4_callback_svc(void *vrqstp)
svc_process(rqstp);
}
svc_exit_thread(rqstp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

@@ -137,7 +137,7 @@ nfs41_callback_svc(void *vrqstp)
}
}
svc_exit_thread(rqstp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index ecc4594299d6..ea41af731978 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -2689,6 +2689,6 @@ static int nfs4_run_state_manager(void *ptr)
allow_signal(SIGKILL);
nfs4_state_manager(clp);
nfs_put_client(clp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 80431921e5d7..5ce9f14318c4 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -986,7 +986,7 @@ nfsd(void *vrqstp)

/* Release module */
mutex_unlock(&nfsd_mutex);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..f03be97e9ec1 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -595,9 +595,9 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
/* Look for this name: can be of form module:name. */
unsigned long module_kallsyms_lookup_name(const char *name);

-extern void __noreturn __module_put_and_exit(struct module *mod,
+extern void __noreturn __module_put_and_kthread_exit(struct module *mod,
long code);
-#define module_put_and_exit(code) __module_put_and_exit(THIS_MODULE, code)
+#define module_put_and_kthread_exit(code) __module_put_and_kthread_exit(THIS_MODULE, code)

#ifdef CONFIG_MODULE_UNLOAD
int module_refcount(struct module *mod);
@@ -790,7 +790,7 @@ static inline int unregister_module_notifier(struct notifier_block *nb)
return 0;
}

-#define module_put_and_exit(code) do_exit(code)
+#define module_put_and_kthread_exit(code) kthread_exit(code)

static inline void print_modules(void)
{
diff --git a/kernel/module.c b/kernel/module.c
index 84a9141a5e15..a3aa00bf270d 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -337,12 +337,12 @@ static inline void add_taint_module(struct module *mod, unsigned flag,
* A thread that wants to hold a reference to a module only while it
* is running can call this to safely exit. nfsd and lockd use this.
*/
-void __noreturn __module_put_and_exit(struct module *mod, long code)
+void __noreturn __module_put_and_kthread_exit(struct module *mod, long code)
{
module_put(mod);
- do_exit(code);
+ kthread_exit(code);
}
-EXPORT_SYMBOL(__module_put_and_exit);
+EXPORT_SYMBOL(__module_put_and_kthread_exit);

/* Find a module section: 0 means not found. */
static unsigned int find_sec(const struct load_info *info, const char *name)
diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index c9add7753b9f..40baa6b7321a 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -535,7 +535,7 @@ static int bnep_session(void *arg)

up_write(&bnep_session_sem);
free_netdev(dev);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 0a2d78e811cf..9bfded6b74b3 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -323,7 +323,7 @@ static int cmtp_session(void *arg)
up_write(&cmtp_session_sem);

kfree(session);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 80848dfc01db..5940744a8cd8 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1305,7 +1305,7 @@ static int hidp_session_thread(void *arg)
l2cap_unregister_user(session->conn, &session->user);
hidp_session_put(session);

- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 90108fe5610d..120e9598c11a 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -170,7 +170,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"do_task_dead",
"kthread_exit",
"make_task_dead",
- "__module_put_and_exit",
+ "__module_put_and_kthread_exit",
"complete_and_exit",
"__reiserfs_panic",
"lbug_with_loc",
--
2.29.2


2021-12-08 20:26:46

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 08/10] exit: Rename complete_and_exit to kthread_complete_and_exit

Update complete_and_exit to call kthread_exit instead of do_exit.

Change the name to reflect this change in functionality. All of the
users of complete_and_exit are causing the current kthread to exit so
this change makes it clear what is happening.

Move the implementation of kthread_complete_and_exit from
kernel/exit.c to to kernel/kthread.c. As this function is kthread
specific it makes most sense to live with the kthread functions.

There are no functional change.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
drivers/net/wireless/rsi/rsi_91x_coex.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_main.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_usb_ops.c | 2 +-
drivers/pnp/pnpbios/core.c | 6 +++---
drivers/staging/rts5208/rtsx.c | 16 +++++++--------
drivers/usb/atm/usbatm.c | 2 +-
drivers/usb/gadget/function/f_mass_storage.c | 2 +-
fs/jffs2/background.c | 2 +-
include/linux/kernel.h | 1 -
include/linux/kthread.h | 1 +
kernel/exit.c | 9 ---------
kernel/kthread.c | 21 ++++++++++++++++++++
lib/kunit/try-catch.c | 4 ++--
tools/objtool/check.c | 2 +-
15 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/net/wireless/rsi/rsi_91x_coex.c b/drivers/net/wireless/rsi/rsi_91x_coex.c
index a0c5d02ae88c..8a3d86897ea8 100644
--- a/drivers/net/wireless/rsi/rsi_91x_coex.c
+++ b/drivers/net/wireless/rsi/rsi_91x_coex.c
@@ -63,7 +63,7 @@ static void rsi_coex_scheduler_thread(struct rsi_common *common)
rsi_coex_sched_tx_pkts(coex_cb);
} while (atomic_read(&coex_cb->coex_tx_thread.thread_done) == 0);

- complete_and_exit(&coex_cb->coex_tx_thread.completion, 0);
+ kthread_complete_and_exit(&coex_cb->coex_tx_thread.completion, 0);
}

int rsi_coex_recv_pkt(struct rsi_common *common, u8 *msg)
diff --git a/drivers/net/wireless/rsi/rsi_91x_main.c b/drivers/net/wireless/rsi/rsi_91x_main.c
index f1bf71e6c608..c7f5cec5e446 100644
--- a/drivers/net/wireless/rsi/rsi_91x_main.c
+++ b/drivers/net/wireless/rsi/rsi_91x_main.c
@@ -260,7 +260,7 @@ static void rsi_tx_scheduler_thread(struct rsi_common *common)
if (common->init_done)
rsi_core_qos_processor(common);
} while (atomic_read(&common->tx_thread.thread_done) == 0);
- complete_and_exit(&common->tx_thread.completion, 0);
+ kthread_complete_and_exit(&common->tx_thread.completion, 0);
}

#ifdef CONFIG_RSI_COEX
diff --git a/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c b/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
index 8ace1874e5cb..b2b47a0abcbf 100644
--- a/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
+++ b/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
@@ -75,7 +75,7 @@ void rsi_sdio_rx_thread(struct rsi_common *common)

rsi_dbg(INFO_ZONE, "%s: Terminated SDIO RX thread\n", __func__);
atomic_inc(&sdev->rx_thread.thread_done);
- complete_and_exit(&sdev->rx_thread.completion, 0);
+ kthread_complete_and_exit(&sdev->rx_thread.completion, 0);
}

/**
diff --git a/drivers/net/wireless/rsi/rsi_91x_usb_ops.c b/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
index 4ffcdde1acb1..5130b0e72adc 100644
--- a/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
+++ b/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
@@ -56,6 +56,6 @@ void rsi_usb_rx_thread(struct rsi_common *common)
out:
rsi_dbg(INFO_ZONE, "%s: Terminated thread\n", __func__);
skb_queue_purge(&dev->rx_q);
- complete_and_exit(&dev->rx_thread.completion, 0);
+ kthread_complete_and_exit(&dev->rx_thread.completion, 0);
}

diff --git a/drivers/pnp/pnpbios/core.c b/drivers/pnp/pnpbios/core.c
index 669ef4700c1a..f7e86ae9f72f 100644
--- a/drivers/pnp/pnpbios/core.c
+++ b/drivers/pnp/pnpbios/core.c
@@ -160,7 +160,7 @@ static int pnp_dock_thread(void *unused)
* No dock to manage
*/
case PNP_FUNCTION_NOT_SUPPORTED:
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
case PNP_SYSTEM_NOT_DOCKED:
d = 0;
break;
@@ -170,7 +170,7 @@ static int pnp_dock_thread(void *unused)
default:
pnpbios_print_status("pnp_dock_thread", status);
printk(KERN_WARNING "PnPBIOS: disabling dock monitoring.\n");
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
}
if (d != docked) {
if (pnp_dock_event(d, &now) == 0) {
@@ -183,7 +183,7 @@ static int pnp_dock_thread(void *unused)
}
}
}
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
}

static int pnpbios_get_resources(struct pnp_dev *dev)
diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
index 91fcf85e150a..5a58dac76c88 100644
--- a/drivers/staging/rts5208/rtsx.c
+++ b/drivers/staging/rts5208/rtsx.c
@@ -450,13 +450,13 @@ static int rtsx_control_thread(void *__dev)
* after the down() -- that's necessary for the thread-shutdown
* case.
*
- * complete_and_exit() goes even further than this -- it is safe in
- * the case that the thread of the caller is going away (not just
- * the structure) -- this is necessary for the module-remove case.
- * This is important in preemption kernels, which transfer the flow
- * of execution immediately upon a complete().
+ * kthread_complete_and_exit() goes even further than this --
+ * it is safe in the case that the thread of the caller is going away
+ * (not just the structure) -- this is necessary for the module-remove
+ * case. This is important in preemption kernels, which transfer the
+ * flow of execution immediately upon a complete().
*/
- complete_and_exit(&dev->control_exit, 0);
+ kthread_complete_and_exit(&dev->control_exit, 0);
}

static int rtsx_polling_thread(void *__dev)
@@ -501,7 +501,7 @@ static int rtsx_polling_thread(void *__dev)
mutex_unlock(&dev->dev_mutex);
}

- complete_and_exit(&dev->polling_exit, 0);
+ kthread_complete_and_exit(&dev->polling_exit, 0);
}

/*
@@ -682,7 +682,7 @@ static int rtsx_scan_thread(void *__dev)
/* Should we unbind if no devices were detected? */
}

- complete_and_exit(&dev->scanning_done, 0);
+ kthread_complete_and_exit(&dev->scanning_done, 0);
}

static void rtsx_init_options(struct rtsx_chip *chip)
diff --git a/drivers/usb/atm/usbatm.c b/drivers/usb/atm/usbatm.c
index da17be1ef64e..e3a49d837609 100644
--- a/drivers/usb/atm/usbatm.c
+++ b/drivers/usb/atm/usbatm.c
@@ -969,7 +969,7 @@ static int usbatm_do_heavy_init(void *arg)
instance->thread = NULL;
mutex_unlock(&instance->serialize);

- complete_and_exit(&instance->thread_exited, ret);
+ kthread_complete_and_exit(&instance->thread_exited, ret);
}

static int usbatm_heavy_init(struct usbatm_data *instance)
diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c
index 752439690fda..46dd11dcb3a8 100644
--- a/drivers/usb/gadget/function/f_mass_storage.c
+++ b/drivers/usb/gadget/function/f_mass_storage.c
@@ -2547,7 +2547,7 @@ static int fsg_main_thread(void *common_)
up_write(&common->filesem);

/* Let fsg_unbind() know the thread has exited */
- complete_and_exit(&common->thread_notifier, 0);
+ kthread_complete_and_exit(&common->thread_notifier, 0);
}


diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b4d5013dc5d..6da92ecaf66d 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -161,5 +161,5 @@ static int jffs2_garbage_collect_thread(void *_c)
spin_lock(&c->erase_completion_lock);
c->gc_task = NULL;
spin_unlock(&c->erase_completion_lock);
- complete_and_exit(&c->gc_thread_exit, 0);
+ kthread_complete_and_exit(&c->gc_thread_exit, 0);
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 77755ac3e189..055eb203c00e 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -187,7 +187,6 @@ static inline void might_fault(void) { }
#endif

void do_exit(long error_code) __noreturn;
-void complete_and_exit(struct completion *, long) __noreturn;

extern int num_to_str(char *buf, int size,
unsigned long long num, unsigned int width);
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 22c43d419687..d86a7e3b9a52 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -71,6 +71,7 @@ int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
void kthread_exit(long result) __noreturn;
+void kthread_complete_and_exit(struct completion *, long) __noreturn;

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/exit.c b/kernel/exit.c
index 57afac845a0a..6c4b04531f17 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -891,15 +891,6 @@ void __noreturn make_task_dead(int signr)
do_exit(signr);
}

-void complete_and_exit(struct completion *comp, long code)
-{
- if (comp)
- complete(comp);
-
- do_exit(code);
-}
-EXPORT_SYMBOL(complete_and_exit);
-
SYSCALL_DEFINE1(exit, int, error_code)
{
do_exit((error_code&0xff)<<8);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 77b7c3f23f18..4388d6694a7f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -283,6 +283,27 @@ void __noreturn kthread_exit(long result)
do_exit(result);
}

+/**
+ * kthread_complete_and exit - Exit the current kthread.
+ * @comp: Completion to complete
+ * @code: The integer value to return to kthread_stop().
+ *
+ * If present complete @comp and the reuturn code to kthread_stop().
+ *
+ * A kernel thread whose module may be removed after the completion of
+ * @comp can use this function exit safely.
+ *
+ * Does not return.
+ */
+void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
+{
+ if (comp)
+ complete(comp);
+
+ kthread_exit(code);
+}
+EXPORT_SYMBOL(kthread_complete_and_exit);
+
static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
index 0dd434e40487..be38a2c5ecc2 100644
--- a/lib/kunit/try-catch.c
+++ b/lib/kunit/try-catch.c
@@ -17,7 +17,7 @@
void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)
{
try_catch->try_result = -EFAULT;
- complete_and_exit(try_catch->try_completion, -EFAULT);
+ kthread_complete_and_exit(try_catch->try_completion, -EFAULT);
}
EXPORT_SYMBOL_GPL(kunit_try_catch_throw);

@@ -27,7 +27,7 @@ static int kunit_generic_run_threadfn_adapter(void *data)

try_catch->try(try_catch->context);

- complete_and_exit(try_catch->try_completion, 0);
+ kthread_complete_and_exit(try_catch->try_completion, 0);
}

static unsigned long kunit_test_timeout(void)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 120e9598c11a..282273a1ffa5 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -171,7 +171,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"kthread_exit",
"make_task_dead",
"__module_put_and_kthread_exit",
- "complete_and_exit",
+ "kthread_complete_and_exit",
"__reiserfs_panic",
"lbug_with_loc",
"fortify_panic",
--
2.29.2


2021-12-08 20:26:53

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Today the rules are a bit iffy and arbitrary about which kernel
threads have struct kthread present. Both idle threads and thread
started with create_kthread want struct kthread present so that is
effectively all kernel threads. Make the rule that if PF_KTHREAD
and the task is running then struct kthread is present.

This will allow the kernel thread code to using tsk->exit_code
with different semantics from ordinary processes.

To make ensure that struct kthread is present for all
kernel threads move it's allocation into copy_process.

Add a deallocation of struct kthread in exec for processes
that were kernel threads.

Move the allocation of struct kthread for the initial thread
earlier so that it is not repeated for each additional idle
thread.

Move the initialization of struct kthread into set_kthread_struct
so that the structure is always and reliably initailized.

Clear set_child_tid in free_kthread_struct to ensure the kthread
struct is reliably freed during exec. The function
free_kthread_struct does not need to clear vfork_done during exec as
exec_mm_release called from exec_mmap has already cleared vfork_done.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 2 ++
include/linux/kthread.h | 2 +-
kernel/fork.c | 4 ++++
kernel/kthread.c | 31 ++++++++++++++-----------------
kernel/sched/core.c | 16 ++++++++--------
5 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..59cac7c18178 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1307,6 +1307,8 @@ int begin_new_exec(struct linux_binprm * bprm)
*/
force_uaccess_begin();

+ if (me->flags & PF_KTHREAD)
+ free_kthread_struct(me);
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index d86a7e3b9a52..4f3433afb54b 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,7 +33,7 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
unsigned int cpu,
const char *namefmt);

-void set_kthread_struct(struct task_struct *p);
+bool set_kthread_struct(struct task_struct *p);

void kthread_set_per_cpu(struct task_struct *k, int cpu);
bool kthread_is_per_cpu(struct task_struct *k);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..04fa3e5d97af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2118,6 +2118,10 @@ static __latent_entropy struct task_struct *copy_process(
p->io_context = NULL;
audit_set_context(p, NULL);
cgroup_fork(p);
+ if (p->flags & PF_KTHREAD) {
+ if (!set_kthread_struct(p))
+ goto bad_fork_cleanup_threadgroup_lock;
+ }
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4388d6694a7f..8e5f44bed027 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -93,20 +93,27 @@ static inline struct kthread *__to_kthread(struct task_struct *p)
return kthread;
}

-void set_kthread_struct(struct task_struct *p)
+bool set_kthread_struct(struct task_struct *p)
{
struct kthread *kthread;

- if (__to_kthread(p))
- return;
+ if (WARN_ON_ONCE(to_kthread(p)))
+ return false;

kthread = kzalloc(sizeof(*kthread), GFP_KERNEL);
+ if (!kthread)
+ return false;
+
+ init_completion(&kthread->exited);
+ init_completion(&kthread->parked);
+ p->vfork_done = &kthread->exited;
+
/*
* We abuse ->set_child_tid to avoid the new member and because it
- * can't be wrongly copied by copy_process(). We also rely on fact
- * that the caller can't exec, so PF_KTHREAD can't be cleared.
+ * can't be wrongly copied by copy_process().
*/
p->set_child_tid = (__force void __user *)kthread;
+ return true;
}

void free_kthread_struct(struct task_struct *k)
@@ -114,13 +121,13 @@ void free_kthread_struct(struct task_struct *k)
struct kthread *kthread;

/*
- * Can be NULL if this kthread was created by kernel_thread()
- * or if kmalloc() in kthread() failed.
+ * Can be NULL if kmalloc() in set_kthread_struct() failed.
*/
kthread = to_kthread(k);
#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread && kthread->blkcg_css);
#endif
+ k->set_child_tid = (__force void __user *)NULL;
kfree(kthread);
}

@@ -315,7 +322,6 @@ static int kthread(void *_create)
struct kthread *self;
int ret;

- set_kthread_struct(current);
self = to_kthread(current);

/* If user was SIGKILLed, I release the structure. */
@@ -325,17 +331,8 @@ static int kthread(void *_create)
kthread_exit(-EINTR);
}

- if (!self) {
- create->result = ERR_PTR(-ENOMEM);
- complete(done);
- kthread_exit(-ENOMEM);
- }
-
self->threadfn = threadfn;
self->data = data;
- init_completion(&self->exited);
- init_completion(&self->parked);
- current->vfork_done = &self->exited;

/*
* The new thread inherited kthreadd's priority and CPU mask. Reset
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c9b0fda64ac..0404a8c572a1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8599,14 +8599,6 @@ void __init init_idle(struct task_struct *idle, int cpu)

__sched_fork(0, idle);

- /*
- * The idle task doesn't need the kthread struct to function, but it
- * is dressed up as a per-CPU kthread and thus needs to play the part
- * if we want to avoid special-casing it in code that deals with per-CPU
- * kthreads.
- */
- set_kthread_struct(idle);
-
raw_spin_lock_irqsave(&idle->pi_lock, flags);
raw_spin_rq_lock(rq);

@@ -9427,6 +9419,14 @@ void __init sched_init(void)
mmgrab(&init_mm);
enter_lazy_tlb(&init_mm, current);

+ /*
+ * The idle task doesn't need the kthread struct to function, but it
+ * is dressed up as a per-CPU kthread and thus needs to play the part
+ * if we want to avoid special-casing it in code that deals with per-CPU
+ * kthreads.
+ */
+ WARN_ON(set_kthread_struct(current));
+
/*
* Make us the idle thread. Technically, schedule() should not be
* called from this thread, however somewhere below it might be,
--
2.29.2


2021-12-08 20:26:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 10/10] exit/kthread: Move the exit code for kernel threads into struct kthread

The exit code of kernel threads has different semantics than the
exit_code of userspace tasks. To avoid confusion and allow
the userspace implementation to change as needed move
the kernel thread exit code into struct kthread.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/kthread.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8e5f44bed027..9c6c532047c4 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -52,6 +52,7 @@ struct kthread_create_info
struct kthread {
unsigned long flags;
unsigned int cpu;
+ int result;
int (*threadfn)(void *);
void *data;
mm_segment_t oldfs;
@@ -287,7 +288,9 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
*/
void __noreturn kthread_exit(long result)
{
- do_exit(result);
+ struct kthread *kthread = to_kthread(current);
+ kthread->result = result;
+ do_exit(0);
}

/**
@@ -679,7 +682,7 @@ int kthread_stop(struct task_struct *k)
kthread_unpark(k);
wake_up_process(k);
wait_for_completion(&kthread->exited);
- ret = k->exit_code;
+ ret = kthread->result;
put_task_struct(k);

trace_sched_kthread_stop_ret(ret);
--
2.29.2


2021-12-12 17:49:14

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
> My s390 assembly is not particularly good so I have read the history
> of the reference to do_exit copy_thread and have been able to
> verify that do_exit is not used.
>
> The general argument is that s390 has been changed to use the generic
> kernel_thread and kernel_execve and the generic versions do not call
> do_exit. So it is strange to see a do_exit reference sitting there.
>
> The history of the do_exit reference in s390's version of copy_thread
> seems conclusive that the do_exit reference is something that lingers
> and should have been removed several years ago.
...
> Remove this dead reference to do_exit to make it clear that s390 is
> not doing anything with do_exit in copy_thread.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> arch/s390/kernel/process.c | 1 -
> 1 file changed, 1 deletion(-)

Applied to s390 tree. Just in case you want to apply this to your tree too:
Acked-by: Heiko Carstens <[email protected]>

2021-12-13 14:51:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

Heiko Carstens <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
>> My s390 assembly is not particularly good so I have read the history
>> of the reference to do_exit copy_thread and have been able to
>> verify that do_exit is not used.
>>
>> The general argument is that s390 has been changed to use the generic
>> kernel_thread and kernel_execve and the generic versions do not call
>> do_exit. So it is strange to see a do_exit reference sitting there.
>>
>> The history of the do_exit reference in s390's version of copy_thread
>> seems conclusive that the do_exit reference is something that lingers
>> and should have been removed several years ago.
> ...
>> Remove this dead reference to do_exit to make it clear that s390 is
>> not doing anything with do_exit in copy_thread.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> arch/s390/kernel/process.c | 1 -
>> 1 file changed, 1 deletion(-)
>
> Applied to s390 tree. Just in case you want to apply this to your tree too:
> Acked-by: Heiko Carstens <[email protected]>

Thank you for looking at this and confirming I had read that the code
properly and that the do_exit reference was no longer used.

I will probably take this through my tree as well just so I don't have
that trailing do_exit reference.

At this point I will give things a bit more for people to review or say
something about the other changes and if there is no negative feedback
I think I will just apply the lot.

Eric


2021-12-13 22:51:08

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/8] signal: Cleanup of the signal->flags


The special case of SIGKILL during coredumps is very fragile today and
while reading through the code I realized I have almost broken it twice.
So this simplifies that special case, removes SIGNAL_GROUP_COREDUMP
which has become unnecessary with the addition of signal->core_state,
and this removes the helper signal_group_exit which is misnamed and
is not used properly.

If you squint very hard there might be a user space visible difference
in behavior somewhere but I don't think there is one in practice.

These patches are on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

After these patches have been reviewed it is my plan to apply them to my
signal-for-v5.17 branch.

Eric W. Biederman (8):
signal: Make SIGKILL during coredumps an explicit special case
signal: Drop signals received after a fatal signal has been processed
signal: Have the oom killer detect coredumps using signal->core_state
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Remove SIGNAL_GROUP_COREDUMP
coredump: Stop setting signal->group_exit_task
signal: Rename group_exit_task group_exec_task
signal: Remove the helper signal_group_exit

fs/coredump.c | 20 +++++++++-----------
fs/exec.c | 10 +++++-----
include/linux/sched/signal.h | 18 +++---------------
kernel/exit.c | 12 ++++++++----
kernel/signal.c | 24 ++++++++++++++++--------
mm/oom_kill.c | 2 +-
6 files changed, 42 insertions(+), 44 deletions(-)

Eric

2021-12-13 22:54:41

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Simplify the code that allows SIGKILL during coredumps to terminate
the coredump. As far as I can tell I have avoided breaking it
by dumb luck.

Historically with all of the other threads stopping in exit_mm the
wants_signal loop in complete_signal would find the dumper task and
then complete_signal would wake the dumper task with signal_wake_up.

After moving the coredump_task_exit above the setting of PF_EXITING in
commit 92307383082d ("coredump: Don't perform any cleanups before
dumping core") wants_signal will consider all of the threads in a
multi-threaded process for waking up, not just the core dumping task.

Luckily complete_signal short circuits SIGKILL during a coredump marks
every thread with SIGKILL and signal_wake_up. This code is arguably
buggy however as it tries to skip creating a group exit when is already
present, and it fails that a coredump is in progress.

Ever since commit 06af8679449d ("coredump: Limit what can interrupt
coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
set on the dumper task but also SIGKILL set in it's pending bitmap.
This means that if the code is ever fixed not to short-circuit and
kill a process after it has already been killed the special case
for SIGKILL during a coredump will be broken.

Sort all of this out by making the coredump special case more special,
and perform all of the work in prepare_signal and leave the rest of
the signal delivery path out of it.

In prepare_signal when the process coredumping is sent SIGKILL find
the task performing the coredump and use sigaddset and signal_wake_up
to ensure that task reports fatal_signal_pending.

Return false from prepare_signal to tell the rest of the signal
delivery path to ignore the signal.

Update wait_for_dump_helpers to perform a wait_event_killable wait
so that if signal_pending gets set spuriously the wait will not
be interrupted unless fatal_signal_pending is true.

I have tested this and verified I did not break SIGKILL during
coredumps by accident (before or after this change). I actually
thought I had and I had to figure out what I had misread that kept
SIGKILL during coredumps working.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 4 ++--
kernel/signal.c | 11 +++++++++--
2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a6b3c196cdef..7b91fb32dbb8 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
static bool dump_interrupted(void)
{
/*
- * SIGKILL or freezing() interrupt the coredumping. Perhaps we
+ * SIGKILL or freezing() interrupted the coredumping. Perhaps we
* can do try_to_freeze() and check __fatal_signal_pending(),
* but then we need to teach dump_write() to restart and clear
* TIF_SIGPENDING.
@@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
* We actually want wait_event_freezable() but then we need
* to clear TIF_SIGPENDING and improve dump_interrupted().
*/
- wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
+ wait_event_killable(pipe->rd_wait, pipe->readers == 1);

pipe_lock(pipe);
pipe->readers--;
diff --git a/kernel/signal.c b/kernel/signal.c
index 8272cac5f429..7e305a8ec7c2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
sigset_t flush;

if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
- if (!(signal->flags & SIGNAL_GROUP_EXIT))
- return sig == SIGKILL;
+ struct core_state *core_state = signal->core_state;
+ if (core_state) {
+ if (sig == SIGKILL) {
+ struct task_struct *dumper = core_state->dumper.task;
+ sigaddset(&dumper->pending.signal, SIGKILL);
+ signal_wake_up(dumper, 1);
+ }
+ return false;
+ }
/*
* The process is in the middle of dying, nothing to do.
*/
--
2.29.2


2021-12-13 22:54:44

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/8] signal: Drop signals received after a fatal signal has been processed

In 403bad72b67d ("coredump: only SIGKILL should interrupt the
coredumping task") Oleg modified the kernel to drop all signals that
come in during a coredump except SIGKILL, and suggested that it might
be a good idea to generalize that to other cases after the process has
received a fatal signal.

Semantically it does not make sense to perform any signal delivery
after the process has already been killed.

When a signal is sent while a process is dying today the signal is
placed in the signal queue by __send_signal and a single task of the
process is woken up with signal_wake_up, if there are any tasks that
have not set PF_EXITING.

Take things one step farther and have prepare_signal report that all
signals that come after a process has been killed should be ignored.
While retaining the historical exception of allowing SIGKILL to
interrupt coredumps.

Remove the SIGNAL_GROUP_EXIT test from complete_signal, as it is no
longer possible for signal processing to reach complete_signal when
SIGNAL_GROUP_EXIT is true.

Update the comment in fs/coredump.c to make it clear coredumps are
special in being able to receive SIGKILL.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 +-
kernel/signal.c | 5 ++---
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 7b91fb32dbb8..a9c25f20118f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,7 +352,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
struct task_struct *t;
int nr = 0;

- /* ignore all signals except SIGKILL, see prepare_signal() */
+ /* Allow SIGKILL, see prepare_signal() */
start->signal->flags = SIGNAL_GROUP_COREDUMP | flags;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index 7e305a8ec7c2..cdccbacac685 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -914,11 +914,11 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
sigaddset(&dumper->pending.signal, SIGKILL);
signal_wake_up(dumper, 1);
}
- return false;
}
/*
- * The process is in the middle of dying, nothing to do.
+ * The process is in the middle of dying, drop the signal.
*/
+ return false;
} else if (sig_kernel_stop(sig)) {
/*
* This is a stop signal. Remove SIGCONT from all queues.
@@ -1039,7 +1039,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) &&
- !(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
/*
--
2.29.2


2021-12-13 22:54:47

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/8] signal: Have the oom killer detect coredumps using signal->core_state

In preparation for removing the flag SIGNAL_GROUP_COREDUMP change
__task_will_free_mem to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.

Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..5c92aad8ca1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -793,7 +793,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
* coredump_task_exit(), so the oom killer cannot assume that
* the process will promptly exit and release memory.
*/
- if (sig->flags & SIGNAL_GROUP_COREDUMP)
+ if (sig->core_state)
return false;

if (sig->flags & SIGNAL_GROUP_EXIT)
--
2.29.2


2021-12-13 22:54:51

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 4/8] signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process

There are only a few places that test SIGNAL_GROUP_EXIT and
are not also already testing SIGNAL_GROUP_COREDUMP.

This will not affect the callers of signal_group_exit as zap_process
also sets group_exit_task so signal_group_exit will continue to return
true at the same times.

This does not affect wait_task_zombie as the none of the threads
wind up in EXIT_ZOMBIE state during a coredump.

This does not affect oom_kill.c:__task_will_free_mem as
sig->core_state is tested and handled before SIGNAL_GROUP_EXIT is
tested for.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a9c25f20118f..5e5a90de7be3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -347,13 +347,13 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
return ispipe;
}

-static int zap_process(struct task_struct *start, int exit_code, int flags)
+static int zap_process(struct task_struct *start, int exit_code)
{
struct task_struct *t;
int nr = 0;

/* Allow SIGKILL, see prepare_signal() */
- start->signal->flags = SIGNAL_GROUP_COREDUMP | flags;
+ start->signal->flags = SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;

@@ -378,7 +378,7 @@ static int zap_threads(struct task_struct *tsk,
if (!signal_group_exit(tsk->signal)) {
tsk->signal->core_state = core_state;
tsk->signal->group_exit_task = tsk;
- nr = zap_process(tsk, exit_code, 0);
+ nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
atomic_set(&core_state->nr_threads, nr);
--
2.29.2


2021-12-13 22:54:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 6/8] coredump: Stop setting signal->group_exit_task

Currently the coredump code sets group_exit_task so that
signal_group_exit() will return true during a coredump. Now that the
coredump code always sets SIGNAL_GROUP_EXIT there is no longer a need
to set signal->group_exit_task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2c9d16d4b57a..ef56595a0d87 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -377,7 +377,6 @@ static int zap_threads(struct task_struct *tsk,
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
tsk->signal->core_state = core_state;
- tsk->signal->group_exit_task = tsk;
nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
@@ -426,7 +425,6 @@ static void coredump_finish(bool core_dumped)
spin_lock_irq(&current->sighand->siglock);
if (core_dumped && !__fatal_signal_pending(current))
current->signal->group_exit_code |= 0x80;
- current->signal->group_exit_task = NULL;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
spin_unlock_irq(&current->sighand->siglock);
--
2.29.2


2021-12-13 22:54:58

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 5/8] signal: Remove SIGNAL_GROUP_COREDUMP

After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.

Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant so
stop setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP
and remove it's definition makeing the code slightly simpler.

With the setting of SIGNAL_GROUP_COREDUMP gone coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 3 +--
include/linux/sched/signal.h | 3 +--
kernel/signal.c | 2 +-
3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 5e5a90de7be3..2c9d16d4b57a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -353,7 +353,7 @@ static int zap_process(struct task_struct *start, int exit_code)
int nr = 0;

/* Allow SIGKILL, see prepare_signal() */
- start->signal->flags = SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP;
+ start->signal->flags = SIGNAL_GROUP_EXIT;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;

@@ -427,7 +427,6 @@ static void coredump_finish(bool core_dumped)
if (core_dumped && !__fatal_signal_pending(current))
current->signal->group_exit_code |= 0x80;
current->signal->group_exit_task = NULL;
- current->signal->flags = SIGNAL_GROUP_EXIT;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
spin_unlock_irq(&current->sighand->siglock);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index fa26d2a58413..ecc10e148799 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -256,7 +256,6 @@ struct signal_struct {
#define SIGNAL_STOP_STOPPED 0x00000001 /* job control stop in effect */
#define SIGNAL_STOP_CONTINUED 0x00000002 /* SIGCONT since WCONTINUED reap */
#define SIGNAL_GROUP_EXIT 0x00000004 /* group exit in progress */
-#define SIGNAL_GROUP_COREDUMP 0x00000008 /* coredump in progress */
/*
* Pending notifications to parent.
*/
@@ -272,7 +271,7 @@ struct signal_struct {
static inline void signal_set_stop_flags(struct signal_struct *sig,
unsigned int flags)
{
- WARN_ON(sig->flags & (SIGNAL_GROUP_EXIT|SIGNAL_GROUP_COREDUMP));
+ WARN_ON(sig->flags & SIGNAL_GROUP_EXIT);
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

diff --git a/kernel/signal.c b/kernel/signal.c
index cdccbacac685..9eb3e2c1f9f7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -906,7 +906,7 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
struct task_struct *t;
sigset_t flush;

- if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
+ if (signal->flags & SIGNAL_GROUP_EXIT) {
struct core_state *core_state = signal->core_state;
if (core_state) {
if (sig == SIGKILL) {
--
2.29.2


2021-12-13 22:55:01

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 7/8] signal: Rename group_exit_task group_exec_task

The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.

Update the comment above the definition of group_exec_task
to document how it is currently used.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 8 ++++----
include/linux/sched/signal.h | 12 ++++--------
kernel/exit.c | 4 ++--
3 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 59cac7c18178..9d2925811011 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1054,7 +1054,7 @@ static int de_thread(struct task_struct *tsk)
return -EAGAIN;
}

- sig->group_exit_task = tsk;
+ sig->group_exec_task = tsk;
sig->notify_count = zap_other_threads(tsk);
if (!thread_group_leader(tsk))
sig->notify_count--;
@@ -1082,7 +1082,7 @@ static int de_thread(struct task_struct *tsk)
write_lock_irq(&tasklist_lock);
/*
* Do this under tasklist_lock to ensure that
- * exit_notify() can't miss ->group_exit_task
+ * exit_notify() can't miss ->group_exec_task
*/
sig->notify_count = -1;
if (likely(leader->exit_state))
@@ -1149,7 +1149,7 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}

- sig->group_exit_task = NULL;
+ sig->group_exec_task = NULL;
sig->notify_count = 0;

no_thread_group:
@@ -1162,7 +1162,7 @@ static int de_thread(struct task_struct *tsk)
killed:
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
+ sig->group_exec_task = NULL;
sig->notify_count = 0;
read_unlock(&tasklist_lock);
return -EAGAIN;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ecc10e148799..d3248aba5183 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -109,13 +109,9 @@ struct signal_struct {

/* thread group exit support */
int group_exit_code;
- /* overloaded:
- * - notify group_exit_task when ->count is equal to notify_count
- * - everyone except group_exit_task is stopped during signal delivery
- * of fatal signals, group_exit_task processes the signal.
- */
+ /* notify group_exec_task when notify_count is less or equal to 0 */
int notify_count;
- struct task_struct *group_exit_task;
+ struct task_struct *group_exec_task;

/* thread group stop support, overloads group_exit_code too */
int group_stop_count;
@@ -275,11 +271,11 @@ static inline void signal_set_stop_flags(struct signal_struct *sig,
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

-/* If true, all threads except ->group_exit_task have pending SIGKILL */
+/* If true, all threads except ->group_exec_task have pending SIGKILL */
static inline int signal_group_exit(const struct signal_struct *sig)
{
return (sig->flags & SIGNAL_GROUP_EXIT) ||
- (sig->group_exit_task != NULL);
+ (sig->group_exec_task != NULL);
}

extern void flush_signals(struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 6c4b04531f17..527c5e4430ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -116,7 +116,7 @@ static void __exit_signal(struct task_struct *tsk)
* then notify it:
*/
if (sig->notify_count > 0 && !--sig->notify_count)
- wake_up_process(sig->group_exit_task);
+ wake_up_process(sig->group_exec_task);

if (tsk == sig->curr_target)
sig->curr_target = next_thread(tsk);
@@ -697,7 +697,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)

/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))
- wake_up_process(tsk->signal->group_exit_task);
+ wake_up_process(tsk->signal->group_exec_task);
write_unlock_irq(&tasklist_lock);

list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
--
2.29.2


2021-12-13 22:55:04

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 8/8] signal: Remove the helper signal_group_exit

This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.

Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.

Removing the helper so that both cases must be coded for explicitly
makes it more obvious what is going on as both cases must be coded for
explicitly.

While removing the helper fix the two cases where I have observed
using signal_group_helper resulted in the wrong result.

For the unset exit_code in do_group_exit during an exec I use 0 as I
think that is what group_exit_code has been set to most of the time.
During a thread group stop group_exit_code is set to the stop signal
and when the thread group receives SIGCONT group_exit_code is reset to
0.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 5 +++--
fs/exec.c | 2 +-
include/linux/sched/signal.h | 7 -------
kernel/exit.c | 8 ++++++--
kernel/signal.c | 8 +++++---
5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ef56595a0d87..09302a6a0d80 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -372,11 +372,12 @@ static int zap_process(struct task_struct *start, int exit_code)
static int zap_threads(struct task_struct *tsk,
struct core_state *core_state, int exit_code)
{
+ struct signal_struct *signal = tsk->signal;
int nr = -EAGAIN;

spin_lock_irq(&tsk->sighand->siglock);
- if (!signal_group_exit(tsk->signal)) {
- tsk->signal->core_state = core_state;
+ if (!(signal->flags & SIGNAL_GROUP_EXIT) && !signal->group_exec_task) {
+ signal->core_state = core_state;
nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
diff --git a/fs/exec.c b/fs/exec.c
index 9d2925811011..82db656ca709 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1045,7 +1045,7 @@ static int de_thread(struct task_struct *tsk)
* Kill all other threads in the thread group.
*/
spin_lock_irq(lock);
- if (signal_group_exit(sig)) {
+ if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) {
/*
* Another group action in progress, just
* return so that the signal is processed.
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d3248aba5183..b6ecb9fc4cd2 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -271,13 +271,6 @@ static inline void signal_set_stop_flags(struct signal_struct *sig,
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

-/* If true, all threads except ->group_exec_task have pending SIGKILL */
-static inline int signal_group_exit(const struct signal_struct *sig)
-{
- return (sig->flags & SIGNAL_GROUP_EXIT) ||
- (sig->group_exec_task != NULL);
-}
-
extern void flush_signals(struct task_struct *);
extern void ignore_signals(struct task_struct *);
extern void flush_signal_handlers(struct task_struct *, int force_default);
diff --git a/kernel/exit.c b/kernel/exit.c
index 527c5e4430ae..e7104f803be0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -907,15 +907,19 @@ do_group_exit(int exit_code)

BUG_ON(exit_code & 0x80); /* core dumps don't get here */

- if (signal_group_exit(sig))
+ if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
+ else if (sig->group_exec_task)
+ exit_code = 0;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;

spin_lock_irq(&sighand->siglock);
- if (signal_group_exit(sig))
+ if (sig->flags & SIGNAL_GROUP_EXIT)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
+ else if (sig->group_exec_task)
+ exit_code = 0;
else {
sig->group_exit_code = exit_code;
sig->flags = SIGNAL_GROUP_EXIT;
diff --git a/kernel/signal.c b/kernel/signal.c
index 9eb3e2c1f9f7..860d844542b2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2392,7 +2392,8 @@ static bool do_signal_stop(int signr)
WARN_ON_ONCE(signr & ~JOBCTL_STOP_SIGMASK);

if (!likely(current->jobctl & JOBCTL_STOP_DEQUEUED) ||
- unlikely(signal_group_exit(sig)))
+ unlikely(sig->flags & SIGNAL_GROUP_EXIT) ||
+ unlikely(sig->group_exec_task))
return false;
/*
* There is no group stop already in progress. We must
@@ -2699,7 +2700,8 @@ bool get_signal(struct ksignal *ksig)
enum pid_type type;

/* Has this task already been marked for death? */
- if (signal_group_exit(signal)) {
+ if ((signal->flags & SIGNAL_GROUP_EXIT) ||
+ signal->group_exec_task) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2955,7 +2957,7 @@ void exit_signals(struct task_struct *tsk)
*/
cgroup_threadgroup_change_begin(tsk);

- if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
+ if (thread_group_empty(tsk) || (tsk->signal->flags & SIGNAL_GROUP_EXIT)) {
tsk->flags |= PF_EXITING;
cgroup_threadgroup_change_end(tsk);
return;
--
2.29.2


2021-12-22 18:19:16

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Hi Eric,

On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> Today the rules are a bit iffy and arbitrary about which kernel
> threads have struct kthread present. Both idle threads and thread
> started with create_kthread want struct kthread present so that is
> effectively all kernel threads. Make the rule that if PF_KTHREAD
> and the task is running then struct kthread is present.
>
> This will allow the kernel thread code to using tsk->exit_code
> with different semantics from ordinary processes.
>
> To make ensure that struct kthread is present for all
> kernel threads move it's allocation into copy_process.
>
> Add a deallocation of struct kthread in exec for processes
> that were kernel threads.
>
> Move the allocation of struct kthread for the initial thread
> earlier so that it is not repeated for each additional idle
> thread.
>
> Move the initialization of struct kthread into set_kthread_struct
> so that the structure is always and reliably initailized.
>
> Clear set_child_tid in free_kthread_struct to ensure the kthread
> struct is reliably freed during exec. The function
> free_kthread_struct does not need to clear vfork_done during exec as
> exec_mm_release called from exec_mmap has already cleared vfork_done.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
present for all kthreads") in -next causes an ARCH=arm
multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
("kthread: Warn about failed allocations for the init kthread") to avoid
the known runtime warning.

$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all

$ qemu-system-arm \
-initrd rootfs.cpio \
-append earlycon \
-machine palmetto-bmc \
-no-reboot \
-dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
-display none \
-kernel arch/arm/boot/zImage \
-m 512m \
-nodefaults \
-serial mon:stdio
qemu-system-arm: warning: nic ftgmac100.0 has no peer
qemu-system-arm: warning: nic ftgmac100.1 has no peer
Booting Linux on physical CPU 0x0
Linux version 5.16.0-rc1-00016-g40966e316f86-dirty ([email protected]) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
CPU: VIVT data cache, VIVT instruction cache
OF: fdt: Machine model: Palmetto BMC
earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
printk: bootconsole [ns16550a0] enabled
Memory policy: Data cache writethrough
cma: Reserved 16 MiB at 0x5b000000
Zone ranges:
DMA [mem 0x0000000040000000-0x000000005edfffff]
Normal empty
HighMem [mem 0x000000005ee00000-0x000000005fffffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000040000000-0x000000005bffffff]
node 0: [mem 0x000000005c000000-0x000000005dffffff]
node 0: [mem 0x000000005e000000-0x000000005edfffff]
node 0: [mem 0x000000005ee00000-0x000000005fffffff]
Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
Built 1 zonelists, mobility grouping on. Total pages: 130084
Kernel command line: earlycon
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
rcu: Preemptible hierarchical RCU implementation.
rcu: RCU event tracing is enabled.
Trampoline variant of Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
i2c controller registered, irq 16
random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
Switching to timer-based delay loop, resolution 41ns
Console: colour dummy device 80x30
printk: console [tty0] enabled
printk: bootconsole [ns16550a0] disabled

After that, it just hangs.

The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
in the images/arm folder.

If there is any more information that I can provide or changes to test,
please let me know.

Cheers,
Nathan

2021-12-22 18:31:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Nathan Chancellor <[email protected]> writes:

> Hi Eric,
>
> On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> Today the rules are a bit iffy and arbitrary about which kernel
>> threads have struct kthread present. Both idle threads and thread
>> started with create_kthread want struct kthread present so that is
>> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> and the task is running then struct kthread is present.
>>
>> This will allow the kernel thread code to using tsk->exit_code
>> with different semantics from ordinary processes.
>>
>> To make ensure that struct kthread is present for all
>> kernel threads move it's allocation into copy_process.
>>
>> Add a deallocation of struct kthread in exec for processes
>> that were kernel threads.
>>
>> Move the allocation of struct kthread for the initial thread
>> earlier so that it is not repeated for each additional idle
>> thread.
>>
>> Move the initialization of struct kthread into set_kthread_struct
>> so that the structure is always and reliably initailized.
>>
>> Clear set_child_tid in free_kthread_struct to ensure the kthread
>> struct is reliably freed during exec. The function
>> free_kthread_struct does not need to clear vfork_done during exec as
>> exec_mm_release called from exec_mmap has already cleared vfork_done.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> present for all kthreads") in -next causes an ARCH=arm
> multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> ("kthread: Warn about failed allocations for the init kthread") to avoid
> the known runtime warning.
>
> $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
>
> $ qemu-system-arm \
> -initrd rootfs.cpio \
> -append earlycon \
> -machine palmetto-bmc \
> -no-reboot \
> -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> -display none \
> -kernel arch/arm/boot/zImage \
> -m 512m \
> -nodefaults \
> -serial mon:stdio
> qemu-system-arm: warning: nic ftgmac100.0 has no peer
> qemu-system-arm: warning: nic ftgmac100.1 has no peer
> Booting Linux on physical CPU 0x0
> Linux version 5.16.0-rc1-00016-g40966e316f86-dirty ([email protected]) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> CPU: VIVT data cache, VIVT instruction cache
> OF: fdt: Machine model: Palmetto BMC
> earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> printk: bootconsole [ns16550a0] enabled
> Memory policy: Data cache writethrough
> cma: Reserved 16 MiB at 0x5b000000
> Zone ranges:
> DMA [mem 0x0000000040000000-0x000000005edfffff]
> Normal empty
> HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> Movable zone start for each node
> Early memory node ranges
> node 0: [mem 0x0000000040000000-0x000000005bffffff]
> node 0: [mem 0x000000005c000000-0x000000005dffffff]
> node 0: [mem 0x000000005e000000-0x000000005edfffff]
> node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> Built 1 zonelists, mobility grouping on. Total pages: 130084
> Kernel command line: earlycon
> Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> mem auto-init: stack:off, heap alloc:off, heap free:off
> Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> rcu: Preemptible hierarchical RCU implementation.
> rcu: RCU event tracing is enabled.
> Trampoline variant of Tasks RCU enabled.
> rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> i2c controller registered, irq 16
> random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> Switching to timer-based delay loop, resolution 41ns
> Console: colour dummy device 80x30
> printk: console [tty0] enabled
> printk: bootconsole [ns16550a0] disabled
>
> After that, it just hangs.
>
> The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> in the images/arm folder.
>
> If there is any more information that I can provide or changes to test,
> please let me know.

Well crap. I hate to hear my code is causing problems like this.

This is however a very good bug report, which I very much appreciate.

I think I have enough information. I will see if I can reproduce this
and track down what is happening.

Have you by any chance tried linux-next with just these changes backed
out?

Eric


2021-12-22 18:46:50

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
> Nathan Chancellor <[email protected]> writes:
>
> > Hi Eric,
> >
> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> >> Today the rules are a bit iffy and arbitrary about which kernel
> >> threads have struct kthread present. Both idle threads and thread
> >> started with create_kthread want struct kthread present so that is
> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
> >> and the task is running then struct kthread is present.
> >>
> >> This will allow the kernel thread code to using tsk->exit_code
> >> with different semantics from ordinary processes.
> >>
> >> To make ensure that struct kthread is present for all
> >> kernel threads move it's allocation into copy_process.
> >>
> >> Add a deallocation of struct kthread in exec for processes
> >> that were kernel threads.
> >>
> >> Move the allocation of struct kthread for the initial thread
> >> earlier so that it is not repeated for each additional idle
> >> thread.
> >>
> >> Move the initialization of struct kthread into set_kthread_struct
> >> so that the structure is always and reliably initailized.
> >>
> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
> >> struct is reliably freed during exec. The function
> >> free_kthread_struct does not need to clear vfork_done during exec as
> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >
> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> > present for all kthreads") in -next causes an ARCH=arm
> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> > ("kthread: Warn about failed allocations for the init kthread") to avoid
> > the known runtime warning.
> >
> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
> >
> > $ qemu-system-arm \
> > -initrd rootfs.cpio \
> > -append earlycon \
> > -machine palmetto-bmc \
> > -no-reboot \
> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> > -display none \
> > -kernel arch/arm/boot/zImage \
> > -m 512m \
> > -nodefaults \
> > -serial mon:stdio
> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
> > Booting Linux on physical CPU 0x0
> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty ([email protected]) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> > CPU: VIVT data cache, VIVT instruction cache
> > OF: fdt: Machine model: Palmetto BMC
> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> > printk: bootconsole [ns16550a0] enabled
> > Memory policy: Data cache writethrough
> > cma: Reserved 16 MiB at 0x5b000000
> > Zone ranges:
> > DMA [mem 0x0000000040000000-0x000000005edfffff]
> > Normal empty
> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> > Movable zone start for each node
> > Early memory node ranges
> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> > Built 1 zonelists, mobility grouping on. Total pages: 130084
> > Kernel command line: earlycon
> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> > mem auto-init: stack:off, heap alloc:off, heap free:off
> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> > rcu: Preemptible hierarchical RCU implementation.
> > rcu: RCU event tracing is enabled.
> > Trampoline variant of Tasks RCU enabled.
> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> > i2c controller registered, irq 16
> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> > Switching to timer-based delay loop, resolution 41ns
> > Console: colour dummy device 80x30
> > printk: console [tty0] enabled
> > printk: bootconsole [ns16550a0] disabled
> >
> > After that, it just hangs.
> >
> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> > in the images/arm folder.
> >
> > If there is any more information that I can provide or changes to test,
> > please let me know.
>
> Well crap. I hate to hear my code is causing problems like this.
>
> This is however a very good bug report, which I very much appreciate.
>
> I think I have enough information. I will see if I can reproduce this
> and track down what is happening.
>
> Have you by any chance tried linux-next with just these changes backed
> out?

Yes, if I back out of the following commits on top of next-20211222 then
the kernel boots right up.

dd621ee0cf8e ("kthread: Warn about failed allocations for the init kthread")
ff8288ff475e ("fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct")
6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA")
1fb466dff904 ("objtool: Add a missing comma to avoid string concatenation")
5eb6f22823e0 ("exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit")
6b1248798eb6 ("exit/kthread: Move the exit code for kernel threads into struct kthread")
40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")

Cheers,
Nathan

2021-12-22 23:25:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Nathan Chancellor <[email protected]> writes:

> On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
>> Nathan Chancellor <[email protected]> writes:
>>
>> > Hi Eric,
>> >
>> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> >> Today the rules are a bit iffy and arbitrary about which kernel
>> >> threads have struct kthread present. Both idle threads and thread
>> >> started with create_kthread want struct kthread present so that is
>> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> >> and the task is running then struct kthread is present.
>> >>
>> >> This will allow the kernel thread code to using tsk->exit_code
>> >> with different semantics from ordinary processes.
>> >>
>> >> To make ensure that struct kthread is present for all
>> >> kernel threads move it's allocation into copy_process.
>> >>
>> >> Add a deallocation of struct kthread in exec for processes
>> >> that were kernel threads.
>> >>
>> >> Move the allocation of struct kthread for the initial thread
>> >> earlier so that it is not repeated for each additional idle
>> >> thread.
>> >>
>> >> Move the initialization of struct kthread into set_kthread_struct
>> >> so that the structure is always and reliably initailized.
>> >>
>> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
>> >> struct is reliably freed during exec. The function
>> >> free_kthread_struct does not need to clear vfork_done during exec as
>> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
>> >>
>> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> >
>> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
>> > present for all kthreads") in -next causes an ARCH=arm
>> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
>> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
>> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
>> > ("kthread: Warn about failed allocations for the init kthread") to avoid
>> > the known runtime warning.
>> >
>> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
>> >
>> > $ qemu-system-arm \
>> > -initrd rootfs.cpio \
>> > -append earlycon \
>> > -machine palmetto-bmc \
>> > -no-reboot \
>> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
>> > -display none \
>> > -kernel arch/arm/boot/zImage \
>> > -m 512m \
>> > -nodefaults \
>> > -serial mon:stdio
>> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
>> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
>> > Booting Linux on physical CPU 0x0
>> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty ([email protected]) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
>> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
>> > CPU: VIVT data cache, VIVT instruction cache
>> > OF: fdt: Machine model: Palmetto BMC
>> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
>> > printk: bootconsole [ns16550a0] enabled
>> > Memory policy: Data cache writethrough
>> > cma: Reserved 16 MiB at 0x5b000000
>> > Zone ranges:
>> > DMA [mem 0x0000000040000000-0x000000005edfffff]
>> > Normal empty
>> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
>> > Movable zone start for each node
>> > Early memory node ranges
>> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
>> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
>> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
>> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
>> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
>> > Built 1 zonelists, mobility grouping on. Total pages: 130084
>> > Kernel command line: earlycon
>> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
>> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
>> > mem auto-init: stack:off, heap alloc:off, heap free:off
>> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
>> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
>> > rcu: Preemptible hierarchical RCU implementation.
>> > rcu: RCU event tracing is enabled.
>> > Trampoline variant of Tasks RCU enabled.
>> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
>> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
>> > i2c controller registered, irq 16
>> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
>> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
>> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
>> > Switching to timer-based delay loop, resolution 41ns
>> > Console: colour dummy device 80x30
>> > printk: console [tty0] enabled
>> > printk: bootconsole [ns16550a0] disabled
>> >
>> > After that, it just hangs.
>> >
>> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
>> > in the images/arm folder.
>> >
>> > If there is any more information that I can provide or changes to test,
>> > please let me know.

I have managed to reproduce, fix and verify my fix, please
see below.


Subject: [PATCH] kthread: Never put_user the set_child_tid address

Kernel threads abuse set_child_tid. Historically that has been fine
as set_child_tid was initialized after the kernel thread had been
forked. Unfortunately storing struct kthread in set_child_tid after
the thread is running makes struct kthread being unusable for storing
result codes of the thread.

When set_child_tid is set to struct kthread during fork that results
in schedule_tail writing the thread id to the beggining of struct
kthread (if put_user does not realize it is a kernel address).

Solve this by skipping the put_user for all kthreads.

Reported-by: Nathan Chancellor <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee222b89c692..d8adbea77be1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
finish_task_switch(prev);
preempt_enable();

- if (current->set_child_tid)
+ if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);

calculate_sigpending();
--
2.29.2


Eric

2021-12-23 00:37:29

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 05:22:45PM -0600, Eric W. Biederman wrote:
> Nathan Chancellor <[email protected]> writes:
>
> > On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
> >> Nathan Chancellor <[email protected]> writes:
> >>
> >> > Hi Eric,
> >> >
> >> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> >> >> Today the rules are a bit iffy and arbitrary about which kernel
> >> >> threads have struct kthread present. Both idle threads and thread
> >> >> started with create_kthread want struct kthread present so that is
> >> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
> >> >> and the task is running then struct kthread is present.
> >> >>
> >> >> This will allow the kernel thread code to using tsk->exit_code
> >> >> with different semantics from ordinary processes.
> >> >>
> >> >> To make ensure that struct kthread is present for all
> >> >> kernel threads move it's allocation into copy_process.
> >> >>
> >> >> Add a deallocation of struct kthread in exec for processes
> >> >> that were kernel threads.
> >> >>
> >> >> Move the allocation of struct kthread for the initial thread
> >> >> earlier so that it is not repeated for each additional idle
> >> >> thread.
> >> >>
> >> >> Move the initialization of struct kthread into set_kthread_struct
> >> >> so that the structure is always and reliably initailized.
> >> >>
> >> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
> >> >> struct is reliably freed during exec. The function
> >> >> free_kthread_struct does not need to clear vfork_done during exec as
> >> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
> >> >>
> >> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> >
> >> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> >> > present for all kthreads") in -next causes an ARCH=arm
> >> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> >> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> >> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> >> > ("kthread: Warn about failed allocations for the init kthread") to avoid
> >> > the known runtime warning.
> >> >
> >> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
> >> >
> >> > $ qemu-system-arm \
> >> > -initrd rootfs.cpio \
> >> > -append earlycon \
> >> > -machine palmetto-bmc \
> >> > -no-reboot \
> >> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> >> > -display none \
> >> > -kernel arch/arm/boot/zImage \
> >> > -m 512m \
> >> > -nodefaults \
> >> > -serial mon:stdio
> >> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
> >> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
> >> > Booting Linux on physical CPU 0x0
> >> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty ([email protected]) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> >> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> >> > CPU: VIVT data cache, VIVT instruction cache
> >> > OF: fdt: Machine model: Palmetto BMC
> >> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> >> > printk: bootconsole [ns16550a0] enabled
> >> > Memory policy: Data cache writethrough
> >> > cma: Reserved 16 MiB at 0x5b000000
> >> > Zone ranges:
> >> > DMA [mem 0x0000000040000000-0x000000005edfffff]
> >> > Normal empty
> >> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> >> > Movable zone start for each node
> >> > Early memory node ranges
> >> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
> >> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
> >> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
> >> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> >> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> >> > Built 1 zonelists, mobility grouping on. Total pages: 130084
> >> > Kernel command line: earlycon
> >> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> >> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> >> > mem auto-init: stack:off, heap alloc:off, heap free:off
> >> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> >> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> >> > rcu: Preemptible hierarchical RCU implementation.
> >> > rcu: RCU event tracing is enabled.
> >> > Trampoline variant of Tasks RCU enabled.
> >> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> >> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> >> > i2c controller registered, irq 16
> >> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> >> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> >> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> >> > Switching to timer-based delay loop, resolution 41ns
> >> > Console: colour dummy device 80x30
> >> > printk: console [tty0] enabled
> >> > printk: bootconsole [ns16550a0] disabled
> >> >
> >> > After that, it just hangs.
> >> >
> >> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> >> > in the images/arm folder.
> >> >
> >> > If there is any more information that I can provide or changes to test,
> >> > please let me know.
>
> I have managed to reproduce, fix and verify my fix, please
> see below.
>
>
> Subject: [PATCH] kthread: Never put_user the set_child_tid address
>
> Kernel threads abuse set_child_tid. Historically that has been fine
> as set_child_tid was initialized after the kernel thread had been
> forked. Unfortunately storing struct kthread in set_child_tid after
> the thread is running makes struct kthread being unusable for storing
> result codes of the thread.
>
> When set_child_tid is set to struct kthread during fork that results
> in schedule_tail writing the thread id to the beggining of struct
> kthread (if put_user does not realize it is a kernel address).
>
> Solve this by skipping the put_user for all kthreads.
>
> Reported-by: Nathan Chancellor <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Thanks a lot for the quick fix. I can confirm that it resolves the
failure on my side.

Tested-by: Nathan Chancellor <[email protected]>

> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee222b89c692..d8adbea77be1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> finish_task_switch(prev);
> preempt_enable();
>
> - if (current->set_child_tid)
> + if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
> put_user(task_pid_vnr(current), current->set_child_tid);
>
> calculate_sigpending();
> --
> 2.29.2
>
>
> Eric

2021-12-23 01:44:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 3:25 PM Eric W. Biederman <[email protected]> wrote:
>
> Solve this by skipping the put_user for all kthreads.

Ugh.

While this fixes the problem, could we please just not mis-use that
'set_child_tid' as that kthread pointer any more?

It was always kind of hacky. I think a new pointer with the proper
'struct kthread *' type would be an improvement.

One of the "arguments" in the comment for re-using that set_child_tid
pointer was that 'fork()' used to not wrongly copy it, but your patch
literally now does that "allocate new kthread struct" at fork-time, so
that argument is actually bogus now.

Linus

2021-12-23 03:35:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads


Added a couple of people from the vhost thread.

Linus Torvalds <[email protected]> writes:

> On Wed, Dec 22, 2021 at 3:25 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Solve this by skipping the put_user for all kthreads.
>
> Ugh.
>
> While this fixes the problem, could we please just not mis-use that
> 'set_child_tid' as that kthread pointer any more?
>
> It was always kind of hacky. I think a new pointer with the proper
> 'struct kthread *' type would be an improvement.
>
> One of the "arguments" in the comment for re-using that set_child_tid
> pointer was that 'fork()' used to not wrongly copy it, but your patch
> literally now does that "allocate new kthread struct" at fork-time, so
> that argument is actually bogus now.

I agree. I think I saw in the recent vhost patches that were
generalizing create_io_thread that the pf_io_worker field of
struct task_struct was being generalized as well.

If so I think it makes sense just to take that approach.

Just build some basic infrastructure that can be used for io_workers,
vhost_workers, and kthreads.

Eric



2021-12-23 05:19:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] kthread: Generalize pf_io_worker so it can point to struct kthread


The point of using set_child_tid to hold the kthread pointer was that
it already did what is necessary. There are now restrictions on when
set_child_tid can be initialized and when set_child_tid can be used in
schedule_tail. Which indicates that continuing to use set_child_tid
to hold the kthread pointer is a bad idea.

Instead of continuing to use the set_child_tid field of task_struct
generalize the pf_io_worker field of task_struct and use it to hold
the kthread pointer.

Rename pf_io_worker (which is a void * pointer) to worker_private so
it can be used to store kthreads struct kthread pointer. Update the
kthread code to store the kthread pointer in the worker_private field.
Remove the places where set_child_tid had to be dealt with carefully
because kthreads also used it.

Link: https://lkml.kernel.org/r/[email protected]om
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

I looked again and the vhost_worker changes do not generalize
pf_io_worker, and as pf_io_worker is already a void * it is easy to
generalize. So I just did that.

Unless someone spots a problem I will add this to my signal-for-v5.17
branch in linux-next, as this seems much less error prone than using
set_child_tid.

fs/io-wq.c | 6 +++---
fs/io-wq.h | 2 +-
include/linux/sched.h | 4 ++--
kernel/fork.c | 8 +-------
kernel/kthread.c | 14 +++++---------
kernel/sched/core.c | 2 +-
6 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..e4fc7384b40c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -657,7 +657,7 @@ static int io_wqe_worker(void *data)
*/
void io_wq_worker_running(struct task_struct *tsk)
{
- struct io_worker *worker = tsk->pf_io_worker;
+ struct io_worker *worker = tsk->worker_private;

if (!worker)
return;
@@ -675,7 +675,7 @@ void io_wq_worker_running(struct task_struct *tsk)
*/
void io_wq_worker_sleeping(struct task_struct *tsk)
{
- struct io_worker *worker = tsk->pf_io_worker;
+ struct io_worker *worker = tsk->worker_private;

if (!worker)
return;
@@ -694,7 +694,7 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
struct task_struct *tsk)
{
- tsk->pf_io_worker = worker;
+ tsk->worker_private = worker;
worker->task = tsk;
set_cpus_allowed_ptr(tsk, wqe->cpu_mask);
tsk->flags |= PF_NO_SETAFFINITY;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..c7c23947cbcd 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -200,6 +200,6 @@ static inline void io_wq_worker_running(struct task_struct *tsk)
static inline bool io_wq_current_is_worker(void)
{
return in_task() && (current->flags & PF_IO_WORKER) &&
- current->pf_io_worker;
+ current->worker_private;
}
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..52f2fdffa3ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -987,8 +987,8 @@ struct task_struct {
/* CLONE_CHILD_CLEARTID: */
int __user *clear_child_tid;

- /* PF_IO_WORKER */
- void *pf_io_worker;
+ /* PF_KTHREAD | PF_IO_WORKER */
+ void *worker_private;

u64 utime;
u64 stime;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0816be1bb044..6f0293cb29c9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -950,7 +950,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
tsk->splice_pipe = NULL;
tsk->task_frag.page = NULL;
tsk->wake_q.next = NULL;
- tsk->pf_io_worker = NULL;
+ tsk->worker_private = NULL;

account_kernel_stack(tsk, 1);

@@ -2032,12 +2032,6 @@ static __latent_entropy struct task_struct *copy_process(
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}

- /*
- * This _must_ happen before we call free_task(), i.e. before we jump
- * to any of the bad_fork_* labels. This is to avoid freeing
- * p->set_child_tid which is (ab)used as a kthread's data pointer for
- * kernel threads (PF_KTHREAD).
- */
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? args->child_tid : NULL;
/*
* Clear TID on mm_release()?
diff --git a/kernel/kthread.c b/kernel/kthread.c
index c14707d15341..261a3c3b9c6c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -72,7 +72,7 @@ enum KTHREAD_BITS {
static inline struct kthread *to_kthread(struct task_struct *k)
{
WARN_ON(!(k->flags & PF_KTHREAD));
- return (__force void *)k->set_child_tid;
+ return k->worker_private;
}

/*
@@ -80,7 +80,7 @@ static inline struct kthread *to_kthread(struct task_struct *k)
*
* Per construction; when:
*
- * (p->flags & PF_KTHREAD) && p->set_child_tid
+ * (p->flags & PF_KTHREAD) && p->worker_private
*
* the task is both a kthread and struct kthread is persistent. However
* PF_KTHREAD on it's own is not, kernel_thread() can exec() (See umh.c and
@@ -88,7 +88,7 @@ static inline struct kthread *to_kthread(struct task_struct *k)
*/
static inline struct kthread *__to_kthread(struct task_struct *p)
{
- void *kthread = (__force void *)p->set_child_tid;
+ void *kthread = p->worker_private;
if (kthread && !(p->flags & PF_KTHREAD))
kthread = NULL;
return kthread;
@@ -109,11 +109,7 @@ bool set_kthread_struct(struct task_struct *p)
init_completion(&kthread->parked);
p->vfork_done = &kthread->exited;

- /*
- * We abuse ->set_child_tid to avoid the new member and because it
- * can't be wrongly copied by copy_process().
- */
- p->set_child_tid = (__force void __user *)kthread;
+ p->worker_private = kthread;
return true;
}

@@ -128,7 +124,7 @@ void free_kthread_struct(struct task_struct *k)
#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread && kthread->blkcg_css);
#endif
- k->set_child_tid = (__force void __user *)NULL;
+ k->worker_private = NULL;
kfree(kthread);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8adbea77be1..ee222b89c692 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
finish_task_switch(prev);
preempt_enable();

- if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
+ if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);

calculate_sigpending();
--
2.29.2


2021-12-23 17:20:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] kthread: Generalize pf_io_worker so it can point to struct kthread

On Wed, Dec 22, 2021 at 9:19 PM Eric W. Biederman <[email protected]> wrote:
>
> Instead of continuing to use the set_child_tid field of task_struct
> generalize the pf_io_worker field of task_struct and use it to hold
> the kthread pointer.

Well that patch certainly looks like a nice cleanup to me. Thanks.

Linus

2022-01-03 21:30:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 00/17] exit: Making task exiting a first class concept


The changes below contain some cleanups and the work to make implement
first class asynchronous task exit. Most of the cleanups are necessary
for this work but a couple of them (removing profile_task_exit and the
extra setting of PT_SEIZED in ptrace_attach) are included because I
stumbled over them and they are worth applying but they aren't
interesting enough to me to make be in their own patchset.

The core of this set of changes is the addition of
schedule_task_exit_locked. Ptrace is cleaned up to avoid a conflict in
task->exit_code. Then the existing task exit code is gradually moved
into the final shape of schedule_task_exit_locked.

This is the fundamental building block I need to fix alpha, m68k,
nios2 and any other architecture that does not always save all of
their registers except when entering into a ptrace context.

This is about half the work to allow coredump signals to use
short-circuit delivery.

With coredumps signals available for short-circuit delivery the
SA_IMMUTABLE hack can be replace by something clean.

The counting of the number of threads that have not been killed to
always set SIGNAL_GROUP_EXIT when a process exits and the coredump
signal short-circuit delivery is a foundation for updating the
SECCOMP_RET_KILL_THREAD implementation such that it can decide if it
should coredump without races.

I have most of those changes pretty much ready I just need to get these
changes finalized reviewed first. At this point they are looking at
v5.18 material.

These patches are on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

After these patches have been reviewed it is my plan to apply them to my
signal-for-v5.17 branch. Any and all feedback is welcome.

Eric W. Biederman (17):
exit: Remove profile_task_exit & profile_munmap
exit: Coredumps reach do_group_exit
exit: Fix the exit_code for wait_task_zombie
exit: Use the correct exit_code in /proc/<pid>/stat
taskstats: Cleanup the use of task->exit_code
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Move setting/clearing ptrace_message into prace_stop
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Separate task->ptrace_code out from task->exit_code
signal: Compute the process exit_code in get_signal
signal: Make individual tasks exiting a first class concept
signal: Remove zap_other_threads
signal: Add JOBCTL_WILL_EXIT to mark exiting tasks
signal: Record the exit_code when an exit is scheduled
signal: Always set SIGNAL_GROUP_EXIT on process exit

arch/m68k/kernel/ptrace.c | 12 +----
fs/coredump.c | 17 +++---
fs/exec.c | 12 +++--
fs/proc/array.c | 9 +++-
include/linux/profile.h | 26 ---------
include/linux/ptrace.h | 5 +-
include/linux/sched.h | 1 +
include/linux/sched/jobctl.h | 2 +
include/linux/sched/signal.h | 6 ++-
include/linux/tracehook.h | 21 ++++----
kernel/exit.c | 29 +++++-----
kernel/fork.c | 2 +
kernel/profile.c | 50 ------------------
kernel/ptrace.c | 14 +++--
kernel/signal.c | 122 +++++++++++++++++++++++--------------------
kernel/tsacct.c | 7 ++-
mm/mmap.c | 1 -
17 files changed, 134 insertions(+), 202 deletions(-)


2022-01-03 21:33:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.

So remove the dead code and make it easier to read do_exit.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/profile.h | 26 ---------------------
kernel/exit.c | 1 -
kernel/profile.c | 50 -----------------------------------------
mm/mmap.c | 1 -
4 files changed, 78 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f557..f7eb2b57d890 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -31,11 +31,6 @@ static inline int create_proc_profile(void)
}
#endif

-enum profile_type {
- PROFILE_TASK_EXIT,
- PROFILE_MUNMAP
-};
-
#ifdef CONFIG_PROFILING

extern int prof_on __read_mostly;
@@ -66,23 +61,14 @@ static inline void profile_hit(int type, void *ip)
struct task_struct;
struct mm_struct;

-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
/* task is dead, free task struct ? Returns 1 if
* the task was taken, 0 if the task should be freed.
*/
int profile_handoff_task(struct task_struct * task);

-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
int task_handoff_register(struct notifier_block * n);
int task_handoff_unregister(struct notifier_block * n);

-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -117,19 +103,7 @@ static inline int task_handoff_unregister(struct notifier_block * n)
return -ENOSYS;
}

-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)

#endif /* CONFIG_PROFILING */

diff --git a/kernel/exit.c b/kernel/exit.c
index e7104f803be0..b5c35b520fda 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -737,7 +737,6 @@ void __noreturn do_exit(long code)

WARN_ON(blk_needs_flush_plug(tsk));

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac5..9355cc934a96 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -135,14 +135,7 @@ int __ref profile_init(void)

/* Profile event notifications */

-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}

int profile_handoff_task(struct task_struct *task)
{
@@ -151,11 +144,6 @@ int profile_handoff_task(struct task_struct *task)
return (ret == NOTIFY_OK) ? 1 : 0;
}

-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
int task_handoff_register(struct notifier_block *n)
{
return atomic_notifier_chain_register(&task_free_notifier, n);
@@ -168,44 +156,6 @@ int task_handoff_unregister(struct notifier_block *n)
}
EXPORT_SYMBOL_GPL(task_handoff_unregister);

-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..70318c2a47c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}

--
2.29.2


2022-01-03 21:33:37

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 02/17] exit: Coredumps reach do_group_exit

The comment about coredumps not reaching do_group_exit and the
corresponding BUG_ON are bogus.

What happens and has happened for years is that get_signal calls
do_coredump (which sets SIGNAL_GROUP_EXIT and group_exit_code) and
then do_group_exit passing the signal number. Then do_group_exit
ignores the exit_code it is passed and uses signal->group_exit_code
from the coredump.

The comment and BUG_ON were correct when they were added during the
2.5 development cycle, but became obsolete and incorrect when
get_signal was changed to fall through to do_group_exit after
do_coredump in 2.6.10-rc2.

So remove the stale comment and BUG_ON

Fixes: 63bd6144f191 ("[PATCH] Invalid BUG_ONs in signal.c")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index b5c35b520fda..34c43037450f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -904,8 +904,6 @@ do_group_exit(int exit_code)
{
struct signal_struct *sig = current->signal;

- BUG_ON(exit_code & 0x80); /* core dumps don't get here */
-
if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
else if (sig->group_exec_task)
--
2.29.2


2022-01-03 21:33:42

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 03/17] exit: Fix the exit_code for wait_task_zombie

The function wait_task_zombie is defined to always returns the process not
thread exit status. Unfortunately when process group exit support
was added to wait_task_zombie the WNOWAIT case was overlooked.

Usually tsk->exit_code and tsk->signal->group_exit_code will be in sync
so fixing this is bug probably has no effect in practice. But fix
it anyway so that people aren't scratching their heads about why
the two code paths are different.

History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 2c66151cbc2c ("[PATCH] sys_exit() threading improvements, BK-curr")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 34c43037450f..7121db37c411 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1011,7 +1011,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
return 0;

if (unlikely(wo->wo_flags & WNOWAIT)) {
- status = p->exit_code;
+ status = (p->signal->flags & SIGNAL_GROUP_EXIT)
+ ? p->signal->group_exit_code : p->exit_code;
get_task_struct(p);
read_unlock(&tasklist_lock);
sched_annotate_sleep();
--
2.29.2


2022-01-03 21:33:43

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 04/17] exit: Use the correct exit_code in /proc/<pid>/stat

Since do_proc_statt was modified to return process wide values instead
of per task values the exit_code calculation has never been updated.
Update it now to return the process wide exit_code when it is requested
and available.

History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: bf719d26a5c1 ("[PATCH] distinct tgid/tid CPU usage")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/array.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ff869a66b34e..43a7abde9e42 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -468,6 +468,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
u64 cgtime, gtime;
unsigned long rsslim = 0;
unsigned long flags;
+ int exit_code = task->exit_code;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -531,6 +532,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
maj_flt += sig->maj_flt;
thread_group_cputime_adjusted(task, &utime, &stime);
gtime += sig->gtime;
+
+ if (sig->flags & (SIGNAL_GROUP_EXIT | SIGNAL_STOP_STOPPED))
+ exit_code = sig->group_exit_code;
}

sid = task_session_nr_ns(task, ns);
@@ -630,7 +634,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
seq_puts(m, " 0 0 0 0 0 0 0");

if (permitted)
- seq_put_decimal_ll(m, " ", task->exit_code);
+ seq_put_decimal_ll(m, " ", exit_code);
else
seq_puts(m, " 0");

--
2.29.2


2022-01-03 21:33:49

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 06/17] ptrace: Remove second setting of PT_SEIZED in ptrace_attach

The code is totally redundant remove it.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/ptrace.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f8589bf8d7dc..eea265082e97 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -419,8 +419,6 @@ static int ptrace_attach(struct task_struct *task, long request,
if (task->ptrace)
goto unlock_tasklist;

- if (seize)
- flags |= PT_SEIZED;
task->ptrace = flags;

ptrace_link(task, current);
--
2.29.2


2022-01-03 21:33:51

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 05/17] taskstats: Cleanup the use of task->exit_code

In the function bacct_add_task the code reading task->exit_code was
introduced in commit f3cef7a99469 ("[PATCH] csa: basic accounting over
taskstats"), and it is not entirely clear what the taskstats interface
is trying to return as only returning the exit_code of the first task
in a process doesn't make a lot of sense.

As best as I can figure the intent is to return task->exit_code after
a task exits. The field is returned with per task fields, so the
exit_code of the entire process is not wanted. Only the value of the
first task is returned so this is not a useful way to get the per task
ptrace stop code. The ordinary case of returning this value is
returning after a task exits, which also precludes use for getting
a ptrace value.

It is common to for the first task of a process to also be the last
task of a process so this field may have done something reasonable by
accident in testing.

Make ac_exitcode a reliable per task value by always returning it for
every exited task.

Setting ac_exitcode in a sensible mannter makes it possible to continue
to provide this value going forward.

Cc: Balbir Singh <[email protected]>
Fixes: f3cef7a99469 ("[PATCH] csa: basic accounting over taskstats")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/tsacct.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index f00de83d0246..1d261fbe367b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -38,11 +38,10 @@ void bacct_add_tsk(struct user_namespace *user_ns,
stats->ac_btime = clamp_t(time64_t, btime, 0, U32_MAX);
stats->ac_btime64 = btime;

- if (thread_group_leader(tsk)) {
+ if (tsk->flags & PF_EXITING)
stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
- stats->ac_flag |= AFORK;
- }
+ if (thread_group_leader(tsk) && (tsk->flags & PF_FORKNOEXEC))
+ stats->ac_flag |= AFORK;
if (tsk->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
if (tsk->flags & PF_DUMPCORE)
--
2.29.2


2022-01-03 21:33:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 07/17] ptrace: Remove unused regs argument from ptrace_report_syscall

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/tracehook.h | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 2564b7434b4d..88c007ab5ebc 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -54,8 +54,7 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline int ptrace_report_syscall(struct pt_regs *regs,
- unsigned long message)
+static inline int ptrace_report_syscall(unsigned long message)
{
int ptrace = current->ptrace;

@@ -102,7 +101,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs,
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
+ return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
}

/**
@@ -127,7 +126,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
if (step)
user_single_step_report(regs);
else
- ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
+ ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_EXIT);
}

/**
--
2.29.2


2022-01-03 21:33:58

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

The generic function ptrace_report_syscall does a little more
than syscall_trace on m68k. The function ptrace_report_syscall
stops early if PT_TRACED is not set, it sets ptrace_message,
and returns the result of fatal_signal_pending.

Setting ptrace_message to a passed in value of 0 is effectively not
setting ptrace_message, making that additional work a noop.

Returning the result of fatal_signal_pending and letting the caller
ignore the result becomes a noop in this change.

When a process is ptraced, the flag PT_PTRACED is always set in
current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
just an optimization to fail early if the process is not ptraced.
Later on in ptrace_notify, ptrace_stop will test current->ptrace under
tasklist_lock and skip performing any work if the task is not ptraced.

Cc: Geert Uytterhoeven <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/m68k/kernel/ptrace.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/m68k/kernel/ptrace.c b/arch/m68k/kernel/ptrace.c
index 94b3b274186d..aa3a0b8d07e9 100644
--- a/arch/m68k/kernel/ptrace.c
+++ b/arch/m68k/kernel/ptrace.c
@@ -273,17 +273,7 @@ long arch_ptrace(struct task_struct *child, long request,

asmlinkage void syscall_trace(void)
{
- ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD)
- ? 0x80 : 0));
- /*
- * this isn't the same as continuing with a signal, but it will do
- * for normal use. strace only continues with a signal if the
- * stopping signal is not SIGTRAP. -brl
- */
- if (current->exit_code) {
- send_sig(current->exit_code, current, 1);
- current->exit_code = 0;
- }
+ ptrace_report_syscall(0);
}

#if defined(CONFIG_COLDFIRE) || !defined(CONFIG_MMU)
--
2.29.2


2022-01-03 21:34:02

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 09/17] ptrace: Move setting/clearing ptrace_message into ptrace_stop

Today ptrace_message is easy to overlook as it not a core part of
ptrace_stop. It has been overlooked so much that there are places
that set ptrace_message and don't clear it, and places that never set
it. So if you get an unlucky sequence of events the ptracer may be
able to read a ptrace_message that does not apply to the current
ptrace stop.

Move setting of ptrace_message into ptrace_stop so that it always gets
set before the stop, and always gets cleared after the stop. This
prevents non-sense from being reported to userspace and makes
ptrace_message more visible in the ptrace API so that kernel
developers can see it.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ptrace.h | 5 ++---
include/linux/tracehook.h | 6 ++----
kernel/signal.c | 19 +++++++++++--------
3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 8aee2945ff08..06f27736c6f8 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -60,7 +60,7 @@ extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned
extern void ptrace_disable(struct task_struct *);
extern int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
-extern void ptrace_notify(int exit_code);
+extern void ptrace_notify(int exit_code, unsigned long message);
extern void __ptrace_link(struct task_struct *child,
struct task_struct *new_parent,
const struct cred *ptracer_cred);
@@ -155,8 +155,7 @@ static inline bool ptrace_event_enabled(struct task_struct *task, int event)
static inline void ptrace_event(int event, unsigned long message)
{
if (unlikely(ptrace_event_enabled(current, event))) {
- current->ptrace_message = message;
- ptrace_notify((event << 8) | SIGTRAP);
+ ptrace_notify((event << 8) | SIGTRAP, message);
} else if (event == PTRACE_EVENT_EXEC) {
/* legacy EXEC report via SIGTRAP */
if ((current->ptrace & (PT_PTRACED|PT_SEIZED)) == PT_PTRACED)
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 88c007ab5ebc..5e60af8a11fc 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -61,8 +61,7 @@ static inline int ptrace_report_syscall(unsigned long message)
if (!(ptrace & PT_PTRACED))
return 0;

- current->ptrace_message = message;
- ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
+ ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0), message);

/*
* this isn't the same as continuing with a signal, but it will do
@@ -74,7 +73,6 @@ static inline int ptrace_report_syscall(unsigned long message)
current->exit_code = 0;
}

- current->ptrace_message = 0;
return fatal_signal_pending(current);
}

@@ -143,7 +141,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
static inline void tracehook_signal_handler(int stepping)
{
if (stepping)
- ptrace_notify(SIGTRAP);
+ ptrace_notify(SIGTRAP, 0);
}

/**
diff --git a/kernel/signal.c b/kernel/signal.c
index 802acca0207b..75bb062d8534 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2197,7 +2197,8 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* If we actually decide not to stop at all because the tracer
* is gone, we keep current->exit_code unless clear_code.
*/
-static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t *info)
+static void ptrace_stop(int exit_code, int why, int clear_code,
+ unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
{
@@ -2243,6 +2244,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
*/
smp_wmb();

+ current->ptrace_message = message;
current->last_siginfo = info;
current->exit_code = exit_code;

@@ -2321,6 +2323,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
*/
spin_lock_irq(&current->sighand->siglock);
current->last_siginfo = NULL;
+ current->ptrace_message = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2333,7 +2336,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
recalc_sigpending_tsk(current);
}

-static void ptrace_do_notify(int signr, int exit_code, int why)
+static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
{
kernel_siginfo_t info;

@@ -2344,17 +2347,17 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

/* Let the debugger run. */
- ptrace_stop(exit_code, why, 1, &info);
+ ptrace_stop(exit_code, why, 1, message, &info);
}

-void ptrace_notify(int exit_code)
+void ptrace_notify(int exit_code, unsigned long message)
{
BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
if (unlikely(current->task_works))
task_work_run();

spin_lock_irq(&current->sighand->siglock);
- ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
+ ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
spin_unlock_irq(&current->sighand->siglock);
}

@@ -2510,10 +2513,10 @@ static void do_jobctl_trap(void)
signr = SIGTRAP;
WARN_ON_ONCE(!signr);
ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
- CLD_STOPPED);
+ CLD_STOPPED, 0);
} else {
WARN_ON_ONCE(!signr);
- ptrace_stop(signr, CLD_STOPPED, 0, NULL);
+ ptrace_stop(signr, CLD_STOPPED, 0, 0, NULL);
current->exit_code = 0;
}
}
@@ -2567,7 +2570,7 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
* comment in dequeue_signal().
*/
current->jobctl |= JOBCTL_STOP_DEQUEUED;
- ptrace_stop(signr, CLD_TRAPPED, 0, info);
+ ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);

/* We're back. Did the debugger cancel the sig? */
signr = current->exit_code;
--
2.29.2


2022-01-03 21:34:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 10/17] ptrace: Return the signal to continue with from ptrace_stop

The signal a task should continue with after a ptrace stop is
inconsistently read, cleared, and sent. Solve this by reading and
clearing the signal to be sent in ptrace_stop.

In an ideal world everything except ptrace_signal would share a common
implementation of continuing with the signal, so ptracers could count
on the signal they ask to continue with actually being delivered. For
now retain bug compatibility and just return with the signal number
the ptracer requested the code continue with.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ptrace.h | 2 +-
include/linux/tracehook.h | 10 +++++-----
kernel/signal.c | 31 ++++++++++++++++++-------------
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 06f27736c6f8..323c9950e705 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -60,7 +60,7 @@ extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned
extern void ptrace_disable(struct task_struct *);
extern int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
-extern void ptrace_notify(int exit_code, unsigned long message);
+extern int ptrace_notify(int exit_code, unsigned long message);
extern void __ptrace_link(struct task_struct *child,
struct task_struct *new_parent,
const struct cred *ptracer_cred);
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 5e60af8a11fc..2fd0bfe866c0 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -57,21 +57,21 @@ struct linux_binprm;
static inline int ptrace_report_syscall(unsigned long message)
{
int ptrace = current->ptrace;
+ int signr;

if (!(ptrace & PT_PTRACED))
return 0;

- ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0), message);
+ signr = ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0),
+ message);

/*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
* stopping signal is not SIGTRAP. -brl
*/
- if (current->exit_code) {
- send_sig(current->exit_code, current, 1);
- current->exit_code = 0;
- }
+ if (signr)
+ send_sig(signr, current, 1);

return fatal_signal_pending(current);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 75bb062d8534..9903ff12e581 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2194,15 +2194,17 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* That makes it a way to test a stopped process for
* being ptrace-stopped vs being job-control-stopped.
*
- * If we actually decide not to stop at all because the tracer
- * is gone, we keep current->exit_code unless clear_code.
+ * Returns the signal the ptracer requested the code resume
+ * with. If the code did not stop because the tracer is gone,
+ * the stop signal remains unchanged unless clear_code.
*/
-static void ptrace_stop(int exit_code, int why, int clear_code,
+static int ptrace_stop(int exit_code, int why, int clear_code,
unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
{
bool gstop_done = false;
+ bool read_code = true;

if (arch_ptrace_stop_needed()) {
/*
@@ -2311,8 +2313,9 @@ static void ptrace_stop(int exit_code, int why, int clear_code,

/* tasklist protects us from ptrace_freeze_traced() */
__set_current_state(TASK_RUNNING);
+ read_code = false;
if (clear_code)
- current->exit_code = 0;
+ exit_code = 0;
read_unlock(&tasklist_lock);
}

@@ -2322,8 +2325,10 @@ static void ptrace_stop(int exit_code, int why, int clear_code,
* any signal-sending on another CPU that wants to examine it.
*/
spin_lock_irq(&current->sighand->siglock);
+ if (read_code) exit_code = current->exit_code;
current->last_siginfo = NULL;
current->ptrace_message = 0;
+ current->exit_code = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2334,9 +2339,10 @@ static void ptrace_stop(int exit_code, int why, int clear_code,
* This sets TIF_SIGPENDING, but never clears it.
*/
recalc_sigpending_tsk(current);
+ return exit_code;
}

-static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
+static int ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
{
kernel_siginfo_t info;

@@ -2347,18 +2353,21 @@ static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long me
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

/* Let the debugger run. */
- ptrace_stop(exit_code, why, 1, message, &info);
+ return ptrace_stop(exit_code, why, 1, message, &info);
}

-void ptrace_notify(int exit_code, unsigned long message)
+int ptrace_notify(int exit_code, unsigned long message)
{
+ int signr;
+
BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
if (unlikely(current->task_works))
task_work_run();

spin_lock_irq(&current->sighand->siglock);
- ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
+ signr = ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
spin_unlock_irq(&current->sighand->siglock);
+ return signr;
}

/**
@@ -2517,7 +2526,6 @@ static void do_jobctl_trap(void)
} else {
WARN_ON_ONCE(!signr);
ptrace_stop(signr, CLD_STOPPED, 0, 0, NULL);
- current->exit_code = 0;
}
}

@@ -2570,15 +2578,12 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
* comment in dequeue_signal().
*/
current->jobctl |= JOBCTL_STOP_DEQUEUED;
- ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);
+ signr = ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);

/* We're back. Did the debugger cancel the sig? */
- signr = current->exit_code;
if (signr == 0)
return signr;

- current->exit_code = 0;
-
/*
* Update the siginfo structure if the signal has
* changed. If the debugger wanted something
--
2.29.2


2022-01-03 21:34:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 11/17] ptrace: Separate task->ptrace_code out from task->exit_code

A process can be marked for death by setting SIGNAL_GROUP_EXIT and
group_exit_code, long before do_exit is called. Unfortunately because
of PTRACE_EVENT_EXIT residing in do_exit this same tactic can not be
used for task death.

Correct this by adding a new task field task->ptrace_code that holds
the code for ptrace stops. This allows task->exit_code to be set to
the exit code long before the PTRACE_EVENT_EXIT ptrace stop.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/array.c | 3 +++
include/linux/sched.h | 1 +
kernel/exit.c | 2 +-
kernel/ptrace.c | 12 ++++++------
kernel/signal.c | 18 +++++++++---------
5 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 43a7abde9e42..3042015c11ad 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -519,6 +519,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
cgtime = sig->cgtime;
rsslim = READ_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);

+ if (task_is_traced(task) && !(task->jobctl & JOBCTL_LISTENING))
+ exit_code = task->ptrace_code;
+
/* add up live thread stats at the group level */
if (whole) {
struct task_struct *t = task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 52f2fdffa3ab..c3d732bf7833 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1174,6 +1174,7 @@ struct task_struct {
/* Ptrace state: */
unsigned long ptrace_message;
kernel_siginfo_t *last_siginfo;
+ int ptrace_code;

struct task_io_accounting ioac;
#ifdef CONFIG_PSI
diff --git a/kernel/exit.c b/kernel/exit.c
index 7121db37c411..aedefe5eb0eb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1134,7 +1134,7 @@ static int *task_stopped_code(struct task_struct *p, bool ptrace)
{
if (ptrace) {
if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
- return &p->exit_code;
+ return &p->ptrace_code;
} else {
if (p->signal->flags & SIGNAL_STOP_STOPPED)
return &p->signal->group_exit_code;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eea265082e97..8bbd73ab9a34 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -172,7 +172,7 @@ void __ptrace_unlink(struct task_struct *child)

static bool looks_like_a_spurious_pid(struct task_struct *task)
{
- if (task->exit_code != ((PTRACE_EVENT_EXEC << 8) | SIGTRAP))
+ if (task->ptrace_code != ((PTRACE_EVENT_EXEC << 8) | SIGTRAP))
return false;

if (task_pid_vnr(task) == task->ptrace_message)
@@ -573,7 +573,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
* tasklist_lock avoids the race with wait_task_stopped(), see
* the comment in ptrace_resume().
*/
- child->exit_code = data;
+ child->ptrace_code = data;
__ptrace_detach(current, child);
write_unlock_irq(&tasklist_lock);

@@ -863,11 +863,11 @@ static int ptrace_resume(struct task_struct *child, long request,
}

/*
- * Change ->exit_code and ->state under siglock to avoid the race
- * with wait_task_stopped() in between; a non-zero ->exit_code will
+ * Change ->ptrace_code and ->state under siglock to avoid the race
+ * with wait_task_stopped() in between; a non-zero ->ptrace_code will
* wrongly look like another report from tracee.
*
- * Note that we need siglock even if ->exit_code == data and/or this
+ * Note that we need siglock even if ->ptrace_code == data and/or this
* status was not reported yet, the new status must not be cleared by
* wait_task_stopped() after resume.
*
@@ -878,7 +878,7 @@ static int ptrace_resume(struct task_struct *child, long request,
need_siglock = data && !thread_group_empty(current);
if (need_siglock)
spin_lock_irq(&child->sighand->siglock);
- child->exit_code = data;
+ child->ptrace_code = data;
wake_up_state(child, __TASK_TRACED);
if (need_siglock)
spin_unlock_irq(&child->sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index 9903ff12e581..fd3c404de8b6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2168,7 +2168,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
info.si_status = tsk->signal->group_exit_code & 0x7f;
break;
case CLD_TRAPPED:
- info.si_status = tsk->exit_code & 0x7f;
+ info.si_status = tsk->ptrace_code & 0x7f;
break;
default:
BUG();
@@ -2198,7 +2198,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* with. If the code did not stop because the tracer is gone,
* the stop signal remains unchanged unless clear_code.
*/
-static int ptrace_stop(int exit_code, int why, int clear_code,
+static int ptrace_stop(int code, int why, int clear_code,
unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
@@ -2248,7 +2248,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,

current->ptrace_message = message;
current->last_siginfo = info;
- current->exit_code = exit_code;
+ current->ptrace_code = code;

/*
* If @why is CLD_STOPPED, we're trapping to participate in a group
@@ -2315,7 +2315,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
__set_current_state(TASK_RUNNING);
read_code = false;
if (clear_code)
- exit_code = 0;
+ code = 0;
read_unlock(&tasklist_lock);
}

@@ -2325,10 +2325,10 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
* any signal-sending on another CPU that wants to examine it.
*/
spin_lock_irq(&current->sighand->siglock);
- if (read_code) exit_code = current->exit_code;
+ if (read_code) code = current->ptrace_code;
current->last_siginfo = NULL;
current->ptrace_message = 0;
- current->exit_code = 0;
+ current->ptrace_code = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2339,7 +2339,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
* This sets TIF_SIGPENDING, but never clears it.
*/
recalc_sigpending_tsk(current);
- return exit_code;
+ return code;
}

static int ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
@@ -2501,11 +2501,11 @@ static bool do_signal_stop(int signr)
*
* When PT_SEIZED, it's used for both group stop and explicit
* SEIZE/INTERRUPT traps. Both generate PTRACE_EVENT_STOP trap with
- * accompanying siginfo. If stopped, lower eight bits of exit_code contain
+ * accompanying siginfo. If stopped, lower eight bits of ptrace_code contain
* the stop signal; otherwise, %SIGTRAP.
*
* When !PT_SEIZED, it's used only for group stop trap with stop signal
- * number as exit_code and no siginfo.
+ * number as ptrace_code and no siginfo.
*
* CONTEXT:
* Must be called with @current->sighand->siglock held, which may be
--
2.29.2


2022-01-03 21:34:19

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 12/17] signal: Compute the process exit_code in get_signal

In prepartion for moving the work of sys_exit and sys_group_exit into
get_signal compute exit_code in get_signal, make PF_SIGNALED depend on
the exit_code and pass the exit_code to do_group_exit.

Anytime there is a group exit the exit_code may differ from the signal
number.

To match the historical precedent as best I can make the exit_code 0
during exec. (The exit_code field would not have been set but probably
would have been left at a value of 0).

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index fd3c404de8b6..2a24cca00ca1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2707,6 +2707,7 @@ bool get_signal(struct ksignal *ksig)
for (;;) {
struct k_sigaction *ka;
enum pid_type type;
+ int exit_code;

/* Has this task already been marked for death? */
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
@@ -2716,6 +2717,10 @@ bool get_signal(struct ksignal *ksig)
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
+ if (signal->flags & SIGNAL_GROUP_EXIT)
+ exit_code = signal->group_exit_code;
+ else
+ exit_code = 0;
goto fatal;
}

@@ -2837,15 +2842,17 @@ bool get_signal(struct ksignal *ksig)
continue;
}

+ /*
+ * Anything else is fatal, maybe with a core dump.
+ */
+ exit_code = signr;
fatal:
spin_unlock_irq(&sighand->siglock);
if (unlikely(cgroup_task_frozen(current)))
cgroup_leave_frozen(true);

- /*
- * Anything else is fatal, maybe with a core dump.
- */
- current->flags |= PF_SIGNALED;
+ if (exit_code & 0x7f)
+ current->flags |= PF_SIGNALED;

if (sig_kernel_coredump(signr)) {
if (print_fatal_signals)
@@ -2873,7 +2880,7 @@ bool get_signal(struct ksignal *ksig)
/*
* Death signals, no core dump.
*/
- do_group_exit(ksig->info.si_signo);
+ do_group_exit(exit_code);
/* NOTREACHED */
}
spin_unlock_irq(&sighand->siglock);
--
2.29.2


2022-01-03 21:34:22

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 13/17] signal: Make individual tasks exiting a first class concept

Add a helper schedule_task_exit_locked that is equivalent to
asynchronously calling exit(2) except for not having an exit code.

This is a generalization of what happens in de_thread, zap_process,
prepare_signal, complete_signal, and zap_other_threads when individual
tasks are asked to shutdown.

The various code paths optimize away the setting sigaddset and
signal_wake_up based on different conditions. Neither sigaddset nor
signal_wake_up are needed if the task has already started running
do_exit. So skip the work if PF_POSTCOREDUMP is set. Which is the
earliest any of the original hand rolled implementations used.

Update get_signal to detect either signal group exit or a single task
exit by testing for __fatal_signal_pending. This works because the
all of the tasks in group exits are killed with
schedule_task_exit_locked.

For clarity the code in get_signal has been updated to call do_exit
instead of do_group_exit when a single task is exiting.

While this schedule_task_exit_locked is a generalization of what
happens in prepare_signal I do not change prepare_signal to use
schedule_task_exit_locked to deliver SIGKILL to a coredumping process.
This keeps all of the specialness delivering a signal to a coredumping
process limited to prepare_signal and the coredump code itself.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 7 ++-----
include/linux/sched/signal.h | 2 ++
kernel/signal.c | 36 +++++++++++++++++++++---------------
3 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 09302a6a0d80..9559e29daada 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -358,12 +358,9 @@ static int zap_process(struct task_struct *start, int exit_code)
start->signal->group_stop_count = 0;

for_each_thread(start, t) {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
+ if (t != current && !(t->flags & PF_POSTCOREDUMP))
nr++;
- }
}

return nr;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b6ecb9fc4cd2..7c62b7c29cc0 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -427,6 +427,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
}

+void schedule_task_exit_locked(struct task_struct *task);
+
void task_join_group_stop(struct task_struct *task);

#ifdef TIF_RESTORE_SIGMASK
diff --git a/kernel/signal.c b/kernel/signal.c
index 2a24cca00ca1..cbfb9020368e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1056,9 +1056,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
signal->group_stop_count = 0;
t = p;
do {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
} while_each_thread(p, t);
return;
}
@@ -1363,6 +1361,16 @@ int force_sig_info(struct kernel_siginfo *info)
return force_sig_info_to_task(info, current, HANDLER_CURRENT);
}

+void schedule_task_exit_locked(struct task_struct *task)
+{
+ task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+ /* Only bother with threads that might be alive */
+ if (!(task->flags & PF_POSTCOREDUMP)) {
+ sigaddset(&task->pending.signal, SIGKILL);
+ signal_wake_up(task, 1);
+ }
+}
+
/*
* Nuke all other threads in the group.
*/
@@ -1374,16 +1382,9 @@ int zap_other_threads(struct task_struct *p)
p->signal->group_stop_count = 0;

while_each_thread(p, t) {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
count++;
-
- /* Don't bother with already dead threads */
- if (t->exit_state)
- continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
}
-
return count;
}

@@ -2706,12 +2707,12 @@ bool get_signal(struct ksignal *ksig)

for (;;) {
struct k_sigaction *ka;
+ bool group_exit = true;
enum pid_type type;
int exit_code;

/* Has this task already been marked for death? */
- if ((signal->flags & SIGNAL_GROUP_EXIT) ||
- signal->group_exec_task) {
+ if (__fatal_signal_pending(current)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2719,8 +2720,10 @@ bool get_signal(struct ksignal *ksig)
recalc_sigpending();
if (signal->flags & SIGNAL_GROUP_EXIT)
exit_code = signal->group_exit_code;
- else
+ else {
exit_code = 0;
+ group_exit = false;
+ }
goto fatal;
}

@@ -2880,7 +2883,10 @@ bool get_signal(struct ksignal *ksig)
/*
* Death signals, no core dump.
*/
- do_group_exit(exit_code);
+ if (group_exit)
+ do_group_exit(exit_code);
+ else
+ do_exit(exit_code);
/* NOTREACHED */
}
spin_unlock_irq(&sighand->siglock);
--
2.29.2


2022-01-03 21:34:28

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 14/17] signal: Remove zap_other_threads

The two callers of zap_other_threads want different things. The
function do_group_exit wants to set the exit code and it does not care
about the number of threads exiting. In de_thread the current thread
is not exiting so there is not really an exit code.

Since schedule_task_exit_locked factors out the tricky bits stop
sharing the loop in zap_other_threads between de_thread and
do_group_exit.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 12 +++++++++---
include/linux/sched/signal.h | 1 -
kernel/exit.c | 9 ++++++++-
kernel/signal.c | 17 -----------------
4 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 82db656ca709..b9f646fddc51 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1037,6 +1037,7 @@ static int de_thread(struct task_struct *tsk)
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t;

if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1055,9 +1056,14 @@ static int de_thread(struct task_struct *tsk)
}

sig->group_exec_task = tsk;
- sig->notify_count = zap_other_threads(tsk);
- if (!thread_group_leader(tsk))
- sig->notify_count--;
+ sig->group_stop_count = 0;
+ sig->notify_count = 0;
+ __for_each_thread(sig, t) {
+ if (t == tsk)
+ continue;
+ sig->notify_count++;
+ schedule_task_exit_locked(t);
+ }

while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7c62b7c29cc0..eed54f9ea2fc 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -343,7 +343,6 @@ extern void force_sig(int);
extern void force_fatal_sig(int);
extern void force_exit_sig(int);
extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
extern struct sigqueue *sigqueue_alloc(void);
extern void sigqueue_free(struct sigqueue *);
extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
diff --git a/kernel/exit.c b/kernel/exit.c
index aedefe5eb0eb..27bc0ccfea78 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -918,9 +918,16 @@ do_group_exit(int exit_code)
else if (sig->group_exec_task)
exit_code = 0;
else {
+ struct task_struct *t;
+
sig->group_exit_code = exit_code;
sig->flags = SIGNAL_GROUP_EXIT;
- zap_other_threads(current);
+ sig->group_stop_count = 0;
+ __for_each_thread(sig, t) {
+ if (t == current)
+ continue;
+ schedule_task_exit_locked(t);
+ }
}
spin_unlock_irq(&sighand->siglock);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index cbfb9020368e..b0201e05be40 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1371,23 +1371,6 @@ void schedule_task_exit_locked(struct task_struct *task)
}
}

-/*
- * Nuke all other threads in the group.
- */
-int zap_other_threads(struct task_struct *p)
-{
- struct task_struct *t = p;
- int count = 0;
-
- p->signal->group_stop_count = 0;
-
- while_each_thread(p, t) {
- count++;
- schedule_task_exit_locked(t);
- }
- return count;
-}
-
struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
unsigned long *flags)
{
--
2.29.2


2022-01-03 21:34:31

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 16/17] signal: Record the exit_code when an exit is scheduled

With ptrace_stop no longer using task->exit_code it is safe
to set task->exit_code when an exit is scheduled.

Use the bit JOBCTL_WILL_EXIT to detect when a exit is first scheduled
and only set exit_code the first time. Only use the code provided
to do_exit if the task has not yet been schedled to exit.

In get_signal and do_grup_exit when JOBCTL_WILL_EXIT is set read the
recored exit_code from current->exit_code, instead of assuming
exit_code will always be 0.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 +-
fs/exec.c | 2 +-
include/linux/sched/signal.h | 2 +-
kernel/exit.c | 12 ++++++++----
kernel/signal.c | 7 ++++---
5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 4e82ee51633d..c54b502bf648 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -357,7 +357,7 @@ static int zap_process(struct task_struct *start, int exit_code)
start->signal->group_stop_count = 0;

for_each_thread(start, t) {
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, exit_code);
if (t != current && !(t->flags & PF_POSTCOREDUMP))
nr++;
}
diff --git a/fs/exec.c b/fs/exec.c
index b9f646fddc51..3203605e54cb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1062,7 +1062,7 @@ static int de_thread(struct task_struct *tsk)
if (t == tsk)
continue;
sig->notify_count++;
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, 0);
}

while (sig->notify_count) {
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 989bb665f107..e8034ecaee84 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -426,7 +426,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
}

-void schedule_task_exit_locked(struct task_struct *task);
+void schedule_task_exit_locked(struct task_struct *task, int exit_code);

void task_join_group_stop(struct task_struct *task);

diff --git a/kernel/exit.c b/kernel/exit.c
index 7a7a0cbac28e..e95500e2d27c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -735,6 +735,11 @@ void __noreturn do_exit(long code)
struct task_struct *tsk = current;
int group_dead;

+ spin_lock_irq(&tsk->sighand->siglock);
+ schedule_task_exit_locked(tsk, code);
+ code = tsk->exit_code;
+ spin_unlock_irq(&tsk->sighand->siglock);
+
WARN_ON(blk_needs_flush_plug(tsk));

kcov_task_exit(tsk);
@@ -773,7 +778,6 @@ void __noreturn do_exit(long code)
tty_audit_exit();
audit_free(tsk);

- tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

exit_mm();
@@ -907,7 +911,7 @@ do_group_exit(int exit_code)
if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
else if (current->jobctl & JOBCTL_WILL_EXIT)
- exit_code = 0;
+ exit_code = current->exit_code;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;

@@ -916,7 +920,7 @@ do_group_exit(int exit_code)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
else if (current->jobctl & JOBCTL_WILL_EXIT)
- exit_code = 0;
+ exit_code = current->exit_code;
else {
struct task_struct *t;

@@ -926,7 +930,7 @@ do_group_exit(int exit_code)
__for_each_thread(sig, t) {
if (t == current)
continue;
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, exit_code);
}
}
spin_unlock_irq(&sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index 6179e34ce666..e8fac8a3c935 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1057,7 +1057,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
signal->group_stop_count = 0;
t = p;
do {
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, sig);
} while_each_thread(p, t);
return;
}
@@ -1362,11 +1362,12 @@ int force_sig_info(struct kernel_siginfo *info)
return force_sig_info_to_task(info, current, HANDLER_CURRENT);
}

-void schedule_task_exit_locked(struct task_struct *task)
+void schedule_task_exit_locked(struct task_struct *task, int exit_code)
{
if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
task->jobctl |= JOBCTL_WILL_EXIT;
+ task->exit_code = exit_code;
signal_wake_up(task, 1);
}
}
@@ -2703,7 +2704,7 @@ bool get_signal(struct ksignal *ksig)
if (signal->flags & SIGNAL_GROUP_EXIT)
exit_code = signal->group_exit_code;
else {
- exit_code = 0;
+ exit_code = current->exit_code;
group_exit = false;
}
goto fatal;
--
2.29.2


2022-01-03 21:34:34

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 15/17] signal: Add JOBCTL_WILL_EXIT to mark exiting tasks

Mark tasks that need to exit with JOBCTL_WILL_EXIT instead of reusing
the per thread SIGKILL.

This removes the double meaning of the per thread SIGKILL and makes it
possible to detect when a task has already been scheduled for exiting
and to skip unnecessary work if the task is already scheduled to exit.

A jobctl flag was choosen for this purpose because jobctl changes are
protected by siglock, and updates are already careful not to change or
clear other bits in jobctl. Protection by a lock when changing the
value is necessary as JOBCTL_WILL_EXIT will not be limited to being
set by the current task. That task->jobctl is protected by siglock is
convenient as siglock is already held everywhere I want to set or reset
JOBCTL_WILL_EXIT.

Teach wants_signal and retarget_shared_pending to use
JOBCTL_TASK_EXITING to detect threads that have an exit pending and so
will not be processing any more signals.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 6 ++++--
include/linux/sched/jobctl.h | 2 ++
include/linux/sched/signal.h | 2 +-
kernel/exit.c | 4 ++--
kernel/signal.c | 19 +++++++++----------
5 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 9559e29daada..4e82ee51633d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,7 +352,6 @@ static int zap_process(struct task_struct *start, int exit_code)
struct task_struct *t;
int nr = 0;

- /* Allow SIGKILL, see prepare_signal() */
start->signal->flags = SIGNAL_GROUP_EXIT;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;
@@ -376,9 +375,11 @@ static int zap_threads(struct task_struct *tsk,
if (!(signal->flags & SIGNAL_GROUP_EXIT) && !signal->group_exec_task) {
signal->core_state = core_state;
nr = zap_process(tsk, exit_code);
+ atomic_set(&core_state->nr_threads, nr);
+ /* Allow SIGKILL, see prepare_signal() */
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
- atomic_set(&core_state->nr_threads, nr);
+ tsk->jobctl &= ~JOBCTL_WILL_EXIT;
}
spin_unlock_irq(&tsk->sighand->siglock);
return nr;
@@ -425,6 +426,7 @@ static void coredump_finish(bool core_dumped)
current->signal->group_exit_code |= 0x80;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
+ current->jobctl |= JOBCTL_WILL_EXIT;
spin_unlock_irq(&current->sighand->siglock);

while ((curr = next) != NULL) {
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..9887d737ccfb 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@ struct task_struct;
#define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */
#define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */
#define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */
+#define JOBCTL_WILL_EXIT_BIT 31 /* task will exit */

#define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT)
#define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@ struct task_struct;
#define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT)
#define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT)
#define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_WILL_EXIT (1UL << JOBCTL_WILL_EXIT_BIT)

#define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
#define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index eed54f9ea2fc..989bb665f107 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -373,7 +373,7 @@ static inline int signal_pending(struct task_struct *p)

static inline int __fatal_signal_pending(struct task_struct *p)
{
- return unlikely(sigismember(&p->pending.signal, SIGKILL));
+ return unlikely(p->jobctl & JOBCTL_WILL_EXIT);
}

static inline int fatal_signal_pending(struct task_struct *p)
diff --git a/kernel/exit.c b/kernel/exit.c
index 27bc0ccfea78..7a7a0cbac28e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,7 +906,7 @@ do_group_exit(int exit_code)

if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
- else if (sig->group_exec_task)
+ else if (current->jobctl & JOBCTL_WILL_EXIT)
exit_code = 0;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;
@@ -915,7 +915,7 @@ do_group_exit(int exit_code)
if (sig->flags & SIGNAL_GROUP_EXIT)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
- else if (sig->group_exec_task)
+ else if (current->jobctl & JOBCTL_WILL_EXIT)
exit_code = 0;
else {
struct task_struct *t;
diff --git a/kernel/signal.c b/kernel/signal.c
index b0201e05be40..6179e34ce666 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -153,7 +153,8 @@ static inline bool has_pending_signals(sigset_t *signal, sigset_t *blocked)

static bool recalc_sigpending_tsk(struct task_struct *t)
{
- if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
+ if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE |
+ JOBCTL_WILL_EXIT)) ||
PENDING(&t->pending, &t->blocked) ||
PENDING(&t->signal->shared_pending, &t->blocked) ||
cgroup_task_frozen(t)) {
@@ -911,7 +912,7 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
if (core_state) {
if (sig == SIGKILL) {
struct task_struct *dumper = core_state->dumper.task;
- sigaddset(&dumper->pending.signal, SIGKILL);
+ dumper->jobctl |= JOBCTL_WILL_EXIT;
signal_wake_up(dumper, 1);
}
}
@@ -985,7 +986,7 @@ static inline bool wants_signal(int sig, struct task_struct *p)
if (sigismember(&p->blocked, sig))
return false;

- if (p->flags & PF_EXITING)
+ if (p->jobctl & JOBCTL_WILL_EXIT)
return false;

if (sig == SIGKILL)
@@ -1363,10 +1364,9 @@ int force_sig_info(struct kernel_siginfo *info)

void schedule_task_exit_locked(struct task_struct *task)
{
- task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
- /* Only bother with threads that might be alive */
- if (!(task->flags & PF_POSTCOREDUMP)) {
- sigaddset(&task->pending.signal, SIGKILL);
+ if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
+ task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+ task->jobctl |= JOBCTL_WILL_EXIT;
signal_wake_up(task, 1);
}
}
@@ -2695,9 +2695,8 @@ bool get_signal(struct ksignal *ksig)
int exit_code;

/* Has this task already been marked for death? */
- if (__fatal_signal_pending(current)) {
+ if (current->jobctl & JOBCTL_WILL_EXIT) {
ksig->info.si_signo = signr = SIGKILL;
- sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
@@ -2935,7 +2934,7 @@ static void retarget_shared_pending(struct task_struct *tsk, sigset_t *which)

t = tsk;
while_each_thread(tsk, t) {
- if (t->flags & PF_EXITING)
+ if (t->jobctl & JOBCTL_WILL_EXIT)
continue;

if (!has_pending_signals(&retarget, &t->blocked))
--
2.29.2


2022-01-03 21:34:39

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 17/17] signal: Always set SIGNAL_GROUP_EXIT on process exit

Track how many threads have not started exiting and when
the last thread starts exiting set SIGNAL_GROUP_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 4 ----
include/linux/sched/signal.h | 1 +
kernel/exit.c | 8 +-------
kernel/fork.c | 2 ++
kernel/signal.c | 10 +++++++---
5 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index c54b502bf648..029d0f98aa90 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,10 +352,6 @@ static int zap_process(struct task_struct *start, int exit_code)
struct task_struct *t;
int nr = 0;

- start->signal->flags = SIGNAL_GROUP_EXIT;
- start->signal->group_exit_code = exit_code;
- start->signal->group_stop_count = 0;
-
for_each_thread(start, t) {
schedule_task_exit_locked(t, exit_code);
if (t != current && !(t->flags & PF_POSTCOREDUMP))
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e8034ecaee84..bd9435e934a1 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -94,6 +94,7 @@ struct signal_struct {
refcount_t sigcnt;
atomic_t live;
int nr_threads;
+ int quick_threads;
struct list_head thread_head;

wait_queue_head_t wait_chldexit; /* for wait4() */
diff --git a/kernel/exit.c b/kernel/exit.c
index e95500e2d27c..be867a12de65 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -924,14 +924,8 @@ do_group_exit(int exit_code)
else {
struct task_struct *t;

- sig->group_exit_code = exit_code;
- sig->flags = SIGNAL_GROUP_EXIT;
- sig->group_stop_count = 0;
- __for_each_thread(sig, t) {
- if (t == current)
- continue;
+ __for_each_thread(sig, t)
schedule_task_exit_locked(t, exit_code);
- }
}
spin_unlock_irq(&sighand->siglock);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6f0293cb29c9..d973189a4014 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1644,6 +1644,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return -ENOMEM;

sig->nr_threads = 1;
+ sig->quick_threads = 1;
atomic_set(&sig->live, 1);
refcount_set(&sig->sigcnt, 1);

@@ -2383,6 +2384,7 @@ static __latent_entropy struct task_struct *copy_process(
__this_cpu_inc(process_counts);
} else {
current->signal->nr_threads++;
+ current->signal->quick_threads++;
atomic_inc(&current->signal->live);
refcount_inc(&current->signal->sigcnt);
task_join_group_stop(p);
diff --git a/kernel/signal.c b/kernel/signal.c
index e8fac8a3c935..9bd835fcb1dc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1052,9 +1052,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* running and doing things after a slower
* thread has the fatal signal pending.
*/
- signal->flags = SIGNAL_GROUP_EXIT;
- signal->group_exit_code = sig;
- signal->group_stop_count = 0;
t = p;
do {
schedule_task_exit_locked(t, sig);
@@ -1365,10 +1362,17 @@ int force_sig_info(struct kernel_siginfo *info)
void schedule_task_exit_locked(struct task_struct *task, int exit_code)
{
if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
+ struct signal_struct *signal = task->signal;
task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
task->jobctl |= JOBCTL_WILL_EXIT;
task->exit_code = exit_code;
signal_wake_up(task, 1);
+ signal->quick_threads--;
+ if (signal->quick_threads == 0) {
+ signal->flags = SIGNAL_GROUP_EXIT;
+ signal->group_exit_code = exit_code;
+ signal->group_stop_count = 0;
+ }
}
}

--
2.29.2


2022-01-04 06:30:51

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

14.12.2021 01:53, Eric W. Biederman пишет:
> Simplify the code that allows SIGKILL during coredumps to terminate
> the coredump. As far as I can tell I have avoided breaking it
> by dumb luck.
>
> Historically with all of the other threads stopping in exit_mm the
> wants_signal loop in complete_signal would find the dumper task and
> then complete_signal would wake the dumper task with signal_wake_up.
>
> After moving the coredump_task_exit above the setting of PF_EXITING in
> commit 92307383082d ("coredump: Don't perform any cleanups before
> dumping core") wants_signal will consider all of the threads in a
> multi-threaded process for waking up, not just the core dumping task.
>
> Luckily complete_signal short circuits SIGKILL during a coredump marks
> every thread with SIGKILL and signal_wake_up. This code is arguably
> buggy however as it tries to skip creating a group exit when is already
> present, and it fails that a coredump is in progress.
>
> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
> set on the dumper task but also SIGKILL set in it's pending bitmap.
> This means that if the code is ever fixed not to short-circuit and
> kill a process after it has already been killed the special case
> for SIGKILL during a coredump will be broken.
>
> Sort all of this out by making the coredump special case more special,
> and perform all of the work in prepare_signal and leave the rest of
> the signal delivery path out of it.
>
> In prepare_signal when the process coredumping is sent SIGKILL find
> the task performing the coredump and use sigaddset and signal_wake_up
> to ensure that task reports fatal_signal_pending.
>
> Return false from prepare_signal to tell the rest of the signal
> delivery path to ignore the signal.
>
> Update wait_for_dump_helpers to perform a wait_event_killable wait
> so that if signal_pending gets set spuriously the wait will not
> be interrupted unless fatal_signal_pending is true.
>
> I have tested this and verified I did not break SIGKILL during
> coredumps by accident (before or after this change). I actually
> thought I had and I had to figure out what I had misread that kept
> SIGKILL during coredumps working.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/coredump.c | 4 ++--
> kernel/signal.c | 11 +++++++++--
> 2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index a6b3c196cdef..7b91fb32dbb8 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
> static bool dump_interrupted(void)
> {
> /*
> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
> * can do try_to_freeze() and check __fatal_signal_pending(),
> * but then we need to teach dump_write() to restart and clear
> * TIF_SIGPENDING.
> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
> * We actually want wait_event_freezable() but then we need
> * to clear TIF_SIGPENDING and improve dump_interrupted().
> */
> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>
> pipe_lock(pipe);
> pipe->readers--;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8272cac5f429..7e305a8ec7c2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
> sigset_t flush;
>
> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
> - return sig == SIGKILL;
> + struct core_state *core_state = signal->core_state;
> + if (core_state) {
> + if (sig == SIGKILL) {
> + struct task_struct *dumper = core_state->dumper.task;
> + sigaddset(&dumper->pending.signal, SIGKILL);
> + signal_wake_up(dumper, 1);
> + }
> + return false;
> + }
> /*
> * The process is in the middle of dying, nothing to do.
> */
>

Hi,

This patch breaks userspace, in particular it breaks gst-plugin-scanner
of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
registry of good/working GStreamer plugins by loading them and
blacklisting those that don't work (crash). Before the hang I see
systemd-coredump process running, taking snapshot of gst-plugin-scanner
and then gst-plugin-scanner gets stuck.

Bisection points at this patch, reverting it restores
gst-plugin-scanner. Systemd-coredump still running, but there is no hang
anymore and everything works properly as before.

I'm seeing this problem on ARM32 and haven't checked other arches.
Please fix, thanks in advance.

2022-01-04 07:38:08

by Christoph Hellwig

[permalink] [raw]

2022-01-04 16:18:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 14.12.2021 01:53, Eric W. Biederman пишет:
>> Simplify the code that allows SIGKILL during coredumps to terminate
>> the coredump. As far as I can tell I have avoided breaking it
>> by dumb luck.
>>
>> Historically with all of the other threads stopping in exit_mm the
>> wants_signal loop in complete_signal would find the dumper task and
>> then complete_signal would wake the dumper task with signal_wake_up.
>>
>> After moving the coredump_task_exit above the setting of PF_EXITING in
>> commit 92307383082d ("coredump: Don't perform any cleanups before
>> dumping core") wants_signal will consider all of the threads in a
>> multi-threaded process for waking up, not just the core dumping task.
>>
>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>> every thread with SIGKILL and signal_wake_up. This code is arguably
>> buggy however as it tries to skip creating a group exit when is already
>> present, and it fails that a coredump is in progress.
>>
>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>> This means that if the code is ever fixed not to short-circuit and
>> kill a process after it has already been killed the special case
>> for SIGKILL during a coredump will be broken.
>>
>> Sort all of this out by making the coredump special case more special,
>> and perform all of the work in prepare_signal and leave the rest of
>> the signal delivery path out of it.
>>
>> In prepare_signal when the process coredumping is sent SIGKILL find
>> the task performing the coredump and use sigaddset and signal_wake_up
>> to ensure that task reports fatal_signal_pending.
>>
>> Return false from prepare_signal to tell the rest of the signal
>> delivery path to ignore the signal.
>>
>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>> so that if signal_pending gets set spuriously the wait will not
>> be interrupted unless fatal_signal_pending is true.
>>
>> I have tested this and verified I did not break SIGKILL during
>> coredumps by accident (before or after this change). I actually
>> thought I had and I had to figure out what I had misread that kept
>> SIGKILL during coredumps working.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/coredump.c | 4 ++--
>> kernel/signal.c | 11 +++++++++--
>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index a6b3c196cdef..7b91fb32dbb8 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>> static bool dump_interrupted(void)
>> {
>> /*
>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>> * can do try_to_freeze() and check __fatal_signal_pending(),
>> * but then we need to teach dump_write() to restart and clear
>> * TIF_SIGPENDING.
>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>> * We actually want wait_event_freezable() but then we need
>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>> */
>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>
>> pipe_lock(pipe);
>> pipe->readers--;
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8272cac5f429..7e305a8ec7c2 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>> sigset_t flush;
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>> + if (sig == SIGKILL) {
>> + struct task_struct *dumper = core_state->dumper.task;
>> + sigaddset(&dumper->pending.signal, SIGKILL);
>> + signal_wake_up(dumper, 1);
>> + }
>> + return false;
>> + }
>> /*
>> * The process is in the middle of dying, nothing to do.
>> */
>>
>
> Hi,
>
> This patch breaks userspace, in particular it breaks gst-plugin-scanner
> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
> registry of good/working GStreamer plugins by loading them and
> blacklisting those that don't work (crash). Before the hang I see
> systemd-coredump process running, taking snapshot of gst-plugin-scanner
> and then gst-plugin-scanner gets stuck.
>
> Bisection points at this patch, reverting it restores
> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
> anymore and everything works properly as before.
>
> I'm seeing this problem on ARM32 and haven't checked other arches.
> Please fix, thanks in advance.

That is weird.

Doubly weird really because this should only change the case where
coredumps are interrupted by SIGKILL.

What distro are you running? I would like to match things as closely
as I can. So I can reproduce the issue so I can figure out what
is wrong so I can fix it.

Eric

2022-01-04 18:45:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Mon, Dec 13, 2021 at 2:54 PM Eric W. Biederman <[email protected]> wrote:
>
>
> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
> - return sig == SIGKILL;
> + struct core_state *core_state = signal->core_state;
> + if (core_state) {

This change is very confusing.

Also, why does it do that 'signal->core_state->dumper.task', when we
already know that it's the same as 'signal->group_exit_task'?

The only thing that sets 'signal->core_state' also sets
'signal->group_exit_task', and the call chain has set both to the same
task.

So the code is odd and makes little sense.

But what's even more odd is how it

(a) sends the SIGKILL to somebody else

(b) does *NOT* send SIGKILL to itself

Now, (a) is explained in the commit message. The intent is to signal
the core dumper.

But (b) looks like a fundamental change in semantics. The target of
the SIGKILL is still running, might be in some loop in the kernel that
wants to be interrupted by a fatal signal, and you expressly disabled
the code that would send that fatal signal.

If I send SIGKILL to thread A, then that SIGKILL had *better* be
delivered. To thread A, which may be in a "mutex_lock_killable()" or
whatever else.

The fact that thread B may be in the process of trying to dump core
doesn't change that at all, as far as I can see.

So I think this patch is fundamentally buggy and wrong. Or at least
needs much more explanation of why you'd not send SIGKILL to the
target thread.

Linus

2022-01-04 19:47:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Linus Torvalds <[email protected]> writes:

> On Mon, Dec 13, 2021 at 2:54 PM Eric W. Biederman <[email protected]> wrote:
>>
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>
> This change is very confusing.
>
> Also, why does it do that 'signal->core_state->dumper.task', when we
> already know that it's the same as 'signal->group_exit_task'?
>
> The only thing that sets 'signal->core_state' also sets
> 'signal->group_exit_task', and the call chain has set both to the same
> task.
>
> So the code is odd and makes little sense.

As you say signal->group_exit_task, and core_state->dumper.task point to
the same task. So it may be a little silly when viewed independently of
everything else to use core_state->dumper.task instead of
group_exit_task as it is an extra cache line dereference.

The thing is signal->group_exit_task is only used by coredumps currently
as a flag to tell signal_group_exit to return true. It is exec that
actually uses signal->group_exit_task in conjunction with
signal->notify_count to wake itself up.

Using a pointer as a flag and not for it's value. Having different
semantics for who sets the pointer. All of those are weird enough
I just want to make signal->group_exit_task to go away.

By using core_state->dumper.task I was able to make
signal->group_exit_task exclusive to the exec case in the following
changes, and to rename it signal->group_exec_task so no one gets
confused what the field is for.

> But what's even more odd is how it
>
> (a) sends the SIGKILL to somebody else
>
> (b) does *NOT* send SIGKILL to itself
>
> Now, (a) is explained in the commit message. The intent is to signal
> the core dumper.

Which is the a specific thread of the target process, and it is
the only thread running of the target process.

> But (b) looks like a fundamental change in semantics. The target of
> the SIGKILL is still running, might be in some loop in the kernel that
> wants to be interrupted by a fatal signal, and you expressly disabled
> the code that would send that fatal signal.
>
> If I send SIGKILL to thread A, then that SIGKILL had *better* be
> delivered. To thread A, which may be in a "mutex_lock_killable()" or
> whatever else.
>
> The fact that thread B may be in the process of trying to dump core
> doesn't change that at all, as far as I can see.
>
> So I think this patch is fundamentally buggy and wrong. Or at least
> needs much more explanation of why you'd not send SIGKILL to the
> target thread.

If you look at zap_threads. You can observe that it takes the siglock,
sets SIGNAL_GROUP_COREDUMP, and sets signal->core_state and in
zap_process makes SIGKILL pending is the per-task sigset, and calls
signal_wake_up on every task.

This case in prepare_signal happens after that. After every task
has been told to die, and __fatal_signal_pending is true for all of
them if they have not reached do_exit yet.



If you look in zap_threads you will see that the core dumping thread
clears TIF_SIGPENDING, and in general makes fatal_signal_pending false
for itself. But keep in mind that this thread because it is dumping
core is already on the path to do_exit. It has already processed a
fatal signal.


So in the special case I only worry about the dumping task as it is the
only task after zap_threads that does not have fatal_signal_pending.


This is different than the ordinary case of delivering SIGKILL
where complete_signal makes SIGKILL pending in the per-task sigset
of every task in the process.


Currently I suspect changing wait_event_uninterruptible to
wait_event_killable, is causing problems.

Or perhaps there is some reason tasks that have already entered do_exit
need to have fatal_signal_pending set. (The will have
fatal_signal_pending set up until they enter get_signal which calls
do_group_exit which calls do_exit).

Which is why I am trying to reproduce the reported failure so I can get
the kernel to tell me what is going on. If this is not resolved quickly
I won't send you this change, and I will pull it out of linux-next.

Eric

2022-01-05 04:25:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

On Sun, Dec 12, 2021 at 06:48:56PM +0100, Heiko Carstens wrote:
> On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
> > My s390 assembly is not particularly good so I have read the history
> > of the reference to do_exit copy_thread and have been able to
> > verify that do_exit is not used.
> >
> > The general argument is that s390 has been changed to use the generic
> > kernel_thread and kernel_execve and the generic versions do not call
> > do_exit. So it is strange to see a do_exit reference sitting there.
> >
> > The history of the do_exit reference in s390's version of copy_thread
> > seems conclusive that the do_exit reference is something that lingers
> > and should have been removed several years ago.
> ...
> > Remove this dead reference to do_exit to make it clear that s390 is
> > not doing anything with do_exit in copy_thread.
> >
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
> > ---
> > arch/s390/kernel/process.c | 1 -
> > 1 file changed, 1 deletion(-)
>
> Applied to s390 tree. Just in case you want to apply this to your tree too:
> Acked-by: Heiko Carstens <[email protected]>

FWIW, this
frame->childregs.psw.addr =
(unsigned long)__ret_from_fork;
is also pointless. We do want psw.mask (if nothing else, __ret_from_fork()
that is called by ret_from_fork() will, in effect, check user_mode(task_pt_regs()).
But psw.addr is, AFAICS, pointless - the only way the callback is allowed to
return is after successful kernel_execve(), which would set psw.addr; moreover,
psw.addr is meaningless until that happens.

2022-01-05 05:01:48

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
> There are two big uses of do_exit. The first is it's design use to be
> the guts of the exit(2) system call. The second use is to terminate
> a task after something catastrophic has happened like a NULL pointer
> in kernel code.
>
> Add a function make_task_dead that is initialy exactly the same as
> do_exit to cover the cases where do_exit is called to handle
> catastrophic failure. In time this can probably be reduced to just a
> light wrapper around do_task_dead. For now keep it exactly the same so
> that there will be no behavioral differences introducing this new
> concept.
>
> Replace all of the uses of do_exit that use it for catastraphic
> task cleanup with make_task_dead to make it clear what the code
> is doing.
>
> As part of this rename rewind_stack_do_exit
> rewind_stack_and_make_dead.

Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
That's an obvious case for your make_task_dead().

2022-01-05 05:48:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> The beginning of do_exit has become cluttered and difficult to read as
> it is filled with checks to handle things that can only happen when
> the kernel is operating improperly.
>
> Now that we have a dedicated function for cleaning up a task when the
> kernel is operating improperly move the checks there.

Umm... I would probably take profile_task_exit() crap out before that
point.
1) the damn thing is dead - nothing registers notifiers there
2) blocking_notifier_call_chain() is not a nice thing to do on oops...

I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
morning (there's also profile_handoff_task(), equally useless these days
and complicating things for __put_task_struct()).

> - /*
> - * If do_exit is called because this processes oopsed, it's possible
> - * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> - * continuing. Amongst other possible reasons, this is to prevent
> - * mm_release()->clear_child_tid() from writing to a user-controlled
> - * kernel address.
> - */
> - force_uaccess_begin();

Are you sure about that one? It shouldn't matter, but... it's a potential
change for do_exit() from a kernel thread. As it is, we have that
force_uaccess_begin() for exiting threads and for kernel ones it's not
a no-op. I'm not concerned about attempted userland access after that
point for those, obviously, but I'm not sure you won't step into something
subtle here.

I would prefer to split that particular change off into a separate commit...

2022-01-05 05:58:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

On Wed, Dec 08, 2021 at 02:25:26PM -0600, Eric W. Biederman wrote:
> When the kernel detects it is oops or otherwise force killing a task
> while it exits the code poorly attempts to permanently stop the task
> from scheduling.
>
> I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
> to be woken up.
>
> As it makes no sense for the task to continue call do_task_dead
> instead which actually does the work and permanently removes the task
> from the scheduler. Guaranteeing the task will never be woken
> up again.

NAK. This is not all do_task_dead() leads to - see what finish_task_switch()
does upon seeing TASK_DEAD:
/* Task is done with its stack. */
put_task_stack(prev);
put_task_struct_rcu_user(prev);


Now take a look at the comment just before that check for PF_EXITING -
the point is to leave the task leaked, rather than proceeding with
freeing the sucker.

We are not going through the normal "turn zombie" motions, including
waking wait(2) callers up, etc. Going ahead and freeing it could
fuck the things up quite badly.

> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/exit.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index d0ec6f6b41cb..f975cd8a2ed8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -886,8 +886,7 @@ void __noreturn make_task_dead(int signr)
> if (unlikely(tsk->flags & PF_EXITING)) {
> pr_alert("Fixing recursive fault but reboot is needed!\n");
> futex_exit_recursive(tsk);
> - set_current_state(TASK_UNINTERRUPTIBLE);
> - schedule();
> + do_task_dead();
> }
>
> do_exit(signr);
> --
> 2.29.2
>

2022-01-05 06:02:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 05/10] exit: Stop exporting do_exit

On Wed, Dec 08, 2021 at 02:25:27PM -0600, Eric W. Biederman wrote:
> Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.
>
> Suggested-by: Christoph Hellwig <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/exit.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f975cd8a2ed8..57afac845a0a 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
> lockdep_free_task(tsk);
> do_task_dead();
> }
> -EXPORT_SYMBOL_GPL(do_exit);

"Now" in the commit message is misleading, AFAICS - there's no such users
in the mainline right now (and yes, that one could be moved all the way
up).

2022-01-05 20:02:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 14.12.2021 01:53, Eric W. Biederman пишет:
>> Simplify the code that allows SIGKILL during coredumps to terminate
>> the coredump. As far as I can tell I have avoided breaking it
>> by dumb luck.
>>
>> Historically with all of the other threads stopping in exit_mm the
>> wants_signal loop in complete_signal would find the dumper task and
>> then complete_signal would wake the dumper task with signal_wake_up.
>>
>> After moving the coredump_task_exit above the setting of PF_EXITING in
>> commit 92307383082d ("coredump: Don't perform any cleanups before
>> dumping core") wants_signal will consider all of the threads in a
>> multi-threaded process for waking up, not just the core dumping task.
>>
>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>> every thread with SIGKILL and signal_wake_up. This code is arguably
>> buggy however as it tries to skip creating a group exit when is already
>> present, and it fails that a coredump is in progress.
>>
>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>> This means that if the code is ever fixed not to short-circuit and
>> kill a process after it has already been killed the special case
>> for SIGKILL during a coredump will be broken.
>>
>> Sort all of this out by making the coredump special case more special,
>> and perform all of the work in prepare_signal and leave the rest of
>> the signal delivery path out of it.
>>
>> In prepare_signal when the process coredumping is sent SIGKILL find
>> the task performing the coredump and use sigaddset and signal_wake_up
>> to ensure that task reports fatal_signal_pending.
>>
>> Return false from prepare_signal to tell the rest of the signal
>> delivery path to ignore the signal.
>>
>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>> so that if signal_pending gets set spuriously the wait will not
>> be interrupted unless fatal_signal_pending is true.
>>
>> I have tested this and verified I did not break SIGKILL during
>> coredumps by accident (before or after this change). I actually
>> thought I had and I had to figure out what I had misread that kept
>> SIGKILL during coredumps working.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/coredump.c | 4 ++--
>> kernel/signal.c | 11 +++++++++--
>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index a6b3c196cdef..7b91fb32dbb8 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>> static bool dump_interrupted(void)
>> {
>> /*
>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>> * can do try_to_freeze() and check __fatal_signal_pending(),
>> * but then we need to teach dump_write() to restart and clear
>> * TIF_SIGPENDING.
>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>> * We actually want wait_event_freezable() but then we need
>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>> */
>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>
>> pipe_lock(pipe);
>> pipe->readers--;
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8272cac5f429..7e305a8ec7c2 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>> sigset_t flush;
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>> + if (sig == SIGKILL) {
>> + struct task_struct *dumper = core_state->dumper.task;
>> + sigaddset(&dumper->pending.signal, SIGKILL);
>> + signal_wake_up(dumper, 1);
>> + }
>> + return false;
>> + }
>> /*
>> * The process is in the middle of dying, nothing to do.
>> */
>>
>
> Hi,
>
> This patch breaks userspace, in particular it breaks gst-plugin-scanner
> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
> registry of good/working GStreamer plugins by loading them and
> blacklisting those that don't work (crash). Before the hang I see
> systemd-coredump process running, taking snapshot of gst-plugin-scanner
> and then gst-plugin-scanner gets stuck.
>
> Bisection points at this patch, reverting it restores
> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
> anymore and everything works properly as before.
>
> I'm seeing this problem on ARM32 and haven't checked other arches.
> Please fix, thanks in advance.


I have not yet been able to figure out how to run gst-pluggin-scanner in
a way that triggers this yet. In truth I can't figure out how to
run gst-pluggin-scanner in a useful way.

I am going to set up some unit tests and see if I can reproduce your
hang another way, but if you could give me some more information on what
you are doing to trigger this I would appreciate it.

Eric


2022-01-05 20:46:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
>> There are two big uses of do_exit. The first is it's design use to be
>> the guts of the exit(2) system call. The second use is to terminate
>> a task after something catastrophic has happened like a NULL pointer
>> in kernel code.
>>
>> Add a function make_task_dead that is initialy exactly the same as
>> do_exit to cover the cases where do_exit is called to handle
>> catastrophic failure. In time this can probably be reduced to just a
>> light wrapper around do_task_dead. For now keep it exactly the same so
>> that there will be no behavioral differences introducing this new
>> concept.
>>
>> Replace all of the uses of do_exit that use it for catastraphic
>> task cleanup with make_task_dead to make it clear what the code
>> is doing.
>>
>> As part of this rename rewind_stack_do_exit
>> rewind_stack_and_make_dead.
>
> Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
> That's an obvious case for your make_task_dead().

Good catch.

Being in assembly it did not have anything after the name do_exit so it
hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
finding that.

Skimming the surrounding code it looks like Linvalid_mask can only be
reached by buggy hardware or buggy kernel code. If userspace could
trigger the condition it would be a candidate for force_exit_sig.

I am a bit puzzled why die is not called, instead of die being
handrolled there.

xtensa folks any thoughts?

If not I will queue up a minimal patch to replace do_exit with
make_task_dead.

Eric


2022-01-05 21:39:16

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

05.01.2022 22:58, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 14.12.2021 01:53, Eric W. Biederman пишет:
>>> Simplify the code that allows SIGKILL during coredumps to terminate
>>> the coredump. As far as I can tell I have avoided breaking it
>>> by dumb luck.
>>>
>>> Historically with all of the other threads stopping in exit_mm the
>>> wants_signal loop in complete_signal would find the dumper task and
>>> then complete_signal would wake the dumper task with signal_wake_up.
>>>
>>> After moving the coredump_task_exit above the setting of PF_EXITING in
>>> commit 92307383082d ("coredump: Don't perform any cleanups before
>>> dumping core") wants_signal will consider all of the threads in a
>>> multi-threaded process for waking up, not just the core dumping task.
>>>
>>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>>> every thread with SIGKILL and signal_wake_up. This code is arguably
>>> buggy however as it tries to skip creating a group exit when is already
>>> present, and it fails that a coredump is in progress.
>>>
>>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>>> This means that if the code is ever fixed not to short-circuit and
>>> kill a process after it has already been killed the special case
>>> for SIGKILL during a coredump will be broken.
>>>
>>> Sort all of this out by making the coredump special case more special,
>>> and perform all of the work in prepare_signal and leave the rest of
>>> the signal delivery path out of it.
>>>
>>> In prepare_signal when the process coredumping is sent SIGKILL find
>>> the task performing the coredump and use sigaddset and signal_wake_up
>>> to ensure that task reports fatal_signal_pending.
>>>
>>> Return false from prepare_signal to tell the rest of the signal
>>> delivery path to ignore the signal.
>>>
>>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>>> so that if signal_pending gets set spuriously the wait will not
>>> be interrupted unless fatal_signal_pending is true.
>>>
>>> I have tested this and verified I did not break SIGKILL during
>>> coredumps by accident (before or after this change). I actually
>>> thought I had and I had to figure out what I had misread that kept
>>> SIGKILL during coredumps working.
>>>
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>> ---
>>> fs/coredump.c | 4 ++--
>>> kernel/signal.c | 11 +++++++++--
>>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index a6b3c196cdef..7b91fb32dbb8 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>>> static bool dump_interrupted(void)
>>> {
>>> /*
>>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>>> * can do try_to_freeze() and check __fatal_signal_pending(),
>>> * but then we need to teach dump_write() to restart and clear
>>> * TIF_SIGPENDING.
>>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>>> * We actually want wait_event_freezable() but then we need
>>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>>> */
>>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>>
>>> pipe_lock(pipe);
>>> pipe->readers--;
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8272cac5f429..7e305a8ec7c2 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>>> sigset_t flush;
>>>
>>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>>> - return sig == SIGKILL;
>>> + struct core_state *core_state = signal->core_state;
>>> + if (core_state) {
>>> + if (sig == SIGKILL) {
>>> + struct task_struct *dumper = core_state->dumper.task;
>>> + sigaddset(&dumper->pending.signal, SIGKILL);
>>> + signal_wake_up(dumper, 1);
>>> + }
>>> + return false;
>>> + }
>>> /*
>>> * The process is in the middle of dying, nothing to do.
>>> */
>>>
>>
>> Hi,
>>
>> This patch breaks userspace, in particular it breaks gst-plugin-scanner
>> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
>> registry of good/working GStreamer plugins by loading them and
>> blacklisting those that don't work (crash). Before the hang I see
>> systemd-coredump process running, taking snapshot of gst-plugin-scanner
>> and then gst-plugin-scanner gets stuck.
>>
>> Bisection points at this patch, reverting it restores
>> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
>> anymore and everything works properly as before.
>>
>> I'm seeing this problem on ARM32 and haven't checked other arches.
>> Please fix, thanks in advance.
>
>
> I have not yet been able to figure out how to run gst-pluggin-scanner in
> a way that triggers this yet. In truth I can't figure out how to
> run gst-pluggin-scanner in a useful way.
>
> I am going to set up some unit tests and see if I can reproduce your
> hang another way, but if you could give me some more information on what
> you are doing to trigger this I would appreciate it.

Thanks, Eric. The distro is Arch Linux, but it's a development
environment where I'm running latest GStreamer from git master. I'll try
to figure out the reproduction steps and get back to you.

2022-01-05 21:53:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> Al Viro <[email protected]> writes:
>
> > On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
> >> There are two big uses of do_exit. The first is it's design use to be
> >> the guts of the exit(2) system call. The second use is to terminate
> >> a task after something catastrophic has happened like a NULL pointer
> >> in kernel code.
> >>
> >> Add a function make_task_dead that is initialy exactly the same as
> >> do_exit to cover the cases where do_exit is called to handle
> >> catastrophic failure. In time this can probably be reduced to just a
> >> light wrapper around do_task_dead. For now keep it exactly the same so
> >> that there will be no behavioral differences introducing this new
> >> concept.
> >>
> >> Replace all of the uses of do_exit that use it for catastraphic
> >> task cleanup with make_task_dead to make it clear what the code
> >> is doing.
> >>
> >> As part of this rename rewind_stack_do_exit
> >> rewind_stack_and_make_dead.
> >
> > Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
> > That's an obvious case for your make_task_dead().
>
> Good catch.
>
> Being in assembly it did not have anything after the name do_exit so it
> hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> finding that.

Umm... What's wrong with '\<do_exit\>'? Difference in catch:

missed 6
Documentation/trace/kprobes.rst:596:do_exit() case covered. do_execve() and do_fork() are not an issue.
arch/x86/entry/entry_32.S:1258: call do_exit
arch/x86/entry/entry_64.S:1440: call do_exit
arch/xtensa/kernel/entry.S:1436: abi_call do_exit
samples/bpf/test_cgrp2_tc.sh:114:do_exit() {
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_multiprobe.tc:8:SYM2=do_exit

extra 3
arch/powerpc/mm/book3s64/radix_tlb.c:815:static void do_exit_flush_lazy_tlb(void *arg)
arch/powerpc/mm/book3s64/radix_tlb.c:830: smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
tools/perf/ui/browsers/hists.c:2847: act->fn = do_exit_browser;

Extra catch clearly contains nothing of interest (assuming it's not a result of a typo
in your regex in the first place - you seem to have omitted _ from the second set, and if
you add that back, these 3 hits go away). And missed 6... 3 are outside of the kernel
source proper, and the rest are all genuine. You've caught x86 ones (inside the
rewind_stack_do_exit variants) and missed the xtensa one...

\< and \> are GNUisms, but both git grep and grep (both on Linux and FreeBSD, at least)
handle them... Or use \bdo_exit\b, for that matter (Perlism instead of GNUism, matching
both the beginnings and ends of words)...

2022-01-05 22:33:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:26PM -0600, Eric W. Biederman wrote:
>> When the kernel detects it is oops or otherwise force killing a task
>> while it exits the code poorly attempts to permanently stop the task
>> from scheduling.
>>
>> I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
>> to be woken up.
>>
>> As it makes no sense for the task to continue call do_task_dead
>> instead which actually does the work and permanently removes the task
>> from the scheduler. Guaranteeing the task will never be woken
>> up again.
>
> NAK. This is not all do_task_dead() leads to - see what finish_task_switch()
> does upon seeing TASK_DEAD:
> /* Task is done with its stack. */
> put_task_stack(prev);
> put_task_struct_rcu_user(prev);
>
>
> Now take a look at the comment just before that check for PF_EXITING -
> the point is to leave the task leaked, rather than proceeding with
> freeing the sucker.
>
> We are not going through the normal "turn zombie" motions, including
> waking wait(2) callers up, etc. Going ahead and freeing it could
> fuck the things up quite badly.

I believe I was thinking this task won't be reaped because release_task
can never be called. Which I admit depending on where we oops in
do_exit is not strictly true.

We can guarantee the leak with:

tsk->exit_state = EXIT_DEAD;
refcount_inc(&tsk->rcu_users);


It just feels wrong to me to have something dead and broken sticking around
the scheduler queue. Especially as something could come along and wake
it up and then what do we do.

Hmm. I think we want that tsk->exit_state = EXIT_DEAD regardless to
prevent it from being reaped and possibly causing more harm.

Eric

2022-01-05 22:36:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 05/10] exit: Stop exporting do_exit

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:27PM -0600, Eric W. Biederman wrote:
>> Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.
>>
>> Suggested-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> kernel/exit.c | 1 -
>> 1 file changed, 1 deletion(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index f975cd8a2ed8..57afac845a0a 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
>> lockdep_free_task(tsk);
>> do_task_dead();
>> }
>> -EXPORT_SYMBOL_GPL(do_exit);
>
> "Now" in the commit message is misleading, AFAICS - there's no such users
> in the mainline right now (and yes, that one could be moved all the way
> up).

Yes. I should have said. Now there are few enough users of do_exit
that I can inspect the code and see there are no more modular users.

Or words to that effect.

Because honestly my make_task_dead change got rid of most of the callers
of do_exit.

Eric


2022-01-05 22:53:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 5, 2022 at 1:53 PM Al Viro <[email protected]> wrote:
>
> On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> >
> > Being in assembly it did not have anything after the name do_exit so it
> > hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> > finding that.
>
> Umm... What's wrong with '\<do_exit\>'?

Christ people, you both make it so complicated.

If you want to search for 'do_exit', just do

git grep -w do_exit

where that '-w' does exactly that "word boundary" thing.

I thought everybody knew about this, because it's such a common thing
to do - checking my shell history, more than a third of my "git grep"
uses use '-w', exactly because it's very convenient for identifier
lookup

But yes, in more complex cases where you have other parts to the
pattern (ie you're not looking *just* for a single word), by all means
use '\<' and/or '\>'.

Linus

2022-01-05 23:35:14

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 05, 2022 at 02:51:05PM -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2022 at 1:53 PM Al Viro <[email protected]> wrote:
> >
> > On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> > >
> > > Being in assembly it did not have anything after the name do_exit so it
> > > hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> > > finding that.
> >
> > Umm... What's wrong with '\<do_exit\>'?
>
> Christ people, you both make it so complicated.
>
> If you want to search for 'do_exit', just do
>
> git grep -w do_exit
>
> where that '-w' does exactly that "word boundary" thing.

Sure.

> I thought everybody knew about this, because it's such a common thing
> to do - checking my shell history, more than a third of my "git grep"
> uses use '-w', exactly because it's very convenient for identifier
> lookup
>
> But yes, in more complex cases where you have other parts to the
> pattern (ie you're not looking *just* for a single word), by all means
> use '\<' and/or '\>'.

Yep. I wanted to make it clear that you really don't need that kind
of horrors ([^A-Za-z0-9_]); sure, on the ends of regex you just need
-w and that's it, but it's not needed in more convoluted cases either.

BTW, it doesn't have to be "have other parts of pattern" - IME the typical
case when -w is not enough is something like

git grep -n '\<wait_for_completion'

2022-01-06 07:08:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> > The beginning of do_exit has become cluttered and difficult to read as
> > it is filled with checks to handle things that can only happen when
> > the kernel is operating improperly.
> >
> > Now that we have a dedicated function for cleaning up a task when the
> > kernel is operating improperly move the checks there.
>
> Umm... I would probably take profile_task_exit() crap out before that
> point.
> 1) the damn thing is dead - nothing registers notifiers there
> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>
> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
> morning (there's also profile_handoff_task(), equally useless these days
> and complicating things for __put_task_struct()).

Ugh... Forgot to post, sorry.

[PATCH] get rid of dead machinery in kernel/profile.c

Nothing is placed on the call chains in there, now that oprofile is
gone. Remove them, along with the hooks for calling them.

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f5574..88dfb0543ea63 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -63,26 +63,6 @@ static inline void profile_hit(int type, void *ip)
profile_hits(type, ip, 1);
}

-struct task_struct;
-struct mm_struct;
-
-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -107,30 +87,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
-#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)
-
#endif /* CONFIG_PROFILING */

#endif /* _LINUX_PROFILE_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686e..5086a5e9d02de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -765,7 +765,6 @@ void __noreturn do_exit(long code)
preempt_count_set(PREEMPT_ENABLED);
}

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697d..496c0b6c8cb83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac52..37640a0bd8a3c 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,79 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90a..70318c2a47c39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}


2022-01-07 02:28:10

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

On Wed, Dec 08, 2021 at 02:25:28PM -0600, Eric W. Biederman wrote:

> +/**
> + * kthread_exit - Cause the current kthread return @result to kthread_stop().
> + * @result: The integer value to return to kthread_stop().
> + *
> + * While kthread_exit can be called directly, it exists so that
> + * functions which do some additional work in non-modular code such as
> + * module_put_and_kthread_exit can be implemented.
> + *
> + * Does not return.
> + */
> +void __noreturn kthread_exit(long result)
> +{
> + do_exit(result);
> +}

> static int kthread(void *_create)
> {
> static const struct sched_param param = { .sched_priority = 0 };
> @@ -286,13 +301,13 @@ static int kthread(void *_create)
> done = xchg(&create->done, NULL);
> if (!done) {
> kfree(create);
> - do_exit(-EINTR);
> + kthread_exit(-EINTR);

This do_exit(-EINTR) is pure cargo-culting; nobody will see that return
value, since by this point nobody could have looked at the task_struct
or pid. Look: we must have had
* __kthread_create_on_node() called
it has allocated a request (create), filled it, put it on
kthreadd's request list and woke kthreadd up.
* it went to wait (killably) for completion of create->done.
* it got a SIGKILL and had succefully replaced create->done
with NULL.
* since it has gotten non-NULL, it has buggered off (with -EINTR,
incidentally).
In the meanwhile, kthreadd had picked the request from its list and
successfully forked the child. Child had run kthread() up to the point
you'd quoted, i.e. it had observed create->done already NULL, freed create
and is now terminating itself.

Caller of kernel_thread() doesn't pass the pid to anyone, since the pid
must've been positive. kthread() does not store its pid or task_struct
reference anywhere on that path. __kthread_create_on_node() doesn't even
bother looking at anything other than create->done and it returns -EINTR.
Child couldn't be traced by anyone.

So how the hell could anyone look at the value we pass to do_exit() here?
Might as well use do_exit(0)...


> if (!self) {
> create->result = ERR_PTR(-ENOMEM);
> complete(done);
> - do_exit(-ENOMEM);
> + kthread_exit(-ENOMEM);

Ditto. We must have had
* __kthread_create_on_node() called
it has allocated a request (create), filled it, put it on
kthreadd's request list and woke kthreadd.
* it went to wait (killably) for completion of create->done.
* it did *NOT* get SIGKILL until after kthread() the child
successfully forked by kthreadd got through that xchg()
in kthread().
Either __kthread_create_on_node() hadn't gotten SIGKILL before our call
of complete(), or it has gotten one, observed NULL create->done and simply
proceeded to do wait_for_completion() on the same thing (it's its local
variable, so it knows what create->done used to point to). Either way,
__kthread_create_on_node() hits
task = create->result;
sees that it's ERR_PTR(-ENOMEM) and proceeds to fail with -ENOMEM.
Again, nothing is looking at the pid or task_struct of the child and
nothing could possibly observe its exit code.

> }
>
> self->threadfn = threadfn;
> @@ -326,7 +341,7 @@ static int kthread(void *_create)
> __kthread_parkme(self);
> ret = threadfn(data);
> }
> - do_exit(ret);
> + kthread_exit(ret);

This one, OTOH, is a different story. Here we have already hit the
completion and stopped the child. With __kthread_create_on_node()
having found and returned the task_struct of the child. Since that
point, somebody had already woken the child (using the pointer returned
by __kthread_create_on_node()).

What's more, the child's payload has already run to completion.
*NORMALLY* that means kthread_stop() called on it. And there we do the
following bit of nastiness:
get_task_struct(k);
...
mark it "should stop" and wake it up
...
wait_for_completion(&kthread->exited);
ret = k->exit_code;
put_task_struct(k);

kthread->exited is what k->vfork_done is left pointing to, so this
wait_for_completion() waits for that do_exit() in the kthread() and
we proceed to pick k->exit_code (using the fact that k has just
been pinned by us). Then kthread_stop() proceeds to return that
value.

Pardon me while I puke. The value being returned has nothing to do
with the things one could normally find in ->exit_code, for starters.
What's more, kthread->exited is a part of per-thread data structure,
and that same structure could bloody well be used to pass the damn
return value of threadfn(), instead of doing unnatural things with
->exit_code. And that data structure is not freed until free_task(k),
i.e. we can fetch from it whenever we can fetch k->exit_code.

Oh, and all other ways to stop the thread do not bother looking at
the exit code at all.

IMO the right way to handle that would be
1) turn these two do_exit() into do_exit(0), to reduce
confusion
2) deal with all do_exit() in kthread payloads. Your
name for the primitive is fine, IMO.
3) make that primitive pass the return value by way of
a field in struct kthread, adjusting kthread_stop() accordingly
and passing 0 to do_exit() in kthread_exit() itself.

(2) is not as trivial as you seem to hope, though. Your patches
in drivers/staging/rt*/ had papered over the problem in there,
but hadn't really solved it.

thread_exit() should've been shot, all right, but it really ought
to have been complete_and_exit() there. The thing is, complete()
+ return does *not* guarantee that driver won't get unloaded before
the thread terminates. Possibly freeing its .code and leaving
a thread to resume running in there as soon as it regains CPU.

The point of complete_and_exit() is that it's noreturn *and* in
core kernel. So it can be safely used in a modular kthread,
if paired with wait_for_completion() in or before module_exit.
complete() + do_exit() (or complete + return as you've gotten
there) doesn't give such guarantees at all.

I'm (re)crawling through that zoo right now, will post when
I get more details.

2022-01-07 03:22:18

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 10/10] exit/kthread: Move the exit code for kernel threads into struct kthread

On Wed, Dec 08, 2021 at 02:25:32PM -0600, Eric W. Biederman wrote:
> The exit code of kernel threads has different semantics than the
> exit_code of userspace tasks. To avoid confusion and allow
> the userspace implementation to change as needed move
> the kernel thread exit code into struct kthread.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/kthread.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 8e5f44bed027..9c6c532047c4 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -52,6 +52,7 @@ struct kthread_create_info
> struct kthread {
> unsigned long flags;
> unsigned int cpu;
> + int result;
> int (*threadfn)(void *);
> void *data;
> mm_segment_t oldfs;
> @@ -287,7 +288,9 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
> */
> void __noreturn kthread_exit(long result)
> {
> - do_exit(result);
> + struct kthread *kthread = to_kthread(current);
> + kthread->result = result;
> + do_exit(0);
> }
>
> /**
> @@ -679,7 +682,7 @@ int kthread_stop(struct task_struct *k)
> kthread_unpark(k);
> wake_up_process(k);
> wait_for_completion(&kthread->exited);
> - ret = k->exit_code;
> + ret = kthread->result;
> put_task_struct(k);
>
> trace_sched_kthread_stop_ret(ret);

Fine, except that you've turned the first two do_exit() in kthread() into
calls of kthread_exit(). If they are hit, you are screwed, especially
the second one - there you have an allocation failure for struct kthread,
so this will instantly oops on attempt to store into ->result.

See reply to your 6/10 regarding the difference between the last
call of do_exit() in kthread() and the first two of them. They
(the first two) should be simply do_exit(0); transmission of error
value happens differently and not in direction of kthread_stop().

2022-01-07 03:43:05

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> > The beginning of do_exit has become cluttered and difficult to read as
> > it is filled with checks to handle things that can only happen when
> > the kernel is operating improperly.
> >
> > Now that we have a dedicated function for cleaning up a task when the
> > kernel is operating improperly move the checks there.
>
> Umm... I would probably take profile_task_exit() crap out before that
> point.
> 1) the damn thing is dead - nothing registers notifiers there
> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>
> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
> morning (there's also profile_handoff_task(), equally useless these days
> and complicating things for __put_task_struct()).

Argh... OK, so your subsequent series had pretty much the same thing.
My apologies - still digging myself out from mail pile that had accumulated
over two months ;-/

2022-01-07 03:48:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

On Mon, Jan 03, 2022 at 03:32:56PM -0600, Eric W. Biederman wrote:
> When I say remove I mean remove. All profile_task_exit and
> profile_munmap do is call a blocking notifier chain. The helpers
> profile_task_register and profile_task_unregister are not called
> anywhere in the tree. Which means this is all dead code.
>
> So remove the dead code and make it easier to read do_exit.

How about doing the same to profile_handoff_task() and
task_handoff_register()/task_handoff_unregister(),
while we are at it? Combined diff would be this:

diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f5574..6aa64730298a0 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -31,11 +31,6 @@ static inline int create_proc_profile(void)
}
#endif

-enum profile_type {
- PROFILE_TASK_EXIT,
- PROFILE_MUNMAP
-};
-
#ifdef CONFIG_PROFILING

extern int prof_on __read_mostly;
@@ -63,26 +58,6 @@ static inline void profile_hit(int type, void *ip)
profile_hits(type, ip, 1);
}

-struct task_struct;
-struct mm_struct;
-
-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -107,30 +82,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
-#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)
-
#endif /* CONFIG_PROFILING */

#endif /* _LINUX_PROFILE_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686e..5086a5e9d02de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -765,7 +765,6 @@ void __noreturn do_exit(long code)
preempt_count_set(PREEMPT_ENABLED);
}

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697d..496c0b6c8cb83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac52..37640a0bd8a3c 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,79 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90a..70318c2a47c39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}


2022-01-07 03:59:49

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> Today the rules are a bit iffy and arbitrary about which kernel
> threads have struct kthread present. Both idle threads and thread
> started with create_kthread want struct kthread present so that is
> effectively all kernel threads. Make the rule that if PF_KTHREAD
> and the task is running then struct kthread is present.
>
> This will allow the kernel thread code to using tsk->exit_code
> with different semantics from ordinary processes.

Getting rid of ->exit_code abuse is independent from this.
I'm not saying that this change is a bad idea, but it's an
independent thing. Simply turn these two failure exits
into do_exit(0) in 06/10 and that's it. Then this one
would get rid of if (!self) and the second of those two
calls, but it won't be nailed to that point of queue.

2022-01-07 19:01:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
>> - /*
>> - * If do_exit is called because this processes oopsed, it's possible
>> - * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
>> - * continuing. Amongst other possible reasons, this is to prevent
>> - * mm_release()->clear_child_tid() from writing to a user-controlled
>> - * kernel address.
>> - */
>> - force_uaccess_begin();
>
> Are you sure about that one? It shouldn't matter, but... it's a potential
> change for do_exit() from a kernel thread. As it is, we have that
> force_uaccess_begin() for exiting threads and for kernel ones it's not
> a no-op. I'm not concerned about attempted userland access after that
> point for those, obviously, but I'm not sure you won't step into something
> subtle here.
>
> I would prefer to split that particular change off into a separate commit...

Thank you for catching that. I was leaning too much on the description
in the comment of why force_uaccess_begin is there.

Catching up on the state of set_fs/get_fs removal it appears like a lot
of progress has been made and on a lot of architectures set_fs/get_fs is
just gone, and force_uaccess_begin is a noop.

On architectures that still have set_fs/get_fs it appears all of the old
warts are present and kernel threads still run with set_fs(KERNEL_DS).

Assuming it won't be too much longer before the rest of the arches have
set_fs/get_fs removed it looks like it makes sense to leave the
force_uaccess_begin where it is, and just let force_uaccess_begin be
removed when set_fs/get_fs are removed from the tree.

Christoph does it look like the set_fs/get_fs removal work is going
to stall indefinitely on some architectures? If so I think we want to
find a way to get kernel threads to run with set_fs(USER_DS) on the
stalled architectures. Otherwise I think we have a real hazard of
introducing bugs that will only show up on the stalled architectures.

I finally understand now why when I updated set_child_tid in the kthread
code early in fork why x86 was fine another architecture was not.

Eric

2022-01-07 19:02:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
>> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
>> > The beginning of do_exit has become cluttered and difficult to read as
>> > it is filled with checks to handle things that can only happen when
>> > the kernel is operating improperly.
>> >
>> > Now that we have a dedicated function for cleaning up a task when the
>> > kernel is operating improperly move the checks there.
>>
>> Umm... I would probably take profile_task_exit() crap out before that
>> point.
>> 1) the damn thing is dead - nothing registers notifiers there
>> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>>
>> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
>> morning (there's also profile_handoff_task(), equally useless these days
>> and complicating things for __put_task_struct()).
>
> Argh... OK, so your subsequent series had pretty much the same thing.
> My apologies - still digging myself out from mail pile that had accumulated
> over two months ;-/

No worries. I really appreciate getting some detail review. Some
things just take another set of eyes to spot.

Eric


2022-01-08 16:11:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

Al Viro <[email protected]> writes:

> On Mon, Jan 03, 2022 at 03:32:56PM -0600, Eric W. Biederman wrote:
>> When I say remove I mean remove. All profile_task_exit and
>> profile_munmap do is call a blocking notifier chain. The helpers
>> profile_task_register and profile_task_unregister are not called
>> anywhere in the tree. Which means this is all dead code.
>>
>> So remove the dead code and make it easier to read do_exit.
>
> How about doing the same to profile_handoff_task() and
> task_handoff_register()/task_handoff_unregister(),
> while we are at it? Combined diff would be this:

A very good idea. I have added this incremental patch to my queue.

Eric

From: "Eric W. Biederman" <[email protected]>
Date: Sat, 8 Jan 2022 10:03:24 -0600
Subject: [PATCH] exit: Remove profile_handoff_task

All profile_handoff_task does is notify the task_free_notifier chain.
The helpers task_handoff_register and task_handoff_unregister are used
to add and delete entries from that chain and are never called.

So remove the dead code and make it much easier to read and reason
about __put_task_struct.

Suggested-by: Al Viro <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/profile.h | 19 -------------------
kernel/fork.c | 4 +---
kernel/profile.c | 23 -----------------------
3 files changed, 1 insertion(+), 45 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index f7eb2b57d890..11db1ec516e2 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -61,14 +61,6 @@ static inline void profile_hit(int type, void *ip)
struct task_struct;
struct mm_struct;

-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
#else

#define prof_on 0
@@ -93,17 +85,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_handoff_task(a) (0)

#endif /* CONFIG_PROFILING */

diff --git a/kernel/fork.c b/kernel/fork.c
index 6f0293cb29c9..494539ecb6d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index 9355cc934a96..37640a0bd8a3 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,29 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
--
2.29.2



2022-01-08 18:14:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 05.01.2022 22:58, Eric W. Biederman пишет:
>>
>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>> a way that triggers this yet. In truth I can't figure out how to
>> run gst-pluggin-scanner in a useful way.
>>
>> I am going to set up some unit tests and see if I can reproduce your
>> hang another way, but if you could give me some more information on what
>> you are doing to trigger this I would appreciate it.
>
> Thanks, Eric. The distro is Arch Linux, but it's a development
> environment where I'm running latest GStreamer from git master. I'll try
> to figure out the reproduction steps and get back to you.

Thank you.

Until I can figure out why this is causing problems I have dropped the
following two patches from my queue:
signal: Make SIGKILL during coredumps an explicit special case
signal: Drop signals received after a fatal signal has been processed

I have replaced them with the following two patches that just do what
is needed for the rest of the code in the series:
signal: Have prepare_signal detect coredumps using
signal: Make coredump handling explicit in complete_signal

Perversely my failure to change the SIGKILL handling when coredumps are
happening proves to me that I need to change the SIGKILL handling when
coredumps are happening to make the code more maintainable.

Eric


2022-01-08 18:15:30

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/2] signal: Have prepare_signal detect coredumps using signal->core_state


In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
prepare_signal to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.

Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8272cac5f429..f95a4423519d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -906,8 +906,8 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
struct task_struct *t;
sigset_t flush;

- if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
- if (!(signal->flags & SIGNAL_GROUP_EXIT))
+ if ((signal->flags & SIGNAL_GROUP_EXIT) || signal->core_state) {
+ if (signal->core_state)
return sig == SIGKILL;
/*
* The process is in the middle of dying, nothing to do.
--
2.29.2


2022-01-08 18:16:04

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/2] signal: Make coredump handling explicit in complete_signal


Ever since commit 6cd8f0acae34 ("coredump: ensure that SIGKILL always
kills the dumping thread") it has been possible for a SIGKILL received
during a coredump to set SIGNAL_GROUP_EXIT and trigger a process
shutdown (for a second time).

Update the logic to explicitly allow coredumps so that coredumps can
set SIGNAL_GROUP_EXIT and shutdown like an ordinary process.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f95a4423519d..0706c1345a71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1032,7 +1032,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) &&
- !(signal->flags & SIGNAL_GROUP_EXIT) &&
+ (signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
/*
--
2.29.2


2022-01-08 18:21:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> Today the rules are a bit iffy and arbitrary about which kernel
>> threads have struct kthread present. Both idle threads and thread
>> started with create_kthread want struct kthread present so that is
>> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> and the task is running then struct kthread is present.
>>
>> This will allow the kernel thread code to using tsk->exit_code
>> with different semantics from ordinary processes.
>
> Getting rid of ->exit_code abuse is independent from this.
> I'm not saying that this change is a bad idea, but it's an
> independent thing. Simply turn these two failure exits
> into do_exit(0) in 06/10 and that's it. Then this one
> would get rid of if (!self) and the second of those two
> calls, but it won't be nailed to that point of queue.

That is a good point.

As this code has been in linux-next for a while, I am going to leave
the dependency in place in the interests of sending Linus tested code.

This change with the bit about which field points to struct kthread
seems like a good idea on it's own merits.

Eric


2022-01-08 18:36:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

Al Viro <[email protected]> writes:

> IMO the right way to handle that would be
> 1) turn these two do_exit() into do_exit(0), to reduce
> confusion
> 2) deal with all do_exit() in kthread payloads. Your
> name for the primitive is fine, IMO.
> 3) make that primitive pass the return value by way of
> a field in struct kthread, adjusting kthread_stop() accordingly
> and passing 0 to do_exit() in kthread_exit() itself.
>
> (2) is not as trivial as you seem to hope, though. Your patches
> in drivers/staging/rt*/ had papered over the problem in there,
> but hadn't really solved it.
>
> thread_exit() should've been shot, all right, but it really ought
> to have been complete_and_exit() there. The thing is, complete()
> + return does *not* guarantee that driver won't get unloaded before
> the thread terminates. Possibly freeing its .code and leaving
> a thread to resume running in there as soon as it regains CPU.
>
> The point of complete_and_exit() is that it's noreturn *and* in
> core kernel. So it can be safely used in a modular kthread,
> if paired with wait_for_completion() in or before module_exit.
> complete() + do_exit() (or complete + return as you've gotten
> there) doesn't give such guarantees at all.


I think we are mostly in agreement here.

There are kernel threads started by modules that do:
complete(...);
return 0;

That should be at a minimum calling complete_and_exit. Possibly should
be restructured to use kthread_stop().

Some of those users of the now removed thread_exit() in staging are
among the offenders.

However thread_exit() was implemented as:
#define thread_exit() complete_and_exit(NULL, 0)

Which does nothing with a completion, it was just a really funny way to
spell "do_exit(0)".

While I agree digging through all of the kernel threads and finding the
ones that should be calling complete_and_exit is a fine idea. It is
a concern independent of these patches.

> I'm (re)crawling through that zoo right now, will post when
> I get more details.

Eric

2022-01-08 19:13:26

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Tue, Jan 04, 2022 at 01:47:05PM -0600, Eric W. Biederman wrote:
> Currently I suspect changing wait_event_uninterruptible to
> wait_event_killable, is causing problems.
>
> Or perhaps there is some reason tasks that have already entered do_exit
> need to have fatal_signal_pending set. (The will have
> fatal_signal_pending set up until they enter get_signal which calls
> do_group_exit which calls do_exit).
>
> Which is why I am trying to reproduce the reported failure so I can get
> the kernel to tell me what is going on. If this is not resolved quickly
> I won't send you this change, and I will pull it out of linux-next.

It would have been good if you would have removed this from linux-next
already.

Anyway, now I also had to spend quite some time to bisect why several test
suites just hang with linux-next. It's probably because of holidays that
you didn't get more bug reports.

On s390

- ltp
- elfutils selftests
- seccomp kernel selftests

hang with linux-next.

I bisected the problem to this patch using elfutils selftests:

git clone git://sourceware.org/git/elfutils.git
cd elfutils
autoreconf -fi
./configure --enable-maintainer-mode --disable-debuginfod
make -j $(nproc) > /dev/null
cd tests
make -j $(nproc) check

Note: I actually didn't verify if this also causes ltp+seccomp selftests
to hang. I just assume it is the case.

2022-01-08 22:44:39

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 06/10] exit: Implement kthread_exit

From: Eric W. Biederman
> Sent: 08 January 2022 18:36
>
> Al Viro <[email protected]> writes:
>
> > IMO the right way to handle that would be
> > 1) turn these two do_exit() into do_exit(0), to reduce
> > confusion
> > 2) deal with all do_exit() in kthread payloads. Your
> > name for the primitive is fine, IMO.
> > 3) make that primitive pass the return value by way of
> > a field in struct kthread, adjusting kthread_stop() accordingly
> > and passing 0 to do_exit() in kthread_exit() itself.
> >
> > (2) is not as trivial as you seem to hope, though. Your patches
> > in drivers/staging/rt*/ had papered over the problem in there,
> > but hadn't really solved it.
> >
> > thread_exit() should've been shot, all right, but it really ought
> > to have been complete_and_exit() there. The thing is, complete()
> > + return does *not* guarantee that driver won't get unloaded before
> > the thread terminates. Possibly freeing its .code and leaving
> > a thread to resume running in there as soon as it regains CPU.
> >
> > The point of complete_and_exit() is that it's noreturn *and* in
> > core kernel. So it can be safely used in a modular kthread,
> > if paired with wait_for_completion() in or before module_exit.
> > complete() + do_exit() (or complete + return as you've gotten
> > there) doesn't give such guarantees at all.
>
>
> I think we are mostly in agreement here.
>
> There are kernel threads started by modules that do:
> complete(...);
> return 0;
>
> That should be at a minimum calling complete_and_exit. Possibly should
> be restructured to use kthread_stop().

There is also module_put_and_exit(0);
Which must have an implied THIS_MODULE.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2022-01-09 03:27:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

On Sat, Jan 08, 2022 at 12:35:40PM -0600, Eric W. Biederman wrote:

> There are kernel threads started by modules that do:
> complete(...);
> return 0;
>
> That should be at a minimum calling complete_and_exit. Possibly should
> be restructured to use kthread_stop().
>
> Some of those users of the now removed thread_exit() in staging are
> among the offenders.
>
> However thread_exit() was implemented as:
> #define thread_exit() complete_and_exit(NULL, 0)
>
> Which does nothing with a completion, it was just a really funny way to
> spell "do_exit(0)".

Yes. And there's a plenty of cargo-culting in that area.

> While I agree digging through all of the kernel threads and finding the
> ones that should be calling complete_and_exit is a fine idea. It is
> a concern independent of these patches.

BTW, could somebody explain how could this
/*
* Prevent the kthread exits directly, and make sure when kthread_stop()
* is called to stop a kthread, it is still alive. If a kthread might be
* stopped by CACHE_SET_IO_DISABLE bit set, wait_for_kthread_stop() is
* necessary before the kthread returns.
*/
static inline void wait_for_kthread_stop(void)
{
while (!kthread_should_stop()) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
}
}

in drivers/md/bcache/bcache.h possibly avoid losing wakeups?

AFAICS, it can be called while in TASK_RUNNING. Suppose kthread_stop()
gets called just after the check for kthread_should_stop(). Our thread
is still in TASK_RUNNING; kthread_stop() sets the flag for the next
kthread_should_stop() to observe and does wake_up_process() to our
thread. Which does nothing. Now our thread goes into TASK_INTERRUPTIBLE
and calls schedule(). Sure, as soon as it gets woken up it'll call
kthread_should_stop(), get true from it and that's it. What's going
to wake it up, though?

The same goes for e.g. fs/btrfs/disk-io.c:cleaner_kthread():
if (kthread_should_stop())
return 0;
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
__set_current_state(TASK_RUNNING);
}
can't be right. Similar fun exists in e.g. fs/jfs, etc.

Am I missing something?

2022-01-10 15:01:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

David Laight <[email protected]> writes:

> From: Eric W. Biederman
>> Sent: 08 January 2022 18:36
>>
>> Al Viro <[email protected]> writes:
>>
>> > IMO the right way to handle that would be
>> > 1) turn these two do_exit() into do_exit(0), to reduce
>> > confusion
>> > 2) deal with all do_exit() in kthread payloads. Your
>> > name for the primitive is fine, IMO.
>> > 3) make that primitive pass the return value by way of
>> > a field in struct kthread, adjusting kthread_stop() accordingly
>> > and passing 0 to do_exit() in kthread_exit() itself.
>> >
>> > (2) is not as trivial as you seem to hope, though. Your patches
>> > in drivers/staging/rt*/ had papered over the problem in there,
>> > but hadn't really solved it.
>> >
>> > thread_exit() should've been shot, all right, but it really ought
>> > to have been complete_and_exit() there. The thing is, complete()
>> > + return does *not* guarantee that driver won't get unloaded before
>> > the thread terminates. Possibly freeing its .code and leaving
>> > a thread to resume running in there as soon as it regains CPU.
>> >
>> > The point of complete_and_exit() is that it's noreturn *and* in
>> > core kernel. So it can be safely used in a modular kthread,
>> > if paired with wait_for_completion() in or before module_exit.
>> > complete() + do_exit() (or complete + return as you've gotten
>> > there) doesn't give such guarantees at all.
>>
>>
>> I think we are mostly in agreement here.
>>
>> There are kernel threads started by modules that do:
>> complete(...);
>> return 0;
>>
>> That should be at a minimum calling complete_and_exit. Possibly should
>> be restructured to use kthread_stop().
>
> There is also module_put_and_exit(0);
> Which must have an implied THIS_MODULE.

Later in the patch series I change
module_put_and_exit -> module_put_and_kthread_exit
complete_and_exit -> complete_and_kthread_exit

The problem that I understand all was seeing was where people should
have been using complete_and_exit and were not.

Eric

2022-01-10 15:05:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

Al Viro <[email protected]> writes:

> On Sat, Jan 08, 2022 at 12:35:40PM -0600, Eric W. Biederman wrote:
>
>> There are kernel threads started by modules that do:
>> complete(...);
>> return 0;
>>
>> That should be at a minimum calling complete_and_exit. Possibly should
>> be restructured to use kthread_stop().
>>
>> Some of those users of the now removed thread_exit() in staging are
>> among the offenders.
>>
>> However thread_exit() was implemented as:
>> #define thread_exit() complete_and_exit(NULL, 0)
>>
>> Which does nothing with a completion, it was just a really funny way to
>> spell "do_exit(0)".
>
> Yes. And there's a plenty of cargo-culting in that area.
>
>> While I agree digging through all of the kernel threads and finding the
>> ones that should be calling complete_and_exit is a fine idea. It is
>> a concern independent of these patches.
>
> BTW, could somebody explain how could this
> /*
> * Prevent the kthread exits directly, and make sure when kthread_stop()
> * is called to stop a kthread, it is still alive. If a kthread might be
> * stopped by CACHE_SET_IO_DISABLE bit set, wait_for_kthread_stop() is
> * necessary before the kthread returns.
> */
> static inline void wait_for_kthread_stop(void)
> {
> while (!kthread_should_stop()) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> }
> }
>
> in drivers/md/bcache/bcache.h possibly avoid losing wakeups?
>
> AFAICS, it can be called while in TASK_RUNNING. Suppose kthread_stop()
> gets called just after the check for kthread_should_stop(). Our thread
> is still in TASK_RUNNING; kthread_stop() sets the flag for the next
> kthread_should_stop() to observe and does wake_up_process() to our
> thread. Which does nothing. Now our thread goes into TASK_INTERRUPTIBLE
> and calls schedule(). Sure, as soon as it gets woken up it'll call
> kthread_should_stop(), get true from it and that's it. What's going
> to wake it up, though?
>
> The same goes for e.g. fs/btrfs/disk-io.c:cleaner_kthread():
> if (kthread_should_stop())
> return 0;
> if (!again) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> __set_current_state(TASK_RUNNING);
> }
> can't be right. Similar fun exists in e.g. fs/jfs, etc.
>
> Am I missing something?

Those examples look as suspect to me as they do to you.

Eric


2022-01-10 15:27:13

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> The generic function ptrace_report_syscall does a little more
> than syscall_trace on m68k. The function ptrace_report_syscall
> stops early if PT_TRACED is not set, it sets ptrace_message,
> and returns the result of fatal_signal_pending.
>
> Setting ptrace_message to a passed in value of 0 is effectively not
> setting ptrace_message, making that additional work a noop.
>
> Returning the result of fatal_signal_pending and letting the caller
> ignore the result becomes a noop in this change.
>
> When a process is ptraced, the flag PT_PTRACED is always set in
> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> just an optimization to fail early if the process is not ptraced.
> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> tasklist_lock and skip performing any work if the task is not ptraced.
>
> Cc: Geert Uytterhoeven <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

As this depends on the removal of a parameter from
ptrace_report_syscall() earlier in this series:
Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-10 16:20:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> > The generic function ptrace_report_syscall does a little more
> > than syscall_trace on m68k. The function ptrace_report_syscall
> > stops early if PT_TRACED is not set, it sets ptrace_message,
> > and returns the result of fatal_signal_pending.
> >
> > Setting ptrace_message to a passed in value of 0 is effectively not
> > setting ptrace_message, making that additional work a noop.
> >
> > Returning the result of fatal_signal_pending and letting the caller
> > ignore the result becomes a noop in this change.
> >
> > When a process is ptraced, the flag PT_PTRACED is always set in
> > current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> > just an optimization to fail early if the process is not ptraced.
> > Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> > tasklist_lock and skip performing any work if the task is not ptraced.
> >
> > Cc: Geert Uytterhoeven <[email protected]>
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> As this depends on the removal of a parameter from
> ptrace_report_syscall() earlier in this series:
> Acked-by: Geert Uytterhoeven <[email protected]>

FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
calls of syscall_trace() in entry.S with syscall_trace_enter() and
syscall_trace_leave() resp. and remove syscall_trace().

Geert, do you see any problems with that? The only difference is that
current->ptrace_message would be set to 1 for ptrace stop on entry and
2 - on leave. Currently m68k just has it 0 all along.

It is user-visible (the whole point is to let the tracer see which
stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
on syscall stops would start seeing 1 or 2 instead of "0 all along".
That's how it works on all other architectures (including m68k-nommu),
and I doubt that anything in userland will get broken.

Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
as-is, of course.

2022-01-10 16:25:25

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 04:20:03PM +0000, Al Viro wrote:

> Geert, do you see any problems with that? The only difference is that
> current->ptrace_message would be set to 1 for ptrace stop on entry and
> 2 - on leave. Currently m68k just has it 0 all along.
>
> It is user-visible (the whole point is to let the tracer see which
> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> That's how it works on all other architectures (including m68k-nommu),
> and I doubt that anything in userland will get broken.
>
> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> as-is, of course.

Actually, the current behaviour is "report what the last PTRACE_GETEVENTMSG
has reported, whatever kind of stop that used to be for". So I very much
doubt that anything could break there.

2022-01-10 17:55:14

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Al,

CC Michael/m68k,

On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> > On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> > > The generic function ptrace_report_syscall does a little more
> > > than syscall_trace on m68k. The function ptrace_report_syscall
> > > stops early if PT_TRACED is not set, it sets ptrace_message,
> > > and returns the result of fatal_signal_pending.
> > >
> > > Setting ptrace_message to a passed in value of 0 is effectively not
> > > setting ptrace_message, making that additional work a noop.
> > >
> > > Returning the result of fatal_signal_pending and letting the caller
> > > ignore the result becomes a noop in this change.
> > >
> > > When a process is ptraced, the flag PT_PTRACED is always set in
> > > current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> > > just an optimization to fail early if the process is not ptraced.
> > > Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> > > tasklist_lock and skip performing any work if the task is not ptraced.
> > >
> > > Cc: Geert Uytterhoeven <[email protected]>
> > > Signed-off-by: "Eric W. Biederman" <[email protected]>
> >
> > As this depends on the removal of a parameter from
> > ptrace_report_syscall() earlier in this series:
> > Acked-by: Geert Uytterhoeven <[email protected]>
>
> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
> calls of syscall_trace() in entry.S with syscall_trace_enter() and
> syscall_trace_leave() resp. and remove syscall_trace().
>
> Geert, do you see any problems with that? The only difference is that
> current->ptrace_message would be set to 1 for ptrace stop on entry and
> 2 - on leave. Currently m68k just has it 0 all along.
>
> It is user-visible (the whole point is to let the tracer see which
> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> That's how it works on all other architectures (including m68k-nommu),
> and I doubt that anything in userland will get broken.
>
> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> as-is, of course.

In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
syscall_trace_enter/leave for m68k"[1], but that's still stuck...

[1] https://lore.kernel.org/r/[email protected]/

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-10 20:37:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 06:54:57PM +0100, Geert Uytterhoeven wrote:

> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>
> [1] https://lore.kernel.org/r/[email protected]/

Looks sane, but I'd split it in two - switch to calling syscall_trace_{enter,leave}
and then handling the return values...

The former would keep the current behaviour (modulo reporting enter vs. leave
via PTRACE_GETEVENTMSG), the latter would allow syscall number change by tracer
and/or handling of seccomp/audit/whatnot.

For exit+signal work the former would suffice, and IMO it would be a good idea
to put that one into a shared branch to be pulled both by seccomp and by signal
series. Would reduce the conflicts...

Objections?

2022-01-10 21:18:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Al Viro <[email protected]> writes:

> On Mon, Jan 10, 2022 at 06:54:57PM +0100, Geert Uytterhoeven wrote:
>
>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>>
>> [1] https://lore.kernel.org/r/[email protected]/
>
> Looks sane, but I'd split it in two - switch to calling syscall_trace_{enter,leave}
> and then handling the return values...
>
> The former would keep the current behaviour (modulo reporting enter vs. leave
> via PTRACE_GETEVENTMSG), the latter would allow syscall number change by tracer
> and/or handling of seccomp/audit/whatnot.
>
> For exit+signal work the former would suffice, and IMO it would be a good idea
> to put that one into a shared branch to be pulled both by seccomp and by signal
> series. Would reduce the conflicts...
>
> Objections?

I have the version that Geert ack'ed queued up for v5.17 in my
signal-for-v5.17 branch, along with a couple others prior fixes in this
series of changes where it was clear they were just obviously correct
bug fixes. No need to delay the removal of profiling bits for example.

I would love to see the m68k perform syscall_trace_{enter,leave} but
just getting as far as ptrace_report_syscall will be enough to avoid any
dependencies on my side.

Eric

2022-01-10 23:00:14

by Olivier Langlois

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>
>
> I have been able to confirm that changing wait_event_interruptible to
> wait_event_killable was the culprit.? Something about the way
> systemd-coredump handles coredumps is not compatible with
> wait_event_killable.

This is my experience too that systemd-coredump is doing something
unexpected. When I tested the patch:
https://lore.kernel.org/lkml/[email protected]/

to make sure that the patch worked, sending coredumps to systemd-
coredump was making systemd-coredump, well, core dump... Not very
useful...

Sending the dumps through a pipe to anything else than systemd-coredump
was working fine.


2022-01-11 01:34:01

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Geert,

Am 11.01.2022 um 06:54 schrieb Geert Uytterhoeven:
> Hi Al,
>
> CC Michael/m68k,
>
> On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
>> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
>>> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
>>>> The generic function ptrace_report_syscall does a little more
>>>> than syscall_trace on m68k. The function ptrace_report_syscall
>>>> stops early if PT_TRACED is not set, it sets ptrace_message,
>>>> and returns the result of fatal_signal_pending.
>>>>
>>>> Setting ptrace_message to a passed in value of 0 is effectively not
>>>> setting ptrace_message, making that additional work a noop.
>>>>
>>>> Returning the result of fatal_signal_pending and letting the caller
>>>> ignore the result becomes a noop in this change.
>>>>
>>>> When a process is ptraced, the flag PT_PTRACED is always set in
>>>> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
>>>> just an optimization to fail early if the process is not ptraced.
>>>> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
>>>> tasklist_lock and skip performing any work if the task is not ptraced.
>>>>
>>>> Cc: Geert Uytterhoeven <[email protected]>
>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>
>>> As this depends on the removal of a parameter from
>>> ptrace_report_syscall() earlier in this series:
>>> Acked-by: Geert Uytterhoeven <[email protected]>
>>
>> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
>> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
>> calls of syscall_trace() in entry.S with syscall_trace_enter() and
>> syscall_trace_leave() resp. and remove syscall_trace().
>>
>> Geert, do you see any problems with that? The only difference is that
>> current->ptrace_message would be set to 1 for ptrace stop on entry and
>> 2 - on leave. Currently m68k just has it 0 all along.
>>
>> It is user-visible (the whole point is to let the tracer see which
>> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
>> on syscall stops would start seeing 1 or 2 instead of "0 all along".
>> That's how it works on all other architectures (including m68k-nommu),
>> and I doubt that anything in userland will get broken.
>>
>> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
>> as-is, of course.
>
> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>
> [1] https://lore.kernel.org/r/[email protected]/

That patch (for reasons I never found out) did interact badly with
Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
Christoph's patches are based upon). Caused format errors under memory
stress tests quite reliably, on my 030 hardware.

Probably needs a fresh look - the signal return path got changed by Al's
patches IIRC, and I might have relied on offsets to data on the stack
that are no longer correct with these patches. Or there's a race between
the syscall trap and signal handling when returning from interrupt
context ...

Still school hols over here so I won't have much peace and quiet until
February.

Cheers,

Michael


>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>

2022-01-11 08:59:46

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

08.01.2022 21:13, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>
>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>> a way that triggers this yet. In truth I can't figure out how to
>>> run gst-pluggin-scanner in a useful way.
>>>
>>> I am going to set up some unit tests and see if I can reproduce your
>>> hang another way, but if you could give me some more information on what
>>> you are doing to trigger this I would appreciate it.
>>
>> Thanks, Eric. The distro is Arch Linux, but it's a development
>> environment where I'm running latest GStreamer from git master. I'll try
>> to figure out the reproduction steps and get back to you.
>
> Thank you.
>
> Until I can figure out why this is causing problems I have dropped the
> following two patches from my queue:
> signal: Make SIGKILL during coredumps an explicit special case
> signal: Drop signals received after a fatal signal has been processed
>
> I have replaced them with the following two patches that just do what
> is needed for the rest of the code in the series:
> signal: Have prepare_signal detect coredumps using
> signal: Make coredump handling explicit in complete_signal
>
> Perversely my failure to change the SIGKILL handling when coredumps are
> happening proves to me that I need to change the SIGKILL handling when
> coredumps are happening to make the code more maintainable.

Eric, thank you again. I started to look at the reproduction steps and
haven't completed it yet. Turned out the problem affects only older
NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
set, hence the problem isn't visible on x86 and other CPUs out of the
box. I'll need to check whether the problem could be simulated on all
arches or maybe it's specific to VFP exception handling of ARM32.

2022-01-11 17:20:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 08.01.2022 21:13, Eric W. Biederman пишет:
>> Dmitry Osipenko <[email protected]> writes:
>>
>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>
>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>> a way that triggers this yet. In truth I can't figure out how to
>>>> run gst-pluggin-scanner in a useful way.
>>>>
>>>> I am going to set up some unit tests and see if I can reproduce your
>>>> hang another way, but if you could give me some more information on what
>>>> you are doing to trigger this I would appreciate it.
>>>
>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>> environment where I'm running latest GStreamer from git master. I'll try
>>> to figure out the reproduction steps and get back to you.
>>
>> Thank you.
>>
>> Until I can figure out why this is causing problems I have dropped the
>> following two patches from my queue:
>> signal: Make SIGKILL during coredumps an explicit special case
>> signal: Drop signals received after a fatal signal has been processed
>>
>> I have replaced them with the following two patches that just do what
>> is needed for the rest of the code in the series:
>> signal: Have prepare_signal detect coredumps using
>> signal: Make coredump handling explicit in complete_signal
>>
>> Perversely my failure to change the SIGKILL handling when coredumps are
>> happening proves to me that I need to change the SIGKILL handling when
>> coredumps are happening to make the code more maintainable.
>
> Eric, thank you again. I started to look at the reproduction steps and
> haven't completed it yet. Turned out the problem affects only older
> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
> set, hence the problem isn't visible on x86 and other CPUs out of the
> box. I'll need to check whether the problem could be simulated on all
> arches or maybe it's specific to VFP exception handling of ARM32.

It sounds like the gstreamer plugins only fail on certain hardware on
arm32, and things don't hang in coredumps unless the plugins fail.
That does make things tricky to minimize.

I have just verified that the known problematic code is not
in linux-next for Jan 11 2022.

If folks as they have time can double check linux-next and verify all is
well I would appreciate it. I don't expect that there are problems but
sometimes one problem hides another.

Eric

2022-01-11 17:28:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Olivier Langlois <[email protected]> writes:

> On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>>
>>
>> I have been able to confirm that changing wait_event_interruptible to
>> wait_event_killable was the culprit.  Something about the way
>> systemd-coredump handles coredumps is not compatible with
>> wait_event_killable.
>
> This is my experience too that systemd-coredump is doing something
> unexpected. When I tested the patch:
> https://lore.kernel.org/lkml/[email protected]/
>
> to make sure that the patch worked, sending coredumps to systemd-
> coredump was making systemd-coredump, well, core dump... Not very
> useful...

Oh. Wow....

> Sending the dumps through a pipe to anything else than systemd-coredump
> was working fine.

Interesting.

I need to read through the pipe code and see how all of that works. For
writing directly to disk only ignoring killable interruptions are the
usual semantics. Ordinary pipe code has different semantics, and I
suspect that is what is tripping things up.

As for systemd-coredump it does whatever it does and I suspect some
versions of systemd-coredump are simply not robust if a coredump stops
unexpectedly.

The good news is the pipe code is simple enough, it will be possible to
completely read through that code.

Eric



2022-01-11 18:51:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

"Eric W. Biederman" <[email protected]> writes:

> Olivier Langlois <[email protected]> writes:
>
>> On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>>>
>>>
>>> I have been able to confirm that changing wait_event_interruptible to
>>> wait_event_killable was the culprit.  Something about the way
>>> systemd-coredump handles coredumps is not compatible with
>>> wait_event_killable.
>>
>> This is my experience too that systemd-coredump is doing something
>> unexpected. When I tested the patch:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> to make sure that the patch worked, sending coredumps to systemd-
>> coredump was making systemd-coredump, well, core dump... Not very
>> useful...
>
> Oh. Wow....
>
>> Sending the dumps through a pipe to anything else than systemd-coredump
>> was working fine.
>
> Interesting.
>
> I need to read through the pipe code and see how all of that works. For
> writing directly to disk only ignoring killable interruptions are the
> usual semantics. Ordinary pipe code has different semantics, and I
> suspect that is what is tripping things up.
>
> As for systemd-coredump it does whatever it does and I suspect some
> versions of systemd-coredump are simply not robust if a coredump stops
> unexpectedly.
>
> The good news is the pipe code is simple enough, it will be possible to
> completely read through that code.

My bug, obvious in hindsight is that "try_to_wait_up(TASK_INTERRUPTIBLE)"
does not work on a task that is in sleeping in TASK_KILLABLE.
That looks fixable in wait_for_dump_helpers it just won't be as easy
as changing wait_event_interruptible to wait_event_killable.

To prevent short pipe write from causing short writes during a coredump
I believe all we need to do handle -ERSTARTSYS with TIF_NOTIFY_SIGNAL.
Something like what I have below.

Until wait_for_dump_helpers is sorted out the coredump won't wait for
the dump helper the way it should, but otherwise things should work.

diff --git a/fs/coredump.c b/fs/coredump.c
index 7dece20b162b..0db1baf91420 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -796,6 +796,10 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
if (dump_interrupted())
return 0;
n = __kernel_write(file, addr, nr, &pos);
+ while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
+ tracehook_notify_signal();
+ n = __kernel_write(file, addr, nr, &pos);
+ }
if (n != nr)
return 0;
file->f_pos = pos;

Eric

2022-01-11 19:20:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
<[email protected]> wrote:
>
> + while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
> + tracehook_notify_signal();
> + n = __kernel_write(file, addr, nr, &pos);
> + }

This reads horribly wrongly to me.

That "tracehook_notify_signal()" thing *has* to be renamed before we
have anything like this that otherwise looks like "this will just loop
forever".

I'm pretty sure we've discussed that "tracehook" thing before - the
whole header file is misnamed, and most of the functions in theer are
too.

As an ugly alternative, open-code it, so that it's clear that "yup,
that clears the TIF_NOTIFY_SIGNAL flag".

Linus

2022-01-11 22:43:13

by Finn Thain

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Tue, 11 Jan 2022, Michael Schmitz wrote:

> Am 11.01.2022 um 06:54 schrieb Geert Uytterhoeven:
> > Hi Al,
> >
> > CC Michael/m68k,
> >
> > On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
> >> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> >>> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]>
> >>> wrote:
> >>>> The generic function ptrace_report_syscall does a little more
> >>>> than syscall_trace on m68k. The function ptrace_report_syscall
> >>>> stops early if PT_TRACED is not set, it sets ptrace_message,
> >>>> and returns the result of fatal_signal_pending.
> >>>>
> >>>> Setting ptrace_message to a passed in value of 0 is effectively not
> >>>> setting ptrace_message, making that additional work a noop.
> >>>>
> >>>> Returning the result of fatal_signal_pending and letting the caller
> >>>> ignore the result becomes a noop in this change.
> >>>>
> >>>> When a process is ptraced, the flag PT_PTRACED is always set in
> >>>> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> >>>> just an optimization to fail early if the process is not ptraced.
> >>>> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> >>>> tasklist_lock and skip performing any work if the task is not ptraced.
> >>>>
> >>>> Cc: Geert Uytterhoeven <[email protected]>
> >>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >>>
> >>> As this depends on the removal of a parameter from
> >>> ptrace_report_syscall() earlier in this series:
> >>> Acked-by: Geert Uytterhoeven <[email protected]>
> >>
> >> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
> >> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
> >> calls of syscall_trace() in entry.S with syscall_trace_enter() and
> >> syscall_trace_leave() resp. and remove syscall_trace().
> >>
> >> Geert, do you see any problems with that? The only difference is that
> >> current->ptrace_message would be set to 1 for ptrace stop on entry and
> >> 2 - on leave. Currently m68k just has it 0 all along.
> >>
> >> It is user-visible (the whole point is to let the tracer see which
> >> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> >> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> >> That's how it works on all other architectures (including m68k-nommu),
> >> and I doubt that anything in userland will get broken.
> >>
> >> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> >> as-is, of course.
> >
> > In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> > syscall_trace_enter/leave for m68k"[1], but that's still stuck...
> >
> > [1]
> > https://lore.kernel.org/r/[email protected]/
>
> That patch (for reasons I never found out) did interact badly with
> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
> Christoph's patches are based upon). Caused format errors under memory
> stress tests quite reliably, on my 030 hardware.
>

Those patches have since been merged, BTW.

> Probably needs a fresh look - the signal return path got changed by Al's
> patches IIRC, and I might have relied on offsets to data on the stack
> that are no longer correct with these patches. Or there's a race between
> the syscall trap and signal handling when returning from interrupt
> context ...
>
> Still school hols over here so I won't have much peace and quiet until
> February.
>

So the patch works okay with Aranym 68040 but not Motorola 68030? Since
there is at least one known issue affecting both Motorola 68030 and Hatari
68030, perhaps this patch is not the problem. In anycase, Al's suggestion
to split the patch into two may help in that testing two smaller patches
might narrow down the root cause.

2022-01-12 00:20:42

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Finn,

Am 12.01.2022 um 11:42 schrieb Finn Thain:
> On Tue, 11 Jan 2022, Michael Schmitz wrote:
>>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
>>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>>>
>>> [1]
>>> https://lore.kernel.org/r/[email protected]/
>>
>> That patch (for reasons I never found out) did interact badly with
>> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
>> Christoph's patches are based upon). Caused format errors under memory
>> stress tests quite reliably, on my 030 hardware.
>>
>
> Those patches have since been merged, BTW.

Yes, that's why I advised caution with mine.

>
>> Probably needs a fresh look - the signal return path got changed by Al's
>> patches IIRC, and I might have relied on offsets to data on the stack
>> that are no longer correct with these patches. Or there's a race between
>> the syscall trap and signal handling when returning from interrupt
>> context ...
>>
>> Still school hols over here so I won't have much peace and quiet until
>> February.
>>
>
> So the patch works okay with Aranym 68040 but not Motorola 68030? Since

Correct - I seem to recall we also tested those on your 040 and there
was no regression there, but I may be misremembering that.

> there is at least one known issue affecting both Motorola 68030 and Hatari
> 68030, perhaps this patch is not the problem. In anycase, Al's suggestion

I hadn't ever made that connection, but it might be another explanation,
yes.

> to split the patch into two may help in that testing two smaller patches
> might narrow down the root cause.

That's certainly true.

What's the other reason these patches are still stuck, Geert? Did we
ever settle the dispute about what return code ought to abort a syscall
(in the seccomp context)?

Cheers,

Michael



2022-01-12 03:33:07

by Finn Thain

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Wed, 12 Jan 2022, Michael Schmitz wrote:

>
> I seem to recall we also tested those on your 040 and there was no
> regression there, but I may be misremembering that.
>

I abandoned that regression testing exercise when unpatched mainline
kernels began failing on that machine. I'm in the process of setting up a
different 68040 machine.

2022-01-12 07:54:54

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Finn,

Am 12.01.2022 um 16:32 schrieb Finn Thain:
> On Wed, 12 Jan 2022, Michael Schmitz wrote:
>
>>
>> I seem to recall we also tested those on your 040 and there was no
>> regression there, but I may be misremembering that.
>>
>
> I abandoned that regression testing exercise when unpatched mainline
> kernels began failing on that machine. I'm in the process of setting up a
> different 68040 machine.
>

Thanks for refreshing my memory!

Splitting my first patch as suggested by Al in order to defer handling
of the syscall_trace_enter() return code would achieve what Geert
suggested (eliminate m68k syscall_trace() altogether) without risk of
regression. This would need to replace Eric's patch 8.

Do you want me to send such a version based on my old patch series, or
would you rather prepare that yourself, Eric?

Cheers,

Michael



2022-01-12 07:55:47

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Michael,

On Wed, Jan 12, 2022 at 1:20 AM Michael Schmitz <[email protected]> wrote:
> Am 12.01.2022 um 11:42 schrieb Finn Thain:
> > On Tue, 11 Jan 2022, Michael Schmitz wrote:
> >>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> >>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
> >>>
> >>> [1]
> >>> https://lore.kernel.org/r/[email protected]/
> >>
> >> That patch (for reasons I never found out) did interact badly with
> >> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
> >> Christoph's patches are based upon). Caused format errors under memory
> >> stress tests quite reliably, on my 030 hardware.
> >>
> >
> > Those patches have since been merged, BTW.
>
> Yes, that's why I advised caution with mine.
>
> >
> >> Probably needs a fresh look - the signal return path got changed by Al's
> >> patches IIRC, and I might have relied on offsets to data on the stack
> >> that are no longer correct with these patches. Or there's a race between
> >> the syscall trap and signal handling when returning from interrupt
> >> context ...
> >>
> >> Still school hols over here so I won't have much peace and quiet until
> >> February.
> >>
> >
> > So the patch works okay with Aranym 68040 but not Motorola 68030? Since
>
> Correct - I seem to recall we also tested those on your 040 and there
> was no regression there, but I may be misremembering that.
>
> > there is at least one known issue affecting both Motorola 68030 and Hatari
> > 68030, perhaps this patch is not the problem. In anycase, Al's suggestion
>
> I hadn't ever made that connection, but it might be another explanation,
> yes.
>
> > to split the patch into two may help in that testing two smaller patches
> > might narrow down the root cause.
>
> That's certainly true.
>
> What's the other reason these patches are still stuck, Geert? Did we
> ever settle the dispute about what return code ought to abort a syscall
> (in the seccomp context)?

IIRC, some (self)tests were still failing?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-12 08:05:44

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Geert,

Am 12.01.2022 um 20:55 schrieb Geert Uytterhoeven:
> Hi Michael,
>
>> What's the other reason these patches are still stuck, Geert? Did we
>> ever settle the dispute about what return code ought to abort a syscall
>> (in the seccomp context)?
>
> IIRC, some (self)tests were still failing?

Too true - but I don't think my way of building the testsuite was
entirely according to the book. And I'm not sure I ran the testsuite
with more than one of the return code options. In all honesty, I had
been waiting for Adrian Glaubitz to test the patches with his seccomp
library port instead of relying on the testsuite.

Still, reason enough to split off the removal of syscall_trace() from
the seccomp stuff if it helps with Eric's patch series.

Cheers,

Michael


>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>

2022-01-15 07:38:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Linus Torvalds <[email protected]> writes:

> On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
> <[email protected]> wrote:
>>
>> + while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
>> + tracehook_notify_signal();
>> + n = __kernel_write(file, addr, nr, &pos);
>> + }
>
> This reads horribly wrongly to me.
>
> That "tracehook_notify_signal()" thing *has* to be renamed before we
> have anything like this that otherwise looks like "this will just loop
> forever".
>
> I'm pretty sure we've discussed that "tracehook" thing before - the
> whole header file is misnamed, and most of the functions in theer are
> too.
>
> As an ugly alternative, open-code it, so that it's clear that "yup,
> that clears the TIF_NOTIFY_SIGNAL flag".

A cleaner alternative looks like to modify the pipe code to use
wake_up_XXX instead of wake_up_interruptible_XXX and then have code
that does pipe_write_killable instead of pipe_write_interruptible.

There is also a question of how all of this should interact with the
freezer, as I think changing from interruptible to killable means that
the coredumps became unfreezable.

I am busily simmering this on my back burner and I hope I can come up
with something sensible.

Eric

2022-01-16 16:22:37

by Olivier Langlois

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Fri, 2022-01-14 at 18:12 -0600, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
> > <[email protected]> wrote:
> > >
> > > +?????? while ((n == -ERESTARTSYS) &&
> > > test_thread_flag(TIF_NOTIFY_SIGNAL)) {
> > > +?????????????? tracehook_notify_signal();
> > > +?????????????? n = __kernel_write(file, addr, nr, &pos);
> > > +?????? }
> >
> > This reads horribly wrongly to me.
> >
> > That "tracehook_notify_signal()" thing *has* to be renamed before
> > we
> > have anything like this that otherwise looks like "this will just
> > loop
> > forever".
> >
> > I'm pretty sure we've discussed that "tracehook" thing before - the
> > whole header file is misnamed, and most of the functions in theer
> > are
> > too.
> >
> > As an ugly alternative, open-code it, so that it's clear that "yup,
> > that clears the TIF_NOTIFY_SIGNAL flag".
>
> A cleaner alternative looks like to modify the pipe code to use
> wake_up_XXX instead of wake_up_interruptible_XXX and then have code
> that does pipe_write_killable instead of pipe_write_interruptible.

Do not forget that the problem might not be limited to the pipe FS as
Oleg Nesterov pointed out here:

https://lore.kernel.org/io-uring/[email protected]/

This is why I did like your patch fixing __dump_emit. If the only
problem is the tracehook_notify_signal() function unclear name, that
should be addressed instead of trying to fix the problem in a different
way.
>
> There is also a question of how all of this should interact with the
> freezer, as I think changing from interruptible to killable means
> that
> the coredumps became unfreezable.
>
> I am busily simmering this on my back burner and I hope I can come up
> with something sensible.

IMHO, fixing the problem on the emit function side has the merit of
being future proof if something else than io_uring in the future would
raise the TIF_NOTIFY_SIGNAL flag

but I am wondering why no one commented anything about my proposal of
cancelling io_uring before generating the core dump therefore stopping
it to flip TIF_NOTIFY_SIGNAL while the core dump is generated.

Is there something wrong with my proposed approach?
https://lore.kernel.org/lkml/[email protected]/

It did flawlessly created many dozens of io_uring app core dumps in the
last months for me...

Olivier


2022-01-17 16:59:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> Assuming it won't be too much longer before the rest of the arches have
> set_fs/get_fs removed it looks like it makes sense to leave the
> force_uaccess_begin where it is, and just let force_uaccess_begin be
> removed when set_fs/get_fs are removed from the tree.
>
> Christoph does it look like the set_fs/get_fs removal work is going
> to stall indefinitely on some architectures? If so I think we want to
> find a way to get kernel threads to run with set_fs(USER_DS) on the
> stalled architectures. Otherwise I think we have a real hazard of
> introducing bugs that will only show up on the stalled architectures.

I really need help from the arch maintainers to finish the set_fs
removal. There have been very few arch maintainers helping with that
work (arm, arm64, parisc, m68k) in addition to the ones I did because
I have the test setups and knowledge. I'll send out another ping,
for necrotic architectures like ia64 and sh I have very little hope.

2022-01-18 02:23:33

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead


On Mon, Jan 17, 2022 at 12:05:41AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> > Assuming it won't be too much longer before the rest of the arches have
> > set_fs/get_fs removed it looks like it makes sense to leave the
> > force_uaccess_begin where it is, and just let force_uaccess_begin be
> > removed when set_fs/get_fs are removed from the tree.
> >
> > Christoph does it look like the set_fs/get_fs removal work is going
> > to stall indefinitely on some architectures? If so I think we want to
> > find a way to get kernel threads to run with set_fs(USER_DS) on the
> > stalled architectures. Otherwise I think we have a real hazard of
> > introducing bugs that will only show up on the stalled architectures.
>
> I really need help from the arch maintainers to finish the set_fs
> removal. There have been very few arch maintainers helping with that
> work (arm, arm64, parisc, m68k) in addition to the ones I did because
> I have the test setups and knowledge. I'll send out another ping,

Just in case you missed it: s390 was converted with commit 87d598634521
("s390/mm: remove set_fs / rework address space handling").

2022-01-18 02:26:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Mon, Jan 17, 2022 at 01:15:43PM +0100, Heiko Carstens wrote:
> > I really need help from the arch maintainers to finish the set_fs
> > removal. There have been very few arch maintainers helping with that
> > work (arm, arm64, parisc, m68k) in addition to the ones I did because
> > I have the test setups and knowledge. I'll send out another ping,
>
> Just in case you missed it: s390 was converted with commit 87d598634521
> ("s390/mm: remove set_fs / rework address space handling").

Sorry, I forgot about s390, which as often was a model citizen here!

2022-01-18 02:27:29

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Mon, Jan 17, 2022 at 9:05 AM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> > Assuming it won't be too much longer before the rest of the arches have
> > set_fs/get_fs removed it looks like it makes sense to leave the
> > force_uaccess_begin where it is, and just let force_uaccess_begin be
> > removed when set_fs/get_fs are removed from the tree.
> >
> > Christoph does it look like the set_fs/get_fs removal work is going
> > to stall indefinitely on some architectures? If so I think we want to
> > find a way to get kernel threads to run with set_fs(USER_DS) on the
> > stalled architectures. Otherwise I think we have a real hazard of
> > introducing bugs that will only show up on the stalled architectures.
>
> I really need help from the arch maintainers to finish the set_fs
> removal. There have been very few arch maintainers helping with that
> work (arm, arm64, parisc, m68k) in addition to the ones I did because
> I have the test setups and knowledge. I'll send out another ping,
> for necrotic architectures like ia64 and sh I have very little hope.

I did a conversion of microblaze for fun at some point, and I think I never
sent that out. I haven't tested it, but if this looks correct to you and
Michal, it could serve as a model for other trivial conversions.

I also looked into converting ia64 and sh at the same time, but I can't
find those patches now, so I think they were never complete.

Arnd

2022-01-18 02:27:46

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH] microblaze: remove CONFIG_SET_FS

From: Arnd Bergmann <[email protected]>

I picked microblaze as one of the architectures that still
use set_fs() and converted it not to.

Link: https://lore.kernel.org/lkml/[email protected]om/
Signed-off-by: Arnd Bergmann <[email protected]>
---
This is an old patch I found after Christoph asked about
conversions for the remaining architectures. I have no idea
about the state of this patch, but there is a reasonable
chance that it works.
---
arch/microblaze/Kconfig | 1 -
arch/microblaze/include/asm/thread_info.h | 6 ---
arch/microblaze/include/asm/uaccess.h | 56 ++++++++++-------------
arch/microblaze/kernel/asm-offsets.c | 1 -
4 files changed, 25 insertions(+), 39 deletions(-)

diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index 59798e43cdb0..1fb1cec087b7 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -42,7 +42,6 @@ config MICROBLAZE
select CPU_NO_EFFICIENT_FFS
select MMU_GATHER_NO_RANGE
select SPARSE_IRQ
- select SET_FS
select ZONE_DMA
select TRACE_IRQFLAGS_SUPPORT

diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 44f5ca331862..a0ddd2a36fb9 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -56,17 +56,12 @@ struct cpu_context {
__u32 fsr;
};

-typedef struct {
- unsigned long seg;
-} mm_segment_t;
-
struct thread_info {
struct task_struct *task; /* main task structure */
unsigned long flags; /* low level flags */
unsigned long status; /* thread-synchronous flags */
__u32 cpu; /* current CPU */
__s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/
- mm_segment_t addr_limit; /* thread address space */

struct cpu_context cpu_context;
};
@@ -80,7 +75,6 @@ struct thread_info {
.flags = 0, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
- .addr_limit = KERNEL_DS, \
}

/* how to get the thread information struct from C */
diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h
index d2a8ef9f8978..346fe4618b27 100644
--- a/arch/microblaze/include/asm/uaccess.h
+++ b/arch/microblaze/include/asm/uaccess.h
@@ -16,45 +16,20 @@
#include <asm/extable.h>
#include <linux/string.h>

-/*
- * On Microblaze the fs value is actually the top of the corresponding
- * address space.
- *
- * The fs value determines whether argument validity checking should be
- * performed or not. If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- *
- * For non-MMU arch like Microblaze, KERNEL_DS and USER_DS is equal.
- */
-# define MAKE_MM_SEG(s) ((mm_segment_t) { (s) })
-
-# define KERNEL_DS MAKE_MM_SEG(0xFFFFFFFF)
-# define USER_DS MAKE_MM_SEG(TASK_SIZE - 1)
-
-# define get_fs() (current_thread_info()->addr_limit)
-# define set_fs(val) (current_thread_info()->addr_limit = (val))
-# define user_addr_max() get_fs().seg
-
-# define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-
static inline int access_ok(const void __user *addr, unsigned long size)
{
if (!size)
goto ok;

- if ((get_fs().seg < ((unsigned long)addr)) ||
- (get_fs().seg < ((unsigned long)addr + size - 1))) {
- pr_devel("ACCESS fail at 0x%08x (size 0x%x), seg 0x%08x\n",
- (__force u32)addr, (u32)size,
- (u32)get_fs().seg);
+ if ((((unsigned long)addr) > TASK_SIZE) ||
+ (((unsigned long)addr + size - 1) > TASK_SIZE)) {
+ pr_devel("ACCESS fail at 0x%08x (size 0x%x)",
+ (__force u32)addr, (u32)size);
return 0;
}
ok:
- pr_devel("ACCESS OK at 0x%08x (size 0x%x), seg 0x%08x\n",
- (__force u32)addr, (u32)size,
- (u32)get_fs().seg);
+ pr_devel("ACCESS OK at 0x%08x (size 0x%x)\n",
+ (__force u32)addr, (u32)size);
return 1;
}

@@ -280,6 +255,25 @@ extern long __user_bad(void);
__gu_err; \
})

+#define __get_kernel_nofault(dst, src, type, label) \
+{ \
+ type __user *p = (type __force __user *)(src); \
+ type data; \
+ if (__get_user(data, p)) \
+ goto label; \
+ *(type *)dst = data; \
+}
+
+#define __put_kernel_nofault(dst, src, type, label) \
+{ \
+ type __user *p = (type __force __user *)(dst); \
+ type data = *(type *)src; \
+ if (__put_user(data, p)) \
+ goto label; \
+}
+
+#define HAVE_GET_KERNEL_NOFAULT
+
static inline unsigned long
raw_copy_from_user(void *to, const void __user *from, unsigned long n)
{
diff --git a/arch/microblaze/kernel/asm-offsets.c b/arch/microblaze/kernel/asm-offsets.c
index b77dd188dec4..47ee409508b1 100644
--- a/arch/microblaze/kernel/asm-offsets.c
+++ b/arch/microblaze/kernel/asm-offsets.c
@@ -86,7 +86,6 @@ int main(int argc, char *argv[])
/* struct thread_info */
DEFINE(TI_TASK, offsetof(struct thread_info, task));
DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
- DEFINE(TI_ADDR_LIMIT, offsetof(struct thread_info, addr_limit));
DEFINE(TI_CPU_CONTEXT, offsetof(struct thread_info, cpu_context));
DEFINE(TI_PREEMPT_COUNT, offsetof(struct thread_info, preempt_count));
BLANK();
--
2.29.2

2022-01-18 02:37:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Olivier Langlois <[email protected]> writes:

> On Fri, 2022-01-14 at 18:12 -0600, Eric W. Biederman wrote:
>> Linus Torvalds <[email protected]> writes:
>>
>> > On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
>> > <[email protected]> wrote:
>> > >
>> > > +       while ((n == -ERESTARTSYS) &&
>> > > test_thread_flag(TIF_NOTIFY_SIGNAL)) {
>> > > +               tracehook_notify_signal();
>> > > +               n = __kernel_write(file, addr, nr, &pos);
>> > > +       }
>> >
>> > This reads horribly wrongly to me.
>> >
>> > That "tracehook_notify_signal()" thing *has* to be renamed before
>> > we
>> > have anything like this that otherwise looks like "this will just
>> > loop
>> > forever".
>> >
>> > I'm pretty sure we've discussed that "tracehook" thing before - the
>> > whole header file is misnamed, and most of the functions in theer
>> > are
>> > too.
>> >
>> > As an ugly alternative, open-code it, so that it's clear that "yup,
>> > that clears the TIF_NOTIFY_SIGNAL flag".
>>
>> A cleaner alternative looks like to modify the pipe code to use
>> wake_up_XXX instead of wake_up_interruptible_XXX and then have code
>> that does pipe_write_killable instead of pipe_write_interruptible.
>
> Do not forget that the problem might not be limited to the pipe FS as
> Oleg Nesterov pointed out here:
>
> https://lore.kernel.org/io-uring/[email protected]/
>
> This is why I did like your patch fixing __dump_emit. If the only
> problem is the tracehook_notify_signal() function unclear name, that
> should be addressed instead of trying to fix the problem in a different
> way.

It might be that the fix is to run a portion of the exit_to_userspace
loop that does:

if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
handle_signal_work(regs, ti_work);

I am deep in brainstorm mode trying to find something that comes out
clean.

Oleg is right that while to be POSIX compliant and otherwise compatible
with traditional unix behavior sleeps in filesystems need to be
uninterruptible. NFS has not always provided that compatibility.


>> There is also a question of how all of this should interact with the
>> freezer, as I think changing from interruptible to killable means
>> that
>> the coredumps became unfreezable.
>>
>> I am busily simmering this on my back burner and I hope I can come up
>> with something sensible.
>
> IMHO, fixing the problem on the emit function side has the merit of
> being future proof if something else than io_uring in the future would
> raise the TIF_NOTIFY_SIGNAL flag
>
> but I am wondering why no one commented anything about my proposal of
> cancelling io_uring before generating the core dump therefore stopping
> it to flip TIF_NOTIFY_SIGNAL while the core dump is generated.
>
> Is there something wrong with my proposed approach?
> https://lore.kernel.org/lkml/[email protected]/
>
> It did flawlessly created many dozens of io_uring app core dumps in the
> last months for me...

From my perspective I am not at all convinced that io_uring is the only
culprit.

Beyond that the purpose of a coredump is to snapshot the process as it
is, before anything is shutdown so that someone can examine the coredump
and figure out what failed. Running around changing the state of the
process has a very real chance of hiding what is going wrong.

Further your change requires that there be a place for io_uring to clean
things up. Given that fundamentally that seems like the wrong thing to
me I am not interested in making it easy to what looks like the wrong
thing.

All of this may be perfection being the enemy of the good (especially as
your io_uring magic happens as a special case in do_coredump). My work
in this area is to remove hacks so I can be convinced the code works
100% of the time so unfortunately I am not interested in pick up a
change that is only good enough. Someone else like Andrew Morton might
be.


None of that changes the fact that tracehook_notify_signal needs to be
renamed. That effects your approach and my proof of concept approach.
So renaming tracehook_notify_signal just needs to be done.

Eric

2022-01-18 02:58:10

by Eric W. Biederman

[permalink] [raw]
Subject: io_uring truncating coredumps


Subject updated to reflect the current discussion.

> Linus Torvalds <[email protected]> writes:

> But I really think it's wrong.
>
> You're trying to work around a problem the wrong way around. If a task
> is dead, and is dumping core, then signals just shouldn't matter in
> the first place, and thus the whole "TASK_INTERRUPTIBLE vs
> TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
> is an issue means there's something wrong in signaling, not in the
> pipe code.
>
> So I really think that's where the fix should be - on the signal delivery side.

Thinking about it from the perspective of not delivering the wake-ups
fixing io_uring and coredumps in a non-hacky way looks comparatively
simple. The function task_work_add just needs to not wake anything up
after a process has started dying.

Something like the patch below.

The only tricky part I can see is making certain there are not any races
between task_work_add and do_coredump depending on task_work_add not
causing signal_pending to return true.

diff --git a/kernel/task_work.c b/kernel/task_work.c
index fad745c59234..5f941e377268 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -44,6 +44,9 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
work->next = head;
} while (cmpxchg(&task->task_works, head, work) != head);

+ if (notify && (task->signal->flags & SIGNAL_GROUP_EXIT))
+ return 0;
+
switch (notify) {
case TWA_NONE:
break;

Eric

2022-01-19 19:05:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: io_uring truncating coredumps

On Mon, Jan 17, 2022 at 8:47 PM Eric W. Biederman <[email protected]> wrote:
>
> Thinking about it from the perspective of not delivering the wake-ups
> fixing io_uring and coredumps in a non-hacky way looks comparatively
> simple. The function task_work_add just needs to not wake anything up
> after a process has started dying.
>
> Something like the patch below.

Hmm. Yes, I think this is the right direction.

That said, I think it should not add the work at all, and return
-ESRCH, the exact same way that it does for that work_exited
condition.

Because it's basically the same thing: the task is dead and shouldn't
do more work. In fact, task_work_run() is the thing that sets it to
&work_exited as it sees PF_EXITING, so it feels to me that THAT is
actually the issue here - we react to PF_EXITING too late. We react to
it *after* we've already added the work, and then we do that "no more
work" logic only after we've accepted those late work entries?

So my gut feel is that task_work_add() should just also test PF_EXITING.

And in fact, my gut feel is that PF_EXITING is too late anyway (it
happens after core-dumping, no?)

But I guess that thing may be on purpose, and maybe the act of dumping
core itself wants to do more work, and so that isn't an option?

So I don't think your patch is "right" as-is, and it all worries me,
but yes, I think this area is very much the questionable one.

I think that work stopping and the io_uring shutdown should probably
move earlier in the exit queue, but as mentioned above, maybe the work
addition boundary in particular really wants to be late because the
exit process itself still uses task works? ;(

Linus

2022-01-20 20:28:14

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

11.01.2022 20:20, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>> Dmitry Osipenko <[email protected]> writes:
>>>
>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>
>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>> run gst-pluggin-scanner in a useful way.
>>>>>
>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>> hang another way, but if you could give me some more information on what
>>>>> you are doing to trigger this I would appreciate it.
>>>>
>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>> to figure out the reproduction steps and get back to you.
>>>
>>> Thank you.
>>>
>>> Until I can figure out why this is causing problems I have dropped the
>>> following two patches from my queue:
>>> signal: Make SIGKILL during coredumps an explicit special case
>>> signal: Drop signals received after a fatal signal has been processed
>>>
>>> I have replaced them with the following two patches that just do what
>>> is needed for the rest of the code in the series:
>>> signal: Have prepare_signal detect coredumps using
>>> signal: Make coredump handling explicit in complete_signal
>>>
>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>> happening proves to me that I need to change the SIGKILL handling when
>>> coredumps are happening to make the code more maintainable.
>>
>> Eric, thank you again. I started to look at the reproduction steps and
>> haven't completed it yet. Turned out the problem affects only older
>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>> set, hence the problem isn't visible on x86 and other CPUs out of the
>> box. I'll need to check whether the problem could be simulated on all
>> arches or maybe it's specific to VFP exception handling of ARM32.
>
> It sounds like the gstreamer plugins only fail on certain hardware on
> arm32, and things don't hang in coredumps unless the plugins fail.
> That does make things tricky to minimize.
>
> I have just verified that the known problematic code is not
> in linux-next for Jan 11 2022.
>
> If folks as they have time can double check linux-next and verify all is
> well I would appreciate it. I don't expect that there are problems but
> sometimes one problem hides another.

Hello Eric,

I reproduced the trouble on x86_64.

Here are the reproduction steps, using ArchLinux and linux-next-20211224:

```
sudo pacman -S base-devel git mesa glu meson wget
git clone https://github.com/grate-driver/gstreamer.git
cd gstreamer
git checkout sigill
meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
cd build
sudo ninja install
wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
rm -r ~/.cache/gstreamer-1.0
gst-play-1.0 ./big_buck_bunny_720p_h264.mov
```

The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.

[1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f

2022-01-20 21:24:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 11.01.2022 20:20, Eric W. Biederman пишет:
>> Dmitry Osipenko <[email protected]> writes:
>>
>>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>>> Dmitry Osipenko <[email protected]> writes:
>>>>
>>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>>
>>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>>> run gst-pluggin-scanner in a useful way.
>>>>>>
>>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>>> hang another way, but if you could give me some more information on what
>>>>>> you are doing to trigger this I would appreciate it.
>>>>>
>>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>>> to figure out the reproduction steps and get back to you.
>>>>
>>>> Thank you.
>>>>
>>>> Until I can figure out why this is causing problems I have dropped the
>>>> following two patches from my queue:
>>>> signal: Make SIGKILL during coredumps an explicit special case
>>>> signal: Drop signals received after a fatal signal has been processed
>>>>
>>>> I have replaced them with the following two patches that just do what
>>>> is needed for the rest of the code in the series:
>>>> signal: Have prepare_signal detect coredumps using
>>>> signal: Make coredump handling explicit in complete_signal
>>>>
>>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>>> happening proves to me that I need to change the SIGKILL handling when
>>>> coredumps are happening to make the code more maintainable.
>>>
>>> Eric, thank you again. I started to look at the reproduction steps and
>>> haven't completed it yet. Turned out the problem affects only older
>>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>>> set, hence the problem isn't visible on x86 and other CPUs out of the
>>> box. I'll need to check whether the problem could be simulated on all
>>> arches or maybe it's specific to VFP exception handling of ARM32.
>>
>> It sounds like the gstreamer plugins only fail on certain hardware on
>> arm32, and things don't hang in coredumps unless the plugins fail.
>> That does make things tricky to minimize.
>>
>> I have just verified that the known problematic code is not
>> in linux-next for Jan 11 2022.
>>
>> If folks as they have time can double check linux-next and verify all is
>> well I would appreciate it. I don't expect that there are problems but
>> sometimes one problem hides another.
>
> Hello Eric,
>
> I reproduced the trouble on x86_64.
>
> Here are the reproduction steps, using ArchLinux and linux-next-20211224:
>
> ```
> sudo pacman -S base-devel git mesa glu meson wget
> git clone https://github.com/grate-driver/gstreamer.git
> cd gstreamer
> git checkout sigill
> meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
> cd build
> sudo ninja install
> wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
> rm -r ~/.cache/gstreamer-1.0
> gst-play-1.0 ./big_buck_bunny_720p_h264.mov
> ```
>
> The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.
>
> [1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f

Thank you.

I will verify this works before I add my updated version to
my signal-for-v5.18 branch.

Have you by any chance tried a newer version of linux-next without
commit fbc11520b58a ("signal: Make SIGKILL during coredumps an explicit
special case") in it?

If not I will double check that my pulling the commit out does not break
in the case you have documented.

Eric

2022-01-20 21:26:15

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

18.01.2022 20:52, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 11.01.2022 20:20, Eric W. Biederman пишет:
>>> Dmitry Osipenko <[email protected]> writes:
>>>
>>>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>>>> Dmitry Osipenko <[email protected]> writes:
>>>>>
>>>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>>>
>>>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>>>> run gst-pluggin-scanner in a useful way.
>>>>>>>
>>>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>>>> hang another way, but if you could give me some more information on what
>>>>>>> you are doing to trigger this I would appreciate it.
>>>>>>
>>>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>>>> to figure out the reproduction steps and get back to you.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Until I can figure out why this is causing problems I have dropped the
>>>>> following two patches from my queue:
>>>>> signal: Make SIGKILL during coredumps an explicit special case
>>>>> signal: Drop signals received after a fatal signal has been processed
>>>>>
>>>>> I have replaced them with the following two patches that just do what
>>>>> is needed for the rest of the code in the series:
>>>>> signal: Have prepare_signal detect coredumps using
>>>>> signal: Make coredump handling explicit in complete_signal
>>>>>
>>>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>>>> happening proves to me that I need to change the SIGKILL handling when
>>>>> coredumps are happening to make the code more maintainable.
>>>>
>>>> Eric, thank you again. I started to look at the reproduction steps and
>>>> haven't completed it yet. Turned out the problem affects only older
>>>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>>>> set, hence the problem isn't visible on x86 and other CPUs out of the
>>>> box. I'll need to check whether the problem could be simulated on all
>>>> arches or maybe it's specific to VFP exception handling of ARM32.
>>>
>>> It sounds like the gstreamer plugins only fail on certain hardware on
>>> arm32, and things don't hang in coredumps unless the plugins fail.
>>> That does make things tricky to minimize.
>>>
>>> I have just verified that the known problematic code is not
>>> in linux-next for Jan 11 2022.
>>>
>>> If folks as they have time can double check linux-next and verify all is
>>> well I would appreciate it. I don't expect that there are problems but
>>> sometimes one problem hides another.
>>
>> Hello Eric,
>>
>> I reproduced the trouble on x86_64.
>>
>> Here are the reproduction steps, using ArchLinux and linux-next-20211224:
>>
>> ```
>> sudo pacman -S base-devel git mesa glu meson wget
>> git clone https://github.com/grate-driver/gstreamer.git
>> cd gstreamer
>> git checkout sigill
>> meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
>> cd build
>> sudo ninja install
>> wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
>> rm -r ~/.cache/gstreamer-1.0
>> gst-play-1.0 ./big_buck_bunny_720p_h264.mov
>> ```
>>
>> The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.
>>
>> [1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f
>
> Thank you.
>
> I will verify this works before I add my updated version to
> my signal-for-v5.18 branch.
>
> Have you by any chance tried a newer version of linux-next without
> commit fbc11520b58a ("signal: Make SIGKILL during coredumps an explicit
> special case") in it?
>
> If not I will double check that my pulling the commit out does not break
> in the case you have documented.

Recent linux-next works fine.