LinuxLists.cc - [PATCH 00/10] Removal of most do_exit calls

2021-12-08 20:18:13

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 00/10] Removal of most do_exit calls

We have a lot of calls to do_exit that really don't want the semantics
of userspace calling pthread_exit, aka exit(2). Instead the interesting
semantics are those of the current task exiting.

This set of changes removes a dead reference to do_exit on s390,
adds a function make_task_dead and changes all of the oops
implementations to use it, and adds function kthread_exit and
changes all of the kthread exits to use it.

The short term win of this set of changes is being able to move many
sanity checks out of do_exit that are only really interesting during an
oops. Making it easier to see what do_exit is actually doing.

After this set of changes the number there are only about a big screen
full of do_exit calls left. Making future changes much easier to
review.

s390 folks. Can you please verify I read the s390 code correctly when
observing the reference to do_exit really is dead? I would really
appreciate it. I am not very familiar with s390.

This is on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

It is my plan that after these changes are reviewed to apply these
changes into my signal-for-v5.17 branch. After that I can get to
cleaning up where signals, coredumps and the exit code meets.

Eric W. Biederman (10):
exit/s390: Remove dead reference to do_exit from copy_thread
exit: Add and use make_task_dead.
exit: Move oops specific logic from do_exit into make_task_dead
exit: Stop poorly open coding do_task_dead in make_task_dead
exit: Stop exporting do_exit
exit: Implement kthread_exit
exit: Rename module_put_and_exit to module_put_and_kthread_exit
exit: Rename complete_and_exit to kthread_complete_and_exit
kthread: Ensure struct kthread is present for all kthreads
exit/kthread: Move the exit code for kernel threads into struct kthread

arch/alpha/kernel/traps.c | 6 +-
arch/alpha/mm/fault.c | 2 +-
arch/arm/kernel/traps.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/kernel/traps.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/csky/abiv1/alignment.c | 2 +-
arch/csky/kernel/traps.c | 2 +-
arch/csky/mm/fault.c | 2 +-
arch/h8300/kernel/traps.c | 2 +-
arch/h8300/mm/fault.c | 2 +-
arch/hexagon/kernel/traps.c | 2 +-
arch/ia64/kernel/mca_drv.c | 2 +-
arch/ia64/kernel/traps.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/kernel/traps.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/kernel/exceptions.c | 4 +-
arch/mips/kernel/traps.c | 2 +-
arch/nds32/kernel/fpu.c | 2 +-
arch/nds32/kernel/traps.c | 8 +--
arch/nios2/kernel/traps.c | 4 +-
arch/openrisc/kernel/traps.c | 2 +-
arch/parisc/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 8 +--
arch/riscv/kernel/traps.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/kernel/dumpstack.c | 2 +-
arch/s390/kernel/nmi.c | 2 +-
arch/s390/kernel/process.c | 1 -
arch/sh/kernel/traps.c | 2 +-
arch/sparc/kernel/traps_32.c | 4 +-
arch/sparc/kernel/traps_64.c | 4 +-
arch/x86/entry/entry_32.S | 6 +-
arch/x86/entry/entry_64.S | 6 +-
arch/x86/kernel/dumpstack.c | 4 +-
arch/xtensa/kernel/traps.c | 2 +-
crypto/algboss.c | 4 +-
drivers/net/wireless/rsi/rsi_91x_coex.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_main.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_usb_ops.c | 2 +-
drivers/pnp/pnpbios/core.c | 6 +-
drivers/staging/rts5208/rtsx.c | 16 ++---
drivers/usb/atm/usbatm.c | 2 +-
drivers/usb/gadget/function/f_mass_storage.c | 2 +-
fs/cifs/connect.c | 2 +-
fs/exec.c | 2 +
fs/jffs2/background.c | 2 +-
fs/nfs/callback.c | 4 +-
fs/nfs/nfs4state.c | 2 +-
fs/nfsd/nfssvc.c | 2 +-
include/linux/kernel.h | 1 -
include/linux/kthread.h | 4 +-
include/linux/module.h | 6 +-
include/linux/sched/task.h | 1 +
kernel/exit.c | 88 ++++++++++++++--------------
kernel/fork.c | 4 ++
kernel/futex/core.c | 2 +-
kernel/kexec_core.c | 2 +-
kernel/kthread.c | 78 +++++++++++++++++-------
kernel/module.c | 6 +-
kernel/sched/core.c | 16 ++---
lib/kunit/try-catch.c | 4 +-
net/bluetooth/bnep/core.c | 2 +-
net/bluetooth/cmtp/core.c | 2 +-
net/bluetooth/hidp/core.c | 2 +-
tools/objtool/check.c | 8 ++-
68 files changed, 212 insertions(+), 173 deletions(-)

Eric

2021-12-08 20:26:17

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

My s390 assembly is not particularly good so I have read the history
of the reference to do_exit copy_thread and have been able to
verify that do_exit is not used.

The general argument is that s390 has been changed to use the generic
kernel_thread and kernel_execve and the generic versions do not call
do_exit. So it is strange to see a do_exit reference sitting there.

The history of the do_exit reference in s390's version of copy_thread
seems conclusive that the do_exit reference is something that lingers
and should have been removed several years ago.

Up through 8d19f15a60be ("[PATCH] s390 update (1/27): arch.") the
s390 code made a call to the exit(2) system call when a kernel thread
finished. Then kernel_thread_starter was added which branched
directly to the value in register 11 when the kernel thread finshed.
The value in register 11 was set in kernel_thread to
"regs.gprs[11] = (unsigned long) do_exit"

In commit 37fe5d41f640 ("s390: fold kernel_thread_helper() into
ret_from_fork()") kernel_thread_starter was moved into entry.S and
entry64.S unchanged (except for the syntax differences between inline
assemly and in the assembly file).

In commit f9a7e025dfc2 ("s390: switch to generic kernel_thread()") the
assignment to "gprs[11]" was moved into copy_thread from the old
kernel_thread. The helper kernel_thread_starter was still being used
and was still branching to "%r11" at the end.

In commit 30dcb0996e40 ("s390: switch to saner kernel_execve()
semantics") kernel_thread_starter was changed to unconditionally
branch to sysc_tracenogo instead to %r11 which held the value of
do_exit. Unfortunately copy_thread was not updated to stop passing
do_exit in "gprs[11]".

In commit 56e62a737028 ("s390: convert to generic entry")
kernel_thread_starter was replaced by __ret_from_fork. And the code
still continued to pass do_exit in "gprs[11]" despite __ret_from_fork
not caring in the slightest.

Remove this dead reference to do_exit to make it clear that s390 is
not doing anything with do_exit in copy_thread.

Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Al Viro <[email protected]>
Fixes: 30dcb0996e40 ("s390: switch to saner kernel_execve() semantics")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/s390/kernel/process.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index e8858b2de24b..71d86f73b02c 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -139,7 +139,6 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
(unsigned long)__ret_from_fork;
frame->childregs.gprs[9] = new_stackp; /* function */
frame->childregs.gprs[10] = arg;
- frame->childregs.gprs[11] = (unsigned long)do_exit;
frame->childregs.orig_gpr2 = -1;
frame->childregs.last_break = 1;
return 0;
--
2.29.2

2021-12-08 20:26:21

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 02/10] exit: Add and use make_task_dead.

There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.

Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.

Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.

As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/traps.c | 6 +++---
arch/alpha/mm/fault.c | 2 +-
arch/arm/kernel/traps.c | 2 +-
arch/arm/mm/fault.c | 2 +-
arch/arm64/kernel/traps.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/csky/abiv1/alignment.c | 2 +-
arch/csky/kernel/traps.c | 2 +-
arch/csky/mm/fault.c | 2 +-
arch/h8300/kernel/traps.c | 2 +-
arch/h8300/mm/fault.c | 2 +-
arch/hexagon/kernel/traps.c | 2 +-
arch/ia64/kernel/mca_drv.c | 2 +-
arch/ia64/kernel/traps.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/kernel/traps.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/kernel/exceptions.c | 4 ++--
arch/mips/kernel/traps.c | 2 +-
arch/nds32/kernel/fpu.c | 2 +-
arch/nds32/kernel/traps.c | 8 ++++----
arch/nios2/kernel/traps.c | 4 ++--
arch/openrisc/kernel/traps.c | 2 +-
arch/parisc/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 8 ++++----
arch/riscv/kernel/traps.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/kernel/dumpstack.c | 2 +-
arch/s390/kernel/nmi.c | 2 +-
arch/sh/kernel/traps.c | 2 +-
arch/sparc/kernel/traps_32.c | 4 +---
arch/sparc/kernel/traps_64.c | 4 +---
arch/x86/entry/entry_32.S | 6 +++---
arch/x86/entry/entry_64.S | 6 +++---
arch/x86/kernel/dumpstack.c | 4 ++--
arch/xtensa/kernel/traps.c | 2 +-
include/linux/sched/task.h | 1 +
kernel/exit.c | 9 +++++++++
tools/objtool/check.c | 3 ++-
39 files changed, 63 insertions(+), 56 deletions(-)

diff --git a/arch/alpha/kernel/traps.c b/arch/alpha/kernel/traps.c
index 2ae34702456c..8a66fe544c69 100644
--- a/arch/alpha/kernel/traps.c
+++ b/arch/alpha/kernel/traps.c
@@ -190,7 +190,7 @@ die_if_kernel(char * str, struct pt_regs *regs, long err, unsigned long *r9_15)
local_irq_enable();
while (1);
}
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

#ifndef CONFIG_MATHEMU
@@ -575,7 +575,7 @@ do_entUna(void * va, unsigned long opcode, unsigned long reg,

printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
pc, va, opcode, reg);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);

got_exception:
/* Ok, we caught the exception, but we don't want it. Is there
@@ -630,7 +630,7 @@ do_entUna(void * va, unsigned long opcode, unsigned long reg,
local_irq_enable();
while (1);
}
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/*
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index eee5102c3d88..e9193d52222e 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -204,7 +204,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
printk(KERN_ALERT "Unable to handle kernel paging request at "
"virtual address %016lx\n", address);
die_if_kernel("Oops", regs, cause, (unsigned long*)regs - 16);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);

/* We ran out of memory, or some other thing happened to us that
made us unable to handle the page fault gracefully. */
diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
index 195dff58bafc..b4bd2e5f17c1 100644
--- a/arch/arm/kernel/traps.c
+++ b/arch/arm/kernel/traps.c
@@ -333,7 +333,7 @@ static void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
if (panic_on_oops)
panic("Fatal exception");
if (signr)
- do_exit(signr);
+ make_task_dead(signr);
}

/*
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bc8779d54a64..bf1a0c618c49 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -111,7 +111,7 @@ static void die_kernel_fault(const char *msg, struct mm_struct *mm,
show_pte(KERN_ALERT, mm, addr);
die("Oops", regs, fsr);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

/*
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 7b21213a570f..bdd456e4e7f4 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -235,7 +235,7 @@ void die(const char *str, struct pt_regs *regs, int err)
raw_spin_unlock_irqrestore(&die_lock, flags);

if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

static void arm64_show_signal(int signo, const char *str)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 9ae24e3b72be..11a28cace2d2 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -302,7 +302,7 @@ static void die_kernel_fault(const char *msg, unsigned long addr,
show_pte(addr);
die("Oops", regs, esr);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

#ifdef CONFIG_KASAN_HW_TAGS
diff --git a/arch/csky/abiv1/alignment.c b/arch/csky/abiv1/alignment.c
index cb2a0d94a144..5e2fb45d605c 100644
--- a/arch/csky/abiv1/alignment.c
+++ b/arch/csky/abiv1/alignment.c
@@ -294,7 +294,7 @@ void csky_alignment(struct pt_regs *regs)
__func__, opcode, rz, rx, imm, addr);
show_regs(regs);
bust_spinlocks(0);
- do_exit(SIGKILL);
+ make_dead_task(SIGKILL);
}

force_sig_fault(SIGBUS, BUS_ADRALN, (void __user *)addr);
diff --git a/arch/csky/kernel/traps.c b/arch/csky/kernel/traps.c
index e5fbf8653a21..88a47035b925 100644
--- a/arch/csky/kernel/traps.c
+++ b/arch/csky/kernel/traps.c
@@ -109,7 +109,7 @@ void die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception");
if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_dead_task(SIGSEGV);
}

void do_trap(struct pt_regs *regs, int signo, int code, unsigned long addr)
diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
index 466ad949818a..7215a46b6b8e 100644
--- a/arch/csky/mm/fault.c
+++ b/arch/csky/mm/fault.c
@@ -67,7 +67,7 @@ static inline void no_context(struct pt_regs *regs, unsigned long addr)
pr_alert("Unable to handle kernel paging request at virtual "
"addr 0x%08lx, pc: 0x%08lx\n", addr, regs->pc);
die(regs, "Oops");
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

static inline void mm_fault_error(struct pt_regs *regs, unsigned long addr, vm_fault_t fault)
diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c
index bdbe988d8dbc..3d4e0bde37ae 100644
--- a/arch/h8300/kernel/traps.c
+++ b/arch/h8300/kernel/traps.c
@@ -106,7 +106,7 @@ void die(const char *str, struct pt_regs *fp, unsigned long err)
dump(fp);

spin_unlock_irq(&die_lock);
- do_exit(SIGSEGV);
+ make_dead_task(SIGSEGV);
}

static int kstack_depth_to_print = 24;
diff --git a/arch/h8300/mm/fault.c b/arch/h8300/mm/fault.c
index d4bc9c16f2df..0223528565dd 100644
--- a/arch/h8300/mm/fault.c
+++ b/arch/h8300/mm/fault.c
@@ -51,7 +51,7 @@ asmlinkage int do_page_fault(struct pt_regs *regs, unsigned long address,
printk(" at virtual address %08lx\n", address);
if (!user_mode(regs))
die("Oops", regs, error_code);
- do_exit(SIGKILL);
+ make_dead_task(SIGKILL);

return 1;
}
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index edfc35dafeb1..6dd6cf0ab711 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -214,7 +214,7 @@ int die(const char *str, struct pt_regs *regs, long err)
panic("Fatal exception");

oops_exit();
- do_exit(err);
+ make_dead_task(err);
return 0;
}

diff --git a/arch/ia64/kernel/mca_drv.c b/arch/ia64/kernel/mca_drv.c
index 5bfc79be4cef..23c203639a96 100644
--- a/arch/ia64/kernel/mca_drv.c
+++ b/arch/ia64/kernel/mca_drv.c
@@ -176,7 +176,7 @@ mca_handler_bh(unsigned long paddr, void *iip, unsigned long ipsr)
spin_unlock(&mca_bh_lock);

/* This process is about to be killed itself */
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

/**
diff --git a/arch/ia64/kernel/traps.c b/arch/ia64/kernel/traps.c
index e13cb905930f..753642366e12 100644
--- a/arch/ia64/kernel/traps.c
+++ b/arch/ia64/kernel/traps.c
@@ -85,7 +85,7 @@ die (const char *str, struct pt_regs *regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
return 0;
}

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 02de2e70c587..4796cccbf74f 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -259,7 +259,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
regs = NULL;
bust_spinlocks(0);
if (regs)
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
return;

out_of_memory:
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index 34d6458340b0..59fc63feb0dc 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1131,7 +1131,7 @@ void die_if_kernel (char *str, struct pt_regs *fp, int nr)
pr_crit("%s: %08x\n", str, nr);
show_registers(fp);
add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

asmlinkage void set_esp0(unsigned long ssp)
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index ef46e77e97a5..fcb3a0d8421c 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -48,7 +48,7 @@ int send_fault_sig(struct pt_regs *regs)
pr_alert("Unable to handle kernel access");
pr_cont(" at virtual address %p\n", addr);
die_if_kernel("Oops", regs, 0 /*error_code*/);
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

return 1;
diff --git a/arch/microblaze/kernel/exceptions.c b/arch/microblaze/kernel/exceptions.c
index 908788497b28..fd153d5fab98 100644
--- a/arch/microblaze/kernel/exceptions.c
+++ b/arch/microblaze/kernel/exceptions.c
@@ -44,10 +44,10 @@ void die(const char *str, struct pt_regs *fp, long err)
pr_warn("Oops: %s, sig: %ld\n", str, err);
show_regs(fp);
spin_unlock_irq(&die_lock);
- /* do_exit() should take care of panic'ing from an interrupt
+ /* make_task_dead() should take care of panic'ing from an interrupt
* context so we don't handle it here
*/
- do_exit(err);
+ make_task_dead(err);
}

/* for user application debugging */
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index d26b0fb8ea06..a486486b2355 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -422,7 +422,7 @@ void __noreturn die(const char *str, struct pt_regs *regs)
if (regs && kexec_should_crash(current))
crash_kexec(regs);

- do_exit(sig);
+ make_task_dead(sig);
}

extern struct exception_table_entry __start___dbe_table[];
diff --git a/arch/nds32/kernel/fpu.c b/arch/nds32/kernel/fpu.c
index 9edd7ed7d7bf..701c09a668de 100644
--- a/arch/nds32/kernel/fpu.c
+++ b/arch/nds32/kernel/fpu.c
@@ -223,7 +223,7 @@ inline void handle_fpu_exception(struct pt_regs *regs)
}
} else if (fpcsr & FPCSR_mskRIT) {
if (!user_mode(regs))
- do_exit(SIGILL);
+ make_task_dead(SIGILL);
si_signo = SIGILL;
}

diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c
index ca75d475eda4..c0a8f3344fb9 100644
--- a/arch/nds32/kernel/traps.c
+++ b/arch/nds32/kernel/traps.c
@@ -141,7 +141,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, int err)

bust_spinlocks(0);
spin_unlock_irq(&die_lock);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

EXPORT_SYMBOL(die);
@@ -240,7 +240,7 @@ void unhandled_interruption(struct pt_regs *regs)
pr_emerg("unhandled_interruption\n");
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
force_sig(SIGKILL);
}

@@ -251,7 +251,7 @@ void unhandled_exceptions(unsigned long entry, unsigned long addr,
addr, type);
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
force_sig(SIGKILL);
}

@@ -278,7 +278,7 @@ void do_revinsn(struct pt_regs *regs)
pr_emerg("Reserved Instruction\n");
show_regs(regs);
if (!user_mode(regs))
- do_exit(SIGILL);
+ make_task_dead(SIGILL);
force_sig(SIGILL);
}

diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c
index 596986a74a26..85ac49d64cf7 100644
--- a/arch/nios2/kernel/traps.c
+++ b/arch/nios2/kernel/traps.c
@@ -37,10 +37,10 @@ void die(const char *str, struct pt_regs *regs, long err)
show_regs(regs);
spin_unlock_irq(&die_lock);
/*
- * do_exit() should take care of panic'ing from an interrupt
+ * make_task_dead() should take care of panic'ing from an interrupt
* context so we don't handle it here
*/
- do_exit(err);
+ make_task_dead(err);
}

void _exception(int signo, struct pt_regs *regs, int code, unsigned long addr)
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 0898cb159fac..0446a3c34372 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -212,7 +212,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, long err)
__asm__ __volatile__("l.nop 1");
do {} while (1);
#endif
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/* This is normally the 'Oops' routine */
diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
index b11fb26ce299..df2122c50d78 100644
--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -269,7 +269,7 @@ void die_if_kernel(char *str, struct pt_regs *regs, long err)
panic("Fatal exception");

oops_exit();
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

/* gdb uses break 4,8 */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 11741703d26e..a08bb7cefdc5 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -245,7 +245,7 @@ static void oops_end(unsigned long flags, struct pt_regs *regs,

if (panic_on_oops)
panic("Fatal exception");
- do_exit(signr);
+ make_task_dead(signr);
}
NOKPROBE_SYMBOL(oops_end);

@@ -792,9 +792,9 @@ int machine_check_generic(struct pt_regs *regs)
void die_mce(const char *str, struct pt_regs *regs, long err)
{
/*
- * The machine check wants to kill the interrupted context, but
- * do_exit() checks for in_interrupt() and panics in that case, so
- * exit the irq/nmi before calling die.
+ * The machine check wants to kill the interrupted context,
+ * but make_task_dead() checks for in_interrupt() and panics
+ * in that case, so exit the irq/nmi before calling die.
*/
if (in_nmi())
nmi_exit();
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0daaa3e4630d..fe92e119e6a3 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -54,7 +54,7 @@ void die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception");
if (ret != NOTIFY_STOP)
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

void do_trap(struct pt_regs *regs, int signo, int code, unsigned long addr)
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index aa08dd2f8fae..42118bc728f9 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -31,7 +31,7 @@ static void die_kernel_fault(const char *msg, unsigned long addr,

bust_spinlocks(0);
die(regs, "Oops");
- do_exit(SIGKILL);
+ make_task_dead(SIGKILL);
}

static inline void no_context(struct pt_regs *regs, unsigned long addr)
diff --git a/arch/s390/kernel/dumpstack.c b/arch/s390/kernel/dumpstack.c
index 0681c55e831d..1e3233eb510a 100644
--- a/arch/s390/kernel/dumpstack.c
+++ b/arch/s390/kernel/dumpstack.c
@@ -224,5 +224,5 @@ void __noreturn die(struct pt_regs *regs, const char *str)
if (panic_on_oops)
panic("Fatal exception: panic_on_oops");
oops_exit();
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 20f8e1868853..a4d8c058dd27 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -175,7 +175,7 @@ void __s390_handle_mcck(void)
"malfunction (code 0x%016lx).\n", mcck.mcck_code);
printk(KERN_EMERG "mcck: task: %s, pid: %d.\n",
current->comm, current->pid);
- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}
}

diff --git a/arch/sh/kernel/traps.c b/arch/sh/kernel/traps.c
index cbe3201d4f21..01884054aeb2 100644
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -57,7 +57,7 @@ void __noreturn die(const char *str, struct pt_regs *regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(SIGSEGV);
+ make_task_dead(SIGSEGV);
}

void die_if_kernel(const char *str, struct pt_regs *regs, long err)
diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c
index 5630e5a395e0..179aabfa712e 100644
--- a/arch/sparc/kernel/traps_32.c
+++ b/arch/sparc/kernel/traps_32.c
@@ -86,9 +86,7 @@ void __noreturn die_if_kernel(char *str, struct pt_regs *regs)
}
printk("Instruction DUMP:");
instruction_dump ((unsigned long *) regs->pc);
- if(regs->psr & PSR_PS)
- do_exit(SIGKILL);
- do_exit(SIGSEGV);
+ make_task_dead((regs->psr & PSR_PS) ? SIGKILL : SIGSEGV);
}

void do_hw_interrupt(struct pt_regs *regs, unsigned long type)
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 6863025ed56d..21077821f427 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -2559,9 +2559,7 @@ void __noreturn die_if_kernel(char *str, struct pt_regs *regs)
}
if (panic_on_oops)
panic("Fatal exception");
- if (regs->tstate & TSTATE_PRIV)
- do_exit(SIGKILL);
- do_exit(SIGSEGV);
+ make_task_dead((regs->tstate & TSTATE_PRIV)? SIGKILL : SIGSEGV);
}
EXPORT_SYMBOL(die_if_kernel);

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index ccb9d32768f3..7738fad6a85e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1248,14 +1248,14 @@ SYM_CODE_START(asm_exc_nmi)
SYM_CODE_END(asm_exc_nmi)

.pushsection .text, "ax"
-SYM_CODE_START(rewind_stack_do_exit)
+SYM_CODE_START(rewind_stack_and_make_dead)
/* Prevent any naive code from trying to unwind to our caller. */
xorl %ebp, %ebp

movl PER_CPU_VAR(cpu_current_top_of_stack), %esi
leal -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp

- call do_exit
+ call make_task_dead
1: jmp 1b
-SYM_CODE_END(rewind_stack_do_exit)
+SYM_CODE_END(rewind_stack_and_make_dead)
.popsection
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..f09276457942 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1429,7 +1429,7 @@ SYM_CODE_END(ignore_sysret)
#endif

.pushsection .text, "ax"
-SYM_CODE_START(rewind_stack_do_exit)
+SYM_CODE_START(rewind_stack_and_make_dead)
UNWIND_HINT_FUNC
/* Prevent any naive code from trying to unwind to our caller. */
xorl %ebp, %ebp
@@ -1438,6 +1438,6 @@ SYM_CODE_START(rewind_stack_do_exit)
leaq -PTREGS_SIZE(%rax), %rsp
UNWIND_HINT_REGS

- call do_exit
-SYM_CODE_END(rewind_stack_do_exit)
+ call make_task_dead
+SYM_CODE_END(rewind_stack_and_make_dead)
.popsection
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index ea4fe192189d..53de044e5654 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -351,7 +351,7 @@ unsigned long oops_begin(void)
}
NOKPROBE_SYMBOL(oops_begin);

-void __noreturn rewind_stack_do_exit(int signr);
+void __noreturn rewind_stack_and_make_dead(int signr);

void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
{
@@ -386,7 +386,7 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
* reuse the task stack and that existing poisons are invalid.
*/
kasan_unpoison_task_stack(current);
- rewind_stack_do_exit(signr);
+ rewind_stack_and_make_dead(signr);
}
NOKPROBE_SYMBOL(oops_end);

diff --git a/arch/xtensa/kernel/traps.c b/arch/xtensa/kernel/traps.c
index 4b4dbeb2d612..9345007d474d 100644
--- a/arch/xtensa/kernel/traps.c
+++ b/arch/xtensa/kernel/traps.c
@@ -552,5 +552,5 @@ void __noreturn die(const char * str, struct pt_regs * regs, long err)
if (panic_on_oops)
panic("Fatal exception");

- do_exit(err);
+ make_task_dead(err);
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..2d4bbd9c3278 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -59,6 +59,7 @@ extern void sched_post_fork(struct task_struct *p,
extern void sched_dead(struct task_struct *p);

void __noreturn do_task_dead(void);
+void __noreturn make_task_dead(int signr);

extern void proc_caches_init(void);

diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..bfa513c5b227 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -884,6 +884,15 @@ void __noreturn do_exit(long code)
}
EXPORT_SYMBOL_GPL(do_exit);

+void __noreturn make_task_dead(int signr)
+{
+ /*
+ * Take the task off the cpu after something catastrophic has
+ * happened.
+ */
+ do_exit(signr);
+}
+
void complete_and_exit(struct completion *comp, long code)
{
if (comp)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 21735829b860..e6ab5687770b 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -168,6 +168,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"panic",
"do_exit",
"do_task_dead",
+ "make_task_dead",
"__module_put_and_exit",
"complete_and_exit",
"__reiserfs_panic",
@@ -175,7 +176,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"fortify_panic",
"usercopy_abort",
"machine_real_restart",
- "rewind_stack_do_exit",
+ "rewind_stack_and_make_dead"
"kunit_try_catch_throw",
"xen_start_kernel",
"cpu_bringup_and_idle",
--
2.29.2

2021-12-08 20:26:26

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

The beginning of do_exit has become cluttered and difficult to read as
it is filled with checks to handle things that can only happen when
the kernel is operating improperly.

Now that we have a dedicated function for cleaning up a task when the
kernel is operating improperly move the checks there.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 78 ++++++++++++++++++++++-----------------------
kernel/futex/core.c | 2 +-
kernel/kexec_core.c | 2 +-
3 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index bfa513c5b227..d0ec6f6b41cb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -735,36 +735,8 @@ void __noreturn do_exit(long code)
struct task_struct *tsk = current;
int group_dead;

- /*
- * We can get here from a kernel oops, sometimes with preemption off.
- * Start by checking for critical errors.
- * Then fix up important state like USER_DS and preemption.
- * Then do everything else.
- */
-
WARN_ON(blk_needs_flush_plug(tsk));

- if (unlikely(in_interrupt()))
- panic("Aiee, killing interrupt handler!");
- if (unlikely(!tsk->pid))
- panic("Attempted to kill the idle task!");
-
- /*
- * If do_exit is called because this processes oopsed, it's possible
- * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
- * continuing. Amongst other possible reasons, this is to prevent
- * mm_release()->clear_child_tid() from writing to a user-controlled
- * kernel address.
- */
- force_uaccess_begin();
-
- if (unlikely(in_atomic())) {
- pr_info("note: %s[%d] exited with preempt_count %d\n",
- current->comm, task_pid_nr(current),
- preempt_count());
- preempt_count_set(PREEMPT_ENABLED);
- }
-
profile_task_exit(tsk);
kcov_task_exit(tsk);

@@ -773,17 +745,6 @@ void __noreturn do_exit(long code)

validate_creds_for_do_exit(tsk);

- /*
- * We're taking recursive faults here in do_exit. Safest is to just
- * leave this task alone and wait for reboot.
- */
- if (unlikely(tsk->flags & PF_EXITING)) {
- pr_alert("Fixing recursive fault but reboot is needed!\n");
- futex_exit_recursive(tsk);
- set_current_state(TASK_UNINTERRUPTIBLE);
- schedule();
- }
-
io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */

@@ -889,7 +850,46 @@ void __noreturn make_task_dead(int signr)
/*
* Take the task off the cpu after something catastrophic has
* happened.
+ *
+ * We can get here from a kernel oops, sometimes with preemption off.
+ * Start by checking for critical errors.
+ * Then fix up important state like USER_DS and preemption.
+ * Then do everything else.
*/
+ struct task_struct *tsk = current;
+
+ if (unlikely(in_interrupt()))
+ panic("Aiee, killing interrupt handler!");
+ if (unlikely(!tsk->pid))
+ panic("Attempted to kill the idle task!");
+
+ /*
+ * If make_task_dead is called because this processes oopsed, it's possible
+ * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+ * continuing. Amongst other possible reasons, this is to prevent
+ * mm_release()->clear_child_tid() from writing to a user-controlled
+ * kernel address.
+ */
+ force_uaccess_begin();
+
+ if (unlikely(in_atomic())) {
+ pr_info("note: %s[%d] exited with preempt_count %d\n",
+ current->comm, task_pid_nr(current),
+ preempt_count());
+ preempt_count_set(PREEMPT_ENABLED);
+ }
+
+ /*
+ * We're taking recursive faults here in make_task_dead. Safest is to just
+ * leave this task alone and wait for reboot.
+ */
+ if (unlikely(tsk->flags & PF_EXITING)) {
+ pr_alert("Fixing recursive fault but reboot is needed!\n");
+ futex_exit_recursive(tsk);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule();
+ }
+
do_exit(signr);
}

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 25d8a88b32e5..39a1522865b5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1044,7 +1044,7 @@ static void futex_cleanup(struct task_struct *tsk)
* actually finished the futex cleanup. The worst case for this is that the
* waiter runs through the wait loop until the state becomes visible.
*
- * This is called from the recursive fault handling path in do_exit().
+ * This is called from the recursive fault handling path in make_task_dead().
*
* This is best effort. Either the futex exit code has run already or
* not. If the OWNER_DIED bit has been set on the futex then the waiter can
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 5a5d192a89ac..68480f731192 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -81,7 +81,7 @@ int kexec_should_crash(struct task_struct *p)
if (crash_kexec_post_notifiers)
return 0;
/*
- * There are 4 panic() calls in do_exit() path, each of which
+ * There are 4 panic() calls in make_task_dead() path, each of which
* corresponds to each of these 4 conditions.
*/
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
--
2.29.2

2021-12-08 20:26:29

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

When the kernel detects it is oops or otherwise force killing a task
while it exits the code poorly attempts to permanently stop the task
from scheduling.

I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
to be woken up.

As it makes no sense for the task to continue call do_task_dead
instead which actually does the work and permanently removes the task
from the scheduler. Guaranteeing the task will never be woken
up again.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d0ec6f6b41cb..f975cd8a2ed8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -886,8 +886,7 @@ void __noreturn make_task_dead(int signr)
if (unlikely(tsk->flags & PF_EXITING)) {
pr_alert("Fixing recursive fault but reboot is needed!\n");
futex_exit_recursive(tsk);
- set_current_state(TASK_UNINTERRUPTIBLE);
- schedule();
+ do_task_dead();
}

do_exit(signr);
--
2.29.2

2021-12-08 20:26:32

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 05/10] exit: Stop exporting do_exit

Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f975cd8a2ed8..57afac845a0a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
lockdep_free_task(tsk);
do_task_dead();
}
-EXPORT_SYMBOL_GPL(do_exit);

void __noreturn make_task_dead(int signr)
{
--
2.29.2

2021-12-08 20:26:38

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 06/10] exit: Implement kthread_exit

The way the per task_struct exit_code is used by kernel threads is not
quite compatible how it is used by userspace applications. The low
byte of the userspace exit_code value encodes the exit signal. While
kthreads just use the value as an int holding ordinary kernel function
exit status like -EPERM.

Add kthread_exit to clearly separate the two kinds of uses.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 23 +++++++++++++++++++----
tools/objtool/check.c | 1 +
3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 346b0f269161..22c43d419687 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -70,6 +70,7 @@ void *kthread_probe_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
+void kthread_exit(long result) __noreturn;

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 7113003fab63..77b7c3f23f18 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -268,6 +268,21 @@ void kthread_parkme(void)
}
EXPORT_SYMBOL_GPL(kthread_parkme);

+/**
+ * kthread_exit - Cause the current kthread return @result to kthread_stop().
+ * @result: The integer value to return to kthread_stop().
+ *
+ * While kthread_exit can be called directly, it exists so that
+ * functions which do some additional work in non-modular code such as
+ * module_put_and_kthread_exit can be implemented.
+ *
+ * Does not return.
+ */
+void __noreturn kthread_exit(long result)
+{
+ do_exit(result);
+}
+
static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
@@ -286,13 +301,13 @@ static int kthread(void *_create)
done = xchg(&create->done, NULL);
if (!done) {
kfree(create);
- do_exit(-EINTR);
+ kthread_exit(-EINTR);
}

if (!self) {
create->result = ERR_PTR(-ENOMEM);
complete(done);
- do_exit(-ENOMEM);
+ kthread_exit(-ENOMEM);
}

self->threadfn = threadfn;
@@ -326,7 +341,7 @@ static int kthread(void *_create)
__kthread_parkme(self);
ret = threadfn(data);
}
- do_exit(ret);
+ kthread_exit(ret);
}

/* called from kernel_clone() to get node information for about to be created task */
@@ -627,7 +642,7 @@ EXPORT_SYMBOL_GPL(kthread_park);
* instead of calling wake_up_process(): the thread will exit without
* calling threadfn().
*
- * If threadfn() may call do_exit() itself, the caller must ensure
+ * If threadfn() may call kthread_exit() itself, the caller must ensure
* task_struct can't go away.
*
* Returns the result of threadfn(), or %-EINTR if wake_up_process()
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index e6ab5687770b..90108fe5610d 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -168,6 +168,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"panic",
"do_exit",
"do_task_dead",
+ "kthread_exit",
"make_task_dead",
"__module_put_and_exit",
"complete_and_exit",
--
2.29.2

2021-12-08 20:26:40

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 07/10] exit: Rename module_put_and_exit to module_put_and_kthread_exit

Update module_put_and_exit to call kthread_exit instead of do_exit.

Change the name to reflect this change in functionality. All of the
users of module_put_and_exit are causing the current kthread to exit
so this change makes it clear what is happening. There is no
functional change.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
crypto/algboss.c | 4 ++--
fs/cifs/connect.c | 2 +-
fs/nfs/callback.c | 4 ++--
fs/nfs/nfs4state.c | 2 +-
fs/nfsd/nfssvc.c | 2 +-
include/linux/module.h | 6 +++---
kernel/module.c | 6 +++---
net/bluetooth/bnep/core.c | 2 +-
net/bluetooth/cmtp/core.c | 2 +-
net/bluetooth/hidp/core.c | 2 +-
tools/objtool/check.c | 2 +-
11 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/crypto/algboss.c b/crypto/algboss.c
index 1814d2c5188a..eb5fe84efb83 100644
--- a/crypto/algboss.c
+++ b/crypto/algboss.c
@@ -67,7 +67,7 @@ static int cryptomgr_probe(void *data)
complete_all(&param->larval->completion);
crypto_alg_put(&param->larval->alg);
kfree(param);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

static int cryptomgr_schedule_probe(struct crypto_larval *larval)
@@ -190,7 +190,7 @@ static int cryptomgr_test(void *data)
crypto_alg_tested(param->driver, err);

kfree(param);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

static int cryptomgr_schedule_test(struct crypto_alg *alg)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 82577a7a5bb1..39fbe9acbf51 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -1139,7 +1139,7 @@ cifs_demultiplex_thread(void *p)
}

memalloc_noreclaim_restore(noreclaim_flag);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
}

/*
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 86d856de1389..3c86a559a321 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -93,7 +93,7 @@ nfs4_callback_svc(void *vrqstp)
svc_process(rqstp);
}
svc_exit_thread(rqstp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

@@ -137,7 +137,7 @@ nfs41_callback_svc(void *vrqstp)
}
}
svc_exit_thread(rqstp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index ecc4594299d6..ea41af731978 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -2689,6 +2689,6 @@ static int nfs4_run_state_manager(void *ptr)
allow_signal(SIGKILL);
nfs4_state_manager(clp);
nfs_put_client(clp);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 80431921e5d7..5ce9f14318c4 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -986,7 +986,7 @@ nfsd(void *vrqstp)

/* Release module */
mutex_unlock(&nfsd_mutex);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..f03be97e9ec1 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -595,9 +595,9 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
/* Look for this name: can be of form module:name. */
unsigned long module_kallsyms_lookup_name(const char *name);

-extern void __noreturn __module_put_and_exit(struct module *mod,
+extern void __noreturn __module_put_and_kthread_exit(struct module *mod,
long code);
-#define module_put_and_exit(code) __module_put_and_exit(THIS_MODULE, code)
+#define module_put_and_kthread_exit(code) __module_put_and_kthread_exit(THIS_MODULE, code)

#ifdef CONFIG_MODULE_UNLOAD
int module_refcount(struct module *mod);
@@ -790,7 +790,7 @@ static inline int unregister_module_notifier(struct notifier_block *nb)
return 0;
}

-#define module_put_and_exit(code) do_exit(code)
+#define module_put_and_kthread_exit(code) kthread_exit(code)

static inline void print_modules(void)
{
diff --git a/kernel/module.c b/kernel/module.c
index 84a9141a5e15..a3aa00bf270d 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -337,12 +337,12 @@ static inline void add_taint_module(struct module *mod, unsigned flag,
* A thread that wants to hold a reference to a module only while it
* is running can call this to safely exit. nfsd and lockd use this.
*/
-void __noreturn __module_put_and_exit(struct module *mod, long code)
+void __noreturn __module_put_and_kthread_exit(struct module *mod, long code)
{
module_put(mod);
- do_exit(code);
+ kthread_exit(code);
}
-EXPORT_SYMBOL(__module_put_and_exit);
+EXPORT_SYMBOL(__module_put_and_kthread_exit);

/* Find a module section: 0 means not found. */
static unsigned int find_sec(const struct load_info *info, const char *name)
diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index c9add7753b9f..40baa6b7321a 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -535,7 +535,7 @@ static int bnep_session(void *arg)

up_write(&bnep_session_sem);
free_netdev(dev);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 0a2d78e811cf..9bfded6b74b3 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -323,7 +323,7 @@ static int cmtp_session(void *arg)
up_write(&cmtp_session_sem);

kfree(session);
- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 80848dfc01db..5940744a8cd8 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1305,7 +1305,7 @@ static int hidp_session_thread(void *arg)
l2cap_unregister_user(session->conn, &session->user);
hidp_session_put(session);

- module_put_and_exit(0);
+ module_put_and_kthread_exit(0);
return 0;
}

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 90108fe5610d..120e9598c11a 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -170,7 +170,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"do_task_dead",
"kthread_exit",
"make_task_dead",
- "__module_put_and_exit",
+ "__module_put_and_kthread_exit",
"complete_and_exit",
"__reiserfs_panic",
"lbug_with_loc",
--
2.29.2

2021-12-08 20:26:46

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 08/10] exit: Rename complete_and_exit to kthread_complete_and_exit

Update complete_and_exit to call kthread_exit instead of do_exit.

Change the name to reflect this change in functionality. All of the
users of complete_and_exit are causing the current kthread to exit so
this change makes it clear what is happening.

Move the implementation of kthread_complete_and_exit from
kernel/exit.c to to kernel/kthread.c. As this function is kthread
specific it makes most sense to live with the kthread functions.

There are no functional change.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
drivers/net/wireless/rsi/rsi_91x_coex.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_main.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_sdio_ops.c | 2 +-
drivers/net/wireless/rsi/rsi_91x_usb_ops.c | 2 +-
drivers/pnp/pnpbios/core.c | 6 +++---
drivers/staging/rts5208/rtsx.c | 16 +++++++--------
drivers/usb/atm/usbatm.c | 2 +-
drivers/usb/gadget/function/f_mass_storage.c | 2 +-
fs/jffs2/background.c | 2 +-
include/linux/kernel.h | 1 -
include/linux/kthread.h | 1 +
kernel/exit.c | 9 ---------
kernel/kthread.c | 21 ++++++++++++++++++++
lib/kunit/try-catch.c | 4 ++--
tools/objtool/check.c | 2 +-
15 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/net/wireless/rsi/rsi_91x_coex.c b/drivers/net/wireless/rsi/rsi_91x_coex.c
index a0c5d02ae88c..8a3d86897ea8 100644
--- a/drivers/net/wireless/rsi/rsi_91x_coex.c
+++ b/drivers/net/wireless/rsi/rsi_91x_coex.c
@@ -63,7 +63,7 @@ static void rsi_coex_scheduler_thread(struct rsi_common *common)
rsi_coex_sched_tx_pkts(coex_cb);
} while (atomic_read(&coex_cb->coex_tx_thread.thread_done) == 0);

- complete_and_exit(&coex_cb->coex_tx_thread.completion, 0);
+ kthread_complete_and_exit(&coex_cb->coex_tx_thread.completion, 0);
}

int rsi_coex_recv_pkt(struct rsi_common *common, u8 *msg)
diff --git a/drivers/net/wireless/rsi/rsi_91x_main.c b/drivers/net/wireless/rsi/rsi_91x_main.c
index f1bf71e6c608..c7f5cec5e446 100644
--- a/drivers/net/wireless/rsi/rsi_91x_main.c
+++ b/drivers/net/wireless/rsi/rsi_91x_main.c
@@ -260,7 +260,7 @@ static void rsi_tx_scheduler_thread(struct rsi_common *common)
if (common->init_done)
rsi_core_qos_processor(common);
} while (atomic_read(&common->tx_thread.thread_done) == 0);
- complete_and_exit(&common->tx_thread.completion, 0);
+ kthread_complete_and_exit(&common->tx_thread.completion, 0);
}

#ifdef CONFIG_RSI_COEX
diff --git a/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c b/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
index 8ace1874e5cb..b2b47a0abcbf 100644
--- a/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
+++ b/drivers/net/wireless/rsi/rsi_91x_sdio_ops.c
@@ -75,7 +75,7 @@ void rsi_sdio_rx_thread(struct rsi_common *common)

rsi_dbg(INFO_ZONE, "%s: Terminated SDIO RX thread\n", __func__);
atomic_inc(&sdev->rx_thread.thread_done);
- complete_and_exit(&sdev->rx_thread.completion, 0);
+ kthread_complete_and_exit(&sdev->rx_thread.completion, 0);
}

/**
diff --git a/drivers/net/wireless/rsi/rsi_91x_usb_ops.c b/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
index 4ffcdde1acb1..5130b0e72adc 100644
--- a/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
+++ b/drivers/net/wireless/rsi/rsi_91x_usb_ops.c
@@ -56,6 +56,6 @@ void rsi_usb_rx_thread(struct rsi_common *common)
out:
rsi_dbg(INFO_ZONE, "%s: Terminated thread\n", __func__);
skb_queue_purge(&dev->rx_q);
- complete_and_exit(&dev->rx_thread.completion, 0);
+ kthread_complete_and_exit(&dev->rx_thread.completion, 0);
}

diff --git a/drivers/pnp/pnpbios/core.c b/drivers/pnp/pnpbios/core.c
index 669ef4700c1a..f7e86ae9f72f 100644
--- a/drivers/pnp/pnpbios/core.c
+++ b/drivers/pnp/pnpbios/core.c
@@ -160,7 +160,7 @@ static int pnp_dock_thread(void *unused)
* No dock to manage
*/
case PNP_FUNCTION_NOT_SUPPORTED:
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
case PNP_SYSTEM_NOT_DOCKED:
d = 0;
break;
@@ -170,7 +170,7 @@ static int pnp_dock_thread(void *unused)
default:
pnpbios_print_status("pnp_dock_thread", status);
printk(KERN_WARNING "PnPBIOS: disabling dock monitoring.\n");
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
}
if (d != docked) {
if (pnp_dock_event(d, &now) == 0) {
@@ -183,7 +183,7 @@ static int pnp_dock_thread(void *unused)
}
}
}
- complete_and_exit(&unload_sem, 0);
+ kthread_complete_and_exit(&unload_sem, 0);
}

static int pnpbios_get_resources(struct pnp_dev *dev)
diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
index 91fcf85e150a..5a58dac76c88 100644
--- a/drivers/staging/rts5208/rtsx.c
+++ b/drivers/staging/rts5208/rtsx.c
@@ -450,13 +450,13 @@ static int rtsx_control_thread(void *__dev)
* after the down() -- that's necessary for the thread-shutdown
* case.
*
- * complete_and_exit() goes even further than this -- it is safe in
- * the case that the thread of the caller is going away (not just
- * the structure) -- this is necessary for the module-remove case.
- * This is important in preemption kernels, which transfer the flow
- * of execution immediately upon a complete().
+ * kthread_complete_and_exit() goes even further than this --
+ * it is safe in the case that the thread of the caller is going away
+ * (not just the structure) -- this is necessary for the module-remove
+ * case. This is important in preemption kernels, which transfer the
+ * flow of execution immediately upon a complete().
*/
- complete_and_exit(&dev->control_exit, 0);
+ kthread_complete_and_exit(&dev->control_exit, 0);
}

static int rtsx_polling_thread(void *__dev)
@@ -501,7 +501,7 @@ static int rtsx_polling_thread(void *__dev)
mutex_unlock(&dev->dev_mutex);
}

- complete_and_exit(&dev->polling_exit, 0);
+ kthread_complete_and_exit(&dev->polling_exit, 0);
}

/*
@@ -682,7 +682,7 @@ static int rtsx_scan_thread(void *__dev)
/* Should we unbind if no devices were detected? */
}

- complete_and_exit(&dev->scanning_done, 0);
+ kthread_complete_and_exit(&dev->scanning_done, 0);
}

static void rtsx_init_options(struct rtsx_chip *chip)
diff --git a/drivers/usb/atm/usbatm.c b/drivers/usb/atm/usbatm.c
index da17be1ef64e..e3a49d837609 100644
--- a/drivers/usb/atm/usbatm.c
+++ b/drivers/usb/atm/usbatm.c
@@ -969,7 +969,7 @@ static int usbatm_do_heavy_init(void *arg)
instance->thread = NULL;
mutex_unlock(&instance->serialize);

- complete_and_exit(&instance->thread_exited, ret);
+ kthread_complete_and_exit(&instance->thread_exited, ret);
}

static int usbatm_heavy_init(struct usbatm_data *instance)
diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c
index 752439690fda..46dd11dcb3a8 100644
--- a/drivers/usb/gadget/function/f_mass_storage.c
+++ b/drivers/usb/gadget/function/f_mass_storage.c
@@ -2547,7 +2547,7 @@ static int fsg_main_thread(void *common_)
up_write(&common->filesem);

/* Let fsg_unbind() know the thread has exited */
- complete_and_exit(&common->thread_notifier, 0);
+ kthread_complete_and_exit(&common->thread_notifier, 0);
}

diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b4d5013dc5d..6da92ecaf66d 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -161,5 +161,5 @@ static int jffs2_garbage_collect_thread(void *_c)
spin_lock(&c->erase_completion_lock);
c->gc_task = NULL;
spin_unlock(&c->erase_completion_lock);
- complete_and_exit(&c->gc_thread_exit, 0);
+ kthread_complete_and_exit(&c->gc_thread_exit, 0);
}
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 77755ac3e189..055eb203c00e 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -187,7 +187,6 @@ static inline void might_fault(void) { }
#endif

void do_exit(long error_code) __noreturn;
-void complete_and_exit(struct completion *, long) __noreturn;

extern int num_to_str(char *buf, int size,
unsigned long long num, unsigned int width);
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 22c43d419687..d86a7e3b9a52 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -71,6 +71,7 @@ int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
void kthread_exit(long result) __noreturn;
+void kthread_complete_and_exit(struct completion *, long) __noreturn;

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/exit.c b/kernel/exit.c
index 57afac845a0a..6c4b04531f17 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -891,15 +891,6 @@ void __noreturn make_task_dead(int signr)
do_exit(signr);
}

-void complete_and_exit(struct completion *comp, long code)
-{
- if (comp)
- complete(comp);
-
- do_exit(code);
-}
-EXPORT_SYMBOL(complete_and_exit);
-
SYSCALL_DEFINE1(exit, int, error_code)
{
do_exit((error_code&0xff)<<8);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 77b7c3f23f18..4388d6694a7f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -283,6 +283,27 @@ void __noreturn kthread_exit(long result)
do_exit(result);
}

+/**
+ * kthread_complete_and exit - Exit the current kthread.
+ * @comp: Completion to complete
+ * @code: The integer value to return to kthread_stop().
+ *
+ * If present complete @comp and the reuturn code to kthread_stop().
+ *
+ * A kernel thread whose module may be removed after the completion of
+ * @comp can use this function exit safely.
+ *
+ * Does not return.
+ */
+void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
+{
+ if (comp)
+ complete(comp);
+
+ kthread_exit(code);
+}
+EXPORT_SYMBOL(kthread_complete_and_exit);
+
static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
index 0dd434e40487..be38a2c5ecc2 100644
--- a/lib/kunit/try-catch.c
+++ b/lib/kunit/try-catch.c
@@ -17,7 +17,7 @@
void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)
{
try_catch->try_result = -EFAULT;
- complete_and_exit(try_catch->try_completion, -EFAULT);
+ kthread_complete_and_exit(try_catch->try_completion, -EFAULT);
}
EXPORT_SYMBOL_GPL(kunit_try_catch_throw);

@@ -27,7 +27,7 @@ static int kunit_generic_run_threadfn_adapter(void *data)

try_catch->try(try_catch->context);

- complete_and_exit(try_catch->try_completion, 0);
+ kthread_complete_and_exit(try_catch->try_completion, 0);
}

static unsigned long kunit_test_timeout(void)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 120e9598c11a..282273a1ffa5 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -171,7 +171,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"kthread_exit",
"make_task_dead",
"__module_put_and_kthread_exit",
- "complete_and_exit",
+ "kthread_complete_and_exit",
"__reiserfs_panic",
"lbug_with_loc",
"fortify_panic",
--
2.29.2

2021-12-08 20:26:53

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Today the rules are a bit iffy and arbitrary about which kernel
threads have struct kthread present. Both idle threads and thread
started with create_kthread want struct kthread present so that is
effectively all kernel threads. Make the rule that if PF_KTHREAD
and the task is running then struct kthread is present.

This will allow the kernel thread code to using tsk->exit_code
with different semantics from ordinary processes.

To make ensure that struct kthread is present for all
kernel threads move it's allocation into copy_process.

Add a deallocation of struct kthread in exec for processes
that were kernel threads.

Move the allocation of struct kthread for the initial thread
earlier so that it is not repeated for each additional idle
thread.

Move the initialization of struct kthread into set_kthread_struct
so that the structure is always and reliably initailized.

Clear set_child_tid in free_kthread_struct to ensure the kthread
struct is reliably freed during exec. The function
free_kthread_struct does not need to clear vfork_done during exec as
exec_mm_release called from exec_mmap has already cleared vfork_done.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 2 ++
include/linux/kthread.h | 2 +-
kernel/fork.c | 4 ++++
kernel/kthread.c | 31 ++++++++++++++-----------------
kernel/sched/core.c | 16 ++++++++--------
5 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..59cac7c18178 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1307,6 +1307,8 @@ int begin_new_exec(struct linux_binprm * bprm)
*/
force_uaccess_begin();

+ if (me->flags & PF_KTHREAD)
+ free_kthread_struct(me);
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index d86a7e3b9a52..4f3433afb54b 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,7 +33,7 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
unsigned int cpu,
const char *namefmt);

-void set_kthread_struct(struct task_struct *p);
+bool set_kthread_struct(struct task_struct *p);

void kthread_set_per_cpu(struct task_struct *k, int cpu);
bool kthread_is_per_cpu(struct task_struct *k);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..04fa3e5d97af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2118,6 +2118,10 @@ static __latent_entropy struct task_struct *copy_process(
p->io_context = NULL;
audit_set_context(p, NULL);
cgroup_fork(p);
+ if (p->flags & PF_KTHREAD) {
+ if (!set_kthread_struct(p))
+ goto bad_fork_cleanup_threadgroup_lock;
+ }
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4388d6694a7f..8e5f44bed027 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -93,20 +93,27 @@ static inline struct kthread *__to_kthread(struct task_struct *p)
return kthread;
}

-void set_kthread_struct(struct task_struct *p)
+bool set_kthread_struct(struct task_struct *p)
{
struct kthread *kthread;

- if (__to_kthread(p))
- return;
+ if (WARN_ON_ONCE(to_kthread(p)))
+ return false;

kthread = kzalloc(sizeof(*kthread), GFP_KERNEL);
+ if (!kthread)
+ return false;
+
+ init_completion(&kthread->exited);
+ init_completion(&kthread->parked);
+ p->vfork_done = &kthread->exited;
+
/*
* We abuse ->set_child_tid to avoid the new member and because it
- * can't be wrongly copied by copy_process(). We also rely on fact
- * that the caller can't exec, so PF_KTHREAD can't be cleared.
+ * can't be wrongly copied by copy_process().
*/
p->set_child_tid = (__force void __user *)kthread;
+ return true;
}

void free_kthread_struct(struct task_struct *k)
@@ -114,13 +121,13 @@ void free_kthread_struct(struct task_struct *k)
struct kthread *kthread;

/*
- * Can be NULL if this kthread was created by kernel_thread()
- * or if kmalloc() in kthread() failed.
+ * Can be NULL if kmalloc() in set_kthread_struct() failed.
*/
kthread = to_kthread(k);
#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread && kthread->blkcg_css);
#endif
+ k->set_child_tid = (__force void __user *)NULL;
kfree(kthread);
}

@@ -315,7 +322,6 @@ static int kthread(void *_create)
struct kthread *self;
int ret;

- set_kthread_struct(current);
self = to_kthread(current);

/* If user was SIGKILLed, I release the structure. */
@@ -325,17 +331,8 @@ static int kthread(void *_create)
kthread_exit(-EINTR);
}

- if (!self) {
- create->result = ERR_PTR(-ENOMEM);
- complete(done);
- kthread_exit(-ENOMEM);
- }
-
self->threadfn = threadfn;
self->data = data;
- init_completion(&self->exited);
- init_completion(&self->parked);
- current->vfork_done = &self->exited;

/*
* The new thread inherited kthreadd's priority and CPU mask. Reset
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c9b0fda64ac..0404a8c572a1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8599,14 +8599,6 @@ void __init init_idle(struct task_struct *idle, int cpu)

__sched_fork(0, idle);

- /*
- * The idle task doesn't need the kthread struct to function, but it
- * is dressed up as a per-CPU kthread and thus needs to play the part
- * if we want to avoid special-casing it in code that deals with per-CPU
- * kthreads.
- */
- set_kthread_struct(idle);
-
raw_spin_lock_irqsave(&idle->pi_lock, flags);
raw_spin_rq_lock(rq);

@@ -9427,6 +9419,14 @@ void __init sched_init(void)
mmgrab(&init_mm);
enter_lazy_tlb(&init_mm, current);

+ /*
+ * The idle task doesn't need the kthread struct to function, but it
+ * is dressed up as a per-CPU kthread and thus needs to play the part
+ * if we want to avoid special-casing it in code that deals with per-CPU
+ * kthreads.
+ */
+ WARN_ON(set_kthread_struct(current));
+
/*
* Make us the idle thread. Technically, schedule() should not be
* called from this thread, however somewhere below it might be,
--
2.29.2

2021-12-08 20:26:55

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 10/10] exit/kthread: Move the exit code for kernel threads into struct kthread

The exit code of kernel threads has different semantics than the
exit_code of userspace tasks. To avoid confusion and allow
the userspace implementation to change as needed move
the kernel thread exit code into struct kthread.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/kthread.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8e5f44bed027..9c6c532047c4 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -52,6 +52,7 @@ struct kthread_create_info
struct kthread {
unsigned long flags;
unsigned int cpu;
+ int result;
int (*threadfn)(void *);
void *data;
mm_segment_t oldfs;
@@ -287,7 +288,9 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
*/
void __noreturn kthread_exit(long result)
{
- do_exit(result);
+ struct kthread *kthread = to_kthread(current);
+ kthread->result = result;
+ do_exit(0);
}

/**
@@ -679,7 +682,7 @@ int kthread_stop(struct task_struct *k)
kthread_unpark(k);
wake_up_process(k);
wait_for_completion(&kthread->exited);
- ret = k->exit_code;
+ ret = kthread->result;
put_task_struct(k);

trace_sched_kthread_stop_ret(ret);
--
2.29.2

2021-12-12 17:49:14

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
> My s390 assembly is not particularly good so I have read the history
> of the reference to do_exit copy_thread and have been able to
> verify that do_exit is not used.
>
> The general argument is that s390 has been changed to use the generic
> kernel_thread and kernel_execve and the generic versions do not call
> do_exit. So it is strange to see a do_exit reference sitting there.
>
> The history of the do_exit reference in s390's version of copy_thread
> seems conclusive that the do_exit reference is something that lingers
> and should have been removed several years ago.
...
> Remove this dead reference to do_exit to make it clear that s390 is
> not doing anything with do_exit in copy_thread.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> arch/s390/kernel/process.c | 1 -
> 1 file changed, 1 deletion(-)

Applied to s390 tree. Just in case you want to apply this to your tree too:
Acked-by: Heiko Carstens <[email protected]>

2021-12-13 14:51:30

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

Heiko Carstens <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
>> My s390 assembly is not particularly good so I have read the history
>> of the reference to do_exit copy_thread and have been able to
>> verify that do_exit is not used.
>>
>> The general argument is that s390 has been changed to use the generic
>> kernel_thread and kernel_execve and the generic versions do not call
>> do_exit. So it is strange to see a do_exit reference sitting there.
>>
>> The history of the do_exit reference in s390's version of copy_thread
>> seems conclusive that the do_exit reference is something that lingers
>> and should have been removed several years ago.
> ...
>> Remove this dead reference to do_exit to make it clear that s390 is
>> not doing anything with do_exit in copy_thread.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> arch/s390/kernel/process.c | 1 -
>> 1 file changed, 1 deletion(-)
>
> Applied to s390 tree. Just in case you want to apply this to your tree too:
> Acked-by: Heiko Carstens <[email protected]>

Thank you for looking at this and confirming I had read that the code
properly and that the do_exit reference was no longer used.

I will probably take this through my tree as well just so I don't have
that trailing do_exit reference.

At this point I will give things a bit more for people to review or say
something about the other changes and if there is no negative feedback
I think I will just apply the lot.

Eric

2021-12-13 22:51:08

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 0/8] signal: Cleanup of the signal->flags

The special case of SIGKILL during coredumps is very fragile today and
while reading through the code I realized I have almost broken it twice.
So this simplifies that special case, removes SIGNAL_GROUP_COREDUMP
which has become unnecessary with the addition of signal->core_state,
and this removes the helper signal_group_exit which is misnamed and
is not used properly.

If you squint very hard there might be a user space visible difference
in behavior somewhere but I don't think there is one in practice.

These patches are on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

After these patches have been reviewed it is my plan to apply them to my
signal-for-v5.17 branch.

Eric W. Biederman (8):
signal: Make SIGKILL during coredumps an explicit special case
signal: Drop signals received after a fatal signal has been processed
signal: Have the oom killer detect coredumps using signal->core_state
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Remove SIGNAL_GROUP_COREDUMP
coredump: Stop setting signal->group_exit_task
signal: Rename group_exit_task group_exec_task
signal: Remove the helper signal_group_exit

fs/coredump.c | 20 +++++++++-----------
fs/exec.c | 10 +++++-----
include/linux/sched/signal.h | 18 +++---------------
kernel/exit.c | 12 ++++++++----
kernel/signal.c | 24 ++++++++++++++++--------
mm/oom_kill.c | 2 +-
6 files changed, 42 insertions(+), 44 deletions(-)

Eric

2021-12-13 22:54:41

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Simplify the code that allows SIGKILL during coredumps to terminate
the coredump. As far as I can tell I have avoided breaking it
by dumb luck.

Historically with all of the other threads stopping in exit_mm the
wants_signal loop in complete_signal would find the dumper task and
then complete_signal would wake the dumper task with signal_wake_up.

After moving the coredump_task_exit above the setting of PF_EXITING in
commit 92307383082d ("coredump: Don't perform any cleanups before
dumping core") wants_signal will consider all of the threads in a
multi-threaded process for waking up, not just the core dumping task.

Luckily complete_signal short circuits SIGKILL during a coredump marks
every thread with SIGKILL and signal_wake_up. This code is arguably
buggy however as it tries to skip creating a group exit when is already
present, and it fails that a coredump is in progress.

Ever since commit 06af8679449d ("coredump: Limit what can interrupt
coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
set on the dumper task but also SIGKILL set in it's pending bitmap.
This means that if the code is ever fixed not to short-circuit and
kill a process after it has already been killed the special case
for SIGKILL during a coredump will be broken.

Sort all of this out by making the coredump special case more special,
and perform all of the work in prepare_signal and leave the rest of
the signal delivery path out of it.

In prepare_signal when the process coredumping is sent SIGKILL find
the task performing the coredump and use sigaddset and signal_wake_up
to ensure that task reports fatal_signal_pending.

Return false from prepare_signal to tell the rest of the signal
delivery path to ignore the signal.

Update wait_for_dump_helpers to perform a wait_event_killable wait
so that if signal_pending gets set spuriously the wait will not
be interrupted unless fatal_signal_pending is true.

I have tested this and verified I did not break SIGKILL during
coredumps by accident (before or after this change). I actually
thought I had and I had to figure out what I had misread that kept
SIGKILL during coredumps working.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 4 ++--
kernel/signal.c | 11 +++++++++--
2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a6b3c196cdef..7b91fb32dbb8 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
static bool dump_interrupted(void)
{
/*
- * SIGKILL or freezing() interrupt the coredumping. Perhaps we
+ * SIGKILL or freezing() interrupted the coredumping. Perhaps we
* can do try_to_freeze() and check __fatal_signal_pending(),
* but then we need to teach dump_write() to restart and clear
* TIF_SIGPENDING.
@@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
* We actually want wait_event_freezable() but then we need
* to clear TIF_SIGPENDING and improve dump_interrupted().
*/
- wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
+ wait_event_killable(pipe->rd_wait, pipe->readers == 1);

pipe_lock(pipe);
pipe->readers--;
diff --git a/kernel/signal.c b/kernel/signal.c
index 8272cac5f429..7e305a8ec7c2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
sigset_t flush;

if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
- if (!(signal->flags & SIGNAL_GROUP_EXIT))
- return sig == SIGKILL;
+ struct core_state *core_state = signal->core_state;
+ if (core_state) {
+ if (sig == SIGKILL) {
+ struct task_struct *dumper = core_state->dumper.task;
+ sigaddset(&dumper->pending.signal, SIGKILL);
+ signal_wake_up(dumper, 1);
+ }
+ return false;
+ }
/*
* The process is in the middle of dying, nothing to do.
*/
--
2.29.2

2021-12-13 22:54:44

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 2/8] signal: Drop signals received after a fatal signal has been processed

In 403bad72b67d ("coredump: only SIGKILL should interrupt the
coredumping task") Oleg modified the kernel to drop all signals that
come in during a coredump except SIGKILL, and suggested that it might
be a good idea to generalize that to other cases after the process has
received a fatal signal.

Semantically it does not make sense to perform any signal delivery
after the process has already been killed.

When a signal is sent while a process is dying today the signal is
placed in the signal queue by __send_signal and a single task of the
process is woken up with signal_wake_up, if there are any tasks that
have not set PF_EXITING.

Take things one step farther and have prepare_signal report that all
signals that come after a process has been killed should be ignored.
While retaining the historical exception of allowing SIGKILL to
interrupt coredumps.

Remove the SIGNAL_GROUP_EXIT test from complete_signal, as it is no
longer possible for signal processing to reach complete_signal when
SIGNAL_GROUP_EXIT is true.

Update the comment in fs/coredump.c to make it clear coredumps are
special in being able to receive SIGKILL.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 +-
kernel/signal.c | 5 ++---
2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 7b91fb32dbb8..a9c25f20118f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,7 +352,7 @@ static int zap_process(struct task_struct *start, int exit_code, int flags)
struct task_struct *t;
int nr = 0;

- /* ignore all signals except SIGKILL, see prepare_signal() */
+ /* Allow SIGKILL, see prepare_signal() */
start->signal->flags = SIGNAL_GROUP_COREDUMP | flags;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index 7e305a8ec7c2..cdccbacac685 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -914,11 +914,11 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
sigaddset(&dumper->pending.signal, SIGKILL);
signal_wake_up(dumper, 1);
}
- return false;
}
/*
- * The process is in the middle of dying, nothing to do.
+ * The process is in the middle of dying, drop the signal.
*/
+ return false;
} else if (sig_kernel_stop(sig)) {
/*
* This is a stop signal. Remove SIGCONT from all queues.
@@ -1039,7 +1039,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) &&
- !(signal->flags & SIGNAL_GROUP_EXIT) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
/*
--
2.29.2

2021-12-13 22:54:47

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 3/8] signal: Have the oom killer detect coredumps using signal->core_state

In preparation for removing the flag SIGNAL_GROUP_COREDUMP change
__task_will_free_mem to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.

Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..5c92aad8ca1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -793,7 +793,7 @@ static inline bool __task_will_free_mem(struct task_struct *task)
* coredump_task_exit(), so the oom killer cannot assume that
* the process will promptly exit and release memory.
*/
- if (sig->flags & SIGNAL_GROUP_COREDUMP)
+ if (sig->core_state)
return false;

if (sig->flags & SIGNAL_GROUP_EXIT)
--
2.29.2

2021-12-13 22:54:51

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 4/8] signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process

There are only a few places that test SIGNAL_GROUP_EXIT and
are not also already testing SIGNAL_GROUP_COREDUMP.

This will not affect the callers of signal_group_exit as zap_process
also sets group_exit_task so signal_group_exit will continue to return
true at the same times.

This does not affect wait_task_zombie as the none of the threads
wind up in EXIT_ZOMBIE state during a coredump.

This does not affect oom_kill.c:__task_will_free_mem as
sig->core_state is tested and handled before SIGNAL_GROUP_EXIT is
tested for.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index a9c25f20118f..5e5a90de7be3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -347,13 +347,13 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm,
return ispipe;
}

-static int zap_process(struct task_struct *start, int exit_code, int flags)
+static int zap_process(struct task_struct *start, int exit_code)
{
struct task_struct *t;
int nr = 0;

/* Allow SIGKILL, see prepare_signal() */
- start->signal->flags = SIGNAL_GROUP_COREDUMP | flags;
+ start->signal->flags = SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;

@@ -378,7 +378,7 @@ static int zap_threads(struct task_struct *tsk,
if (!signal_group_exit(tsk->signal)) {
tsk->signal->core_state = core_state;
tsk->signal->group_exit_task = tsk;
- nr = zap_process(tsk, exit_code, 0);
+ nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
atomic_set(&core_state->nr_threads, nr);
--
2.29.2

2021-12-13 22:54:55

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 6/8] coredump: Stop setting signal->group_exit_task

Currently the coredump code sets group_exit_task so that
signal_group_exit() will return true during a coredump. Now that the
coredump code always sets SIGNAL_GROUP_EXIT there is no longer a need
to set signal->group_exit_task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2c9d16d4b57a..ef56595a0d87 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -377,7 +377,6 @@ static int zap_threads(struct task_struct *tsk,
spin_lock_irq(&tsk->sighand->siglock);
if (!signal_group_exit(tsk->signal)) {
tsk->signal->core_state = core_state;
- tsk->signal->group_exit_task = tsk;
nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
@@ -426,7 +425,6 @@ static void coredump_finish(bool core_dumped)
spin_lock_irq(&current->sighand->siglock);
if (core_dumped && !__fatal_signal_pending(current))
current->signal->group_exit_code |= 0x80;
- current->signal->group_exit_task = NULL;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
spin_unlock_irq(&current->sighand->siglock);
--
2.29.2

2021-12-13 22:54:58

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 5/8] signal: Remove SIGNAL_GROUP_COREDUMP

After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.

Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant so
stop setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP
and remove it's definition makeing the code slightly simpler.

With the setting of SIGNAL_GROUP_COREDUMP gone coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 3 +--
include/linux/sched/signal.h | 3 +--
kernel/signal.c | 2 +-
3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 5e5a90de7be3..2c9d16d4b57a 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -353,7 +353,7 @@ static int zap_process(struct task_struct *start, int exit_code)
int nr = 0;

/* Allow SIGKILL, see prepare_signal() */
- start->signal->flags = SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP;
+ start->signal->flags = SIGNAL_GROUP_EXIT;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;

@@ -427,7 +427,6 @@ static void coredump_finish(bool core_dumped)
if (core_dumped && !__fatal_signal_pending(current))
current->signal->group_exit_code |= 0x80;
current->signal->group_exit_task = NULL;
- current->signal->flags = SIGNAL_GROUP_EXIT;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
spin_unlock_irq(&current->sighand->siglock);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index fa26d2a58413..ecc10e148799 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -256,7 +256,6 @@ struct signal_struct {
#define SIGNAL_STOP_STOPPED 0x00000001 /* job control stop in effect */
#define SIGNAL_STOP_CONTINUED 0x00000002 /* SIGCONT since WCONTINUED reap */
#define SIGNAL_GROUP_EXIT 0x00000004 /* group exit in progress */
-#define SIGNAL_GROUP_COREDUMP 0x00000008 /* coredump in progress */
/*
* Pending notifications to parent.
*/
@@ -272,7 +271,7 @@ struct signal_struct {
static inline void signal_set_stop_flags(struct signal_struct *sig,
unsigned int flags)
{
- WARN_ON(sig->flags & (SIGNAL_GROUP_EXIT|SIGNAL_GROUP_COREDUMP));
+ WARN_ON(sig->flags & SIGNAL_GROUP_EXIT);
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

diff --git a/kernel/signal.c b/kernel/signal.c
index cdccbacac685..9eb3e2c1f9f7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -906,7 +906,7 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
struct task_struct *t;
sigset_t flush;

- if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
+ if (signal->flags & SIGNAL_GROUP_EXIT) {
struct core_state *core_state = signal->core_state;
if (core_state) {
if (sig == SIGKILL) {
--
2.29.2

2021-12-13 22:55:01

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 7/8] signal: Rename group_exit_task group_exec_task

The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.

Update the comment above the definition of group_exec_task
to document how it is currently used.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 8 ++++----
include/linux/sched/signal.h | 12 ++++--------
kernel/exit.c | 4 ++--
3 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 59cac7c18178..9d2925811011 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1054,7 +1054,7 @@ static int de_thread(struct task_struct *tsk)
return -EAGAIN;
}

- sig->group_exit_task = tsk;
+ sig->group_exec_task = tsk;
sig->notify_count = zap_other_threads(tsk);
if (!thread_group_leader(tsk))
sig->notify_count--;
@@ -1082,7 +1082,7 @@ static int de_thread(struct task_struct *tsk)
write_lock_irq(&tasklist_lock);
/*
* Do this under tasklist_lock to ensure that
- * exit_notify() can't miss ->group_exit_task
+ * exit_notify() can't miss ->group_exec_task
*/
sig->notify_count = -1;
if (likely(leader->exit_state))
@@ -1149,7 +1149,7 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}

- sig->group_exit_task = NULL;
+ sig->group_exec_task = NULL;
sig->notify_count = 0;

no_thread_group:
@@ -1162,7 +1162,7 @@ static int de_thread(struct task_struct *tsk)
killed:
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
+ sig->group_exec_task = NULL;
sig->notify_count = 0;
read_unlock(&tasklist_lock);
return -EAGAIN;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index ecc10e148799..d3248aba5183 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -109,13 +109,9 @@ struct signal_struct {

/* thread group exit support */
int group_exit_code;
- /* overloaded:
- * - notify group_exit_task when ->count is equal to notify_count
- * - everyone except group_exit_task is stopped during signal delivery
- * of fatal signals, group_exit_task processes the signal.
- */
+ /* notify group_exec_task when notify_count is less or equal to 0 */
int notify_count;
- struct task_struct *group_exit_task;
+ struct task_struct *group_exec_task;

/* thread group stop support, overloads group_exit_code too */
int group_stop_count;
@@ -275,11 +271,11 @@ static inline void signal_set_stop_flags(struct signal_struct *sig,
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

-/* If true, all threads except ->group_exit_task have pending SIGKILL */
+/* If true, all threads except ->group_exec_task have pending SIGKILL */
static inline int signal_group_exit(const struct signal_struct *sig)
{
return (sig->flags & SIGNAL_GROUP_EXIT) ||
- (sig->group_exit_task != NULL);
+ (sig->group_exec_task != NULL);
}

extern void flush_signals(struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 6c4b04531f17..527c5e4430ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -116,7 +116,7 @@ static void __exit_signal(struct task_struct *tsk)
* then notify it:
*/
if (sig->notify_count > 0 && !--sig->notify_count)
- wake_up_process(sig->group_exit_task);
+ wake_up_process(sig->group_exec_task);

if (tsk == sig->curr_target)
sig->curr_target = next_thread(tsk);
@@ -697,7 +697,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)

/* mt-exec, de_thread() is waiting for group leader */
if (unlikely(tsk->signal->notify_count < 0))
- wake_up_process(tsk->signal->group_exit_task);
+ wake_up_process(tsk->signal->group_exec_task);
write_unlock_irq(&tasklist_lock);

list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
--
2.29.2

2021-12-13 22:55:04

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 8/8] signal: Remove the helper signal_group_exit

This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.

Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.

Removing the helper so that both cases must be coded for explicitly
makes it more obvious what is going on as both cases must be coded for
explicitly.

While removing the helper fix the two cases where I have observed
using signal_group_helper resulted in the wrong result.

For the unset exit_code in do_group_exit during an exec I use 0 as I
think that is what group_exit_code has been set to most of the time.
During a thread group stop group_exit_code is set to the stop signal
and when the thread group receives SIGCONT group_exit_code is reset to
0.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 5 +++--
fs/exec.c | 2 +-
include/linux/sched/signal.h | 7 -------
kernel/exit.c | 8 ++++++--
kernel/signal.c | 8 +++++---
5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ef56595a0d87..09302a6a0d80 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -372,11 +372,12 @@ static int zap_process(struct task_struct *start, int exit_code)
static int zap_threads(struct task_struct *tsk,
struct core_state *core_state, int exit_code)
{
+ struct signal_struct *signal = tsk->signal;
int nr = -EAGAIN;

spin_lock_irq(&tsk->sighand->siglock);
- if (!signal_group_exit(tsk->signal)) {
- tsk->signal->core_state = core_state;
+ if (!(signal->flags & SIGNAL_GROUP_EXIT) && !signal->group_exec_task) {
+ signal->core_state = core_state;
nr = zap_process(tsk, exit_code);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
diff --git a/fs/exec.c b/fs/exec.c
index 9d2925811011..82db656ca709 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1045,7 +1045,7 @@ static int de_thread(struct task_struct *tsk)
* Kill all other threads in the thread group.
*/
spin_lock_irq(lock);
- if (signal_group_exit(sig)) {
+ if ((sig->flags & SIGNAL_GROUP_EXIT) || sig->group_exec_task) {
/*
* Another group action in progress, just
* return so that the signal is processed.
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d3248aba5183..b6ecb9fc4cd2 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -271,13 +271,6 @@ static inline void signal_set_stop_flags(struct signal_struct *sig,
sig->flags = (sig->flags & ~SIGNAL_STOP_MASK) | flags;
}

-/* If true, all threads except ->group_exec_task have pending SIGKILL */
-static inline int signal_group_exit(const struct signal_struct *sig)
-{
- return (sig->flags & SIGNAL_GROUP_EXIT) ||
- (sig->group_exec_task != NULL);
-}
-
extern void flush_signals(struct task_struct *);
extern void ignore_signals(struct task_struct *);
extern void flush_signal_handlers(struct task_struct *, int force_default);
diff --git a/kernel/exit.c b/kernel/exit.c
index 527c5e4430ae..e7104f803be0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -907,15 +907,19 @@ do_group_exit(int exit_code)

BUG_ON(exit_code & 0x80); /* core dumps don't get here */

- if (signal_group_exit(sig))
+ if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
+ else if (sig->group_exec_task)
+ exit_code = 0;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;

spin_lock_irq(&sighand->siglock);
- if (signal_group_exit(sig))
+ if (sig->flags & SIGNAL_GROUP_EXIT)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
+ else if (sig->group_exec_task)
+ exit_code = 0;
else {
sig->group_exit_code = exit_code;
sig->flags = SIGNAL_GROUP_EXIT;
diff --git a/kernel/signal.c b/kernel/signal.c
index 9eb3e2c1f9f7..860d844542b2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2392,7 +2392,8 @@ static bool do_signal_stop(int signr)
WARN_ON_ONCE(signr & ~JOBCTL_STOP_SIGMASK);

if (!likely(current->jobctl & JOBCTL_STOP_DEQUEUED) ||
- unlikely(signal_group_exit(sig)))
+ unlikely(sig->flags & SIGNAL_GROUP_EXIT) ||
+ unlikely(sig->group_exec_task))
return false;
/*
* There is no group stop already in progress. We must
@@ -2699,7 +2700,8 @@ bool get_signal(struct ksignal *ksig)
enum pid_type type;

/* Has this task already been marked for death? */
- if (signal_group_exit(signal)) {
+ if ((signal->flags & SIGNAL_GROUP_EXIT) ||
+ signal->group_exec_task) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2955,7 +2957,7 @@ void exit_signals(struct task_struct *tsk)
*/
cgroup_threadgroup_change_begin(tsk);

- if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
+ if (thread_group_empty(tsk) || (tsk->signal->flags & SIGNAL_GROUP_EXIT)) {
tsk->flags |= PF_EXITING;
cgroup_threadgroup_change_end(tsk);
return;
--
2.29.2

2021-12-22 18:19:16

by Nathan Chancellor

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Hi Eric,

On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> Today the rules are a bit iffy and arbitrary about which kernel
> threads have struct kthread present. Both idle threads and thread
> started with create_kthread want struct kthread present so that is
> effectively all kernel threads. Make the rule that if PF_KTHREAD
> and the task is running then struct kthread is present.
>
> This will allow the kernel thread code to using tsk->exit_code
> with different semantics from ordinary processes.
>
> To make ensure that struct kthread is present for all
> kernel threads move it's allocation into copy_process.
>
> Add a deallocation of struct kthread in exec for processes
> that were kernel threads.
>
> Move the allocation of struct kthread for the initial thread
> earlier so that it is not repeated for each additional idle
> thread.
>
> Move the initialization of struct kthread into set_kthread_struct
> so that the structure is always and reliably initailized.
>
> Clear set_child_tid in free_kthread_struct to ensure the kthread
> struct is reliably freed during exec. The function
> free_kthread_struct does not need to clear vfork_done during exec as
> exec_mm_release called from exec_mmap has already cleared vfork_done.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
present for all kthreads") in -next causes an ARCH=arm
multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
("kthread: Warn about failed allocations for the init kthread") to avoid
the known runtime warning.

$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all

$ qemu-system-arm \
-initrd rootfs.cpio \
-append earlycon \
-machine palmetto-bmc \
-no-reboot \
-dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
-display none \
-kernel arch/arm/boot/zImage \
-m 512m \
-nodefaults \
-serial mon:stdio
qemu-system-arm: warning: nic ftgmac100.0 has no peer
qemu-system-arm: warning: nic ftgmac100.1 has no peer
Booting Linux on physical CPU 0x0
Linux version 5.16.0-rc1-00016-g40966e316f86-dirty (nathan@archlinux-ax161) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
CPU: VIVT data cache, VIVT instruction cache
OF: fdt: Machine model: Palmetto BMC
earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
printk: bootconsole [ns16550a0] enabled
Memory policy: Data cache writethrough
cma: Reserved 16 MiB at 0x5b000000
Zone ranges:
DMA [mem 0x0000000040000000-0x000000005edfffff]
Normal empty
HighMem [mem 0x000000005ee00000-0x000000005fffffff]
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000040000000-0x000000005bffffff]
node 0: [mem 0x000000005c000000-0x000000005dffffff]
node 0: [mem 0x000000005e000000-0x000000005edfffff]
node 0: [mem 0x000000005ee00000-0x000000005fffffff]
Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
Built 1 zonelists, mobility grouping on. Total pages: 130084
Kernel command line: earlycon
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
rcu: Preemptible hierarchical RCU implementation.
rcu: RCU event tracing is enabled.
Trampoline variant of Tasks RCU enabled.
rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
i2c controller registered, irq 16
random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
Switching to timer-based delay loop, resolution 41ns
Console: colour dummy device 80x30
printk: console [tty0] enabled
printk: bootconsole [ns16550a0] disabled

After that, it just hangs.

The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
in the images/arm folder.

If there is any more information that I can provide or changes to test,
please let me know.

Cheers,
Nathan

2021-12-22 18:31:11

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Nathan Chancellor <[email protected]> writes:

> Hi Eric,
>
> On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> Today the rules are a bit iffy and arbitrary about which kernel
>> threads have struct kthread present. Both idle threads and thread
>> started with create_kthread want struct kthread present so that is
>> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> and the task is running then struct kthread is present.
>>
>> This will allow the kernel thread code to using tsk->exit_code
>> with different semantics from ordinary processes.
>>
>> To make ensure that struct kthread is present for all
>> kernel threads move it's allocation into copy_process.
>>
>> Add a deallocation of struct kthread in exec for processes
>> that were kernel threads.
>>
>> Move the allocation of struct kthread for the initial thread
>> earlier so that it is not repeated for each additional idle
>> thread.
>>
>> Move the initialization of struct kthread into set_kthread_struct
>> so that the structure is always and reliably initailized.
>>
>> Clear set_child_tid in free_kthread_struct to ensure the kthread
>> struct is reliably freed during exec. The function
>> free_kthread_struct does not need to clear vfork_done during exec as
>> exec_mm_release called from exec_mmap has already cleared vfork_done.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> present for all kthreads") in -next causes an ARCH=arm
> multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> ("kthread: Warn about failed allocations for the init kthread") to avoid
> the known runtime warning.
>
> $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
>
> $ qemu-system-arm \
> -initrd rootfs.cpio \
> -append earlycon \
> -machine palmetto-bmc \
> -no-reboot \
> -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> -display none \
> -kernel arch/arm/boot/zImage \
> -m 512m \
> -nodefaults \
> -serial mon:stdio
> qemu-system-arm: warning: nic ftgmac100.0 has no peer
> qemu-system-arm: warning: nic ftgmac100.1 has no peer
> Booting Linux on physical CPU 0x0
> Linux version 5.16.0-rc1-00016-g40966e316f86-dirty (nathan@archlinux-ax161) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> CPU: VIVT data cache, VIVT instruction cache
> OF: fdt: Machine model: Palmetto BMC
> earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> printk: bootconsole [ns16550a0] enabled
> Memory policy: Data cache writethrough
> cma: Reserved 16 MiB at 0x5b000000
> Zone ranges:
> DMA [mem 0x0000000040000000-0x000000005edfffff]
> Normal empty
> HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> Movable zone start for each node
> Early memory node ranges
> node 0: [mem 0x0000000040000000-0x000000005bffffff]
> node 0: [mem 0x000000005c000000-0x000000005dffffff]
> node 0: [mem 0x000000005e000000-0x000000005edfffff]
> node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> Built 1 zonelists, mobility grouping on. Total pages: 130084
> Kernel command line: earlycon
> Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> mem auto-init: stack:off, heap alloc:off, heap free:off
> Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> rcu: Preemptible hierarchical RCU implementation.
> rcu: RCU event tracing is enabled.
> Trampoline variant of Tasks RCU enabled.
> rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> i2c controller registered, irq 16
> random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> Switching to timer-based delay loop, resolution 41ns
> Console: colour dummy device 80x30
> printk: console [tty0] enabled
> printk: bootconsole [ns16550a0] disabled
>
> After that, it just hangs.
>
> The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> in the images/arm folder.
>
> If there is any more information that I can provide or changes to test,
> please let me know.

Well crap. I hate to hear my code is causing problems like this.

This is however a very good bug report, which I very much appreciate.

I think I have enough information. I will see if I can reproduce this
and track down what is happening.

Have you by any chance tried linux-next with just these changes backed
out?

Eric

2021-12-22 18:46:50

by Nathan Chancellor

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
> Nathan Chancellor <[email protected]> writes:
>
> > Hi Eric,
> >
> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> >> Today the rules are a bit iffy and arbitrary about which kernel
> >> threads have struct kthread present. Both idle threads and thread
> >> started with create_kthread want struct kthread present so that is
> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
> >> and the task is running then struct kthread is present.
> >>
> >> This will allow the kernel thread code to using tsk->exit_code
> >> with different semantics from ordinary processes.
> >>
> >> To make ensure that struct kthread is present for all
> >> kernel threads move it's allocation into copy_process.
> >>
> >> Add a deallocation of struct kthread in exec for processes
> >> that were kernel threads.
> >>
> >> Move the allocation of struct kthread for the initial thread
> >> earlier so that it is not repeated for each additional idle
> >> thread.
> >>
> >> Move the initialization of struct kthread into set_kthread_struct
> >> so that the structure is always and reliably initailized.
> >>
> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
> >> struct is reliably freed during exec. The function
> >> free_kthread_struct does not need to clear vfork_done during exec as
> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >
> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> > present for all kthreads") in -next causes an ARCH=arm
> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> > ("kthread: Warn about failed allocations for the init kthread") to avoid
> > the known runtime warning.
> >
> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
> >
> > $ qemu-system-arm \
> > -initrd rootfs.cpio \
> > -append earlycon \
> > -machine palmetto-bmc \
> > -no-reboot \
> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> > -display none \
> > -kernel arch/arm/boot/zImage \
> > -m 512m \
> > -nodefaults \
> > -serial mon:stdio
> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
> > Booting Linux on physical CPU 0x0
> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty (nathan@archlinux-ax161) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> > CPU: VIVT data cache, VIVT instruction cache
> > OF: fdt: Machine model: Palmetto BMC
> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> > printk: bootconsole [ns16550a0] enabled
> > Memory policy: Data cache writethrough
> > cma: Reserved 16 MiB at 0x5b000000
> > Zone ranges:
> > DMA [mem 0x0000000040000000-0x000000005edfffff]
> > Normal empty
> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> > Movable zone start for each node
> > Early memory node ranges
> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> > Built 1 zonelists, mobility grouping on. Total pages: 130084
> > Kernel command line: earlycon
> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> > mem auto-init: stack:off, heap alloc:off, heap free:off
> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> > rcu: Preemptible hierarchical RCU implementation.
> > rcu: RCU event tracing is enabled.
> > Trampoline variant of Tasks RCU enabled.
> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> > i2c controller registered, irq 16
> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> > Switching to timer-based delay loop, resolution 41ns
> > Console: colour dummy device 80x30
> > printk: console [tty0] enabled
> > printk: bootconsole [ns16550a0] disabled
> >
> > After that, it just hangs.
> >
> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> > in the images/arm folder.
> >
> > If there is any more information that I can provide or changes to test,
> > please let me know.
>
> Well crap. I hate to hear my code is causing problems like this.
>
> This is however a very good bug report, which I very much appreciate.
>
> I think I have enough information. I will see if I can reproduce this
> and track down what is happening.
>
> Have you by any chance tried linux-next with just these changes backed
> out?

Yes, if I back out of the following commits on top of next-20211222 then
the kernel boots right up.

dd621ee0cf8e ("kthread: Warn about failed allocations for the init kthread")
ff8288ff475e ("fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacct")
6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMA")
1fb466dff904 ("objtool: Add a missing comma to avoid string concatenation")
5eb6f22823e0 ("exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exit")
6b1248798eb6 ("exit/kthread: Move the exit code for kernel threads into struct kthread")
40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")

Cheers,
Nathan

2021-12-22 23:25:24

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Nathan Chancellor <[email protected]> writes:

> On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
>> Nathan Chancellor <[email protected]> writes:
>>
>> > Hi Eric,
>> >
>> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> >> Today the rules are a bit iffy and arbitrary about which kernel
>> >> threads have struct kthread present. Both idle threads and thread
>> >> started with create_kthread want struct kthread present so that is
>> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> >> and the task is running then struct kthread is present.
>> >>
>> >> This will allow the kernel thread code to using tsk->exit_code
>> >> with different semantics from ordinary processes.
>> >>
>> >> To make ensure that struct kthread is present for all
>> >> kernel threads move it's allocation into copy_process.
>> >>
>> >> Add a deallocation of struct kthread in exec for processes
>> >> that were kernel threads.
>> >>
>> >> Move the allocation of struct kthread for the initial thread
>> >> earlier so that it is not repeated for each additional idle
>> >> thread.
>> >>
>> >> Move the initialization of struct kthread into set_kthread_struct
>> >> so that the structure is always and reliably initailized.
>> >>
>> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
>> >> struct is reliably freed during exec. The function
>> >> free_kthread_struct does not need to clear vfork_done during exec as
>> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
>> >>
>> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> >
>> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
>> > present for all kthreads") in -next causes an ARCH=arm
>> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
>> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
>> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
>> > ("kthread: Warn about failed allocations for the init kthread") to avoid
>> > the known runtime warning.
>> >
>> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
>> >
>> > $ qemu-system-arm \
>> > -initrd rootfs.cpio \
>> > -append earlycon \
>> > -machine palmetto-bmc \
>> > -no-reboot \
>> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
>> > -display none \
>> > -kernel arch/arm/boot/zImage \
>> > -m 512m \
>> > -nodefaults \
>> > -serial mon:stdio
>> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
>> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
>> > Booting Linux on physical CPU 0x0
>> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty (nathan@archlinux-ax161) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
>> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
>> > CPU: VIVT data cache, VIVT instruction cache
>> > OF: fdt: Machine model: Palmetto BMC
>> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
>> > printk: bootconsole [ns16550a0] enabled
>> > Memory policy: Data cache writethrough
>> > cma: Reserved 16 MiB at 0x5b000000
>> > Zone ranges:
>> > DMA [mem 0x0000000040000000-0x000000005edfffff]
>> > Normal empty
>> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
>> > Movable zone start for each node
>> > Early memory node ranges
>> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
>> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
>> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
>> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
>> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
>> > Built 1 zonelists, mobility grouping on. Total pages: 130084
>> > Kernel command line: earlycon
>> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
>> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
>> > mem auto-init: stack:off, heap alloc:off, heap free:off
>> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
>> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
>> > rcu: Preemptible hierarchical RCU implementation.
>> > rcu: RCU event tracing is enabled.
>> > Trampoline variant of Tasks RCU enabled.
>> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
>> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
>> > i2c controller registered, irq 16
>> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
>> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
>> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
>> > Switching to timer-based delay loop, resolution 41ns
>> > Console: colour dummy device 80x30
>> > printk: console [tty0] enabled
>> > printk: bootconsole [ns16550a0] disabled
>> >
>> > After that, it just hangs.
>> >
>> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
>> > in the images/arm folder.
>> >
>> > If there is any more information that I can provide or changes to test,
>> > please let me know.

I have managed to reproduce, fix and verify my fix, please
see below.

Subject: [PATCH] kthread: Never put_user the set_child_tid address

Kernel threads abuse set_child_tid. Historically that has been fine
as set_child_tid was initialized after the kernel thread had been
forked. Unfortunately storing struct kthread in set_child_tid after
the thread is running makes struct kthread being unusable for storing
result codes of the thread.

When set_child_tid is set to struct kthread during fork that results
in schedule_tail writing the thread id to the beggining of struct
kthread (if put_user does not realize it is a kernel address).

Solve this by skipping the put_user for all kthreads.

Reported-by: Nathan Chancellor <[email protected]>
Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee222b89c692..d8adbea77be1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
finish_task_switch(prev);
preempt_enable();

- if (current->set_child_tid)
+ if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);

calculate_sigpending();
--
2.29.2

Eric

2021-12-23 00:37:29

by Nathan Chancellor

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 05:22:45PM -0600, Eric W. Biederman wrote:
> Nathan Chancellor <[email protected]> writes:
>
> > On Wed, Dec 22, 2021 at 12:30:57PM -0600, Eric W. Biederman wrote:
> >> Nathan Chancellor <[email protected]> writes:
> >>
> >> > Hi Eric,
> >> >
> >> > On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> >> >> Today the rules are a bit iffy and arbitrary about which kernel
> >> >> threads have struct kthread present. Both idle threads and thread
> >> >> started with create_kthread want struct kthread present so that is
> >> >> effectively all kernel threads. Make the rule that if PF_KTHREAD
> >> >> and the task is running then struct kthread is present.
> >> >>
> >> >> This will allow the kernel thread code to using tsk->exit_code
> >> >> with different semantics from ordinary processes.
> >> >>
> >> >> To make ensure that struct kthread is present for all
> >> >> kernel threads move it's allocation into copy_process.
> >> >>
> >> >> Add a deallocation of struct kthread in exec for processes
> >> >> that were kernel threads.
> >> >>
> >> >> Move the allocation of struct kthread for the initial thread
> >> >> earlier so that it is not repeated for each additional idle
> >> >> thread.
> >> >>
> >> >> Move the initialization of struct kthread into set_kthread_struct
> >> >> so that the structure is always and reliably initailized.
> >> >>
> >> >> Clear set_child_tid in free_kthread_struct to ensure the kthread
> >> >> struct is reliably freed during exec. The function
> >> >> free_kthread_struct does not need to clear vfork_done during exec as
> >> >> exec_mm_release called from exec_mmap has already cleared vfork_done.
> >> >>
> >> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> >
> >> > This patch as commit 40966e316f86 ("kthread: Ensure struct kthread is
> >> > present for all kthreads") in -next causes an ARCH=arm
> >> > multi_v5_defconfig kernel to fail to boot in QEMU. I had to apply commit
> >> > 6692c98c7df5 ("fork: Stop protecting back_fork_cleanup_cgroup_lock with
> >> > CONFIG_NUMA") to get it to build and I applied commit dd621ee0cf8e
> >> > ("kthread: Warn about failed allocations for the init kthread") to avoid
> >> > the known runtime warning.
> >> >
> >> > $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- distclean multi_v5_defconfig all
> >> >
> >> > $ qemu-system-arm \
> >> > -initrd rootfs.cpio \
> >> > -append earlycon \
> >> > -machine palmetto-bmc \
> >> > -no-reboot \
> >> > -dtb arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dtb \
> >> > -display none \
> >> > -kernel arch/arm/boot/zImage \
> >> > -m 512m \
> >> > -nodefaults \
> >> > -serial mon:stdio
> >> > qemu-system-arm: warning: nic ftgmac100.0 has no peer
> >> > qemu-system-arm: warning: nic ftgmac100.1 has no peer
> >> > Booting Linux on physical CPU 0x0
> >> > Linux version 5.16.0-rc1-00016-g40966e316f86-dirty (nathan@archlinux-ax161) (arm-linux-gnueabi-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 PREEMPT Wed Dec 22 18:08:53 UTC 2021
> >> > CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00093177
> >> > CPU: VIVT data cache, VIVT instruction cache
> >> > OF: fdt: Machine model: Palmetto BMC
> >> > earlycon: ns16550a0 at MMIO 0x1e784000 (options '')
> >> > printk: bootconsole [ns16550a0] enabled
> >> > Memory policy: Data cache writethrough
> >> > cma: Reserved 16 MiB at 0x5b000000
> >> > Zone ranges:
> >> > DMA [mem 0x0000000040000000-0x000000005edfffff]
> >> > Normal empty
> >> > HighMem [mem 0x000000005ee00000-0x000000005fffffff]
> >> > Movable zone start for each node
> >> > Early memory node ranges
> >> > node 0: [mem 0x0000000040000000-0x000000005bffffff]
> >> > node 0: [mem 0x000000005c000000-0x000000005dffffff]
> >> > node 0: [mem 0x000000005e000000-0x000000005edfffff]
> >> > node 0: [mem 0x000000005ee00000-0x000000005fffffff]
> >> > Initmem setup node 0 [mem 0x0000000040000000-0x000000005fffffff]
> >> > Built 1 zonelists, mobility grouping on. Total pages: 130084
> >> > Kernel command line: earlycon
> >> > Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
> >> > Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
> >> > mem auto-init: stack:off, heap alloc:off, heap free:off
> >> > Memory: 433140K/524288K available (9628K kernel code, 2019K rwdata, 2368K rodata, 340K init, 661K bss, 74764K reserved, 16384K cma-reserved, 0K highmem)
> >> > SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> >> > rcu: Preemptible hierarchical RCU implementation.
> >> > rcu: RCU event tracing is enabled.
> >> > Trampoline variant of Tasks RCU enabled.
> >> > rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
> >> > NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
> >> > i2c controller registered, irq 16
> >> > random: get_random_bytes called from start_kernel+0x408/0x624 with crng_init=0
> >> > clocksource: FTTMR010-TIMER2: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635851949 ns
> >> > sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 89478484971ns
> >> > Switching to timer-based delay loop, resolution 41ns
> >> > Console: colour dummy device 80x30
> >> > printk: console [tty0] enabled
> >> > printk: bootconsole [ns16550a0] disabled
> >> >
> >> > After that, it just hangs.
> >> >
> >> > The rootfs is available at https://github.com/ClangBuiltLinux/boot-utils
> >> > in the images/arm folder.
> >> >
> >> > If there is any more information that I can provide or changes to test,
> >> > please let me know.
>
> I have managed to reproduce, fix and verify my fix, please
> see below.
>
>
> Subject: [PATCH] kthread: Never put_user the set_child_tid address
>
> Kernel threads abuse set_child_tid. Historically that has been fine
> as set_child_tid was initialized after the kernel thread had been
> forked. Unfortunately storing struct kthread in set_child_tid after
> the thread is running makes struct kthread being unusable for storing
> result codes of the thread.
>
> When set_child_tid is set to struct kthread during fork that results
> in schedule_tail writing the thread id to the beggining of struct
> kthread (if put_user does not realize it is a kernel address).
>
> Solve this by skipping the put_user for all kthreads.
>
> Reported-by: Nathan Chancellor <[email protected]>
> Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Thanks a lot for the quick fix. I can confirm that it resolves the
failure on my side.

Tested-by: Nathan Chancellor <[email protected]>

> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee222b89c692..d8adbea77be1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> finish_task_switch(prev);
> preempt_enable();
>
> - if (current->set_child_tid)
> + if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
> put_user(task_pid_vnr(current), current->set_child_tid);
>
> calculate_sigpending();
> --
> 2.29.2
>
>
> Eric

2021-12-23 01:44:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 22, 2021 at 3:25 PM Eric W. Biederman <[email protected]> wrote:
>
> Solve this by skipping the put_user for all kthreads.

Ugh.

While this fixes the problem, could we please just not mis-use that
'set_child_tid' as that kthread pointer any more?

It was always kind of hacky. I think a new pointer with the proper
'struct kthread *' type would be an improvement.

One of the "arguments" in the comment for re-using that set_child_tid
pointer was that 'fork()' used to not wrongly copy it, but your patch
literally now does that "allocate new kthread struct" at fork-time, so
that argument is actually bogus now.

Linus

2021-12-23 03:35:22

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Added a couple of people from the vhost thread.

Linus Torvalds <[email protected]> writes:

> On Wed, Dec 22, 2021 at 3:25 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Solve this by skipping the put_user for all kthreads.
>
> Ugh.
>
> While this fixes the problem, could we please just not mis-use that
> 'set_child_tid' as that kthread pointer any more?
>
> It was always kind of hacky. I think a new pointer with the proper
> 'struct kthread *' type would be an improvement.
>
> One of the "arguments" in the comment for re-using that set_child_tid
> pointer was that 'fork()' used to not wrongly copy it, but your patch
> literally now does that "allocate new kthread struct" at fork-time, so
> that argument is actually bogus now.

I agree. I think I saw in the recent vhost patches that were
generalizing create_io_thread that the pf_io_worker field of
struct task_struct was being generalized as well.

If so I think it makes sense just to take that approach.

Just build some basic infrastructure that can be used for io_workers,
vhost_workers, and kthreads.

Eric

2021-12-23 05:19:23

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] kthread: Generalize pf_io_worker so it can point to struct kthread

The point of using set_child_tid to hold the kthread pointer was that
it already did what is necessary. There are now restrictions on when
set_child_tid can be initialized and when set_child_tid can be used in
schedule_tail. Which indicates that continuing to use set_child_tid
to hold the kthread pointer is a bad idea.

Instead of continuing to use the set_child_tid field of task_struct
generalize the pf_io_worker field of task_struct and use it to hold
the kthread pointer.

Rename pf_io_worker (which is a void * pointer) to worker_private so
it can be used to store kthreads struct kthread pointer. Update the
kthread code to store the kthread pointer in the worker_private field.
Remove the places where set_child_tid had to be dealt with carefully
because kthreads also used it.

Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

I looked again and the vhost_worker changes do not generalize
pf_io_worker, and as pf_io_worker is already a void * it is easy to
generalize. So I just did that.

Unless someone spots a problem I will add this to my signal-for-v5.17
branch in linux-next, as this seems much less error prone than using
set_child_tid.

fs/io-wq.c | 6 +++---
fs/io-wq.h | 2 +-
include/linux/sched.h | 4 ++--
kernel/fork.c | 8 +-------
kernel/kthread.c | 14 +++++---------
kernel/sched/core.c | 2 +-
6 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..e4fc7384b40c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -657,7 +657,7 @@ static int io_wqe_worker(void *data)
*/
void io_wq_worker_running(struct task_struct *tsk)
{
- struct io_worker *worker = tsk->pf_io_worker;
+ struct io_worker *worker = tsk->worker_private;

if (!worker)
return;
@@ -675,7 +675,7 @@ void io_wq_worker_running(struct task_struct *tsk)
*/
void io_wq_worker_sleeping(struct task_struct *tsk)
{
- struct io_worker *worker = tsk->pf_io_worker;
+ struct io_worker *worker = tsk->worker_private;

if (!worker)
return;
@@ -694,7 +694,7 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
struct task_struct *tsk)
{
- tsk->pf_io_worker = worker;
+ tsk->worker_private = worker;
worker->task = tsk;
set_cpus_allowed_ptr(tsk, wqe->cpu_mask);
tsk->flags |= PF_NO_SETAFFINITY;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..c7c23947cbcd 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -200,6 +200,6 @@ static inline void io_wq_worker_running(struct task_struct *tsk)
static inline bool io_wq_current_is_worker(void)
{
return in_task() && (current->flags & PF_IO_WORKER) &&
- current->pf_io_worker;
+ current->worker_private;
}
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..52f2fdffa3ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -987,8 +987,8 @@ struct task_struct {
/* CLONE_CHILD_CLEARTID: */
int __user *clear_child_tid;

- /* PF_IO_WORKER */
- void *pf_io_worker;
+ /* PF_KTHREAD | PF_IO_WORKER */
+ void *worker_private;

u64 utime;
u64 stime;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0816be1bb044..6f0293cb29c9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -950,7 +950,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
tsk->splice_pipe = NULL;
tsk->task_frag.page = NULL;
tsk->wake_q.next = NULL;
- tsk->pf_io_worker = NULL;
+ tsk->worker_private = NULL;

account_kernel_stack(tsk, 1);

@@ -2032,12 +2032,6 @@ static __latent_entropy struct task_struct *copy_process(
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}

- /*
- * This _must_ happen before we call free_task(), i.e. before we jump
- * to any of the bad_fork_* labels. This is to avoid freeing
- * p->set_child_tid which is (ab)used as a kthread's data pointer for
- * kernel threads (PF_KTHREAD).
- */
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? args->child_tid : NULL;
/*
* Clear TID on mm_release()?
diff --git a/kernel/kthread.c b/kernel/kthread.c
index c14707d15341..261a3c3b9c6c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -72,7 +72,7 @@ enum KTHREAD_BITS {
static inline struct kthread *to_kthread(struct task_struct *k)
{
WARN_ON(!(k->flags & PF_KTHREAD));
- return (__force void *)k->set_child_tid;
+ return k->worker_private;
}

/*
@@ -80,7 +80,7 @@ static inline struct kthread *to_kthread(struct task_struct *k)
*
* Per construction; when:
*
- * (p->flags & PF_KTHREAD) && p->set_child_tid
+ * (p->flags & PF_KTHREAD) && p->worker_private
*
* the task is both a kthread and struct kthread is persistent. However
* PF_KTHREAD on it's own is not, kernel_thread() can exec() (See umh.c and
@@ -88,7 +88,7 @@ static inline struct kthread *to_kthread(struct task_struct *k)
*/
static inline struct kthread *__to_kthread(struct task_struct *p)
{
- void *kthread = (__force void *)p->set_child_tid;
+ void *kthread = p->worker_private;
if (kthread && !(p->flags & PF_KTHREAD))
kthread = NULL;
return kthread;
@@ -109,11 +109,7 @@ bool set_kthread_struct(struct task_struct *p)
init_completion(&kthread->parked);
p->vfork_done = &kthread->exited;

- /*
- * We abuse ->set_child_tid to avoid the new member and because it
- * can't be wrongly copied by copy_process().
- */
- p->set_child_tid = (__force void __user *)kthread;
+ p->worker_private = kthread;
return true;
}

@@ -128,7 +124,7 @@ void free_kthread_struct(struct task_struct *k)
#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread && kthread->blkcg_css);
#endif
- k->set_child_tid = (__force void __user *)NULL;
+ k->worker_private = NULL;
kfree(kthread);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8adbea77be1..ee222b89c692 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4908,7 +4908,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
finish_task_switch(prev);
preempt_enable();

- if (!(current->flags & PF_KTHREAD) && current->set_child_tid)
+ if (current->set_child_tid)
put_user(task_pid_vnr(current), current->set_child_tid);

calculate_sigpending();
--
2.29.2

2021-12-23 17:20:51

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] kthread: Generalize pf_io_worker so it can point to struct kthread

On Wed, Dec 22, 2021 at 9:19 PM Eric W. Biederman <[email protected]> wrote:
>
> Instead of continuing to use the set_child_tid field of task_struct
> generalize the pf_io_worker field of task_struct and use it to hold
> the kthread pointer.

Well that patch certainly looks like a nice cleanup to me. Thanks.

Linus

2022-01-03 21:30:35

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 00/17] exit: Making task exiting a first class concept

The changes below contain some cleanups and the work to make implement
first class asynchronous task exit. Most of the cleanups are necessary
for this work but a couple of them (removing profile_task_exit and the
extra setting of PT_SEIZED in ptrace_attach) are included because I
stumbled over them and they are worth applying but they aren't
interesting enough to me to make be in their own patchset.

The core of this set of changes is the addition of
schedule_task_exit_locked. Ptrace is cleaned up to avoid a conflict in
task->exit_code. Then the existing task exit code is gradually moved
into the final shape of schedule_task_exit_locked.

This is the fundamental building block I need to fix alpha, m68k,
nios2 and any other architecture that does not always save all of
their registers except when entering into a ptrace context.

This is about half the work to allow coredump signals to use
short-circuit delivery.

With coredumps signals available for short-circuit delivery the
SA_IMMUTABLE hack can be replace by something clean.

The counting of the number of threads that have not been killed to
always set SIGNAL_GROUP_EXIT when a process exits and the coredump
signal short-circuit delivery is a foundation for updating the
SECCOMP_RET_KILL_THREAD implementation such that it can decide if it
should coredump without races.

I have most of those changes pretty much ready I just need to get these
changes finalized reviewed first. At this point they are looking at
v5.18 material.

These patches are on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git/ signal-for-v5.17

After these patches have been reviewed it is my plan to apply them to my
signal-for-v5.17 branch. Any and all feedback is welcome.

Eric W. Biederman (17):
exit: Remove profile_task_exit & profile_munmap
exit: Coredumps reach do_group_exit
exit: Fix the exit_code for wait_task_zombie
exit: Use the correct exit_code in /proc/<pid>/stat
taskstats: Cleanup the use of task->exit_code
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Move setting/clearing ptrace_message into prace_stop
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Separate task->ptrace_code out from task->exit_code
signal: Compute the process exit_code in get_signal
signal: Make individual tasks exiting a first class concept
signal: Remove zap_other_threads
signal: Add JOBCTL_WILL_EXIT to mark exiting tasks
signal: Record the exit_code when an exit is scheduled
signal: Always set SIGNAL_GROUP_EXIT on process exit

arch/m68k/kernel/ptrace.c | 12 +----
fs/coredump.c | 17 +++---
fs/exec.c | 12 +++--
fs/proc/array.c | 9 +++-
include/linux/profile.h | 26 ---------
include/linux/ptrace.h | 5 +-
include/linux/sched.h | 1 +
include/linux/sched/jobctl.h | 2 +
include/linux/sched/signal.h | 6 ++-
include/linux/tracehook.h | 21 ++++----
kernel/exit.c | 29 +++++-----
kernel/fork.c | 2 +
kernel/profile.c | 50 ------------------
kernel/ptrace.c | 14 +++--
kernel/signal.c | 122 +++++++++++++++++++++++--------------------
kernel/tsacct.c | 7 ++-
mm/mmap.c | 1 -
17 files changed, 134 insertions(+), 202 deletions(-)

2022-01-03 21:33:35

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.

So remove the dead code and make it easier to read do_exit.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/profile.h | 26 ---------------------
kernel/exit.c | 1 -
kernel/profile.c | 50 -----------------------------------------
mm/mmap.c | 1 -
4 files changed, 78 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f557..f7eb2b57d890 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -31,11 +31,6 @@ static inline int create_proc_profile(void)
}
#endif

-enum profile_type {
- PROFILE_TASK_EXIT,
- PROFILE_MUNMAP
-};
-
#ifdef CONFIG_PROFILING

extern int prof_on __read_mostly;
@@ -66,23 +61,14 @@ static inline void profile_hit(int type, void *ip)
struct task_struct;
struct mm_struct;

-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
/* task is dead, free task struct ? Returns 1 if
* the task was taken, 0 if the task should be freed.
*/
int profile_handoff_task(struct task_struct * task);

-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
int task_handoff_register(struct notifier_block * n);
int task_handoff_unregister(struct notifier_block * n);

-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -117,19 +103,7 @@ static inline int task_handoff_unregister(struct notifier_block * n)
return -ENOSYS;
}

-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)

#endif /* CONFIG_PROFILING */

diff --git a/kernel/exit.c b/kernel/exit.c
index e7104f803be0..b5c35b520fda 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -737,7 +737,6 @@ void __noreturn do_exit(long code)

WARN_ON(blk_needs_flush_plug(tsk));

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac5..9355cc934a96 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -135,14 +135,7 @@ int __ref profile_init(void)

/* Profile event notifications */

-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}

int profile_handoff_task(struct task_struct *task)
{
@@ -151,11 +144,6 @@ int profile_handoff_task(struct task_struct *task)
return (ret == NOTIFY_OK) ? 1 : 0;
}

-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
int task_handoff_register(struct notifier_block *n)
{
return atomic_notifier_chain_register(&task_free_notifier, n);
@@ -168,44 +156,6 @@ int task_handoff_unregister(struct notifier_block *n)
}
EXPORT_SYMBOL_GPL(task_handoff_unregister);

-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..70318c2a47c3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}

--
2.29.2

2022-01-03 21:33:37

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 02/17] exit: Coredumps reach do_group_exit

The comment about coredumps not reaching do_group_exit and the
corresponding BUG_ON are bogus.

What happens and has happened for years is that get_signal calls
do_coredump (which sets SIGNAL_GROUP_EXIT and group_exit_code) and
then do_group_exit passing the signal number. Then do_group_exit
ignores the exit_code it is passed and uses signal->group_exit_code
from the coredump.

The comment and BUG_ON were correct when they were added during the
2.5 development cycle, but became obsolete and incorrect when
get_signal was changed to fall through to do_group_exit after
do_coredump in 2.6.10-rc2.

So remove the stale comment and BUG_ON

Fixes: 63bd6144f191 ("[PATCH] Invalid BUG_ONs in signal.c")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index b5c35b520fda..34c43037450f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -904,8 +904,6 @@ do_group_exit(int exit_code)
{
struct signal_struct *sig = current->signal;

- BUG_ON(exit_code & 0x80); /* core dumps don't get here */
-
if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
else if (sig->group_exec_task)
--
2.29.2

2022-01-03 21:33:42

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 03/17] exit: Fix the exit_code for wait_task_zombie

The function wait_task_zombie is defined to always returns the process not
thread exit status. Unfortunately when process group exit support
was added to wait_task_zombie the WNOWAIT case was overlooked.

Usually tsk->exit_code and tsk->signal->group_exit_code will be in sync
so fixing this is bug probably has no effect in practice. But fix
it anyway so that people aren't scratching their heads about why
the two code paths are different.

History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 2c66151cbc2c ("[PATCH] sys_exit() threading improvements, BK-curr")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 34c43037450f..7121db37c411 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1011,7 +1011,8 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
return 0;

if (unlikely(wo->wo_flags & WNOWAIT)) {
- status = p->exit_code;
+ status = (p->signal->flags & SIGNAL_GROUP_EXIT)
+ ? p->signal->group_exit_code : p->exit_code;
get_task_struct(p);
read_unlock(&tasklist_lock);
sched_annotate_sleep();
--
2.29.2

2022-01-03 21:33:43

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 04/17] exit: Use the correct exit_code in /proc/<pid>/stat

Since do_proc_statt was modified to return process wide values instead
of per task values the exit_code calculation has never been updated.
Update it now to return the process wide exit_code when it is requested
and available.

History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: bf719d26a5c1 ("[PATCH] distinct tgid/tid CPU usage")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/array.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ff869a66b34e..43a7abde9e42 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -468,6 +468,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
u64 cgtime, gtime;
unsigned long rsslim = 0;
unsigned long flags;
+ int exit_code = task->exit_code;

state = *get_task_state(task);
vsize = eip = esp = 0;
@@ -531,6 +532,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
maj_flt += sig->maj_flt;
thread_group_cputime_adjusted(task, &utime, &stime);
gtime += sig->gtime;
+
+ if (sig->flags & (SIGNAL_GROUP_EXIT | SIGNAL_STOP_STOPPED))
+ exit_code = sig->group_exit_code;
}

sid = task_session_nr_ns(task, ns);
@@ -630,7 +634,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
seq_puts(m, " 0 0 0 0 0 0 0");

if (permitted)
- seq_put_decimal_ll(m, " ", task->exit_code);
+ seq_put_decimal_ll(m, " ", exit_code);
else
seq_puts(m, " 0");

--
2.29.2

2022-01-03 21:33:49

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 06/17] ptrace: Remove second setting of PT_SEIZED in ptrace_attach

The code is totally redundant remove it.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/ptrace.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f8589bf8d7dc..eea265082e97 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -419,8 +419,6 @@ static int ptrace_attach(struct task_struct *task, long request,
if (task->ptrace)
goto unlock_tasklist;

- if (seize)
- flags |= PT_SEIZED;
task->ptrace = flags;

ptrace_link(task, current);
--
2.29.2

2022-01-03 21:33:51

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 05/17] taskstats: Cleanup the use of task->exit_code

In the function bacct_add_task the code reading task->exit_code was
introduced in commit f3cef7a99469 ("[PATCH] csa: basic accounting over
taskstats"), and it is not entirely clear what the taskstats interface
is trying to return as only returning the exit_code of the first task
in a process doesn't make a lot of sense.

As best as I can figure the intent is to return task->exit_code after
a task exits. The field is returned with per task fields, so the
exit_code of the entire process is not wanted. Only the value of the
first task is returned so this is not a useful way to get the per task
ptrace stop code. The ordinary case of returning this value is
returning after a task exits, which also precludes use for getting
a ptrace value.

It is common to for the first task of a process to also be the last
task of a process so this field may have done something reasonable by
accident in testing.

Make ac_exitcode a reliable per task value by always returning it for
every exited task.

Setting ac_exitcode in a sensible mannter makes it possible to continue
to provide this value going forward.

Cc: Balbir Singh <[email protected]>
Fixes: f3cef7a99469 ("[PATCH] csa: basic accounting over taskstats")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/tsacct.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index f00de83d0246..1d261fbe367b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -38,11 +38,10 @@ void bacct_add_tsk(struct user_namespace *user_ns,
stats->ac_btime = clamp_t(time64_t, btime, 0, U32_MAX);
stats->ac_btime64 = btime;

- if (thread_group_leader(tsk)) {
+ if (tsk->flags & PF_EXITING)
stats->ac_exitcode = tsk->exit_code;
- if (tsk->flags & PF_FORKNOEXEC)
- stats->ac_flag |= AFORK;
- }
+ if (thread_group_leader(tsk) && (tsk->flags & PF_FORKNOEXEC))
+ stats->ac_flag |= AFORK;
if (tsk->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
if (tsk->flags & PF_DUMPCORE)
--
2.29.2

2022-01-03 21:33:54

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 07/17] ptrace: Remove unused regs argument from ptrace_report_syscall

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/tracehook.h | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 2564b7434b4d..88c007ab5ebc 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -54,8 +54,7 @@ struct linux_binprm;
/*
* ptrace report for syscall entry and exit looks identical.
*/
-static inline int ptrace_report_syscall(struct pt_regs *regs,
- unsigned long message)
+static inline int ptrace_report_syscall(unsigned long message)
{
int ptrace = current->ptrace;

@@ -102,7 +101,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs,
static inline __must_check int tracehook_report_syscall_entry(
struct pt_regs *regs)
{
- return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
+ return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
}

/**
@@ -127,7 +126,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
if (step)
user_single_step_report(regs);
else
- ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
+ ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_EXIT);
}

/**
--
2.29.2

2022-01-03 21:33:58

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

The generic function ptrace_report_syscall does a little more
than syscall_trace on m68k. The function ptrace_report_syscall
stops early if PT_TRACED is not set, it sets ptrace_message,
and returns the result of fatal_signal_pending.

Setting ptrace_message to a passed in value of 0 is effectively not
setting ptrace_message, making that additional work a noop.

Returning the result of fatal_signal_pending and letting the caller
ignore the result becomes a noop in this change.

When a process is ptraced, the flag PT_PTRACED is always set in
current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
just an optimization to fail early if the process is not ptraced.
Later on in ptrace_notify, ptrace_stop will test current->ptrace under
tasklist_lock and skip performing any work if the task is not ptraced.

Cc: Geert Uytterhoeven <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/m68k/kernel/ptrace.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/m68k/kernel/ptrace.c b/arch/m68k/kernel/ptrace.c
index 94b3b274186d..aa3a0b8d07e9 100644
--- a/arch/m68k/kernel/ptrace.c
+++ b/arch/m68k/kernel/ptrace.c
@@ -273,17 +273,7 @@ long arch_ptrace(struct task_struct *child, long request,

asmlinkage void syscall_trace(void)
{
- ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD)
- ? 0x80 : 0));
- /*
- * this isn't the same as continuing with a signal, but it will do
- * for normal use. strace only continues with a signal if the
- * stopping signal is not SIGTRAP. -brl
- */
- if (current->exit_code) {
- send_sig(current->exit_code, current, 1);
- current->exit_code = 0;
- }
+ ptrace_report_syscall(0);
}

#if defined(CONFIG_COLDFIRE) || !defined(CONFIG_MMU)
--
2.29.2

2022-01-03 21:34:02

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 09/17] ptrace: Move setting/clearing ptrace_message into ptrace_stop

Today ptrace_message is easy to overlook as it not a core part of
ptrace_stop. It has been overlooked so much that there are places
that set ptrace_message and don't clear it, and places that never set
it. So if you get an unlucky sequence of events the ptracer may be
able to read a ptrace_message that does not apply to the current
ptrace stop.

Move setting of ptrace_message into ptrace_stop so that it always gets
set before the stop, and always gets cleared after the stop. This
prevents non-sense from being reported to userspace and makes
ptrace_message more visible in the ptrace API so that kernel
developers can see it.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ptrace.h | 5 ++---
include/linux/tracehook.h | 6 ++----
kernel/signal.c | 19 +++++++++++--------
3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 8aee2945ff08..06f27736c6f8 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -60,7 +60,7 @@ extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned
extern void ptrace_disable(struct task_struct *);
extern int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
-extern void ptrace_notify(int exit_code);
+extern void ptrace_notify(int exit_code, unsigned long message);
extern void __ptrace_link(struct task_struct *child,
struct task_struct *new_parent,
const struct cred *ptracer_cred);
@@ -155,8 +155,7 @@ static inline bool ptrace_event_enabled(struct task_struct *task, int event)
static inline void ptrace_event(int event, unsigned long message)
{
if (unlikely(ptrace_event_enabled(current, event))) {
- current->ptrace_message = message;
- ptrace_notify((event << 8) | SIGTRAP);
+ ptrace_notify((event << 8) | SIGTRAP, message);
} else if (event == PTRACE_EVENT_EXEC) {
/* legacy EXEC report via SIGTRAP */
if ((current->ptrace & (PT_PTRACED|PT_SEIZED)) == PT_PTRACED)
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 88c007ab5ebc..5e60af8a11fc 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -61,8 +61,7 @@ static inline int ptrace_report_syscall(unsigned long message)
if (!(ptrace & PT_PTRACED))
return 0;

- current->ptrace_message = message;
- ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
+ ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0), message);

/*
* this isn't the same as continuing with a signal, but it will do
@@ -74,7 +73,6 @@ static inline int ptrace_report_syscall(unsigned long message)
current->exit_code = 0;
}

- current->ptrace_message = 0;
return fatal_signal_pending(current);
}

@@ -143,7 +141,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
static inline void tracehook_signal_handler(int stepping)
{
if (stepping)
- ptrace_notify(SIGTRAP);
+ ptrace_notify(SIGTRAP, 0);
}

/**
diff --git a/kernel/signal.c b/kernel/signal.c
index 802acca0207b..75bb062d8534 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2197,7 +2197,8 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* If we actually decide not to stop at all because the tracer
* is gone, we keep current->exit_code unless clear_code.
*/
-static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t *info)
+static void ptrace_stop(int exit_code, int why, int clear_code,
+ unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
{
@@ -2243,6 +2244,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
*/
smp_wmb();

+ current->ptrace_message = message;
current->last_siginfo = info;
current->exit_code = exit_code;

@@ -2321,6 +2323,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
*/
spin_lock_irq(&current->sighand->siglock);
current->last_siginfo = NULL;
+ current->ptrace_message = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2333,7 +2336,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t
recalc_sigpending_tsk(current);
}

-static void ptrace_do_notify(int signr, int exit_code, int why)
+static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
{
kernel_siginfo_t info;

@@ -2344,17 +2347,17 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

/* Let the debugger run. */
- ptrace_stop(exit_code, why, 1, &info);
+ ptrace_stop(exit_code, why, 1, message, &info);
}

-void ptrace_notify(int exit_code)
+void ptrace_notify(int exit_code, unsigned long message)
{
BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
if (unlikely(current->task_works))
task_work_run();

spin_lock_irq(&current->sighand->siglock);
- ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
+ ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
spin_unlock_irq(&current->sighand->siglock);
}

@@ -2510,10 +2513,10 @@ static void do_jobctl_trap(void)
signr = SIGTRAP;
WARN_ON_ONCE(!signr);
ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
- CLD_STOPPED);
+ CLD_STOPPED, 0);
} else {
WARN_ON_ONCE(!signr);
- ptrace_stop(signr, CLD_STOPPED, 0, NULL);
+ ptrace_stop(signr, CLD_STOPPED, 0, 0, NULL);
current->exit_code = 0;
}
}
@@ -2567,7 +2570,7 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
* comment in dequeue_signal().
*/
current->jobctl |= JOBCTL_STOP_DEQUEUED;
- ptrace_stop(signr, CLD_TRAPPED, 0, info);
+ ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);

/* We're back. Did the debugger cancel the sig? */
signr = current->exit_code;
--
2.29.2

2022-01-03 21:34:06

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 10/17] ptrace: Return the signal to continue with from ptrace_stop

The signal a task should continue with after a ptrace stop is
inconsistently read, cleared, and sent. Solve this by reading and
clearing the signal to be sent in ptrace_stop.

In an ideal world everything except ptrace_signal would share a common
implementation of continuing with the signal, so ptracers could count
on the signal they ask to continue with actually being delivered. For
now retain bug compatibility and just return with the signal number
the ptracer requested the code continue with.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ptrace.h | 2 +-
include/linux/tracehook.h | 10 +++++-----
kernel/signal.c | 31 ++++++++++++++++++-------------
3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 06f27736c6f8..323c9950e705 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -60,7 +60,7 @@ extern int ptrace_writedata(struct task_struct *tsk, char __user *src, unsigned
extern void ptrace_disable(struct task_struct *);
extern int ptrace_request(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
-extern void ptrace_notify(int exit_code, unsigned long message);
+extern int ptrace_notify(int exit_code, unsigned long message);
extern void __ptrace_link(struct task_struct *child,
struct task_struct *new_parent,
const struct cred *ptracer_cred);
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 5e60af8a11fc..2fd0bfe866c0 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -57,21 +57,21 @@ struct linux_binprm;
static inline int ptrace_report_syscall(unsigned long message)
{
int ptrace = current->ptrace;
+ int signr;

if (!(ptrace & PT_PTRACED))
return 0;

- ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0), message);
+ signr = ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0),
+ message);

/*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
* stopping signal is not SIGTRAP. -brl
*/
- if (current->exit_code) {
- send_sig(current->exit_code, current, 1);
- current->exit_code = 0;
- }
+ if (signr)
+ send_sig(signr, current, 1);

return fatal_signal_pending(current);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 75bb062d8534..9903ff12e581 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2194,15 +2194,17 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* That makes it a way to test a stopped process for
* being ptrace-stopped vs being job-control-stopped.
*
- * If we actually decide not to stop at all because the tracer
- * is gone, we keep current->exit_code unless clear_code.
+ * Returns the signal the ptracer requested the code resume
+ * with. If the code did not stop because the tracer is gone,
+ * the stop signal remains unchanged unless clear_code.
*/
-static void ptrace_stop(int exit_code, int why, int clear_code,
+static int ptrace_stop(int exit_code, int why, int clear_code,
unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
{
bool gstop_done = false;
+ bool read_code = true;

if (arch_ptrace_stop_needed()) {
/*
@@ -2311,8 +2313,9 @@ static void ptrace_stop(int exit_code, int why, int clear_code,

/* tasklist protects us from ptrace_freeze_traced() */
__set_current_state(TASK_RUNNING);
+ read_code = false;
if (clear_code)
- current->exit_code = 0;
+ exit_code = 0;
read_unlock(&tasklist_lock);
}

@@ -2322,8 +2325,10 @@ static void ptrace_stop(int exit_code, int why, int clear_code,
* any signal-sending on another CPU that wants to examine it.
*/
spin_lock_irq(&current->sighand->siglock);
+ if (read_code) exit_code = current->exit_code;
current->last_siginfo = NULL;
current->ptrace_message = 0;
+ current->exit_code = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2334,9 +2339,10 @@ static void ptrace_stop(int exit_code, int why, int clear_code,
* This sets TIF_SIGPENDING, but never clears it.
*/
recalc_sigpending_tsk(current);
+ return exit_code;
}

-static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
+static int ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
{
kernel_siginfo_t info;

@@ -2347,18 +2353,21 @@ static void ptrace_do_notify(int signr, int exit_code, int why, unsigned long me
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

/* Let the debugger run. */
- ptrace_stop(exit_code, why, 1, message, &info);
+ return ptrace_stop(exit_code, why, 1, message, &info);
}

-void ptrace_notify(int exit_code, unsigned long message)
+int ptrace_notify(int exit_code, unsigned long message)
{
+ int signr;
+
BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
if (unlikely(current->task_works))
task_work_run();

spin_lock_irq(&current->sighand->siglock);
- ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
+ signr = ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED, message);
spin_unlock_irq(&current->sighand->siglock);
+ return signr;
}

/**
@@ -2517,7 +2526,6 @@ static void do_jobctl_trap(void)
} else {
WARN_ON_ONCE(!signr);
ptrace_stop(signr, CLD_STOPPED, 0, 0, NULL);
- current->exit_code = 0;
}
}

@@ -2570,15 +2578,12 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type)
* comment in dequeue_signal().
*/
current->jobctl |= JOBCTL_STOP_DEQUEUED;
- ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);
+ signr = ptrace_stop(signr, CLD_TRAPPED, 0, 0, info);

/* We're back. Did the debugger cancel the sig? */
- signr = current->exit_code;
if (signr == 0)
return signr;

- current->exit_code = 0;
-
/*
* Update the siginfo structure if the signal has
* changed. If the debugger wanted something
--
2.29.2

2022-01-03 21:34:17

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 11/17] ptrace: Separate task->ptrace_code out from task->exit_code

A process can be marked for death by setting SIGNAL_GROUP_EXIT and
group_exit_code, long before do_exit is called. Unfortunately because
of PTRACE_EVENT_EXIT residing in do_exit this same tactic can not be
used for task death.

Correct this by adding a new task field task->ptrace_code that holds
the code for ptrace stops. This allows task->exit_code to be set to
the exit code long before the PTRACE_EVENT_EXIT ptrace stop.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/proc/array.c | 3 +++
include/linux/sched.h | 1 +
kernel/exit.c | 2 +-
kernel/ptrace.c | 12 ++++++------
kernel/signal.c | 18 +++++++++---------
5 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 43a7abde9e42..3042015c11ad 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -519,6 +519,9 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
cgtime = sig->cgtime;
rsslim = READ_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);

+ if (task_is_traced(task) && !(task->jobctl & JOBCTL_LISTENING))
+ exit_code = task->ptrace_code;
+
/* add up live thread stats at the group level */
if (whole) {
struct task_struct *t = task;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 52f2fdffa3ab..c3d732bf7833 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1174,6 +1174,7 @@ struct task_struct {
/* Ptrace state: */
unsigned long ptrace_message;
kernel_siginfo_t *last_siginfo;
+ int ptrace_code;

struct task_io_accounting ioac;
#ifdef CONFIG_PSI
diff --git a/kernel/exit.c b/kernel/exit.c
index 7121db37c411..aedefe5eb0eb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1134,7 +1134,7 @@ static int *task_stopped_code(struct task_struct *p, bool ptrace)
{
if (ptrace) {
if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
- return &p->exit_code;
+ return &p->ptrace_code;
} else {
if (p->signal->flags & SIGNAL_STOP_STOPPED)
return &p->signal->group_exit_code;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eea265082e97..8bbd73ab9a34 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -172,7 +172,7 @@ void __ptrace_unlink(struct task_struct *child)

static bool looks_like_a_spurious_pid(struct task_struct *task)
{
- if (task->exit_code != ((PTRACE_EVENT_EXEC << 8) | SIGTRAP))
+ if (task->ptrace_code != ((PTRACE_EVENT_EXEC << 8) | SIGTRAP))
return false;

if (task_pid_vnr(task) == task->ptrace_message)
@@ -573,7 +573,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
* tasklist_lock avoids the race with wait_task_stopped(), see
* the comment in ptrace_resume().
*/
- child->exit_code = data;
+ child->ptrace_code = data;
__ptrace_detach(current, child);
write_unlock_irq(&tasklist_lock);

@@ -863,11 +863,11 @@ static int ptrace_resume(struct task_struct *child, long request,
}

/*
- * Change ->exit_code and ->state under siglock to avoid the race
- * with wait_task_stopped() in between; a non-zero ->exit_code will
+ * Change ->ptrace_code and ->state under siglock to avoid the race
+ * with wait_task_stopped() in between; a non-zero ->ptrace_code will
* wrongly look like another report from tracee.
*
- * Note that we need siglock even if ->exit_code == data and/or this
+ * Note that we need siglock even if ->ptrace_code == data and/or this
* status was not reported yet, the new status must not be cleared by
* wait_task_stopped() after resume.
*
@@ -878,7 +878,7 @@ static int ptrace_resume(struct task_struct *child, long request,
need_siglock = data && !thread_group_empty(current);
if (need_siglock)
spin_lock_irq(&child->sighand->siglock);
- child->exit_code = data;
+ child->ptrace_code = data;
wake_up_state(child, __TASK_TRACED);
if (need_siglock)
spin_unlock_irq(&child->sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index 9903ff12e581..fd3c404de8b6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2168,7 +2168,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
info.si_status = tsk->signal->group_exit_code & 0x7f;
break;
case CLD_TRAPPED:
- info.si_status = tsk->exit_code & 0x7f;
+ info.si_status = tsk->ptrace_code & 0x7f;
break;
default:
BUG();
@@ -2198,7 +2198,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
* with. If the code did not stop because the tracer is gone,
* the stop signal remains unchanged unless clear_code.
*/
-static int ptrace_stop(int exit_code, int why, int clear_code,
+static int ptrace_stop(int code, int why, int clear_code,
unsigned long message, kernel_siginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
@@ -2248,7 +2248,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,

current->ptrace_message = message;
current->last_siginfo = info;
- current->exit_code = exit_code;
+ current->ptrace_code = code;

/*
* If @why is CLD_STOPPED, we're trapping to participate in a group
@@ -2315,7 +2315,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
__set_current_state(TASK_RUNNING);
read_code = false;
if (clear_code)
- exit_code = 0;
+ code = 0;
read_unlock(&tasklist_lock);
}

@@ -2325,10 +2325,10 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
* any signal-sending on another CPU that wants to examine it.
*/
spin_lock_irq(&current->sighand->siglock);
- if (read_code) exit_code = current->exit_code;
+ if (read_code) code = current->ptrace_code;
current->last_siginfo = NULL;
current->ptrace_message = 0;
- current->exit_code = 0;
+ current->ptrace_code = 0;

/* LISTENING can be set only during STOP traps, clear it */
current->jobctl &= ~JOBCTL_LISTENING;
@@ -2339,7 +2339,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
* This sets TIF_SIGPENDING, but never clears it.
*/
recalc_sigpending_tsk(current);
- return exit_code;
+ return code;
}

static int ptrace_do_notify(int signr, int exit_code, int why, unsigned long message)
@@ -2501,11 +2501,11 @@ static bool do_signal_stop(int signr)
*
* When PT_SEIZED, it's used for both group stop and explicit
* SEIZE/INTERRUPT traps. Both generate PTRACE_EVENT_STOP trap with
- * accompanying siginfo. If stopped, lower eight bits of exit_code contain
+ * accompanying siginfo. If stopped, lower eight bits of ptrace_code contain
* the stop signal; otherwise, %SIGTRAP.
*
* When !PT_SEIZED, it's used only for group stop trap with stop signal
- * number as exit_code and no siginfo.
+ * number as ptrace_code and no siginfo.
*
* CONTEXT:
* Must be called with @current->sighand->siglock held, which may be
--
2.29.2

2022-01-03 21:34:19

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 12/17] signal: Compute the process exit_code in get_signal

In prepartion for moving the work of sys_exit and sys_group_exit into
get_signal compute exit_code in get_signal, make PF_SIGNALED depend on
the exit_code and pass the exit_code to do_group_exit.

Anytime there is a group exit the exit_code may differ from the signal
number.

To match the historical precedent as best I can make the exit_code 0
during exec. (The exit_code field would not have been set but probably
would have been left at a value of 0).

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index fd3c404de8b6..2a24cca00ca1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2707,6 +2707,7 @@ bool get_signal(struct ksignal *ksig)
for (;;) {
struct k_sigaction *ka;
enum pid_type type;
+ int exit_code;

/* Has this task already been marked for death? */
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
@@ -2716,6 +2717,10 @@ bool get_signal(struct ksignal *ksig)
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
+ if (signal->flags & SIGNAL_GROUP_EXIT)
+ exit_code = signal->group_exit_code;
+ else
+ exit_code = 0;
goto fatal;
}

@@ -2837,15 +2842,17 @@ bool get_signal(struct ksignal *ksig)
continue;
}

+ /*
+ * Anything else is fatal, maybe with a core dump.
+ */
+ exit_code = signr;
fatal:
spin_unlock_irq(&sighand->siglock);
if (unlikely(cgroup_task_frozen(current)))
cgroup_leave_frozen(true);

- /*
- * Anything else is fatal, maybe with a core dump.
- */
- current->flags |= PF_SIGNALED;
+ if (exit_code & 0x7f)
+ current->flags |= PF_SIGNALED;

if (sig_kernel_coredump(signr)) {
if (print_fatal_signals)
@@ -2873,7 +2880,7 @@ bool get_signal(struct ksignal *ksig)
/*
* Death signals, no core dump.
*/
- do_group_exit(ksig->info.si_signo);
+ do_group_exit(exit_code);
/* NOTREACHED */
}
spin_unlock_irq(&sighand->siglock);
--
2.29.2

2022-01-03 21:34:22

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 13/17] signal: Make individual tasks exiting a first class concept

Add a helper schedule_task_exit_locked that is equivalent to
asynchronously calling exit(2) except for not having an exit code.

This is a generalization of what happens in de_thread, zap_process,
prepare_signal, complete_signal, and zap_other_threads when individual
tasks are asked to shutdown.

The various code paths optimize away the setting sigaddset and
signal_wake_up based on different conditions. Neither sigaddset nor
signal_wake_up are needed if the task has already started running
do_exit. So skip the work if PF_POSTCOREDUMP is set. Which is the
earliest any of the original hand rolled implementations used.

Update get_signal to detect either signal group exit or a single task
exit by testing for __fatal_signal_pending. This works because the
all of the tasks in group exits are killed with
schedule_task_exit_locked.

For clarity the code in get_signal has been updated to call do_exit
instead of do_group_exit when a single task is exiting.

While this schedule_task_exit_locked is a generalization of what
happens in prepare_signal I do not change prepare_signal to use
schedule_task_exit_locked to deliver SIGKILL to a coredumping process.
This keeps all of the specialness delivering a signal to a coredumping
process limited to prepare_signal and the coredump code itself.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 7 ++-----
include/linux/sched/signal.h | 2 ++
kernel/signal.c | 36 +++++++++++++++++++++---------------
3 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 09302a6a0d80..9559e29daada 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -358,12 +358,9 @@ static int zap_process(struct task_struct *start, int exit_code)
start->signal->group_stop_count = 0;

for_each_thread(start, t) {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
+ if (t != current && !(t->flags & PF_POSTCOREDUMP))
nr++;
- }
}

return nr;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b6ecb9fc4cd2..7c62b7c29cc0 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -427,6 +427,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
}

+void schedule_task_exit_locked(struct task_struct *task);
+
void task_join_group_stop(struct task_struct *task);

#ifdef TIF_RESTORE_SIGMASK
diff --git a/kernel/signal.c b/kernel/signal.c
index 2a24cca00ca1..cbfb9020368e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1056,9 +1056,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
signal->group_stop_count = 0;
t = p;
do {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
} while_each_thread(p, t);
return;
}
@@ -1363,6 +1361,16 @@ int force_sig_info(struct kernel_siginfo *info)
return force_sig_info_to_task(info, current, HANDLER_CURRENT);
}

+void schedule_task_exit_locked(struct task_struct *task)
+{
+ task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+ /* Only bother with threads that might be alive */
+ if (!(task->flags & PF_POSTCOREDUMP)) {
+ sigaddset(&task->pending.signal, SIGKILL);
+ signal_wake_up(task, 1);
+ }
+}
+
/*
* Nuke all other threads in the group.
*/
@@ -1374,16 +1382,9 @@ int zap_other_threads(struct task_struct *p)
p->signal->group_stop_count = 0;

while_each_thread(p, t) {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
count++;
-
- /* Don't bother with already dead threads */
- if (t->exit_state)
- continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ schedule_task_exit_locked(t);
}
-
return count;
}

@@ -2706,12 +2707,12 @@ bool get_signal(struct ksignal *ksig)

for (;;) {
struct k_sigaction *ka;
+ bool group_exit = true;
enum pid_type type;
int exit_code;

/* Has this task already been marked for death? */
- if ((signal->flags & SIGNAL_GROUP_EXIT) ||
- signal->group_exec_task) {
+ if (__fatal_signal_pending(current)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2719,8 +2720,10 @@ bool get_signal(struct ksignal *ksig)
recalc_sigpending();
if (signal->flags & SIGNAL_GROUP_EXIT)
exit_code = signal->group_exit_code;
- else
+ else {
exit_code = 0;
+ group_exit = false;
+ }
goto fatal;
}

@@ -2880,7 +2883,10 @@ bool get_signal(struct ksignal *ksig)
/*
* Death signals, no core dump.
*/
- do_group_exit(exit_code);
+ if (group_exit)
+ do_group_exit(exit_code);
+ else
+ do_exit(exit_code);
/* NOTREACHED */
}
spin_unlock_irq(&sighand->siglock);
--
2.29.2

2022-01-03 21:34:28

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 14/17] signal: Remove zap_other_threads

The two callers of zap_other_threads want different things. The
function do_group_exit wants to set the exit code and it does not care
about the number of threads exiting. In de_thread the current thread
is not exiting so there is not really an exit code.

Since schedule_task_exit_locked factors out the tricky bits stop
sharing the loop in zap_other_threads between de_thread and
do_group_exit.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 12 +++++++++---
include/linux/sched/signal.h | 1 -
kernel/exit.c | 9 ++++++++-
kernel/signal.c | 17 -----------------
4 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 82db656ca709..b9f646fddc51 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1037,6 +1037,7 @@ static int de_thread(struct task_struct *tsk)
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t;

if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1055,9 +1056,14 @@ static int de_thread(struct task_struct *tsk)
}

sig->group_exec_task = tsk;
- sig->notify_count = zap_other_threads(tsk);
- if (!thread_group_leader(tsk))
- sig->notify_count--;
+ sig->group_stop_count = 0;
+ sig->notify_count = 0;
+ __for_each_thread(sig, t) {
+ if (t == tsk)
+ continue;
+ sig->notify_count++;
+ schedule_task_exit_locked(t);
+ }

while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7c62b7c29cc0..eed54f9ea2fc 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -343,7 +343,6 @@ extern void force_sig(int);
extern void force_fatal_sig(int);
extern void force_exit_sig(int);
extern int send_sig(int, struct task_struct *, int);
-extern int zap_other_threads(struct task_struct *p);
extern struct sigqueue *sigqueue_alloc(void);
extern void sigqueue_free(struct sigqueue *);
extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
diff --git a/kernel/exit.c b/kernel/exit.c
index aedefe5eb0eb..27bc0ccfea78 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -918,9 +918,16 @@ do_group_exit(int exit_code)
else if (sig->group_exec_task)
exit_code = 0;
else {
+ struct task_struct *t;
+
sig->group_exit_code = exit_code;
sig->flags = SIGNAL_GROUP_EXIT;
- zap_other_threads(current);
+ sig->group_stop_count = 0;
+ __for_each_thread(sig, t) {
+ if (t == current)
+ continue;
+ schedule_task_exit_locked(t);
+ }
}
spin_unlock_irq(&sighand->siglock);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index cbfb9020368e..b0201e05be40 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1371,23 +1371,6 @@ void schedule_task_exit_locked(struct task_struct *task)
}
}

-/*
- * Nuke all other threads in the group.
- */
-int zap_other_threads(struct task_struct *p)
-{
- struct task_struct *t = p;
- int count = 0;
-
- p->signal->group_stop_count = 0;
-
- while_each_thread(p, t) {
- count++;
- schedule_task_exit_locked(t);
- }
- return count;
-}
-
struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
unsigned long *flags)
{
--
2.29.2

2022-01-03 21:34:31

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 16/17] signal: Record the exit_code when an exit is scheduled

With ptrace_stop no longer using task->exit_code it is safe
to set task->exit_code when an exit is scheduled.

Use the bit JOBCTL_WILL_EXIT to detect when a exit is first scheduled
and only set exit_code the first time. Only use the code provided
to do_exit if the task has not yet been schedled to exit.

In get_signal and do_grup_exit when JOBCTL_WILL_EXIT is set read the
recored exit_code from current->exit_code, instead of assuming
exit_code will always be 0.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 2 +-
fs/exec.c | 2 +-
include/linux/sched/signal.h | 2 +-
kernel/exit.c | 12 ++++++++----
kernel/signal.c | 7 ++++---
5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 4e82ee51633d..c54b502bf648 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -357,7 +357,7 @@ static int zap_process(struct task_struct *start, int exit_code)
start->signal->group_stop_count = 0;

for_each_thread(start, t) {
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, exit_code);
if (t != current && !(t->flags & PF_POSTCOREDUMP))
nr++;
}
diff --git a/fs/exec.c b/fs/exec.c
index b9f646fddc51..3203605e54cb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1062,7 +1062,7 @@ static int de_thread(struct task_struct *tsk)
if (t == tsk)
continue;
sig->notify_count++;
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, 0);
}

while (sig->notify_count) {
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 989bb665f107..e8034ecaee84 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -426,7 +426,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
}

-void schedule_task_exit_locked(struct task_struct *task);
+void schedule_task_exit_locked(struct task_struct *task, int exit_code);

void task_join_group_stop(struct task_struct *task);

diff --git a/kernel/exit.c b/kernel/exit.c
index 7a7a0cbac28e..e95500e2d27c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -735,6 +735,11 @@ void __noreturn do_exit(long code)
struct task_struct *tsk = current;
int group_dead;

+ spin_lock_irq(&tsk->sighand->siglock);
+ schedule_task_exit_locked(tsk, code);
+ code = tsk->exit_code;
+ spin_unlock_irq(&tsk->sighand->siglock);
+
WARN_ON(blk_needs_flush_plug(tsk));

kcov_task_exit(tsk);
@@ -773,7 +778,6 @@ void __noreturn do_exit(long code)
tty_audit_exit();
audit_free(tsk);

- tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

exit_mm();
@@ -907,7 +911,7 @@ do_group_exit(int exit_code)
if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
else if (current->jobctl & JOBCTL_WILL_EXIT)
- exit_code = 0;
+ exit_code = current->exit_code;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;

@@ -916,7 +920,7 @@ do_group_exit(int exit_code)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
else if (current->jobctl & JOBCTL_WILL_EXIT)
- exit_code = 0;
+ exit_code = current->exit_code;
else {
struct task_struct *t;

@@ -926,7 +930,7 @@ do_group_exit(int exit_code)
__for_each_thread(sig, t) {
if (t == current)
continue;
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, exit_code);
}
}
spin_unlock_irq(&sighand->siglock);
diff --git a/kernel/signal.c b/kernel/signal.c
index 6179e34ce666..e8fac8a3c935 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1057,7 +1057,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
signal->group_stop_count = 0;
t = p;
do {
- schedule_task_exit_locked(t);
+ schedule_task_exit_locked(t, sig);
} while_each_thread(p, t);
return;
}
@@ -1362,11 +1362,12 @@ int force_sig_info(struct kernel_siginfo *info)
return force_sig_info_to_task(info, current, HANDLER_CURRENT);
}

-void schedule_task_exit_locked(struct task_struct *task)
+void schedule_task_exit_locked(struct task_struct *task, int exit_code)
{
if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
task->jobctl |= JOBCTL_WILL_EXIT;
+ task->exit_code = exit_code;
signal_wake_up(task, 1);
}
}
@@ -2703,7 +2704,7 @@ bool get_signal(struct ksignal *ksig)
if (signal->flags & SIGNAL_GROUP_EXIT)
exit_code = signal->group_exit_code;
else {
- exit_code = 0;
+ exit_code = current->exit_code;
group_exit = false;
}
goto fatal;
--
2.29.2

2022-01-03 21:34:34

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 15/17] signal: Add JOBCTL_WILL_EXIT to mark exiting tasks

Mark tasks that need to exit with JOBCTL_WILL_EXIT instead of reusing
the per thread SIGKILL.

This removes the double meaning of the per thread SIGKILL and makes it
possible to detect when a task has already been scheduled for exiting
and to skip unnecessary work if the task is already scheduled to exit.

A jobctl flag was choosen for this purpose because jobctl changes are
protected by siglock, and updates are already careful not to change or
clear other bits in jobctl. Protection by a lock when changing the
value is necessary as JOBCTL_WILL_EXIT will not be limited to being
set by the current task. That task->jobctl is protected by siglock is
convenient as siglock is already held everywhere I want to set or reset
JOBCTL_WILL_EXIT.

Teach wants_signal and retarget_shared_pending to use
JOBCTL_TASK_EXITING to detect threads that have an exit pending and so
will not be processing any more signals.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 6 ++++--
include/linux/sched/jobctl.h | 2 ++
include/linux/sched/signal.h | 2 +-
kernel/exit.c | 4 ++--
kernel/signal.c | 19 +++++++++----------
5 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 9559e29daada..4e82ee51633d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,7 +352,6 @@ static int zap_process(struct task_struct *start, int exit_code)
struct task_struct *t;
int nr = 0;

- /* Allow SIGKILL, see prepare_signal() */
start->signal->flags = SIGNAL_GROUP_EXIT;
start->signal->group_exit_code = exit_code;
start->signal->group_stop_count = 0;
@@ -376,9 +375,11 @@ static int zap_threads(struct task_struct *tsk,
if (!(signal->flags & SIGNAL_GROUP_EXIT) && !signal->group_exec_task) {
signal->core_state = core_state;
nr = zap_process(tsk, exit_code);
+ atomic_set(&core_state->nr_threads, nr);
+ /* Allow SIGKILL, see prepare_signal() */
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
tsk->flags |= PF_DUMPCORE;
- atomic_set(&core_state->nr_threads, nr);
+ tsk->jobctl &= ~JOBCTL_WILL_EXIT;
}
spin_unlock_irq(&tsk->sighand->siglock);
return nr;
@@ -425,6 +426,7 @@ static void coredump_finish(bool core_dumped)
current->signal->group_exit_code |= 0x80;
next = current->signal->core_state->dumper.next;
current->signal->core_state = NULL;
+ current->jobctl |= JOBCTL_WILL_EXIT;
spin_unlock_irq(&current->sighand->siglock);

while ((curr = next) != NULL) {
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..9887d737ccfb 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@ struct task_struct;
#define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */
#define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */
#define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */
+#define JOBCTL_WILL_EXIT_BIT 31 /* task will exit */

#define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT)
#define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@ struct task_struct;
#define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT)
#define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT)
#define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_WILL_EXIT (1UL << JOBCTL_WILL_EXIT_BIT)

#define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
#define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index eed54f9ea2fc..989bb665f107 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -373,7 +373,7 @@ static inline int signal_pending(struct task_struct *p)

static inline int __fatal_signal_pending(struct task_struct *p)
{
- return unlikely(sigismember(&p->pending.signal, SIGKILL));
+ return unlikely(p->jobctl & JOBCTL_WILL_EXIT);
}

static inline int fatal_signal_pending(struct task_struct *p)
diff --git a/kernel/exit.c b/kernel/exit.c
index 27bc0ccfea78..7a7a0cbac28e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,7 +906,7 @@ do_group_exit(int exit_code)

if (sig->flags & SIGNAL_GROUP_EXIT)
exit_code = sig->group_exit_code;
- else if (sig->group_exec_task)
+ else if (current->jobctl & JOBCTL_WILL_EXIT)
exit_code = 0;
else if (!thread_group_empty(current)) {
struct sighand_struct *const sighand = current->sighand;
@@ -915,7 +915,7 @@ do_group_exit(int exit_code)
if (sig->flags & SIGNAL_GROUP_EXIT)
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
- else if (sig->group_exec_task)
+ else if (current->jobctl & JOBCTL_WILL_EXIT)
exit_code = 0;
else {
struct task_struct *t;
diff --git a/kernel/signal.c b/kernel/signal.c
index b0201e05be40..6179e34ce666 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -153,7 +153,8 @@ static inline bool has_pending_signals(sigset_t *signal, sigset_t *blocked)

static bool recalc_sigpending_tsk(struct task_struct *t)
{
- if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
+ if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE |
+ JOBCTL_WILL_EXIT)) ||
PENDING(&t->pending, &t->blocked) ||
PENDING(&t->signal->shared_pending, &t->blocked) ||
cgroup_task_frozen(t)) {
@@ -911,7 +912,7 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
if (core_state) {
if (sig == SIGKILL) {
struct task_struct *dumper = core_state->dumper.task;
- sigaddset(&dumper->pending.signal, SIGKILL);
+ dumper->jobctl |= JOBCTL_WILL_EXIT;
signal_wake_up(dumper, 1);
}
}
@@ -985,7 +986,7 @@ static inline bool wants_signal(int sig, struct task_struct *p)
if (sigismember(&p->blocked, sig))
return false;

- if (p->flags & PF_EXITING)
+ if (p->jobctl & JOBCTL_WILL_EXIT)
return false;

if (sig == SIGKILL)
@@ -1363,10 +1364,9 @@ int force_sig_info(struct kernel_siginfo *info)

void schedule_task_exit_locked(struct task_struct *task)
{
- task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
- /* Only bother with threads that might be alive */
- if (!(task->flags & PF_POSTCOREDUMP)) {
- sigaddset(&task->pending.signal, SIGKILL);
+ if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
+ task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+ task->jobctl |= JOBCTL_WILL_EXIT;
signal_wake_up(task, 1);
}
}
@@ -2695,9 +2695,8 @@ bool get_signal(struct ksignal *ksig)
int exit_code;

/* Has this task already been marked for death? */
- if (__fatal_signal_pending(current)) {
+ if (current->jobctl & JOBCTL_WILL_EXIT) {
ksig->info.si_signo = signr = SIGKILL;
- sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
@@ -2935,7 +2934,7 @@ static void retarget_shared_pending(struct task_struct *tsk, sigset_t *which)

t = tsk;
while_each_thread(tsk, t) {
- if (t->flags & PF_EXITING)
+ if (t->jobctl & JOBCTL_WILL_EXIT)
continue;

if (!has_pending_signals(&retarget, &t->blocked))
--
2.29.2

2022-01-03 21:34:39

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 17/17] signal: Always set SIGNAL_GROUP_EXIT on process exit

Track how many threads have not started exiting and when
the last thread starts exiting set SIGNAL_GROUP_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/coredump.c | 4 ----
include/linux/sched/signal.h | 1 +
kernel/exit.c | 8 +-------
kernel/fork.c | 2 ++
kernel/signal.c | 10 +++++++---
5 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index c54b502bf648..029d0f98aa90 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -352,10 +352,6 @@ static int zap_process(struct task_struct *start, int exit_code)
struct task_struct *t;
int nr = 0;

- start->signal->flags = SIGNAL_GROUP_EXIT;
- start->signal->group_exit_code = exit_code;
- start->signal->group_stop_count = 0;
-
for_each_thread(start, t) {
schedule_task_exit_locked(t, exit_code);
if (t != current && !(t->flags & PF_POSTCOREDUMP))
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e8034ecaee84..bd9435e934a1 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -94,6 +94,7 @@ struct signal_struct {
refcount_t sigcnt;
atomic_t live;
int nr_threads;
+ int quick_threads;
struct list_head thread_head;

wait_queue_head_t wait_chldexit; /* for wait4() */
diff --git a/kernel/exit.c b/kernel/exit.c
index e95500e2d27c..be867a12de65 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -924,14 +924,8 @@ do_group_exit(int exit_code)
else {
struct task_struct *t;

- sig->group_exit_code = exit_code;
- sig->flags = SIGNAL_GROUP_EXIT;
- sig->group_stop_count = 0;
- __for_each_thread(sig, t) {
- if (t == current)
- continue;
+ __for_each_thread(sig, t)
schedule_task_exit_locked(t, exit_code);
- }
}
spin_unlock_irq(&sighand->siglock);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 6f0293cb29c9..d973189a4014 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1644,6 +1644,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return -ENOMEM;

sig->nr_threads = 1;
+ sig->quick_threads = 1;
atomic_set(&sig->live, 1);
refcount_set(&sig->sigcnt, 1);

@@ -2383,6 +2384,7 @@ static __latent_entropy struct task_struct *copy_process(
__this_cpu_inc(process_counts);
} else {
current->signal->nr_threads++;
+ current->signal->quick_threads++;
atomic_inc(&current->signal->live);
refcount_inc(&current->signal->sigcnt);
task_join_group_stop(p);
diff --git a/kernel/signal.c b/kernel/signal.c
index e8fac8a3c935..9bd835fcb1dc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1052,9 +1052,6 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* running and doing things after a slower
* thread has the fatal signal pending.
*/
- signal->flags = SIGNAL_GROUP_EXIT;
- signal->group_exit_code = sig;
- signal->group_stop_count = 0;
t = p;
do {
schedule_task_exit_locked(t, sig);
@@ -1365,10 +1362,17 @@ int force_sig_info(struct kernel_siginfo *info)
void schedule_task_exit_locked(struct task_struct *task, int exit_code)
{
if (!(task->jobctl & JOBCTL_WILL_EXIT)) {
+ struct signal_struct *signal = task->signal;
task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
task->jobctl |= JOBCTL_WILL_EXIT;
task->exit_code = exit_code;
signal_wake_up(task, 1);
+ signal->quick_threads--;
+ if (signal->quick_threads == 0) {
+ signal->flags = SIGNAL_GROUP_EXIT;
+ signal->group_exit_code = exit_code;
+ signal->group_stop_count = 0;
+ }
}
}

--
2.29.2

2022-01-04 06:30:51

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

14.12.2021 01:53, Eric W. Biederman пишет:
> Simplify the code that allows SIGKILL during coredumps to terminate
> the coredump. As far as I can tell I have avoided breaking it
> by dumb luck.
>
> Historically with all of the other threads stopping in exit_mm the
> wants_signal loop in complete_signal would find the dumper task and
> then complete_signal would wake the dumper task with signal_wake_up.
>
> After moving the coredump_task_exit above the setting of PF_EXITING in
> commit 92307383082d ("coredump: Don't perform any cleanups before
> dumping core") wants_signal will consider all of the threads in a
> multi-threaded process for waking up, not just the core dumping task.
>
> Luckily complete_signal short circuits SIGKILL during a coredump marks
> every thread with SIGKILL and signal_wake_up. This code is arguably
> buggy however as it tries to skip creating a group exit when is already
> present, and it fails that a coredump is in progress.
>
> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
> set on the dumper task but also SIGKILL set in it's pending bitmap.
> This means that if the code is ever fixed not to short-circuit and
> kill a process after it has already been killed the special case
> for SIGKILL during a coredump will be broken.
>
> Sort all of this out by making the coredump special case more special,
> and perform all of the work in prepare_signal and leave the rest of
> the signal delivery path out of it.
>
> In prepare_signal when the process coredumping is sent SIGKILL find
> the task performing the coredump and use sigaddset and signal_wake_up
> to ensure that task reports fatal_signal_pending.
>
> Return false from prepare_signal to tell the rest of the signal
> delivery path to ignore the signal.
>
> Update wait_for_dump_helpers to perform a wait_event_killable wait
> so that if signal_pending gets set spuriously the wait will not
> be interrupted unless fatal_signal_pending is true.
>
> I have tested this and verified I did not break SIGKILL during
> coredumps by accident (before or after this change). I actually
> thought I had and I had to figure out what I had misread that kept
> SIGKILL during coredumps working.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/coredump.c | 4 ++--
> kernel/signal.c | 11 +++++++++--
> 2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index a6b3c196cdef..7b91fb32dbb8 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
> static bool dump_interrupted(void)
> {
> /*
> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
> * can do try_to_freeze() and check __fatal_signal_pending(),
> * but then we need to teach dump_write() to restart and clear
> * TIF_SIGPENDING.
> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
> * We actually want wait_event_freezable() but then we need
> * to clear TIF_SIGPENDING and improve dump_interrupted().
> */
> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>
> pipe_lock(pipe);
> pipe->readers--;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8272cac5f429..7e305a8ec7c2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
> sigset_t flush;
>
> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
> - return sig == SIGKILL;
> + struct core_state *core_state = signal->core_state;
> + if (core_state) {
> + if (sig == SIGKILL) {
> + struct task_struct *dumper = core_state->dumper.task;
> + sigaddset(&dumper->pending.signal, SIGKILL);
> + signal_wake_up(dumper, 1);
> + }
> + return false;
> + }
> /*
> * The process is in the middle of dying, nothing to do.
> */
>

Hi,

This patch breaks userspace, in particular it breaks gst-plugin-scanner
of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
registry of good/working GStreamer plugins by loading them and
blacklisting those that don't work (crash). Before the hang I see
systemd-coredump process running, taking snapshot of gst-plugin-scanner
and then gst-plugin-scanner gets stuck.

Bisection points at this patch, reverting it restores
gst-plugin-scanner. Systemd-coredump still running, but there is no hang
anymore and everything works properly as before.

I'm seeing this problem on ARM32 and haven't checked other arches.
Please fix, thanks in advance.

2022-01-04 07:38:08

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-01-04 16:18:49

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 14.12.2021 01:53, Eric W. Biederman пишет:
>> Simplify the code that allows SIGKILL during coredumps to terminate
>> the coredump. As far as I can tell I have avoided breaking it
>> by dumb luck.
>>
>> Historically with all of the other threads stopping in exit_mm the
>> wants_signal loop in complete_signal would find the dumper task and
>> then complete_signal would wake the dumper task with signal_wake_up.
>>
>> After moving the coredump_task_exit above the setting of PF_EXITING in
>> commit 92307383082d ("coredump: Don't perform any cleanups before
>> dumping core") wants_signal will consider all of the threads in a
>> multi-threaded process for waking up, not just the core dumping task.
>>
>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>> every thread with SIGKILL and signal_wake_up. This code is arguably
>> buggy however as it tries to skip creating a group exit when is already
>> present, and it fails that a coredump is in progress.
>>
>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>> This means that if the code is ever fixed not to short-circuit and
>> kill a process after it has already been killed the special case
>> for SIGKILL during a coredump will be broken.
>>
>> Sort all of this out by making the coredump special case more special,
>> and perform all of the work in prepare_signal and leave the rest of
>> the signal delivery path out of it.
>>
>> In prepare_signal when the process coredumping is sent SIGKILL find
>> the task performing the coredump and use sigaddset and signal_wake_up
>> to ensure that task reports fatal_signal_pending.
>>
>> Return false from prepare_signal to tell the rest of the signal
>> delivery path to ignore the signal.
>>
>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>> so that if signal_pending gets set spuriously the wait will not
>> be interrupted unless fatal_signal_pending is true.
>>
>> I have tested this and verified I did not break SIGKILL during
>> coredumps by accident (before or after this change). I actually
>> thought I had and I had to figure out what I had misread that kept
>> SIGKILL during coredumps working.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/coredump.c | 4 ++--
>> kernel/signal.c | 11 +++++++++--
>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index a6b3c196cdef..7b91fb32dbb8 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>> static bool dump_interrupted(void)
>> {
>> /*
>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>> * can do try_to_freeze() and check __fatal_signal_pending(),
>> * but then we need to teach dump_write() to restart and clear
>> * TIF_SIGPENDING.
>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>> * We actually want wait_event_freezable() but then we need
>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>> */
>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>
>> pipe_lock(pipe);
>> pipe->readers--;
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8272cac5f429..7e305a8ec7c2 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>> sigset_t flush;
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>> + if (sig == SIGKILL) {
>> + struct task_struct *dumper = core_state->dumper.task;
>> + sigaddset(&dumper->pending.signal, SIGKILL);
>> + signal_wake_up(dumper, 1);
>> + }
>> + return false;
>> + }
>> /*
>> * The process is in the middle of dying, nothing to do.
>> */
>>
>
> Hi,
>
> This patch breaks userspace, in particular it breaks gst-plugin-scanner
> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
> registry of good/working GStreamer plugins by loading them and
> blacklisting those that don't work (crash). Before the hang I see
> systemd-coredump process running, taking snapshot of gst-plugin-scanner
> and then gst-plugin-scanner gets stuck.
>
> Bisection points at this patch, reverting it restores
> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
> anymore and everything works properly as before.
>
> I'm seeing this problem on ARM32 and haven't checked other arches.
> Please fix, thanks in advance.

That is weird.

Doubly weird really because this should only change the case where
coredumps are interrupted by SIGKILL.

What distro are you running? I would like to match things as closely
as I can. So I can reproduce the issue so I can figure out what
is wrong so I can fix it.

Eric

2022-01-04 18:45:21

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Mon, Dec 13, 2021 at 2:54 PM Eric W. Biederman <[email protected]> wrote:
>
>
> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
> - return sig == SIGKILL;
> + struct core_state *core_state = signal->core_state;
> + if (core_state) {

This change is very confusing.

Also, why does it do that 'signal->core_state->dumper.task', when we
already know that it's the same as 'signal->group_exit_task'?

The only thing that sets 'signal->core_state' also sets
'signal->group_exit_task', and the call chain has set both to the same
task.

So the code is odd and makes little sense.

But what's even more odd is how it

(a) sends the SIGKILL to somebody else

(b) does *NOT* send SIGKILL to itself

Now, (a) is explained in the commit message. The intent is to signal
the core dumper.

But (b) looks like a fundamental change in semantics. The target of
the SIGKILL is still running, might be in some loop in the kernel that
wants to be interrupted by a fatal signal, and you expressly disabled
the code that would send that fatal signal.

If I send SIGKILL to thread A, then that SIGKILL had *better* be
delivered. To thread A, which may be in a "mutex_lock_killable()" or
whatever else.

The fact that thread B may be in the process of trying to dump core
doesn't change that at all, as far as I can see.

So I think this patch is fundamentally buggy and wrong. Or at least
needs much more explanation of why you'd not send SIGKILL to the
target thread.

Linus

2022-01-04 19:47:21

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Linus Torvalds <[email protected]> writes:

> On Mon, Dec 13, 2021 at 2:54 PM Eric W. Biederman <[email protected]> wrote:
>>
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>
> This change is very confusing.
>
> Also, why does it do that 'signal->core_state->dumper.task', when we
> already know that it's the same as 'signal->group_exit_task'?
>
> The only thing that sets 'signal->core_state' also sets
> 'signal->group_exit_task', and the call chain has set both to the same
> task.
>
> So the code is odd and makes little sense.

As you say signal->group_exit_task, and core_state->dumper.task point to
the same task. So it may be a little silly when viewed independently of
everything else to use core_state->dumper.task instead of
group_exit_task as it is an extra cache line dereference.

The thing is signal->group_exit_task is only used by coredumps currently
as a flag to tell signal_group_exit to return true. It is exec that
actually uses signal->group_exit_task in conjunction with
signal->notify_count to wake itself up.

Using a pointer as a flag and not for it's value. Having different
semantics for who sets the pointer. All of those are weird enough
I just want to make signal->group_exit_task to go away.

By using core_state->dumper.task I was able to make
signal->group_exit_task exclusive to the exec case in the following
changes, and to rename it signal->group_exec_task so no one gets
confused what the field is for.

> But what's even more odd is how it
>
> (a) sends the SIGKILL to somebody else
>
> (b) does *NOT* send SIGKILL to itself
>
> Now, (a) is explained in the commit message. The intent is to signal
> the core dumper.

Which is the a specific thread of the target process, and it is
the only thread running of the target process.

> But (b) looks like a fundamental change in semantics. The target of
> the SIGKILL is still running, might be in some loop in the kernel that
> wants to be interrupted by a fatal signal, and you expressly disabled
> the code that would send that fatal signal.
>
> If I send SIGKILL to thread A, then that SIGKILL had *better* be
> delivered. To thread A, which may be in a "mutex_lock_killable()" or
> whatever else.
>
> The fact that thread B may be in the process of trying to dump core
> doesn't change that at all, as far as I can see.
>
> So I think this patch is fundamentally buggy and wrong. Or at least
> needs much more explanation of why you'd not send SIGKILL to the
> target thread.

If you look at zap_threads. You can observe that it takes the siglock,
sets SIGNAL_GROUP_COREDUMP, and sets signal->core_state and in
zap_process makes SIGKILL pending is the per-task sigset, and calls
signal_wake_up on every task.

This case in prepare_signal happens after that. After every task
has been told to die, and __fatal_signal_pending is true for all of
them if they have not reached do_exit yet.

If you look in zap_threads you will see that the core dumping thread
clears TIF_SIGPENDING, and in general makes fatal_signal_pending false
for itself. But keep in mind that this thread because it is dumping
core is already on the path to do_exit. It has already processed a
fatal signal.

So in the special case I only worry about the dumping task as it is the
only task after zap_threads that does not have fatal_signal_pending.

This is different than the ordinary case of delivering SIGKILL
where complete_signal makes SIGKILL pending in the per-task sigset
of every task in the process.

Currently I suspect changing wait_event_uninterruptible to
wait_event_killable, is causing problems.

Or perhaps there is some reason tasks that have already entered do_exit
need to have fatal_signal_pending set. (The will have
fatal_signal_pending set up until they enter get_signal which calls
do_group_exit which calls do_exit).

Which is why I am trying to reproduce the reported failure so I can get
the kernel to tell me what is going on. If this is not resolved quickly
I won't send you this change, and I will pull it out of linux-next.

Eric

2022-01-05 04:25:37

[permalink] [raw]

Subject: Re: [PATCH 01/10] exit/s390: Remove dead reference to do_exit from copy_thread

On Sun, Dec 12, 2021 at 06:48:56PM +0100, Heiko Carstens wrote:
> On Wed, Dec 08, 2021 at 02:25:23PM -0600, Eric W. Biederman wrote:
> > My s390 assembly is not particularly good so I have read the history
> > of the reference to do_exit copy_thread and have been able to
> > verify that do_exit is not used.
> >
> > The general argument is that s390 has been changed to use the generic
> > kernel_thread and kernel_execve and the generic versions do not call
> > do_exit. So it is strange to see a do_exit reference sitting there.
> >
> > The history of the do_exit reference in s390's version of copy_thread
> > seems conclusive that the do_exit reference is something that lingers
> > and should have been removed several years ago.
> ...
> > Remove this dead reference to do_exit to make it clear that s390 is
> > not doing anything with do_exit in copy_thread.
> >
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
> > ---
> > arch/s390/kernel/process.c | 1 -
> > 1 file changed, 1 deletion(-)
>
> Applied to s390 tree. Just in case you want to apply this to your tree too:
> Acked-by: Heiko Carstens <[email protected]>

FWIW, this
frame->childregs.psw.addr =
(unsigned long)__ret_from_fork;
is also pointless. We do want psw.mask (if nothing else, __ret_from_fork()
that is called by ret_from_fork() will, in effect, check user_mode(task_pt_regs()).
But psw.addr is, AFAICS, pointless - the only way the callback is allowed to
return is after successful kernel_execve(), which would set psw.addr; moreover,
psw.addr is meaningless until that happens.

2022-01-05 05:01:48

[permalink] [raw]

Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
> There are two big uses of do_exit. The first is it's design use to be
> the guts of the exit(2) system call. The second use is to terminate
> a task after something catastrophic has happened like a NULL pointer
> in kernel code.
>
> Add a function make_task_dead that is initialy exactly the same as
> do_exit to cover the cases where do_exit is called to handle
> catastrophic failure. In time this can probably be reduced to just a
> light wrapper around do_task_dead. For now keep it exactly the same so
> that there will be no behavioral differences introducing this new
> concept.
>
> Replace all of the uses of do_exit that use it for catastraphic
> task cleanup with make_task_dead to make it clear what the code
> is doing.
>
> As part of this rename rewind_stack_do_exit
> rewind_stack_and_make_dead.

Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
That's an obvious case for your make_task_dead().

2022-01-05 05:48:12

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> The beginning of do_exit has become cluttered and difficult to read as
> it is filled with checks to handle things that can only happen when
> the kernel is operating improperly.
>
> Now that we have a dedicated function for cleaning up a task when the
> kernel is operating improperly move the checks there.

Umm... I would probably take profile_task_exit() crap out before that
point.
1) the damn thing is dead - nothing registers notifiers there
2) blocking_notifier_call_chain() is not a nice thing to do on oops...

I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
morning (there's also profile_handoff_task(), equally useless these days
and complicating things for __put_task_struct()).

> - /*
> - * If do_exit is called because this processes oopsed, it's possible
> - * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> - * continuing. Amongst other possible reasons, this is to prevent
> - * mm_release()->clear_child_tid() from writing to a user-controlled
> - * kernel address.
> - */
> - force_uaccess_begin();

Are you sure about that one? It shouldn't matter, but... it's a potential
change for do_exit() from a kernel thread. As it is, we have that
force_uaccess_begin() for exiting threads and for kernel ones it's not
a no-op. I'm not concerned about attempted userland access after that
point for those, obviously, but I'm not sure you won't step into something
subtle here.

I would prefer to split that particular change off into a separate commit...

2022-01-05 05:58:42

[permalink] [raw]

Subject: Re: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

On Wed, Dec 08, 2021 at 02:25:26PM -0600, Eric W. Biederman wrote:
> When the kernel detects it is oops or otherwise force killing a task
> while it exits the code poorly attempts to permanently stop the task
> from scheduling.
>
> I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
> to be woken up.
>
> As it makes no sense for the task to continue call do_task_dead
> instead which actually does the work and permanently removes the task
> from the scheduler. Guaranteeing the task will never be woken
> up again.

NAK. This is not all do_task_dead() leads to - see what finish_task_switch()
does upon seeing TASK_DEAD:
/* Task is done with its stack. */
put_task_stack(prev);
put_task_struct_rcu_user(prev);

Now take a look at the comment just before that check for PF_EXITING -
the point is to leave the task leaked, rather than proceeding with
freeing the sucker.

We are not going through the normal "turn zombie" motions, including
waking wait(2) callers up, etc. Going ahead and freeing it could
fuck the things up quite badly.

> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/exit.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index d0ec6f6b41cb..f975cd8a2ed8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -886,8 +886,7 @@ void __noreturn make_task_dead(int signr)
> if (unlikely(tsk->flags & PF_EXITING)) {
> pr_alert("Fixing recursive fault but reboot is needed!\n");
> futex_exit_recursive(tsk);
> - set_current_state(TASK_UNINTERRUPTIBLE);
> - schedule();
> + do_task_dead();
> }
>
> do_exit(signr);
> --
> 2.29.2
>

2022-01-05 06:02:29

[permalink] [raw]

Subject: Re: [PATCH 05/10] exit: Stop exporting do_exit

On Wed, Dec 08, 2021 at 02:25:27PM -0600, Eric W. Biederman wrote:
> Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.
>
> Suggested-by: Christoph Hellwig <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/exit.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f975cd8a2ed8..57afac845a0a 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
> lockdep_free_task(tsk);
> do_task_dead();
> }
> -EXPORT_SYMBOL_GPL(do_exit);

"Now" in the commit message is misleading, AFAICS - there's no such users
in the mainline right now (and yes, that one could be moved all the way
up).

2022-01-05 20:02:38

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 14.12.2021 01:53, Eric W. Biederman пишет:
>> Simplify the code that allows SIGKILL during coredumps to terminate
>> the coredump. As far as I can tell I have avoided breaking it
>> by dumb luck.
>>
>> Historically with all of the other threads stopping in exit_mm the
>> wants_signal loop in complete_signal would find the dumper task and
>> then complete_signal would wake the dumper task with signal_wake_up.
>>
>> After moving the coredump_task_exit above the setting of PF_EXITING in
>> commit 92307383082d ("coredump: Don't perform any cleanups before
>> dumping core") wants_signal will consider all of the threads in a
>> multi-threaded process for waking up, not just the core dumping task.
>>
>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>> every thread with SIGKILL and signal_wake_up. This code is arguably
>> buggy however as it tries to skip creating a group exit when is already
>> present, and it fails that a coredump is in progress.
>>
>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>> This means that if the code is ever fixed not to short-circuit and
>> kill a process after it has already been killed the special case
>> for SIGKILL during a coredump will be broken.
>>
>> Sort all of this out by making the coredump special case more special,
>> and perform all of the work in prepare_signal and leave the rest of
>> the signal delivery path out of it.
>>
>> In prepare_signal when the process coredumping is sent SIGKILL find
>> the task performing the coredump and use sigaddset and signal_wake_up
>> to ensure that task reports fatal_signal_pending.
>>
>> Return false from prepare_signal to tell the rest of the signal
>> delivery path to ignore the signal.
>>
>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>> so that if signal_pending gets set spuriously the wait will not
>> be interrupted unless fatal_signal_pending is true.
>>
>> I have tested this and verified I did not break SIGKILL during
>> coredumps by accident (before or after this change). I actually
>> thought I had and I had to figure out what I had misread that kept
>> SIGKILL during coredumps working.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/coredump.c | 4 ++--
>> kernel/signal.c | 11 +++++++++--
>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index a6b3c196cdef..7b91fb32dbb8 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>> static bool dump_interrupted(void)
>> {
>> /*
>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>> * can do try_to_freeze() and check __fatal_signal_pending(),
>> * but then we need to teach dump_write() to restart and clear
>> * TIF_SIGPENDING.
>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>> * We actually want wait_event_freezable() but then we need
>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>> */
>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>
>> pipe_lock(pipe);
>> pipe->readers--;
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8272cac5f429..7e305a8ec7c2 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>> sigset_t flush;
>>
>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>> - return sig == SIGKILL;
>> + struct core_state *core_state = signal->core_state;
>> + if (core_state) {
>> + if (sig == SIGKILL) {
>> + struct task_struct *dumper = core_state->dumper.task;
>> + sigaddset(&dumper->pending.signal, SIGKILL);
>> + signal_wake_up(dumper, 1);
>> + }
>> + return false;
>> + }
>> /*
>> * The process is in the middle of dying, nothing to do.
>> */
>>
>
> Hi,
>
> This patch breaks userspace, in particular it breaks gst-plugin-scanner
> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
> registry of good/working GStreamer plugins by loading them and
> blacklisting those that don't work (crash). Before the hang I see
> systemd-coredump process running, taking snapshot of gst-plugin-scanner
> and then gst-plugin-scanner gets stuck.
>
> Bisection points at this patch, reverting it restores
> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
> anymore and everything works properly as before.
>
> I'm seeing this problem on ARM32 and haven't checked other arches.
> Please fix, thanks in advance.

I have not yet been able to figure out how to run gst-pluggin-scanner in
a way that triggers this yet. In truth I can't figure out how to
run gst-pluggin-scanner in a useful way.

I am going to set up some unit tests and see if I can reproduce your
hang another way, but if you could give me some more information on what
you are doing to trigger this I would appreciate it.

Eric

2022-01-05 20:46:28

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
>> There are two big uses of do_exit. The first is it's design use to be
>> the guts of the exit(2) system call. The second use is to terminate
>> a task after something catastrophic has happened like a NULL pointer
>> in kernel code.
>>
>> Add a function make_task_dead that is initialy exactly the same as
>> do_exit to cover the cases where do_exit is called to handle
>> catastrophic failure. In time this can probably be reduced to just a
>> light wrapper around do_task_dead. For now keep it exactly the same so
>> that there will be no behavioral differences introducing this new
>> concept.
>>
>> Replace all of the uses of do_exit that use it for catastraphic
>> task cleanup with make_task_dead to make it clear what the code
>> is doing.
>>
>> As part of this rename rewind_stack_do_exit
>> rewind_stack_and_make_dead.
>
> Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
> That's an obvious case for your make_task_dead().

Good catch.

Being in assembly it did not have anything after the name do_exit so it
hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
finding that.

Skimming the surrounding code it looks like Linvalid_mask can only be
reached by buggy hardware or buggy kernel code. If userspace could
trigger the condition it would be a candidate for force_exit_sig.

I am a bit puzzled why die is not called, instead of die being
handrolled there.

xtensa folks any thoughts?

If not I will queue up a minimal patch to replace do_exit with
make_task_dead.

Eric

2022-01-05 21:39:16

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

05.01.2022 22:58, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 14.12.2021 01:53, Eric W. Biederman пишет:
>>> Simplify the code that allows SIGKILL during coredumps to terminate
>>> the coredump. As far as I can tell I have avoided breaking it
>>> by dumb luck.
>>>
>>> Historically with all of the other threads stopping in exit_mm the
>>> wants_signal loop in complete_signal would find the dumper task and
>>> then complete_signal would wake the dumper task with signal_wake_up.
>>>
>>> After moving the coredump_task_exit above the setting of PF_EXITING in
>>> commit 92307383082d ("coredump: Don't perform any cleanups before
>>> dumping core") wants_signal will consider all of the threads in a
>>> multi-threaded process for waking up, not just the core dumping task.
>>>
>>> Luckily complete_signal short circuits SIGKILL during a coredump marks
>>> every thread with SIGKILL and signal_wake_up. This code is arguably
>>> buggy however as it tries to skip creating a group exit when is already
>>> present, and it fails that a coredump is in progress.
>>>
>>> Ever since commit 06af8679449d ("coredump: Limit what can interrupt
>>> coredumps") was added dump_interrupted needs not just TIF_SIGPENDING
>>> set on the dumper task but also SIGKILL set in it's pending bitmap.
>>> This means that if the code is ever fixed not to short-circuit and
>>> kill a process after it has already been killed the special case
>>> for SIGKILL during a coredump will be broken.
>>>
>>> Sort all of this out by making the coredump special case more special,
>>> and perform all of the work in prepare_signal and leave the rest of
>>> the signal delivery path out of it.
>>>
>>> In prepare_signal when the process coredumping is sent SIGKILL find
>>> the task performing the coredump and use sigaddset and signal_wake_up
>>> to ensure that task reports fatal_signal_pending.
>>>
>>> Return false from prepare_signal to tell the rest of the signal
>>> delivery path to ignore the signal.
>>>
>>> Update wait_for_dump_helpers to perform a wait_event_killable wait
>>> so that if signal_pending gets set spuriously the wait will not
>>> be interrupted unless fatal_signal_pending is true.
>>>
>>> I have tested this and verified I did not break SIGKILL during
>>> coredumps by accident (before or after this change). I actually
>>> thought I had and I had to figure out what I had misread that kept
>>> SIGKILL during coredumps working.
>>>
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>> ---
>>> fs/coredump.c | 4 ++--
>>> kernel/signal.c | 11 +++++++++--
>>> 2 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>> index a6b3c196cdef..7b91fb32dbb8 100644
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -448,7 +448,7 @@ static void coredump_finish(bool core_dumped)
>>> static bool dump_interrupted(void)
>>> {
>>> /*
>>> - * SIGKILL or freezing() interrupt the coredumping. Perhaps we
>>> + * SIGKILL or freezing() interrupted the coredumping. Perhaps we
>>> * can do try_to_freeze() and check __fatal_signal_pending(),
>>> * but then we need to teach dump_write() to restart and clear
>>> * TIF_SIGPENDING.
>>> @@ -471,7 +471,7 @@ static void wait_for_dump_helpers(struct file *file)
>>> * We actually want wait_event_freezable() but then we need
>>> * to clear TIF_SIGPENDING and improve dump_interrupted().
>>> */
>>> - wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>>> + wait_event_killable(pipe->rd_wait, pipe->readers == 1);
>>>
>>> pipe_lock(pipe);
>>> pipe->readers--;
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8272cac5f429..7e305a8ec7c2 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -907,8 +907,15 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
>>> sigset_t flush;
>>>
>>> if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
>>> - if (!(signal->flags & SIGNAL_GROUP_EXIT))
>>> - return sig == SIGKILL;
>>> + struct core_state *core_state = signal->core_state;
>>> + if (core_state) {
>>> + if (sig == SIGKILL) {
>>> + struct task_struct *dumper = core_state->dumper.task;
>>> + sigaddset(&dumper->pending.signal, SIGKILL);
>>> + signal_wake_up(dumper, 1);
>>> + }
>>> + return false;
>>> + }
>>> /*
>>> * The process is in the middle of dying, nothing to do.
>>> */
>>>
>>
>> Hi,
>>
>> This patch breaks userspace, in particular it breaks gst-plugin-scanner
>> of GStreamer which hangs now on next-20211224. IIUC, this tool builds a
>> registry of good/working GStreamer plugins by loading them and
>> blacklisting those that don't work (crash). Before the hang I see
>> systemd-coredump process running, taking snapshot of gst-plugin-scanner
>> and then gst-plugin-scanner gets stuck.
>>
>> Bisection points at this patch, reverting it restores
>> gst-plugin-scanner. Systemd-coredump still running, but there is no hang
>> anymore and everything works properly as before.
>>
>> I'm seeing this problem on ARM32 and haven't checked other arches.
>> Please fix, thanks in advance.
>
>
> I have not yet been able to figure out how to run gst-pluggin-scanner in
> a way that triggers this yet. In truth I can't figure out how to
> run gst-pluggin-scanner in a useful way.
>
> I am going to set up some unit tests and see if I can reproduce your
> hang another way, but if you could give me some more information on what
> you are doing to trigger this I would appreciate it.

Thanks, Eric. The distro is Arch Linux, but it's a development
environment where I'm running latest GStreamer from git master. I'll try
to figure out the reproduction steps and get back to you.

2022-01-05 21:53:42

[permalink] [raw]

Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> Al Viro <[email protected]> writes:
>
> > On Wed, Dec 08, 2021 at 02:25:24PM -0600, Eric W. Biederman wrote:
> >> There are two big uses of do_exit. The first is it's design use to be
> >> the guts of the exit(2) system call. The second use is to terminate
> >> a task after something catastrophic has happened like a NULL pointer
> >> in kernel code.
> >>
> >> Add a function make_task_dead that is initialy exactly the same as
> >> do_exit to cover the cases where do_exit is called to handle
> >> catastrophic failure. In time this can probably be reduced to just a
> >> light wrapper around do_task_dead. For now keep it exactly the same so
> >> that there will be no behavioral differences introducing this new
> >> concept.
> >>
> >> Replace all of the uses of do_exit that use it for catastraphic
> >> task cleanup with make_task_dead to make it clear what the code
> >> is doing.
> >>
> >> As part of this rename rewind_stack_do_exit
> >> rewind_stack_and_make_dead.
> >
> > Umm... What about .Linvalid_mask: in arch/xtensa/kernel/entry.S?
> > That's an obvious case for your make_task_dead().
>
> Good catch.
>
> Being in assembly it did not have anything after the name do_exit so it
> hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> finding that.

Umm... What's wrong with '\<do_exit\>'? Difference in catch:

missed 6
Documentation/trace/kprobes.rst:596:do_exit() case covered. do_execve() and do_fork() are not an issue.
arch/x86/entry/entry_32.S:1258: call do_exit
arch/x86/entry/entry_64.S:1440: call do_exit
arch/xtensa/kernel/entry.S:1436: abi_call do_exit
samples/bpf/test_cgrp2_tc.sh:114:do_exit() {
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_multiprobe.tc:8:SYM2=do_exit

extra 3
arch/powerpc/mm/book3s64/radix_tlb.c:815:static void do_exit_flush_lazy_tlb(void *arg)
arch/powerpc/mm/book3s64/radix_tlb.c:830: smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
tools/perf/ui/browsers/hists.c:2847: act->fn = do_exit_browser;

Extra catch clearly contains nothing of interest (assuming it's not a result of a typo
in your regex in the first place - you seem to have omitted _ from the second set, and if
you add that back, these 3 hits go away). And missed 6... 3 are outside of the kernel
source proper, and the rest are all genuine. You've caught x86 ones (inside the
rewind_stack_do_exit variants) and missed the xtensa one...

\< and \> are GNUisms, but both git grep and grep (both on Linux and FreeBSD, at least)
handle them... Or use \bdo_exit\b, for that matter (Perlism instead of GNUism, matching
both the beginnings and ends of words)...

2022-01-05 22:33:34

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 04/10] exit: Stop poorly open coding do_task_dead in make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:26PM -0600, Eric W. Biederman wrote:
>> When the kernel detects it is oops or otherwise force killing a task
>> while it exits the code poorly attempts to permanently stop the task
>> from scheduling.
>>
>> I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE
>> to be woken up.
>>
>> As it makes no sense for the task to continue call do_task_dead
>> instead which actually does the work and permanently removes the task
>> from the scheduler. Guaranteeing the task will never be woken
>> up again.
>
> NAK. This is not all do_task_dead() leads to - see what finish_task_switch()
> does upon seeing TASK_DEAD:
> /* Task is done with its stack. */
> put_task_stack(prev);
> put_task_struct_rcu_user(prev);
>
>
> Now take a look at the comment just before that check for PF_EXITING -
> the point is to leave the task leaked, rather than proceeding with
> freeing the sucker.
>
> We are not going through the normal "turn zombie" motions, including
> waking wait(2) callers up, etc. Going ahead and freeing it could
> fuck the things up quite badly.

I believe I was thinking this task won't be reaped because release_task
can never be called. Which I admit depending on where we oops in
do_exit is not strictly true.

We can guarantee the leak with:

tsk->exit_state = EXIT_DEAD;
refcount_inc(&tsk->rcu_users);

It just feels wrong to me to have something dead and broken sticking around
the scheduler queue. Especially as something could come along and wake
it up and then what do we do.

Hmm. I think we want that tsk->exit_state = EXIT_DEAD regardless to
prevent it from being reaped and possibly causing more harm.

Eric

2022-01-05 22:36:24

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 05/10] exit: Stop exporting do_exit

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:27PM -0600, Eric W. Biederman wrote:
>> Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL.
>>
>> Suggested-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> kernel/exit.c | 1 -
>> 1 file changed, 1 deletion(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index f975cd8a2ed8..57afac845a0a 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -843,7 +843,6 @@ void __noreturn do_exit(long code)
>> lockdep_free_task(tsk);
>> do_task_dead();
>> }
>> -EXPORT_SYMBOL_GPL(do_exit);
>
> "Now" in the commit message is misleading, AFAICS - there's no such users
> in the mainline right now (and yes, that one could be moved all the way
> up).

Yes. I should have said. Now there are few enough users of do_exit
that I can inspect the code and see there are no more modular users.

Or words to that effect.

Because honestly my make_task_dead change got rid of most of the callers
of do_exit.

Eric

2022-01-05 22:53:15

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 5, 2022 at 1:53 PM Al Viro <[email protected]> wrote:
>
> On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> >
> > Being in assembly it did not have anything after the name do_exit so it
> > hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> > finding that.
>
> Umm... What's wrong with '\<do_exit\>'?

Christ people, you both make it so complicated.

If you want to search for 'do_exit', just do

git grep -w do_exit

where that '-w' does exactly that "word boundary" thing.

I thought everybody knew about this, because it's such a common thing
to do - checking my shell history, more than a third of my "git grep"
uses use '-w', exactly because it's very convenient for identifier
lookup

But yes, in more complex cases where you have other parts to the
pattern (ie you're not looking *just* for a single word), by all means
use '\<' and/or '\>'.

Linus

2022-01-05 23:35:14

[permalink] [raw]

Subject: Re: [PATCH 02/10] exit: Add and use make_task_dead.

On Wed, Jan 05, 2022 at 02:51:05PM -0800, Linus Torvalds wrote:
> On Wed, Jan 5, 2022 at 1:53 PM Al Viro <[email protected]> wrote:
> >
> > On Wed, Jan 05, 2022 at 02:46:10PM -0600, Eric W. Biederman wrote:
> > >
> > > Being in assembly it did not have anything after the name do_exit so it
> > > hid from my regex "[^A-Za-z0-9_]do_exit[^A-Za-z0-9]". Thank you for
> > > finding that.
> >
> > Umm... What's wrong with '\<do_exit\>'?
>
> Christ people, you both make it so complicated.
>
> If you want to search for 'do_exit', just do
>
> git grep -w do_exit
>
> where that '-w' does exactly that "word boundary" thing.

Sure.

> I thought everybody knew about this, because it's such a common thing
> to do - checking my shell history, more than a third of my "git grep"
> uses use '-w', exactly because it's very convenient for identifier
> lookup
>
> But yes, in more complex cases where you have other parts to the
> pattern (ie you're not looking *just* for a single word), by all means
> use '\<' and/or '\>'.

Yep. I wanted to make it clear that you really don't need that kind
of horrors ([^A-Za-z0-9_]); sure, on the ends of regex you just need
-w and that's it, but it's not needed in more convoluted cases either.

BTW, it doesn't have to be "have other parts of pattern" - IME the typical
case when -w is not enough is something like

git grep -n '\<wait_for_completion'

2022-01-06 07:08:22

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> > The beginning of do_exit has become cluttered and difficult to read as
> > it is filled with checks to handle things that can only happen when
> > the kernel is operating improperly.
> >
> > Now that we have a dedicated function for cleaning up a task when the
> > kernel is operating improperly move the checks there.
>
> Umm... I would probably take profile_task_exit() crap out before that
> point.
> 1) the damn thing is dead - nothing registers notifiers there
> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>
> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
> morning (there's also profile_handoff_task(), equally useless these days
> and complicating things for __put_task_struct()).

Ugh... Forgot to post, sorry.

[PATCH] get rid of dead machinery in kernel/profile.c

Nothing is placed on the call chains in there, now that oprofile is
gone. Remove them, along with the hooks for calling them.

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f5574..88dfb0543ea63 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -63,26 +63,6 @@ static inline void profile_hit(int type, void *ip)
profile_hits(type, ip, 1);
}

-struct task_struct;
-struct mm_struct;
-
-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -107,30 +87,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
-#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)
-
#endif /* CONFIG_PROFILING */

#endif /* _LINUX_PROFILE_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686e..5086a5e9d02de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -765,7 +765,6 @@ void __noreturn do_exit(long code)
preempt_count_set(PREEMPT_ENABLED);
}

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697d..496c0b6c8cb83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac52..37640a0bd8a3c 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,79 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90a..70318c2a47c39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}

2022-01-07 02:28:10

[permalink] [raw]

Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

On Wed, Dec 08, 2021 at 02:25:28PM -0600, Eric W. Biederman wrote:

> +/**
> + * kthread_exit - Cause the current kthread return @result to kthread_stop().
> + * @result: The integer value to return to kthread_stop().
> + *
> + * While kthread_exit can be called directly, it exists so that
> + * functions which do some additional work in non-modular code such as
> + * module_put_and_kthread_exit can be implemented.
> + *
> + * Does not return.
> + */
> +void __noreturn kthread_exit(long result)
> +{
> + do_exit(result);
> +}

> static int kthread(void *_create)
> {
> static const struct sched_param param = { .sched_priority = 0 };
> @@ -286,13 +301,13 @@ static int kthread(void *_create)
> done = xchg(&create->done, NULL);
> if (!done) {
> kfree(create);
> - do_exit(-EINTR);
> + kthread_exit(-EINTR);

This do_exit(-EINTR) is pure cargo-culting; nobody will see that return
value, since by this point nobody could have looked at the task_struct
or pid. Look: we must have had
* __kthread_create_on_node() called
it has allocated a request (create), filled it, put it on
kthreadd's request list and woke kthreadd up.
* it went to wait (killably) for completion of create->done.
* it got a SIGKILL and had succefully replaced create->done
with NULL.
* since it has gotten non-NULL, it has buggered off (with -EINTR,
incidentally).
In the meanwhile, kthreadd had picked the request from its list and
successfully forked the child. Child had run kthread() up to the point
you'd quoted, i.e. it had observed create->done already NULL, freed create
and is now terminating itself.

Caller of kernel_thread() doesn't pass the pid to anyone, since the pid
must've been positive. kthread() does not store its pid or task_struct
reference anywhere on that path. __kthread_create_on_node() doesn't even
bother looking at anything other than create->done and it returns -EINTR.
Child couldn't be traced by anyone.

So how the hell could anyone look at the value we pass to do_exit() here?
Might as well use do_exit(0)...

> if (!self) {
> create->result = ERR_PTR(-ENOMEM);
> complete(done);
> - do_exit(-ENOMEM);
> + kthread_exit(-ENOMEM);

Ditto. We must have had
* __kthread_create_on_node() called
it has allocated a request (create), filled it, put it on
kthreadd's request list and woke kthreadd.
* it went to wait (killably) for completion of create->done.
* it did *NOT* get SIGKILL until after kthread() the child
successfully forked by kthreadd got through that xchg()
in kthread().
Either __kthread_create_on_node() hadn't gotten SIGKILL before our call
of complete(), or it has gotten one, observed NULL create->done and simply
proceeded to do wait_for_completion() on the same thing (it's its local
variable, so it knows what create->done used to point to). Either way,
__kthread_create_on_node() hits
task = create->result;
sees that it's ERR_PTR(-ENOMEM) and proceeds to fail with -ENOMEM.
Again, nothing is looking at the pid or task_struct of the child and
nothing could possibly observe its exit code.

> }
>
> self->threadfn = threadfn;
> @@ -326,7 +341,7 @@ static int kthread(void *_create)
> __kthread_parkme(self);
> ret = threadfn(data);
> }
> - do_exit(ret);
> + kthread_exit(ret);

This one, OTOH, is a different story. Here we have already hit the
completion and stopped the child. With __kthread_create_on_node()
having found and returned the task_struct of the child. Since that
point, somebody had already woken the child (using the pointer returned
by __kthread_create_on_node()).

What's more, the child's payload has already run to completion.
*NORMALLY* that means kthread_stop() called on it. And there we do the
following bit of nastiness:
get_task_struct(k);
...
mark it "should stop" and wake it up
...
wait_for_completion(&kthread->exited);
ret = k->exit_code;
put_task_struct(k);

kthread->exited is what k->vfork_done is left pointing to, so this
wait_for_completion() waits for that do_exit() in the kthread() and
we proceed to pick k->exit_code (using the fact that k has just
been pinned by us). Then kthread_stop() proceeds to return that
value.

Pardon me while I puke. The value being returned has nothing to do
with the things one could normally find in ->exit_code, for starters.
What's more, kthread->exited is a part of per-thread data structure,
and that same structure could bloody well be used to pass the damn
return value of threadfn(), instead of doing unnatural things with
->exit_code. And that data structure is not freed until free_task(k),
i.e. we can fetch from it whenever we can fetch k->exit_code.

Oh, and all other ways to stop the thread do not bother looking at
the exit code at all.

IMO the right way to handle that would be
1) turn these two do_exit() into do_exit(0), to reduce
confusion
2) deal with all do_exit() in kthread payloads. Your
name for the primitive is fine, IMO.
3) make that primitive pass the return value by way of
a field in struct kthread, adjusting kthread_stop() accordingly
and passing 0 to do_exit() in kthread_exit() itself.

(2) is not as trivial as you seem to hope, though. Your patches
in drivers/staging/rt*/ had papered over the problem in there,
but hadn't really solved it.

thread_exit() should've been shot, all right, but it really ought
to have been complete_and_exit() there. The thing is, complete()
+ return does *not* guarantee that driver won't get unloaded before
the thread terminates. Possibly freeing its .code and leaving
a thread to resume running in there as soon as it regains CPU.

The point of complete_and_exit() is that it's noreturn *and* in
core kernel. So it can be safely used in a modular kthread,
if paired with wait_for_completion() in or before module_exit.
complete() + do_exit() (or complete + return as you've gotten
there) doesn't give such guarantees at all.

I'm (re)crawling through that zoo right now, will post when
I get more details.

2022-01-07 03:22:18

[permalink] [raw]

Subject: Re: [PATCH 10/10] exit/kthread: Move the exit code for kernel threads into struct kthread

On Wed, Dec 08, 2021 at 02:25:32PM -0600, Eric W. Biederman wrote:
> The exit code of kernel threads has different semantics than the
> exit_code of userspace tasks. To avoid confusion and allow
> the userspace implementation to change as needed move
> the kernel thread exit code into struct kthread.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/kthread.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 8e5f44bed027..9c6c532047c4 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -52,6 +52,7 @@ struct kthread_create_info
> struct kthread {
> unsigned long flags;
> unsigned int cpu;
> + int result;
> int (*threadfn)(void *);
> void *data;
> mm_segment_t oldfs;
> @@ -287,7 +288,9 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
> */
> void __noreturn kthread_exit(long result)
> {
> - do_exit(result);
> + struct kthread *kthread = to_kthread(current);
> + kthread->result = result;
> + do_exit(0);
> }
>
> /**
> @@ -679,7 +682,7 @@ int kthread_stop(struct task_struct *k)
> kthread_unpark(k);
> wake_up_process(k);
> wait_for_completion(&kthread->exited);
> - ret = k->exit_code;
> + ret = kthread->result;
> put_task_struct(k);
>
> trace_sched_kthread_stop_ret(ret);

Fine, except that you've turned the first two do_exit() in kthread() into
calls of kthread_exit(). If they are hit, you are screwed, especially
the second one - there you have an allocation failure for struct kthread,
so this will instantly oops on attempt to store into ->result.

See reply to your 6/10 regarding the difference between the last
call of do_exit() in kthread() and the first two of them. They
(the first two) should be simply do_exit(0); transmission of error
value happens differently and not in direction of kthread_stop().

2022-01-07 03:43:05

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
> > The beginning of do_exit has become cluttered and difficult to read as
> > it is filled with checks to handle things that can only happen when
> > the kernel is operating improperly.
> >
> > Now that we have a dedicated function for cleaning up a task when the
> > kernel is operating improperly move the checks there.
>
> Umm... I would probably take profile_task_exit() crap out before that
> point.
> 1) the damn thing is dead - nothing registers notifiers there
> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>
> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
> morning (there's also profile_handoff_task(), equally useless these days
> and complicating things for __put_task_struct()).

Argh... OK, so your subsequent series had pretty much the same thing.
My apologies - still digging myself out from mail pile that had accumulated
over two months ;-/

2022-01-07 03:48:22

[permalink] [raw]

Subject: Re: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

On Mon, Jan 03, 2022 at 03:32:56PM -0600, Eric W. Biederman wrote:
> When I say remove I mean remove. All profile_task_exit and
> profile_munmap do is call a blocking notifier chain. The helpers
> profile_task_register and profile_task_unregister are not called
> anywhere in the tree. Which means this is all dead code.
>
> So remove the dead code and make it easier to read do_exit.

How about doing the same to profile_handoff_task() and
task_handoff_register()/task_handoff_unregister(),
while we are at it? Combined diff would be this:

diff --git a/include/linux/profile.h b/include/linux/profile.h
index fd18ca96f5574..6aa64730298a0 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -31,11 +31,6 @@ static inline int create_proc_profile(void)
}
#endif

-enum profile_type {
- PROFILE_TASK_EXIT,
- PROFILE_MUNMAP
-};
-
#ifdef CONFIG_PROFILING

extern int prof_on __read_mostly;
@@ -63,26 +58,6 @@ static inline void profile_hit(int type, void *ip)
profile_hits(type, ip, 1);
}

-struct task_struct;
-struct mm_struct;
-
-/* task is in do_exit() */
-void profile_task_exit(struct task_struct * task);
-
-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-/* sys_munmap */
-void profile_munmap(unsigned long addr);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
-int profile_event_register(enum profile_type, struct notifier_block * n);
-int profile_event_unregister(enum profile_type, struct notifier_block * n);
-
#else

#define prof_on 0
@@ -107,30 +82,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_register(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int profile_event_unregister(enum profile_type t, struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_task_exit(a) do { } while (0)
-#define profile_handoff_task(a) (0)
-#define profile_munmap(a) do { } while (0)
-
#endif /* CONFIG_PROFILING */

#endif /* _LINUX_PROFILE_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686e..5086a5e9d02de 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -765,7 +765,6 @@ void __noreturn do_exit(long code)
preempt_count_set(PREEMPT_ENABLED);
}

- profile_task_exit(tsk);
kcov_task_exit(tsk);

coredump_task_exit(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697d..496c0b6c8cb83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index eb9c7f0f5ac52..37640a0bd8a3c 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,79 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-static BLOCKING_NOTIFIER_HEAD(munmap_notifier);
-
-void profile_task_exit(struct task_struct *task)
-{
- blocking_notifier_call_chain(&task_exit_notifier, 0, task);
-}
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-void profile_munmap(unsigned long addr)
-{
- blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
-int profile_event_register(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_register(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_register(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_register);
-
-int profile_event_unregister(enum profile_type type, struct notifier_block *n)
-{
- int err = -EINVAL;
-
- switch (type) {
- case PROFILE_TASK_EXIT:
- err = blocking_notifier_chain_unregister(
- &task_exit_notifier, n);
- break;
- case PROFILE_MUNMAP:
- err = blocking_notifier_chain_unregister(
- &munmap_notifier, n);
- break;
- }
-
- return err;
-}
-EXPORT_SYMBOL_GPL(profile_event_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90a..70318c2a47c39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2928,7 +2928,6 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
addr = untagged_addr(addr);
- profile_munmap(addr);
return __vm_munmap(addr, len, true);
}

2022-01-07 03:59:49

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
> Today the rules are a bit iffy and arbitrary about which kernel
> threads have struct kthread present. Both idle threads and thread
> started with create_kthread want struct kthread present so that is
> effectively all kernel threads. Make the rule that if PF_KTHREAD
> and the task is running then struct kthread is present.
>
> This will allow the kernel thread code to using tsk->exit_code
> with different semantics from ordinary processes.

Getting rid of ->exit_code abuse is independent from this.
I'm not saying that this change is a bad idea, but it's an
independent thing. Simply turn these two failure exits
into do_exit(0) in 06/10 and that's it. Then this one
would get rid of if (!self) and the second of those two
calls, but it won't be nailed to that point of queue.

2022-01-07 19:01:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
>> - /*
>> - * If do_exit is called because this processes oopsed, it's possible
>> - * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
>> - * continuing. Amongst other possible reasons, this is to prevent
>> - * mm_release()->clear_child_tid() from writing to a user-controlled
>> - * kernel address.
>> - */
>> - force_uaccess_begin();
>
> Are you sure about that one? It shouldn't matter, but... it's a potential
> change for do_exit() from a kernel thread. As it is, we have that
> force_uaccess_begin() for exiting threads and for kernel ones it's not
> a no-op. I'm not concerned about attempted userland access after that
> point for those, obviously, but I'm not sure you won't step into something
> subtle here.
>
> I would prefer to split that particular change off into a separate commit...

Thank you for catching that. I was leaning too much on the description
in the comment of why force_uaccess_begin is there.

Catching up on the state of set_fs/get_fs removal it appears like a lot
of progress has been made and on a lot of architectures set_fs/get_fs is
just gone, and force_uaccess_begin is a noop.

On architectures that still have set_fs/get_fs it appears all of the old
warts are present and kernel threads still run with set_fs(KERNEL_DS).

Assuming it won't be too much longer before the rest of the arches have
set_fs/get_fs removed it looks like it makes sense to leave the
force_uaccess_begin where it is, and just let force_uaccess_begin be
removed when set_fs/get_fs are removed from the tree.

Christoph does it look like the set_fs/get_fs removal work is going
to stall indefinitely on some architectures? If so I think we want to
find a way to get kernel threads to run with set_fs(USER_DS) on the
stalled architectures. Otherwise I think we have a real hazard of
introducing bugs that will only show up on the stalled architectures.

I finally understand now why when I updated set_child_tid in the kthread
code early in fork why x86 was fine another architecture was not.

Eric

2022-01-07 19:02:31

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

Al Viro <[email protected]> writes:

> On Wed, Jan 05, 2022 at 05:48:08AM +0000, Al Viro wrote:
>> On Wed, Dec 08, 2021 at 02:25:25PM -0600, Eric W. Biederman wrote:
>> > The beginning of do_exit has become cluttered and difficult to read as
>> > it is filled with checks to handle things that can only happen when
>> > the kernel is operating improperly.
>> >
>> > Now that we have a dedicated function for cleaning up a task when the
>> > kernel is operating improperly move the checks there.
>>
>> Umm... I would probably take profile_task_exit() crap out before that
>> point.
>> 1) the damn thing is dead - nothing registers notifiers there
>> 2) blocking_notifier_call_chain() is not a nice thing to do on oops...
>>
>> I'll post a patch ripping the dead parts of kernel/profile.c out tomorrow
>> morning (there's also profile_handoff_task(), equally useless these days
>> and complicating things for __put_task_struct()).
>
> Argh... OK, so your subsequent series had pretty much the same thing.
> My apologies - still digging myself out from mail pile that had accumulated
> over two months ;-/

No worries. I really appreciate getting some detail review. Some
things just take another set of eyes to spot.

Eric

2022-01-08 16:11:29

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 01/17] exit: Remove profile_task_exit & profile_munmap

Al Viro <[email protected]> writes:

> On Mon, Jan 03, 2022 at 03:32:56PM -0600, Eric W. Biederman wrote:
>> When I say remove I mean remove. All profile_task_exit and
>> profile_munmap do is call a blocking notifier chain. The helpers
>> profile_task_register and profile_task_unregister are not called
>> anywhere in the tree. Which means this is all dead code.
>>
>> So remove the dead code and make it easier to read do_exit.
>
> How about doing the same to profile_handoff_task() and
> task_handoff_register()/task_handoff_unregister(),
> while we are at it? Combined diff would be this:

A very good idea. I have added this incremental patch to my queue.

Eric

From: "Eric W. Biederman" <[email protected]>
Date: Sat, 8 Jan 2022 10:03:24 -0600
Subject: [PATCH] exit: Remove profile_handoff_task

All profile_handoff_task does is notify the task_free_notifier chain.
The helpers task_handoff_register and task_handoff_unregister are used
to add and delete entries from that chain and are never called.

So remove the dead code and make it much easier to read and reason
about __put_task_struct.

Suggested-by: Al Viro <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/profile.h | 19 -------------------
kernel/fork.c | 4 +---
kernel/profile.c | 23 -----------------------
3 files changed, 1 insertion(+), 45 deletions(-)

diff --git a/include/linux/profile.h b/include/linux/profile.h
index f7eb2b57d890..11db1ec516e2 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -61,14 +61,6 @@ static inline void profile_hit(int type, void *ip)
struct task_struct;
struct mm_struct;

-/* task is dead, free task struct ? Returns 1 if
- * the task was taken, 0 if the task should be freed.
- */
-int profile_handoff_task(struct task_struct * task);
-
-int task_handoff_register(struct notifier_block * n);
-int task_handoff_unregister(struct notifier_block * n);
-
#else

#define prof_on 0
@@ -93,17 +85,6 @@ static inline void profile_hit(int type, void *ip)
return;
}

-static inline int task_handoff_register(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-static inline int task_handoff_unregister(struct notifier_block * n)
-{
- return -ENOSYS;
-}
-
-#define profile_handoff_task(a) (0)

#endif /* CONFIG_PROFILING */

diff --git a/kernel/fork.c b/kernel/fork.c
index 6f0293cb29c9..494539ecb6d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -754,9 +754,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
sched_core_free(tsk);
-
- if (!profile_handoff_task(tsk))
- free_task(tsk);
+ free_task(tsk);
}
EXPORT_SYMBOL_GPL(__put_task_struct);

diff --git a/kernel/profile.c b/kernel/profile.c
index 9355cc934a96..37640a0bd8a3 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -133,29 +133,6 @@ int __ref profile_init(void)
return -ENOMEM;
}

-/* Profile event notifications */
-
-static ATOMIC_NOTIFIER_HEAD(task_free_notifier);
-
-int profile_handoff_task(struct task_struct *task)
-{
- int ret;
- ret = atomic_notifier_call_chain(&task_free_notifier, 0, task);
- return (ret == NOTIFY_OK) ? 1 : 0;
-}
-
-int task_handoff_register(struct notifier_block *n)
-{
- return atomic_notifier_chain_register(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_register);
-
-int task_handoff_unregister(struct notifier_block *n)
-{
- return atomic_notifier_chain_unregister(&task_free_notifier, n);
-}
-EXPORT_SYMBOL_GPL(task_handoff_unregister);
-
#if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
/*
* Each cpu has a pair of open-addressed hashtables for pending
--
2.29.2

2022-01-08 18:14:49

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 05.01.2022 22:58, Eric W. Biederman пишет:
>>
>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>> a way that triggers this yet. In truth I can't figure out how to
>> run gst-pluggin-scanner in a useful way.
>>
>> I am going to set up some unit tests and see if I can reproduce your
>> hang another way, but if you could give me some more information on what
>> you are doing to trigger this I would appreciate it.
>
> Thanks, Eric. The distro is Arch Linux, but it's a development
> environment where I'm running latest GStreamer from git master. I'll try
> to figure out the reproduction steps and get back to you.

Thank you.

Until I can figure out why this is causing problems I have dropped the
following two patches from my queue:
signal: Make SIGKILL during coredumps an explicit special case
signal: Drop signals received after a fatal signal has been processed

I have replaced them with the following two patches that just do what
is needed for the rest of the code in the series:
signal: Have prepare_signal detect coredumps using
signal: Make coredump handling explicit in complete_signal

Perversely my failure to change the SIGKILL handling when coredumps are
happening proves to me that I need to change the SIGKILL handling when
coredumps are happening to make the code more maintainable.

Eric

2022-01-08 18:15:30

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 1/2] signal: Have prepare_signal detect coredumps using signal->core_state

In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
prepare_signal to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.

Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8272cac5f429..f95a4423519d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -906,8 +906,8 @@ static bool prepare_signal(int sig, struct task_struct *p, bool force)
struct task_struct *t;
sigset_t flush;

- if (signal->flags & (SIGNAL_GROUP_EXIT | SIGNAL_GROUP_COREDUMP)) {
- if (!(signal->flags & SIGNAL_GROUP_EXIT))
+ if ((signal->flags & SIGNAL_GROUP_EXIT) || signal->core_state) {
+ if (signal->core_state)
return sig == SIGKILL;
/*
* The process is in the middle of dying, nothing to do.
--
2.29.2

2022-01-08 18:16:04

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 2/2] signal: Make coredump handling explicit in complete_signal

Ever since commit 6cd8f0acae34 ("coredump: ensure that SIGKILL always
kills the dumping thread") it has been possible for a SIGKILL received
during a coredump to set SIGNAL_GROUP_EXIT and trigger a process
shutdown (for a second time).

Update the logic to explicitly allow coredumps so that coredumps can
set SIGNAL_GROUP_EXIT and shutdown like an ordinary process.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/signal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f95a4423519d..0706c1345a71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1032,7 +1032,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) &&
- !(signal->flags & SIGNAL_GROUP_EXIT) &&
+ (signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT)) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL || !p->ptrace)) {
/*
--
2.29.2

2022-01-08 18:21:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 09/10] kthread: Ensure struct kthread is present for all kthreads

Al Viro <[email protected]> writes:

> On Wed, Dec 08, 2021 at 02:25:31PM -0600, Eric W. Biederman wrote:
>> Today the rules are a bit iffy and arbitrary about which kernel
>> threads have struct kthread present. Both idle threads and thread
>> started with create_kthread want struct kthread present so that is
>> effectively all kernel threads. Make the rule that if PF_KTHREAD
>> and the task is running then struct kthread is present.
>>
>> This will allow the kernel thread code to using tsk->exit_code
>> with different semantics from ordinary processes.
>
> Getting rid of ->exit_code abuse is independent from this.
> I'm not saying that this change is a bad idea, but it's an
> independent thing. Simply turn these two failure exits
> into do_exit(0) in 06/10 and that's it. Then this one
> would get rid of if (!self) and the second of those two
> calls, but it won't be nailed to that point of queue.

That is a good point.

As this code has been in linux-next for a while, I am going to leave
the dependency in place in the interests of sending Linus tested code.

This change with the bit about which field points to struct kthread
seems like a good idea on it's own merits.

Eric

2022-01-08 18:36:08

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

Al Viro <[email protected]> writes:

> IMO the right way to handle that would be
> 1) turn these two do_exit() into do_exit(0), to reduce
> confusion
> 2) deal with all do_exit() in kthread payloads. Your
> name for the primitive is fine, IMO.
> 3) make that primitive pass the return value by way of
> a field in struct kthread, adjusting kthread_stop() accordingly
> and passing 0 to do_exit() in kthread_exit() itself.
>
> (2) is not as trivial as you seem to hope, though. Your patches
> in drivers/staging/rt*/ had papered over the problem in there,
> but hadn't really solved it.
>
> thread_exit() should've been shot, all right, but it really ought
> to have been complete_and_exit() there. The thing is, complete()
> + return does *not* guarantee that driver won't get unloaded before
> the thread terminates. Possibly freeing its .code and leaving
> a thread to resume running in there as soon as it regains CPU.
>
> The point of complete_and_exit() is that it's noreturn *and* in
> core kernel. So it can be safely used in a modular kthread,
> if paired with wait_for_completion() in or before module_exit.
> complete() + do_exit() (or complete + return as you've gotten
> there) doesn't give such guarantees at all.

I think we are mostly in agreement here.

There are kernel threads started by modules that do:
complete(...);
return 0;

That should be at a minimum calling complete_and_exit. Possibly should
be restructured to use kthread_stop().

Some of those users of the now removed thread_exit() in staging are
among the offenders.

However thread_exit() was implemented as:
#define thread_exit() complete_and_exit(NULL, 0)

Which does nothing with a completion, it was just a really funny way to
spell "do_exit(0)".

While I agree digging through all of the kernel threads and finding the
ones that should be calling complete_and_exit is a fine idea. It is
a concern independent of these patches.

> I'm (re)crawling through that zoo right now, will post when
> I get more details.

Eric

2022-01-08 19:13:26

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Tue, Jan 04, 2022 at 01:47:05PM -0600, Eric W. Biederman wrote:
> Currently I suspect changing wait_event_uninterruptible to
> wait_event_killable, is causing problems.
>
> Or perhaps there is some reason tasks that have already entered do_exit
> need to have fatal_signal_pending set. (The will have
> fatal_signal_pending set up until they enter get_signal which calls
> do_group_exit which calls do_exit).
>
> Which is why I am trying to reproduce the reported failure so I can get
> the kernel to tell me what is going on. If this is not resolved quickly
> I won't send you this change, and I will pull it out of linux-next.

It would have been good if you would have removed this from linux-next
already.

Anyway, now I also had to spend quite some time to bisect why several test
suites just hang with linux-next. It's probably because of holidays that
you didn't get more bug reports.

On s390

- ltp
- elfutils selftests
- seccomp kernel selftests

hang with linux-next.

I bisected the problem to this patch using elfutils selftests:

git clone git://sourceware.org/git/elfutils.git
cd elfutils
autoreconf -fi
./configure --enable-maintainer-mode --disable-debuginfod
make -j $(nproc) > /dev/null
cd tests
make -j $(nproc) check

Note: I actually didn't verify if this also causes ltp+seccomp selftests
to hang. I just assume it is the case.

2022-01-08 22:44:39

by David Laight

[permalink] [raw]

Subject: RE: [PATCH 06/10] exit: Implement kthread_exit

From: Eric W. Biederman
> Sent: 08 January 2022 18:36
>
> Al Viro <[email protected]> writes:
>
> > IMO the right way to handle that would be
> > 1) turn these two do_exit() into do_exit(0), to reduce
> > confusion
> > 2) deal with all do_exit() in kthread payloads. Your
> > name for the primitive is fine, IMO.
> > 3) make that primitive pass the return value by way of
> > a field in struct kthread, adjusting kthread_stop() accordingly
> > and passing 0 to do_exit() in kthread_exit() itself.
> >
> > (2) is not as trivial as you seem to hope, though. Your patches
> > in drivers/staging/rt*/ had papered over the problem in there,
> > but hadn't really solved it.
> >
> > thread_exit() should've been shot, all right, but it really ought
> > to have been complete_and_exit() there. The thing is, complete()
> > + return does *not* guarantee that driver won't get unloaded before
> > the thread terminates. Possibly freeing its .code and leaving
> > a thread to resume running in there as soon as it regains CPU.
> >
> > The point of complete_and_exit() is that it's noreturn *and* in
> > core kernel. So it can be safely used in a modular kthread,
> > if paired with wait_for_completion() in or before module_exit.
> > complete() + do_exit() (or complete + return as you've gotten
> > there) doesn't give such guarantees at all.
>
>
> I think we are mostly in agreement here.
>
> There are kernel threads started by modules that do:
> complete(...);
> return 0;
>
> That should be at a minimum calling complete_and_exit. Possibly should
> be restructured to use kthread_stop().

There is also module_put_and_exit(0);
Which must have an implied THIS_MODULE.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-01-09 03:27:35

[permalink] [raw]

Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

On Sat, Jan 08, 2022 at 12:35:40PM -0600, Eric W. Biederman wrote:

> There are kernel threads started by modules that do:
> complete(...);
> return 0;
>
> That should be at a minimum calling complete_and_exit. Possibly should
> be restructured to use kthread_stop().
>
> Some of those users of the now removed thread_exit() in staging are
> among the offenders.
>
> However thread_exit() was implemented as:
> #define thread_exit() complete_and_exit(NULL, 0)
>
> Which does nothing with a completion, it was just a really funny way to
> spell "do_exit(0)".

Yes. And there's a plenty of cargo-culting in that area.

> While I agree digging through all of the kernel threads and finding the
> ones that should be calling complete_and_exit is a fine idea. It is
> a concern independent of these patches.

BTW, could somebody explain how could this
/*
* Prevent the kthread exits directly, and make sure when kthread_stop()
* is called to stop a kthread, it is still alive. If a kthread might be
* stopped by CACHE_SET_IO_DISABLE bit set, wait_for_kthread_stop() is
* necessary before the kthread returns.
*/
static inline void wait_for_kthread_stop(void)
{
while (!kthread_should_stop()) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
}
}

in drivers/md/bcache/bcache.h possibly avoid losing wakeups?

AFAICS, it can be called while in TASK_RUNNING. Suppose kthread_stop()
gets called just after the check for kthread_should_stop(). Our thread
is still in TASK_RUNNING; kthread_stop() sets the flag for the next
kthread_should_stop() to observe and does wake_up_process() to our
thread. Which does nothing. Now our thread goes into TASK_INTERRUPTIBLE
and calls schedule(). Sure, as soon as it gets woken up it'll call
kthread_should_stop(), get true from it and that's it. What's going
to wake it up, though?

The same goes for e.g. fs/btrfs/disk-io.c:cleaner_kthread():
if (kthread_should_stop())
return 0;
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
__set_current_state(TASK_RUNNING);
}
can't be right. Similar fun exists in e.g. fs/jfs, etc.

Am I missing something?

2022-01-10 15:01:03

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

David Laight <[email protected]> writes:

> From: Eric W. Biederman
>> Sent: 08 January 2022 18:36
>>
>> Al Viro <[email protected]> writes:
>>
>> > IMO the right way to handle that would be
>> > 1) turn these two do_exit() into do_exit(0), to reduce
>> > confusion
>> > 2) deal with all do_exit() in kthread payloads. Your
>> > name for the primitive is fine, IMO.
>> > 3) make that primitive pass the return value by way of
>> > a field in struct kthread, adjusting kthread_stop() accordingly
>> > and passing 0 to do_exit() in kthread_exit() itself.
>> >
>> > (2) is not as trivial as you seem to hope, though. Your patches
>> > in drivers/staging/rt*/ had papered over the problem in there,
>> > but hadn't really solved it.
>> >
>> > thread_exit() should've been shot, all right, but it really ought
>> > to have been complete_and_exit() there. The thing is, complete()
>> > + return does *not* guarantee that driver won't get unloaded before
>> > the thread terminates. Possibly freeing its .code and leaving
>> > a thread to resume running in there as soon as it regains CPU.
>> >
>> > The point of complete_and_exit() is that it's noreturn *and* in
>> > core kernel. So it can be safely used in a modular kthread,
>> > if paired with wait_for_completion() in or before module_exit.
>> > complete() + do_exit() (or complete + return as you've gotten
>> > there) doesn't give such guarantees at all.
>>
>>
>> I think we are mostly in agreement here.
>>
>> There are kernel threads started by modules that do:
>> complete(...);
>> return 0;
>>
>> That should be at a minimum calling complete_and_exit. Possibly should
>> be restructured to use kthread_stop().
>
> There is also module_put_and_exit(0);
> Which must have an implied THIS_MODULE.

Later in the patch series I change
module_put_and_exit -> module_put_and_kthread_exit
complete_and_exit -> complete_and_kthread_exit

The problem that I understand all was seeing was where people should
have been using complete_and_exit and were not.

Eric

2022-01-10 15:05:34

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 06/10] exit: Implement kthread_exit

Al Viro <[email protected]> writes:

> On Sat, Jan 08, 2022 at 12:35:40PM -0600, Eric W. Biederman wrote:
>
>> There are kernel threads started by modules that do:
>> complete(...);
>> return 0;
>>
>> That should be at a minimum calling complete_and_exit. Possibly should
>> be restructured to use kthread_stop().
>>
>> Some of those users of the now removed thread_exit() in staging are
>> among the offenders.
>>
>> However thread_exit() was implemented as:
>> #define thread_exit() complete_and_exit(NULL, 0)
>>
>> Which does nothing with a completion, it was just a really funny way to
>> spell "do_exit(0)".
>
> Yes. And there's a plenty of cargo-culting in that area.
>
>> While I agree digging through all of the kernel threads and finding the
>> ones that should be calling complete_and_exit is a fine idea. It is
>> a concern independent of these patches.
>
> BTW, could somebody explain how could this
> /*
> * Prevent the kthread exits directly, and make sure when kthread_stop()
> * is called to stop a kthread, it is still alive. If a kthread might be
> * stopped by CACHE_SET_IO_DISABLE bit set, wait_for_kthread_stop() is
> * necessary before the kthread returns.
> */
> static inline void wait_for_kthread_stop(void)
> {
> while (!kthread_should_stop()) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> }
> }
>
> in drivers/md/bcache/bcache.h possibly avoid losing wakeups?
>
> AFAICS, it can be called while in TASK_RUNNING. Suppose kthread_stop()
> gets called just after the check for kthread_should_stop(). Our thread
> is still in TASK_RUNNING; kthread_stop() sets the flag for the next
> kthread_should_stop() to observe and does wake_up_process() to our
> thread. Which does nothing. Now our thread goes into TASK_INTERRUPTIBLE
> and calls schedule(). Sure, as soon as it gets woken up it'll call
> kthread_should_stop(), get true from it and that's it. What's going
> to wake it up, though?
>
> The same goes for e.g. fs/btrfs/disk-io.c:cleaner_kthread():
> if (kthread_should_stop())
> return 0;
> if (!again) {
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> __set_current_state(TASK_RUNNING);
> }
> can't be right. Similar fun exists in e.g. fs/jfs, etc.
>
> Am I missing something?

Those examples look as suspect to me as they do to you.

Eric

2022-01-10 15:27:13

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> The generic function ptrace_report_syscall does a little more
> than syscall_trace on m68k. The function ptrace_report_syscall
> stops early if PT_TRACED is not set, it sets ptrace_message,
> and returns the result of fatal_signal_pending.
>
> Setting ptrace_message to a passed in value of 0 is effectively not
> setting ptrace_message, making that additional work a noop.
>
> Returning the result of fatal_signal_pending and letting the caller
> ignore the result becomes a noop in this change.
>
> When a process is ptraced, the flag PT_PTRACED is always set in
> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> just an optimization to fail early if the process is not ptraced.
> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> tasklist_lock and skip performing any work if the task is not ptraced.
>
> Cc: Geert Uytterhoeven <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

As this depends on the removal of a parameter from
ptrace_report_syscall() earlier in this series:
Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-10 16:20:13

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> > The generic function ptrace_report_syscall does a little more
> > than syscall_trace on m68k. The function ptrace_report_syscall
> > stops early if PT_TRACED is not set, it sets ptrace_message,
> > and returns the result of fatal_signal_pending.
> >
> > Setting ptrace_message to a passed in value of 0 is effectively not
> > setting ptrace_message, making that additional work a noop.
> >
> > Returning the result of fatal_signal_pending and letting the caller
> > ignore the result becomes a noop in this change.
> >
> > When a process is ptraced, the flag PT_PTRACED is always set in
> > current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> > just an optimization to fail early if the process is not ptraced.
> > Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> > tasklist_lock and skip performing any work if the task is not ptraced.
> >
> > Cc: Geert Uytterhoeven <[email protected]>
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> As this depends on the removal of a parameter from
> ptrace_report_syscall() earlier in this series:
> Acked-by: Geert Uytterhoeven <[email protected]>

FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
calls of syscall_trace() in entry.S with syscall_trace_enter() and
syscall_trace_leave() resp. and remove syscall_trace().

Geert, do you see any problems with that? The only difference is that
current->ptrace_message would be set to 1 for ptrace stop on entry and
2 - on leave. Currently m68k just has it 0 all along.

It is user-visible (the whole point is to let the tracer see which
stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
on syscall stops would start seeing 1 or 2 instead of "0 all along".
That's how it works on all other architectures (including m68k-nommu),
and I doubt that anything in userland will get broken.

Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
as-is, of course.

2022-01-10 16:25:25

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 04:20:03PM +0000, Al Viro wrote:

> Geert, do you see any problems with that? The only difference is that
> current->ptrace_message would be set to 1 for ptrace stop on entry and
> 2 - on leave. Currently m68k just has it 0 all along.
>
> It is user-visible (the whole point is to let the tracer see which
> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> That's how it works on all other architectures (including m68k-nommu),
> and I doubt that anything in userland will get broken.
>
> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> as-is, of course.

Actually, the current behaviour is "report what the last PTRACE_GETEVENTMSG
has reported, whatever kind of stop that used to be for". So I very much
doubt that anything could break there.

2022-01-10 17:55:14

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Al,

CC Michael/m68k,

On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> > On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
> > > The generic function ptrace_report_syscall does a little more
> > > than syscall_trace on m68k. The function ptrace_report_syscall
> > > stops early if PT_TRACED is not set, it sets ptrace_message,
> > > and returns the result of fatal_signal_pending.
> > >
> > > Setting ptrace_message to a passed in value of 0 is effectively not
> > > setting ptrace_message, making that additional work a noop.
> > >
> > > Returning the result of fatal_signal_pending and letting the caller
> > > ignore the result becomes a noop in this change.
> > >
> > > When a process is ptraced, the flag PT_PTRACED is always set in
> > > current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> > > just an optimization to fail early if the process is not ptraced.
> > > Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> > > tasklist_lock and skip performing any work if the task is not ptraced.
> > >
> > > Cc: Geert Uytterhoeven <[email protected]>
> > > Signed-off-by: "Eric W. Biederman" <[email protected]>
> >
> > As this depends on the removal of a parameter from
> > ptrace_report_syscall() earlier in this series:
> > Acked-by: Geert Uytterhoeven <[email protected]>
>
> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
> calls of syscall_trace() in entry.S with syscall_trace_enter() and
> syscall_trace_leave() resp. and remove syscall_trace().
>
> Geert, do you see any problems with that? The only difference is that
> current->ptrace_message would be set to 1 for ptrace stop on entry and
> 2 - on leave. Currently m68k just has it 0 all along.
>
> It is user-visible (the whole point is to let the tracer see which
> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> That's how it works on all other architectures (including m68k-nommu),
> and I doubt that anything in userland will get broken.
>
> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> as-is, of course.

In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
syscall_trace_enter/leave for m68k"[1], but that's still stuck...

[1] https://lore.kernel.org/r/[email protected]/

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-10 20:37:46

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Mon, Jan 10, 2022 at 06:54:57PM +0100, Geert Uytterhoeven wrote:

> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>
> [1] https://lore.kernel.org/r/[email protected]/

Looks sane, but I'd split it in two - switch to calling syscall_trace_{enter,leave}
and then handling the return values...

The former would keep the current behaviour (modulo reporting enter vs. leave
via PTRACE_GETEVENTMSG), the latter would allow syscall number change by tracer
and/or handling of seccomp/audit/whatnot.

For exit+signal work the former would suffice, and IMO it would be a good idea
to put that one into a shared branch to be pulled both by seccomp and by signal
series. Would reduce the conflicts...

Objections?

2022-01-10 21:18:33

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Al Viro <[email protected]> writes:

> On Mon, Jan 10, 2022 at 06:54:57PM +0100, Geert Uytterhoeven wrote:
>
>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>>
>> [1] https://lore.kernel.org/r/[email protected]/
>
> Looks sane, but I'd split it in two - switch to calling syscall_trace_{enter,leave}
> and then handling the return values...
>
> The former would keep the current behaviour (modulo reporting enter vs. leave
> via PTRACE_GETEVENTMSG), the latter would allow syscall number change by tracer
> and/or handling of seccomp/audit/whatnot.
>
> For exit+signal work the former would suffice, and IMO it would be a good idea
> to put that one into a shared branch to be pulled both by seccomp and by signal
> series. Would reduce the conflicts...
>
> Objections?

I have the version that Geert ack'ed queued up for v5.17 in my
signal-for-v5.17 branch, along with a couple others prior fixes in this
series of changes where it was clear they were just obviously correct
bug fixes. No need to delay the removal of profiling bits for example.

I would love to see the m68k perform syscall_trace_{enter,leave} but
just getting as far as ptrace_report_syscall will be enough to avoid any
dependencies on my side.

Eric

2022-01-10 23:00:14

by Olivier Langlois

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>
>
> I have been able to confirm that changing wait_event_interruptible to
> wait_event_killable was the culprit.? Something about the way
> systemd-coredump handles coredumps is not compatible with
> wait_event_killable.

This is my experience too that systemd-coredump is doing something
unexpected. When I tested the patch:
https://lore.kernel.org/lkml/[email protected]/

to make sure that the patch worked, sending coredumps to systemd-
coredump was making systemd-coredump, well, core dump... Not very
useful...

Sending the dumps through a pipe to anything else than systemd-coredump
was working fine.

2022-01-11 01:34:01

by Michael Schmitz

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Geert,

Am 11.01.2022 um 06:54 schrieb Geert Uytterhoeven:
> Hi Al,
>
> CC Michael/m68k,
>
> On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
>> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
>>> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]> wrote:
>>>> The generic function ptrace_report_syscall does a little more
>>>> than syscall_trace on m68k. The function ptrace_report_syscall
>>>> stops early if PT_TRACED is not set, it sets ptrace_message,
>>>> and returns the result of fatal_signal_pending.
>>>>
>>>> Setting ptrace_message to a passed in value of 0 is effectively not
>>>> setting ptrace_message, making that additional work a noop.
>>>>
>>>> Returning the result of fatal_signal_pending and letting the caller
>>>> ignore the result becomes a noop in this change.
>>>>
>>>> When a process is ptraced, the flag PT_PTRACED is always set in
>>>> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
>>>> just an optimization to fail early if the process is not ptraced.
>>>> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
>>>> tasklist_lock and skip performing any work if the task is not ptraced.
>>>>
>>>> Cc: Geert Uytterhoeven <[email protected]>
>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>
>>> As this depends on the removal of a parameter from
>>> ptrace_report_syscall() earlier in this series:
>>> Acked-by: Geert Uytterhoeven <[email protected]>
>>
>> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
>> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
>> calls of syscall_trace() in entry.S with syscall_trace_enter() and
>> syscall_trace_leave() resp. and remove syscall_trace().
>>
>> Geert, do you see any problems with that? The only difference is that
>> current->ptrace_message would be set to 1 for ptrace stop on entry and
>> 2 - on leave. Currently m68k just has it 0 all along.
>>
>> It is user-visible (the whole point is to let the tracer see which
>> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
>> on syscall stops would start seeing 1 or 2 instead of "0 all along".
>> That's how it works on all other architectures (including m68k-nommu),
>> and I doubt that anything in userland will get broken.
>>
>> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
>> as-is, of course.
>
> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>
> [1] https://lore.kernel.org/r/[email protected]/

That patch (for reasons I never found out) did interact badly with
Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
Christoph's patches are based upon). Caused format errors under memory
stress tests quite reliably, on my 030 hardware.

Probably needs a fresh look - the signal return path got changed by Al's
patches IIRC, and I might have relied on offsets to data on the stack
that are no longer correct with these patches. Or there's a race between
the syscall trap and signal handling when returning from interrupt
context ...

Still school hols over here so I won't have much peace and quiet until
February.

Cheers,

Michael

>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>

2022-01-11 08:59:46

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

08.01.2022 21:13, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>
>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>> a way that triggers this yet. In truth I can't figure out how to
>>> run gst-pluggin-scanner in a useful way.
>>>
>>> I am going to set up some unit tests and see if I can reproduce your
>>> hang another way, but if you could give me some more information on what
>>> you are doing to trigger this I would appreciate it.
>>
>> Thanks, Eric. The distro is Arch Linux, but it's a development
>> environment where I'm running latest GStreamer from git master. I'll try
>> to figure out the reproduction steps and get back to you.
>
> Thank you.
>
> Until I can figure out why this is causing problems I have dropped the
> following two patches from my queue:
> signal: Make SIGKILL during coredumps an explicit special case
> signal: Drop signals received after a fatal signal has been processed
>
> I have replaced them with the following two patches that just do what
> is needed for the rest of the code in the series:
> signal: Have prepare_signal detect coredumps using
> signal: Make coredump handling explicit in complete_signal
>
> Perversely my failure to change the SIGKILL handling when coredumps are
> happening proves to me that I need to change the SIGKILL handling when
> coredumps are happening to make the code more maintainable.

Eric, thank you again. I started to look at the reproduction steps and
haven't completed it yet. Turned out the problem affects only older
NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
set, hence the problem isn't visible on x86 and other CPUs out of the
box. I'll need to check whether the problem could be simulated on all
arches or maybe it's specific to VFP exception handling of ARM32.

2022-01-11 17:20:56

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 08.01.2022 21:13, Eric W. Biederman пишет:
>> Dmitry Osipenko <[email protected]> writes:
>>
>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>
>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>> a way that triggers this yet. In truth I can't figure out how to
>>>> run gst-pluggin-scanner in a useful way.
>>>>
>>>> I am going to set up some unit tests and see if I can reproduce your
>>>> hang another way, but if you could give me some more information on what
>>>> you are doing to trigger this I would appreciate it.
>>>
>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>> environment where I'm running latest GStreamer from git master. I'll try
>>> to figure out the reproduction steps and get back to you.
>>
>> Thank you.
>>
>> Until I can figure out why this is causing problems I have dropped the
>> following two patches from my queue:
>> signal: Make SIGKILL during coredumps an explicit special case
>> signal: Drop signals received after a fatal signal has been processed
>>
>> I have replaced them with the following two patches that just do what
>> is needed for the rest of the code in the series:
>> signal: Have prepare_signal detect coredumps using
>> signal: Make coredump handling explicit in complete_signal
>>
>> Perversely my failure to change the SIGKILL handling when coredumps are
>> happening proves to me that I need to change the SIGKILL handling when
>> coredumps are happening to make the code more maintainable.
>
> Eric, thank you again. I started to look at the reproduction steps and
> haven't completed it yet. Turned out the problem affects only older
> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
> set, hence the problem isn't visible on x86 and other CPUs out of the
> box. I'll need to check whether the problem could be simulated on all
> arches or maybe it's specific to VFP exception handling of ARM32.

It sounds like the gstreamer plugins only fail on certain hardware on
arm32, and things don't hang in coredumps unless the plugins fail.
That does make things tricky to minimize.

I have just verified that the known problematic code is not
in linux-next for Jan 11 2022.

If folks as they have time can double check linux-next and verify all is
well I would appreciate it. I don't expect that there are problems but
sometimes one problem hides another.

Eric

2022-01-11 17:28:35

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Olivier Langlois <[email protected]> writes:

> On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>>
>>
>> I have been able to confirm that changing wait_event_interruptible to
>> wait_event_killable was the culprit. Something about the way
>> systemd-coredump handles coredumps is not compatible with
>> wait_event_killable.
>
> This is my experience too that systemd-coredump is doing something
> unexpected. When I tested the patch:
> https://lore.kernel.org/lkml/[email protected]/
>
> to make sure that the patch worked, sending coredumps to systemd-
> coredump was making systemd-coredump, well, core dump... Not very
> useful...

Oh. Wow....

> Sending the dumps through a pipe to anything else than systemd-coredump
> was working fine.

Interesting.

I need to read through the pipe code and see how all of that works. For
writing directly to disk only ignoring killable interruptions are the
usual semantics. Ordinary pipe code has different semantics, and I
suspect that is what is tripping things up.

As for systemd-coredump it does whatever it does and I suspect some
versions of systemd-coredump are simply not robust if a coredump stops
unexpectedly.

The good news is the pipe code is simple enough, it will be possible to
completely read through that code.

Eric

2022-01-11 18:51:55

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

"Eric W. Biederman" <[email protected]> writes:

> Olivier Langlois <[email protected]> writes:
>
>> On Mon, 2022-01-10 at 15:11 -0600, Eric W. Biederman wrote:
>>>
>>>
>>> I have been able to confirm that changing wait_event_interruptible to
>>> wait_event_killable was the culprit. Something about the way
>>> systemd-coredump handles coredumps is not compatible with
>>> wait_event_killable.
>>
>> This is my experience too that systemd-coredump is doing something
>> unexpected. When I tested the patch:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> to make sure that the patch worked, sending coredumps to systemd-
>> coredump was making systemd-coredump, well, core dump... Not very
>> useful...
>
> Oh. Wow....
>
>> Sending the dumps through a pipe to anything else than systemd-coredump
>> was working fine.
>
> Interesting.
>
> I need to read through the pipe code and see how all of that works. For
> writing directly to disk only ignoring killable interruptions are the
> usual semantics. Ordinary pipe code has different semantics, and I
> suspect that is what is tripping things up.
>
> As for systemd-coredump it does whatever it does and I suspect some
> versions of systemd-coredump are simply not robust if a coredump stops
> unexpectedly.
>
> The good news is the pipe code is simple enough, it will be possible to
> completely read through that code.

My bug, obvious in hindsight is that "try_to_wait_up(TASK_INTERRUPTIBLE)"
does not work on a task that is in sleeping in TASK_KILLABLE.
That looks fixable in wait_for_dump_helpers it just won't be as easy
as changing wait_event_interruptible to wait_event_killable.

To prevent short pipe write from causing short writes during a coredump
I believe all we need to do handle -ERSTARTSYS with TIF_NOTIFY_SIGNAL.
Something like what I have below.

Until wait_for_dump_helpers is sorted out the coredump won't wait for
the dump helper the way it should, but otherwise things should work.

diff --git a/fs/coredump.c b/fs/coredump.c
index 7dece20b162b..0db1baf91420 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -796,6 +796,10 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
if (dump_interrupted())
return 0;
n = __kernel_write(file, addr, nr, &pos);
+ while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
+ tracehook_notify_signal();
+ n = __kernel_write(file, addr, nr, &pos);
+ }
if (n != nr)
return 0;
file->f_pos = pos;

Eric

2022-01-11 19:20:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
<[email protected]> wrote:
>
> + while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
> + tracehook_notify_signal();
> + n = __kernel_write(file, addr, nr, &pos);
> + }

This reads horribly wrongly to me.

That "tracehook_notify_signal()" thing *has* to be renamed before we
have anything like this that otherwise looks like "this will just loop
forever".

I'm pretty sure we've discussed that "tracehook" thing before - the
whole header file is misnamed, and most of the functions in theer are
too.

As an ugly alternative, open-code it, so that it's clear that "yup,
that clears the TIF_NOTIFY_SIGNAL flag".

Linus

2022-01-11 22:43:13

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Tue, 11 Jan 2022, Michael Schmitz wrote:

> Am 11.01.2022 um 06:54 schrieb Geert Uytterhoeven:
> > Hi Al,
> >
> > CC Michael/m68k,
> >
> > On Mon, Jan 10, 2022 at 5:20 PM Al Viro <[email protected]> wrote:
> >> On Mon, Jan 10, 2022 at 04:26:57PM +0100, Geert Uytterhoeven wrote:
> >>> On Mon, Jan 3, 2022 at 10:33 PM Eric W. Biederman <[email protected]>
> >>> wrote:
> >>>> The generic function ptrace_report_syscall does a little more
> >>>> than syscall_trace on m68k. The function ptrace_report_syscall
> >>>> stops early if PT_TRACED is not set, it sets ptrace_message,
> >>>> and returns the result of fatal_signal_pending.
> >>>>
> >>>> Setting ptrace_message to a passed in value of 0 is effectively not
> >>>> setting ptrace_message, making that additional work a noop.
> >>>>
> >>>> Returning the result of fatal_signal_pending and letting the caller
> >>>> ignore the result becomes a noop in this change.
> >>>>
> >>>> When a process is ptraced, the flag PT_PTRACED is always set in
> >>>> current->ptrace. Testing for PT_PTRACED in ptrace_report_syscall is
> >>>> just an optimization to fail early if the process is not ptraced.
> >>>> Later on in ptrace_notify, ptrace_stop will test current->ptrace under
> >>>> tasklist_lock and skip performing any work if the task is not ptraced.
> >>>>
> >>>> Cc: Geert Uytterhoeven <[email protected]>
> >>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >>>
> >>> As this depends on the removal of a parameter from
> >>> ptrace_report_syscall() earlier in this series:
> >>> Acked-by: Geert Uytterhoeven <[email protected]>
> >>
> >> FWIW, I would suggest taking it a bit further: make syscall_trace_enter()
> >> and syscall_trace_leave() in m68k ptrace.c unconditional, replace the
> >> calls of syscall_trace() in entry.S with syscall_trace_enter() and
> >> syscall_trace_leave() resp. and remove syscall_trace().
> >>
> >> Geert, do you see any problems with that? The only difference is that
> >> current->ptrace_message would be set to 1 for ptrace stop on entry and
> >> 2 - on leave. Currently m68k just has it 0 all along.
> >>
> >> It is user-visible (the whole point is to let the tracer see which
> >> stop it is - entry or exit one), so somebody using PTRACE_GETEVENTMSG
> >> on syscall stops would start seeing 1 or 2 instead of "0 all along".
> >> That's how it works on all other architectures (including m68k-nommu),
> >> and I doubt that anything in userland will get broken.
> >>
> >> Behaviour of PTRACE_GETEVENTMSG for other stops (fork, etc.) remains
> >> as-is, of course.
> >
> > In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> > syscall_trace_enter/leave for m68k"[1], but that's still stuck...
> >
> > [1]
> > https://lore.kernel.org/r/[email protected]/
>
> That patch (for reasons I never found out) did interact badly with
> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
> Christoph's patches are based upon). Caused format errors under memory
> stress tests quite reliably, on my 030 hardware.
>

Those patches have since been merged, BTW.

> Probably needs a fresh look - the signal return path got changed by Al's
> patches IIRC, and I might have relied on offsets to data on the stack
> that are no longer correct with these patches. Or there's a race between
> the syscall trap and signal handling when returning from interrupt
> context ...
>
> Still school hols over here so I won't have much peace and quiet until
> February.
>

So the patch works okay with Aranym 68040 but not Motorola 68030? Since
there is at least one known issue affecting both Motorola 68030 and Hatari
68030, perhaps this patch is not the problem. In anycase, Al's suggestion
to split the patch into two may help in that testing two smaller patches
might narrow down the root cause.

2022-01-12 00:20:42

by Michael Schmitz

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Finn,

Am 12.01.2022 um 11:42 schrieb Finn Thain:
> On Tue, 11 Jan 2022, Michael Schmitz wrote:
>>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
>>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
>>>
>>> [1]
>>> https://lore.kernel.org/r/[email protected]/
>>
>> That patch (for reasons I never found out) did interact badly with
>> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
>> Christoph's patches are based upon). Caused format errors under memory
>> stress tests quite reliably, on my 030 hardware.
>>
>
> Those patches have since been merged, BTW.

Yes, that's why I advised caution with mine.

>
>> Probably needs a fresh look - the signal return path got changed by Al's
>> patches IIRC, and I might have relied on offsets to data on the stack
>> that are no longer correct with these patches. Or there's a race between
>> the syscall trap and signal handling when returning from interrupt
>> context ...
>>
>> Still school hols over here so I won't have much peace and quiet until
>> February.
>>
>
> So the patch works okay with Aranym 68040 but not Motorola 68030? Since

Correct - I seem to recall we also tested those on your 040 and there
was no regression there, but I may be misremembering that.

> there is at least one known issue affecting both Motorola 68030 and Hatari
> 68030, perhaps this patch is not the problem. In anycase, Al's suggestion

I hadn't ever made that connection, but it might be another explanation,
yes.

> to split the patch into two may help in that testing two smaller patches
> might narrow down the root cause.

That's certainly true.

What's the other reason these patches are still stuck, Geert? Did we
ever settle the dispute about what return code ought to abort a syscall
(in the seccomp context)?

Cheers,

Michael

2022-01-12 03:33:07

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

On Wed, 12 Jan 2022, Michael Schmitz wrote:

>
> I seem to recall we also tested those on your 040 and there was no
> regression there, but I may be misremembering that.
>

I abandoned that regression testing exercise when unpatched mainline
kernels began failing on that machine. I'm in the process of setting up a
different 68040 machine.

2022-01-12 07:54:54

by Michael Schmitz

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Finn,

Am 12.01.2022 um 16:32 schrieb Finn Thain:
> On Wed, 12 Jan 2022, Michael Schmitz wrote:
>
>>
>> I seem to recall we also tested those on your 040 and there was no
>> regression there, but I may be misremembering that.
>>
>
> I abandoned that regression testing exercise when unpatched mainline
> kernels began failing on that machine. I'm in the process of setting up a
> different 68040 machine.
>

Thanks for refreshing my memory!

Splitting my first patch as suggested by Al in order to defer handling
of the syscall_trace_enter() return code would achieve what Geert
suggested (eliminate m68k syscall_trace() altogether) without risk of
regression. This would need to replace Eric's patch 8.

Do you want me to send such a version based on my old patch series, or
would you rather prepare that yourself, Eric?

Cheers,

Michael

2022-01-12 07:55:47

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Michael,

On Wed, Jan 12, 2022 at 1:20 AM Michael Schmitz <[email protected]> wrote:
> Am 12.01.2022 um 11:42 schrieb Finn Thain:
> > On Tue, 11 Jan 2022, Michael Schmitz wrote:
> >>> In fact Michael did so in "[PATCH v7 1/2] m68k/kernel - wire up
> >>> syscall_trace_enter/leave for m68k"[1], but that's still stuck...
> >>>
> >>> [1]
> >>> https://lore.kernel.org/r/[email protected]/
> >>
> >> That patch (for reasons I never found out) did interact badly with
> >> Christoph Hellwig's 'remove set_fs' patches (and Al's signal fixes which
> >> Christoph's patches are based upon). Caused format errors under memory
> >> stress tests quite reliably, on my 030 hardware.
> >>
> >
> > Those patches have since been merged, BTW.
>
> Yes, that's why I advised caution with mine.
>
> >
> >> Probably needs a fresh look - the signal return path got changed by Al's
> >> patches IIRC, and I might have relied on offsets to data on the stack
> >> that are no longer correct with these patches. Or there's a race between
> >> the syscall trap and signal handling when returning from interrupt
> >> context ...
> >>
> >> Still school hols over here so I won't have much peace and quiet until
> >> February.
> >>
> >
> > So the patch works okay with Aranym 68040 but not Motorola 68030? Since
>
> Correct - I seem to recall we also tested those on your 040 and there
> was no regression there, but I may be misremembering that.
>
> > there is at least one known issue affecting both Motorola 68030 and Hatari
> > 68030, perhaps this patch is not the problem. In anycase, Al's suggestion
>
> I hadn't ever made that connection, but it might be another explanation,
> yes.
>
> > to split the patch into two may help in that testing two smaller patches
> > might narrow down the root cause.
>
> That's certainly true.
>
> What's the other reason these patches are still stuck, Geert? Did we
> ever settle the dispute about what return code ought to abort a syscall
> (in the seccomp context)?

IIRC, some (self)tests were still failing?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-12 08:05:44

by Michael Schmitz

[permalink] [raw]

Subject: Re: [PATCH 08/17] ptrace/m68k: Stop open coding ptrace_report_syscall

Hi Geert,

Am 12.01.2022 um 20:55 schrieb Geert Uytterhoeven:
> Hi Michael,
>
>> What's the other reason these patches are still stuck, Geert? Did we
>> ever settle the dispute about what return code ought to abort a syscall
>> (in the seccomp context)?
>
> IIRC, some (self)tests were still failing?

Too true - but I don't think my way of building the testsuite was
entirely according to the book. And I'm not sure I ran the testsuite
with more than one of the return code options. In all honesty, I had
been waiting for Adrian Glaubitz to test the patches with his seccomp
library port instead of relying on the testsuite.

Still, reason enough to split off the removal of syscall_trace() from
the seccomp stuff if it helps with Eric's patch series.

Cheers,

Michael

>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds
>

2022-01-15 07:38:28

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Linus Torvalds <[email protected]> writes:

> On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
> <[email protected]> wrote:
>>
>> + while ((n == -ERESTARTSYS) && test_thread_flag(TIF_NOTIFY_SIGNAL)) {
>> + tracehook_notify_signal();
>> + n = __kernel_write(file, addr, nr, &pos);
>> + }
>
> This reads horribly wrongly to me.
>
> That "tracehook_notify_signal()" thing *has* to be renamed before we
> have anything like this that otherwise looks like "this will just loop
> forever".
>
> I'm pretty sure we've discussed that "tracehook" thing before - the
> whole header file is misnamed, and most of the functions in theer are
> too.
>
> As an ugly alternative, open-code it, so that it's clear that "yup,
> that clears the TIF_NOTIFY_SIGNAL flag".

A cleaner alternative looks like to modify the pipe code to use
wake_up_XXX instead of wake_up_interruptible_XXX and then have code
that does pipe_write_killable instead of pipe_write_interruptible.

There is also a question of how all of this should interact with the
freezer, as I think changing from interruptible to killable means that
the coredumps became unfreezable.

I am busily simmering this on my back burner and I hope I can come up
with something sensible.

Eric

2022-01-16 16:22:37

by Olivier Langlois

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Fri, 2022-01-14 at 18:12 -0600, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
> > <[email protected]> wrote:
> > >
> > > +?????? while ((n == -ERESTARTSYS) &&
> > > test_thread_flag(TIF_NOTIFY_SIGNAL)) {
> > > +?????????????? tracehook_notify_signal();
> > > +?????????????? n = __kernel_write(file, addr, nr, &pos);
> > > +?????? }
> >
> > This reads horribly wrongly to me.
> >
> > That "tracehook_notify_signal()" thing *has* to be renamed before
> > we
> > have anything like this that otherwise looks like "this will just
> > loop
> > forever".
> >
> > I'm pretty sure we've discussed that "tracehook" thing before - the
> > whole header file is misnamed, and most of the functions in theer
> > are
> > too.
> >
> > As an ugly alternative, open-code it, so that it's clear that "yup,
> > that clears the TIF_NOTIFY_SIGNAL flag".
>
> A cleaner alternative looks like to modify the pipe code to use
> wake_up_XXX instead of wake_up_interruptible_XXX and then have code
> that does pipe_write_killable instead of pipe_write_interruptible.

Do not forget that the problem might not be limited to the pipe FS as
Oleg Nesterov pointed out here:

https://lore.kernel.org/io-uring/[email protected]/

This is why I did like your patch fixing __dump_emit. If the only
problem is the tracehook_notify_signal() function unclear name, that
should be addressed instead of trying to fix the problem in a different
way.
>
> There is also a question of how all of this should interact with the
> freezer, as I think changing from interruptible to killable means
> that
> the coredumps became unfreezable.
>
> I am busily simmering this on my back burner and I hope I can come up
> with something sensible.

IMHO, fixing the problem on the emit function side has the merit of
being future proof if something else than io_uring in the future would
raise the TIF_NOTIFY_SIGNAL flag

but I am wondering why no one commented anything about my proposal of
cancelling io_uring before generating the core dump therefore stopping
it to flip TIF_NOTIFY_SIGNAL while the core dump is generated.

Is there something wrong with my proposed approach?
https://lore.kernel.org/lkml/[email protected]/

It did flawlessly created many dozens of io_uring app core dumps in the
last months for me...

Olivier

2022-01-17 16:59:41

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> Assuming it won't be too much longer before the rest of the arches have
> set_fs/get_fs removed it looks like it makes sense to leave the
> force_uaccess_begin where it is, and just let force_uaccess_begin be
> removed when set_fs/get_fs are removed from the tree.
>
> Christoph does it look like the set_fs/get_fs removal work is going
> to stall indefinitely on some architectures? If so I think we want to
> find a way to get kernel threads to run with set_fs(USER_DS) on the
> stalled architectures. Otherwise I think we have a real hazard of
> introducing bugs that will only show up on the stalled architectures.

I really need help from the arch maintainers to finish the set_fs
removal. There have been very few arch maintainers helping with that
work (arm, arm64, parisc, m68k) in addition to the ones I did because
I have the test setups and knowledge. I'll send out another ping,
for necrotic architectures like ia64 and sh I have very little hope.

2022-01-18 02:23:33

by Heiko Carstens

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Mon, Jan 17, 2022 at 12:05:41AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> > Assuming it won't be too much longer before the rest of the arches have
> > set_fs/get_fs removed it looks like it makes sense to leave the
> > force_uaccess_begin where it is, and just let force_uaccess_begin be
> > removed when set_fs/get_fs are removed from the tree.
> >
> > Christoph does it look like the set_fs/get_fs removal work is going
> > to stall indefinitely on some architectures? If so I think we want to
> > find a way to get kernel threads to run with set_fs(USER_DS) on the
> > stalled architectures. Otherwise I think we have a real hazard of
> > introducing bugs that will only show up on the stalled architectures.
>
> I really need help from the arch maintainers to finish the set_fs
> removal. There have been very few arch maintainers helping with that
> work (arm, arm64, parisc, m68k) in addition to the ones I did because
> I have the test setups and knowledge. I'll send out another ping,

Just in case you missed it: s390 was converted with commit 87d598634521
("s390/mm: remove set_fs / rework address space handling").

2022-01-18 02:26:46

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Mon, Jan 17, 2022 at 01:15:43PM +0100, Heiko Carstens wrote:
> > I really need help from the arch maintainers to finish the set_fs
> > removal. There have been very few arch maintainers helping with that
> > work (arm, arm64, parisc, m68k) in addition to the ones I did because
> > I have the test setups and knowledge. I'll send out another ping,
>
> Just in case you missed it: s390 was converted with commit 87d598634521
> ("s390/mm: remove set_fs / rework address space handling").

Sorry, I forgot about s390, which as often was a model citizen here!

2022-01-18 02:27:29

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH 03/10] exit: Move oops specific logic from do_exit into make_task_dead

On Mon, Jan 17, 2022 at 9:05 AM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Jan 07, 2022 at 12:59:33PM -0600, Eric W. Biederman wrote:
> > Assuming it won't be too much longer before the rest of the arches have
> > set_fs/get_fs removed it looks like it makes sense to leave the
> > force_uaccess_begin where it is, and just let force_uaccess_begin be
> > removed when set_fs/get_fs are removed from the tree.
> >
> > Christoph does it look like the set_fs/get_fs removal work is going
> > to stall indefinitely on some architectures? If so I think we want to
> > find a way to get kernel threads to run with set_fs(USER_DS) on the
> > stalled architectures. Otherwise I think we have a real hazard of
> > introducing bugs that will only show up on the stalled architectures.
>
> I really need help from the arch maintainers to finish the set_fs
> removal. There have been very few arch maintainers helping with that
> work (arm, arm64, parisc, m68k) in addition to the ones I did because
> I have the test setups and knowledge. I'll send out another ping,
> for necrotic architectures like ia64 and sh I have very little hope.

I did a conversion of microblaze for fun at some point, and I think I never
sent that out. I haven't tested it, but if this looks correct to you and
Michal, it could serve as a model for other trivial conversions.

I also looked into converting ia64 and sh at the same time, but I can't
find those patches now, so I think they were never complete.

Arnd

2022-01-18 02:27:46

by Arnd Bergmann

[permalink] [raw]

Subject: [PATCH] microblaze: remove CONFIG_SET_FS

From: Arnd Bergmann <[email protected]>

I picked microblaze as one of the architectures that still
use set_fs() and converted it not to.

Link: https://lore.kernel.org/lkml/CAK8P3a22ntk5fTuk6xjh1pyS-eVbGo7zDQSVkn2VG1xgp01D9g@mail.gmail.com/
Signed-off-by: Arnd Bergmann <[email protected]>
---
This is an old patch I found after Christoph asked about
conversions for the remaining architectures. I have no idea
about the state of this patch, but there is a reasonable
chance that it works.
---
arch/microblaze/Kconfig | 1 -
arch/microblaze/include/asm/thread_info.h | 6 ---
arch/microblaze/include/asm/uaccess.h | 56 ++++++++++-------------
arch/microblaze/kernel/asm-offsets.c | 1 -
4 files changed, 25 insertions(+), 39 deletions(-)

diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index 59798e43cdb0..1fb1cec087b7 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -42,7 +42,6 @@ config MICROBLAZE
select CPU_NO_EFFICIENT_FFS
select MMU_GATHER_NO_RANGE
select SPARSE_IRQ
- select SET_FS
select ZONE_DMA
select TRACE_IRQFLAGS_SUPPORT

diff --git a/arch/microblaze/include/asm/thread_info.h b/arch/microblaze/include/asm/thread_info.h
index 44f5ca331862..a0ddd2a36fb9 100644
--- a/arch/microblaze/include/asm/thread_info.h
+++ b/arch/microblaze/include/asm/thread_info.h
@@ -56,17 +56,12 @@ struct cpu_context {
__u32 fsr;
};

-typedef struct {
- unsigned long seg;
-} mm_segment_t;
-
struct thread_info {
struct task_struct *task; /* main task structure */
unsigned long flags; /* low level flags */
unsigned long status; /* thread-synchronous flags */
__u32 cpu; /* current CPU */
__s32 preempt_count; /* 0 => preemptable,< 0 => BUG*/
- mm_segment_t addr_limit; /* thread address space */

struct cpu_context cpu_context;
};
@@ -80,7 +75,6 @@ struct thread_info {
.flags = 0, \
.cpu = 0, \
.preempt_count = INIT_PREEMPT_COUNT, \
- .addr_limit = KERNEL_DS, \
}

/* how to get the thread information struct from C */
diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h
index d2a8ef9f8978..346fe4618b27 100644
--- a/arch/microblaze/include/asm/uaccess.h
+++ b/arch/microblaze/include/asm/uaccess.h
@@ -16,45 +16,20 @@
#include <asm/extable.h>
#include <linux/string.h>

-/*
- * On Microblaze the fs value is actually the top of the corresponding
- * address space.
- *
- * The fs value determines whether argument validity checking should be
- * performed or not. If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- *
- * For non-MMU arch like Microblaze, KERNEL_DS and USER_DS is equal.
- */
-# define MAKE_MM_SEG(s) ((mm_segment_t) { (s) })
-
-# define KERNEL_DS MAKE_MM_SEG(0xFFFFFFFF)
-# define USER_DS MAKE_MM_SEG(TASK_SIZE - 1)
-
-# define get_fs() (current_thread_info()->addr_limit)
-# define set_fs(val) (current_thread_info()->addr_limit = (val))
-# define user_addr_max() get_fs().seg
-
-# define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-
static inline int access_ok(const void __user *addr, unsigned long size)
{
if (!size)
goto ok;

- if ((get_fs().seg < ((unsigned long)addr)) ||
- (get_fs().seg < ((unsigned long)addr + size - 1))) {
- pr_devel("ACCESS fail at 0x%08x (size 0x%x), seg 0x%08x\n",
- (__force u32)addr, (u32)size,
- (u32)get_fs().seg);
+ if ((((unsigned long)addr) > TASK_SIZE) ||
+ (((unsigned long)addr + size - 1) > TASK_SIZE)) {
+ pr_devel("ACCESS fail at 0x%08x (size 0x%x)",
+ (__force u32)addr, (u32)size);
return 0;
}
ok:
- pr_devel("ACCESS OK at 0x%08x (size 0x%x), seg 0x%08x\n",
- (__force u32)addr, (u32)size,
- (u32)get_fs().seg);
+ pr_devel("ACCESS OK at 0x%08x (size 0x%x)\n",
+ (__force u32)addr, (u32)size);
return 1;
}

@@ -280,6 +255,25 @@ extern long __user_bad(void);
__gu_err; \
})

+#define __get_kernel_nofault(dst, src, type, label) \
+{ \
+ type __user *p = (type __force __user *)(src); \
+ type data; \
+ if (__get_user(data, p)) \
+ goto label; \
+ *(type *)dst = data; \
+}
+
+#define __put_kernel_nofault(dst, src, type, label) \
+{ \
+ type __user *p = (type __force __user *)(dst); \
+ type data = *(type *)src; \
+ if (__put_user(data, p)) \
+ goto label; \
+}
+
+#define HAVE_GET_KERNEL_NOFAULT
+
static inline unsigned long
raw_copy_from_user(void *to, const void __user *from, unsigned long n)
{
diff --git a/arch/microblaze/kernel/asm-offsets.c b/arch/microblaze/kernel/asm-offsets.c
index b77dd188dec4..47ee409508b1 100644
--- a/arch/microblaze/kernel/asm-offsets.c
+++ b/arch/microblaze/kernel/asm-offsets.c
@@ -86,7 +86,6 @@ int main(int argc, char *argv[])
/* struct thread_info */
DEFINE(TI_TASK, offsetof(struct thread_info, task));
DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
- DEFINE(TI_ADDR_LIMIT, offsetof(struct thread_info, addr_limit));
DEFINE(TI_CPU_CONTEXT, offsetof(struct thread_info, cpu_context));
DEFINE(TI_PREEMPT_COUNT, offsetof(struct thread_info, preempt_count));
BLANK();
--
2.29.2

2022-01-18 02:37:00

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Olivier Langlois <[email protected]> writes:

> On Fri, 2022-01-14 at 18:12 -0600, Eric W. Biederman wrote:
>> Linus Torvalds <[email protected]> writes:
>>
>> > On Tue, Jan 11, 2022 at 10:51 AM Eric W. Biederman
>> > <[email protected]> wrote:
>> > >
>> > > +       while ((n == -ERESTARTSYS) &&
>> > > test_thread_flag(TIF_NOTIFY_SIGNAL)) {
>> > > +               tracehook_notify_signal();
>> > > +               n = __kernel_write(file, addr, nr, &pos);
>> > > +       }
>> >
>> > This reads horribly wrongly to me.
>> >
>> > That "tracehook_notify_signal()" thing *has* to be renamed before
>> > we
>> > have anything like this that otherwise looks like "this will just
>> > loop
>> > forever".
>> >
>> > I'm pretty sure we've discussed that "tracehook" thing before - the
>> > whole header file is misnamed, and most of the functions in theer
>> > are
>> > too.
>> >
>> > As an ugly alternative, open-code it, so that it's clear that "yup,
>> > that clears the TIF_NOTIFY_SIGNAL flag".
>>
>> A cleaner alternative looks like to modify the pipe code to use
>> wake_up_XXX instead of wake_up_interruptible_XXX and then have code
>> that does pipe_write_killable instead of pipe_write_interruptible.
>
> Do not forget that the problem might not be limited to the pipe FS as
> Oleg Nesterov pointed out here:
>
> https://lore.kernel.org/io-uring/[email protected]/
>
> This is why I did like your patch fixing __dump_emit. If the only
> problem is the tracehook_notify_signal() function unclear name, that
> should be addressed instead of trying to fix the problem in a different
> way.

It might be that the fix is to run a portion of the exit_to_userspace
loop that does:

if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
handle_signal_work(regs, ti_work);

I am deep in brainstorm mode trying to find something that comes out
clean.

Oleg is right that while to be POSIX compliant and otherwise compatible
with traditional unix behavior sleeps in filesystems need to be
uninterruptible. NFS has not always provided that compatibility.

>> There is also a question of how all of this should interact with the
>> freezer, as I think changing from interruptible to killable means
>> that
>> the coredumps became unfreezable.
>>
>> I am busily simmering this on my back burner and I hope I can come up
>> with something sensible.
>
> IMHO, fixing the problem on the emit function side has the merit of
> being future proof if something else than io_uring in the future would
> raise the TIF_NOTIFY_SIGNAL flag
>
> but I am wondering why no one commented anything about my proposal of
> cancelling io_uring before generating the core dump therefore stopping
> it to flip TIF_NOTIFY_SIGNAL while the core dump is generated.
>
> Is there something wrong with my proposed approach?
> https://lore.kernel.org/lkml/[email protected]/
>
> It did flawlessly created many dozens of io_uring app core dumps in the
> last months for me...

From my perspective I am not at all convinced that io_uring is the only
culprit.

Beyond that the purpose of a coredump is to snapshot the process as it
is, before anything is shutdown so that someone can examine the coredump
and figure out what failed. Running around changing the state of the
process has a very real chance of hiding what is going wrong.

Further your change requires that there be a place for io_uring to clean
things up. Given that fundamentally that seems like the wrong thing to
me I am not interested in making it easy to what looks like the wrong
thing.

All of this may be perfection being the enemy of the good (especially as
your io_uring magic happens as a special case in do_coredump). My work
in this area is to remove hacks so I can be convinced the code works
100% of the time so unfortunately I am not interested in pick up a
change that is only good enough. Someone else like Andrew Morton might
be.

None of that changes the fact that tracehook_notify_signal needs to be
renamed. That effects your approach and my proof of concept approach.
So renaming tracehook_notify_signal just needs to be done.

Eric

2022-01-18 02:58:10

by Eric W. Biederman

[permalink] [raw]

Subject: io_uring truncating coredumps

Subject updated to reflect the current discussion.

> Linus Torvalds <[email protected]> writes:

> But I really think it's wrong.
>
> You're trying to work around a problem the wrong way around. If a task
> is dead, and is dumping core, then signals just shouldn't matter in
> the first place, and thus the whole "TASK_INTERRUPTIBLE vs
> TASK_UNINTERRUPTIBLE" really shouldn't be an issue. The fact that it
> is an issue means there's something wrong in signaling, not in the
> pipe code.
>
> So I really think that's where the fix should be - on the signal delivery side.

Thinking about it from the perspective of not delivering the wake-ups
fixing io_uring and coredumps in a non-hacky way looks comparatively
simple. The function task_work_add just needs to not wake anything up
after a process has started dying.

Something like the patch below.

The only tricky part I can see is making certain there are not any races
between task_work_add and do_coredump depending on task_work_add not
causing signal_pending to return true.

diff --git a/kernel/task_work.c b/kernel/task_work.c
index fad745c59234..5f941e377268 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -44,6 +44,9 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
work->next = head;
} while (cmpxchg(&task->task_works, head, work) != head);

+ if (notify && (task->signal->flags & SIGNAL_GROUP_EXIT))
+ return 0;
+
switch (notify) {
case TWA_NONE:
break;

Eric

2022-01-19 19:05:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: io_uring truncating coredumps

On Mon, Jan 17, 2022 at 8:47 PM Eric W. Biederman <[email protected]> wrote:
>
> Thinking about it from the perspective of not delivering the wake-ups
> fixing io_uring and coredumps in a non-hacky way looks comparatively
> simple. The function task_work_add just needs to not wake anything up
> after a process has started dying.
>
> Something like the patch below.

Hmm. Yes, I think this is the right direction.

That said, I think it should not add the work at all, and return
-ESRCH, the exact same way that it does for that work_exited
condition.

Because it's basically the same thing: the task is dead and shouldn't
do more work. In fact, task_work_run() is the thing that sets it to
&work_exited as it sees PF_EXITING, so it feels to me that THAT is
actually the issue here - we react to PF_EXITING too late. We react to
it *after* we've already added the work, and then we do that "no more
work" logic only after we've accepted those late work entries?

So my gut feel is that task_work_add() should just also test PF_EXITING.

And in fact, my gut feel is that PF_EXITING is too late anyway (it
happens after core-dumping, no?)

But I guess that thing may be on purpose, and maybe the act of dumping
core itself wants to do more work, and so that isn't an option?

So I don't think your patch is "right" as-is, and it all worries me,
but yes, I think this area is very much the questionable one.

I think that work stopping and the io_uring shutdown should probably
move earlier in the exit queue, but as mentioned above, maybe the work
addition boundary in particular really wants to be late because the
exit process itself still uses task works? ;(

Linus

2022-01-20 20:28:14

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

11.01.2022 20:20, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>> Dmitry Osipenko <[email protected]> writes:
>>>
>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>
>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>> run gst-pluggin-scanner in a useful way.
>>>>>
>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>> hang another way, but if you could give me some more information on what
>>>>> you are doing to trigger this I would appreciate it.
>>>>
>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>> to figure out the reproduction steps and get back to you.
>>>
>>> Thank you.
>>>
>>> Until I can figure out why this is causing problems I have dropped the
>>> following two patches from my queue:
>>> signal: Make SIGKILL during coredumps an explicit special case
>>> signal: Drop signals received after a fatal signal has been processed
>>>
>>> I have replaced them with the following two patches that just do what
>>> is needed for the rest of the code in the series:
>>> signal: Have prepare_signal detect coredumps using
>>> signal: Make coredump handling explicit in complete_signal
>>>
>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>> happening proves to me that I need to change the SIGKILL handling when
>>> coredumps are happening to make the code more maintainable.
>>
>> Eric, thank you again. I started to look at the reproduction steps and
>> haven't completed it yet. Turned out the problem affects only older
>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>> set, hence the problem isn't visible on x86 and other CPUs out of the
>> box. I'll need to check whether the problem could be simulated on all
>> arches or maybe it's specific to VFP exception handling of ARM32.
>
> It sounds like the gstreamer plugins only fail on certain hardware on
> arm32, and things don't hang in coredumps unless the plugins fail.
> That does make things tricky to minimize.
>
> I have just verified that the known problematic code is not
> in linux-next for Jan 11 2022.
>
> If folks as they have time can double check linux-next and verify all is
> well I would appreciate it. I don't expect that there are problems but
> sometimes one problem hides another.

Hello Eric,

I reproduced the trouble on x86_64.

Here are the reproduction steps, using ArchLinux and linux-next-20211224:

```
sudo pacman -S base-devel git mesa glu meson wget
git clone https://github.com/grate-driver/gstreamer.git
cd gstreamer
git checkout sigill
meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
cd build
sudo ninja install
wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
rm -r ~/.cache/gstreamer-1.0
gst-play-1.0 ./big_buck_bunny_720p_h264.mov
```

The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.

[1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f

2022-01-20 21:24:05

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

Dmitry Osipenko <[email protected]> writes:

> 11.01.2022 20:20, Eric W. Biederman пишет:
>> Dmitry Osipenko <[email protected]> writes:
>>
>>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>>> Dmitry Osipenko <[email protected]> writes:
>>>>
>>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>>
>>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>>> run gst-pluggin-scanner in a useful way.
>>>>>>
>>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>>> hang another way, but if you could give me some more information on what
>>>>>> you are doing to trigger this I would appreciate it.
>>>>>
>>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>>> to figure out the reproduction steps and get back to you.
>>>>
>>>> Thank you.
>>>>
>>>> Until I can figure out why this is causing problems I have dropped the
>>>> following two patches from my queue:
>>>> signal: Make SIGKILL during coredumps an explicit special case
>>>> signal: Drop signals received after a fatal signal has been processed
>>>>
>>>> I have replaced them with the following two patches that just do what
>>>> is needed for the rest of the code in the series:
>>>> signal: Have prepare_signal detect coredumps using
>>>> signal: Make coredump handling explicit in complete_signal
>>>>
>>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>>> happening proves to me that I need to change the SIGKILL handling when
>>>> coredumps are happening to make the code more maintainable.
>>>
>>> Eric, thank you again. I started to look at the reproduction steps and
>>> haven't completed it yet. Turned out the problem affects only older
>>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>>> set, hence the problem isn't visible on x86 and other CPUs out of the
>>> box. I'll need to check whether the problem could be simulated on all
>>> arches or maybe it's specific to VFP exception handling of ARM32.
>>
>> It sounds like the gstreamer plugins only fail on certain hardware on
>> arm32, and things don't hang in coredumps unless the plugins fail.
>> That does make things tricky to minimize.
>>
>> I have just verified that the known problematic code is not
>> in linux-next for Jan 11 2022.
>>
>> If folks as they have time can double check linux-next and verify all is
>> well I would appreciate it. I don't expect that there are problems but
>> sometimes one problem hides another.
>
> Hello Eric,
>
> I reproduced the trouble on x86_64.
>
> Here are the reproduction steps, using ArchLinux and linux-next-20211224:
>
> ```
> sudo pacman -S base-devel git mesa glu meson wget
> git clone https://github.com/grate-driver/gstreamer.git
> cd gstreamer
> git checkout sigill
> meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
> cd build
> sudo ninja install
> wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
> rm -r ~/.cache/gstreamer-1.0
> gst-play-1.0 ./big_buck_bunny_720p_h264.mov
> ```
>
> The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.
>
> [1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f

Thank you.

I will verify this works before I add my updated version to
my signal-for-v5.18 branch.

Have you by any chance tried a newer version of linux-next without
commit fbc11520b58a ("signal: Make SIGKILL during coredumps an explicit
special case") in it?

If not I will double check that my pulling the commit out does not break
in the case you have documented.

Eric

2022-01-20 21:26:15

by Dmitry Osipenko

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

18.01.2022 20:52, Eric W. Biederman пишет:
> Dmitry Osipenko <[email protected]> writes:
>
>> 11.01.2022 20:20, Eric W. Biederman пишет:
>>> Dmitry Osipenko <[email protected]> writes:
>>>
>>>> 08.01.2022 21:13, Eric W. Biederman пишет:
>>>>> Dmitry Osipenko <[email protected]> writes:
>>>>>
>>>>>> 05.01.2022 22:58, Eric W. Biederman пишет:
>>>>>>>
>>>>>>> I have not yet been able to figure out how to run gst-pluggin-scanner in
>>>>>>> a way that triggers this yet. In truth I can't figure out how to
>>>>>>> run gst-pluggin-scanner in a useful way.
>>>>>>>
>>>>>>> I am going to set up some unit tests and see if I can reproduce your
>>>>>>> hang another way, but if you could give me some more information on what
>>>>>>> you are doing to trigger this I would appreciate it.
>>>>>>
>>>>>> Thanks, Eric. The distro is Arch Linux, but it's a development
>>>>>> environment where I'm running latest GStreamer from git master. I'll try
>>>>>> to figure out the reproduction steps and get back to you.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Until I can figure out why this is causing problems I have dropped the
>>>>> following two patches from my queue:
>>>>> signal: Make SIGKILL during coredumps an explicit special case
>>>>> signal: Drop signals received after a fatal signal has been processed
>>>>>
>>>>> I have replaced them with the following two patches that just do what
>>>>> is needed for the rest of the code in the series:
>>>>> signal: Have prepare_signal detect coredumps using
>>>>> signal: Make coredump handling explicit in complete_signal
>>>>>
>>>>> Perversely my failure to change the SIGKILL handling when coredumps are
>>>>> happening proves to me that I need to change the SIGKILL handling when
>>>>> coredumps are happening to make the code more maintainable.
>>>>
>>>> Eric, thank you again. I started to look at the reproduction steps and
>>>> haven't completed it yet. Turned out the problem affects only older
>>>> NVIDIA Tegra2 Cortex-A9 CPU that lacks support of ARM NEON instructions
>>>> set, hence the problem isn't visible on x86 and other CPUs out of the
>>>> box. I'll need to check whether the problem could be simulated on all
>>>> arches or maybe it's specific to VFP exception handling of ARM32.
>>>
>>> It sounds like the gstreamer plugins only fail on certain hardware on
>>> arm32, and things don't hang in coredumps unless the plugins fail.
>>> That does make things tricky to minimize.
>>>
>>> I have just verified that the known problematic code is not
>>> in linux-next for Jan 11 2022.
>>>
>>> If folks as they have time can double check linux-next and verify all is
>>> well I would appreciate it. I don't expect that there are problems but
>>> sometimes one problem hides another.
>>
>> Hello Eric,
>>
>> I reproduced the trouble on x86_64.
>>
>> Here are the reproduction steps, using ArchLinux and linux-next-20211224:
>>
>> ```
>> sudo pacman -S base-devel git mesa glu meson wget
>> git clone https://github.com/grate-driver/gstreamer.git
>> cd gstreamer
>> git checkout sigill
>> meson --prefix=/usr -Dgst-plugins-base:playback=enabled -Dgst-devtools:validate=disabled build
>> cd build
>> sudo ninja install
>> wget https://www.peach.themazzone.com/big_buck_bunny_720p_h264.mov
>> rm -r ~/.cache/gstreamer-1.0
>> gst-play-1.0 ./big_buck_bunny_720p_h264.mov
>> ```
>>
>> The SIGILL, thrown by [1], causes the hang. There is no hang using v5.16.1 kernel.
>>
>> [1] https://github.com/grate-driver/gstreamer/commit/006f9a2ee6dcf7b31c9b5413815d6054d82a3b2f
>
> Thank you.
>
> I will verify this works before I add my updated version to
> my signal-for-v5.18 branch.
>
> Have you by any chance tried a newer version of linux-next without
> commit fbc11520b58a ("signal: Make SIGKILL during coredumps an explicit
> special case") in it?
>
> If not I will double check that my pulling the commit out does not break
> in the case you have documented.

Recent linux-next works fine.

2022-01-26 21:44:05

by Olivier Langlois

[permalink] [raw]

Subject: Re: [PATCH 1/8] signal: Make SIGKILL during coredumps an explicit special case

On Mon, 2022-01-17 at 10:09 -0600, Eric W. Biederman wrote:
> Olivier Langlois <[email protected]> writes:
> From my perspective I am not at all convinced that io_uring is the
> only
> culprit.
>
> Beyond that the purpose of a coredump is to snapshot the process as
> it
> is, before anything is shutdown so that someone can examine the
> coredump
> and figure out what failed.? Running around changing the state of the
> process has a very real chance of hiding what is going wrong.
>
> Further your change requires that there be a place for io_uring to
> clean
> things up.? Given that fundamentally that seems like the wrong thing
> to
> me I am not interested in making it easy to what looks like the wrong
> thing.
>
> All of this may be perfection being the enemy of the good (especially
> as
> your io_uring magic happens as a special case in do_coredump).? My
> work
> in this area is to remove hacks so I can be convinced the code works
> 100% of the time so unfortunately I am not interested in pick up a
> change that is only good enough.? Someone else like Andrew Morton
> might
> be.
>
>
Fair enough.

You do bring good points but I am not so sure about the second one
considering that the coredump is meant to be a snapshot and if io_uring
still runs, the state may change as the dump is generated anyway.

I'll follow with interest what you finally come up with but my mindset
when I wrote the patch was that there does not seem to be any benefit
keeping io_uring active while coredumping and it has the potential to
create nasty issues.

I did stumble into core file truncation problem.

Pavel got that when modifying io_uring code:
https://lore.kernel.org/all/[email protected]/

and I find very likely that keeping io_uring active while coredumping
might create new nasty but subtle issues down the road...

Greetings,
Olivier

2022-02-09 14:11:51

by Michal Simek

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On 2/9/22 14:52, Christoph Hellwig wrote:
> On Wed, Feb 09, 2022 at 02:50:32PM +0100, Michal Simek wrote:
>> I can't see any issue with the patch when I run it on real HW.
>> Tested-by: Michal Simek <[email protected]>
>>
>> Christoph: Is there any recommended test suite which I should run?
>
> No. For architectures that already didn't use set_fs internally
> there is nothing specific to test. Running some perf or backtrace
> tests might be useful to check if the non-faulting kernel helpers
> work properly.

Thanks for confirmation. Once Arnd sent v2 with updated commit message I will
queue it to next release.

Thanks,
Michal

--
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Xilinx Microblaze
Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP ARM64 SoCs
U-Boot custodian - Xilinx Microblaze/Zynq/ZynqMP/Versal SoCs

2022-02-09 15:23:20

by Michal Simek

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On 2/9/22 15:40, Arnd Bergmann wrote:
> On Wed, Feb 9, 2022 at 2:50 PM Michal Simek <[email protected]> wrote:
>>
>> Hi Arnd,
>>
>> po 17. 1. 2022 v 14:28 odesílatel Arnd Bergmann <[email protected]> napsal:
>>>
>>> From: Arnd Bergmann <[email protected]>
>>>
>>> I picked microblaze as one of the architectures that still
>>> use set_fs() and converted it not to.
>>
>> Can you please update the commit message because what is above is not
>> the right one?
>
> Ah, sorry about that. I think you can copy from the openrisc patch,
> see https://lore.kernel.org/lkml/[email protected]/

Please do it. You are the author of this patch and we should follow the process.
Link to riscv commit would be also useful.
Definitely thanks for this work and getting this to my attention.

Thanks,
Michal

2022-02-09 15:43:23

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Wed, Feb 9, 2022 at 2:50 PM Michal Simek <[email protected]> wrote:
>
> Hi Arnd,
>
> po 17. 1. 2022 v 14:28 odesílatel Arnd Bergmann <[email protected]> napsal:
> >
> > From: Arnd Bergmann <[email protected]>
> >
> > I picked microblaze as one of the architectures that still
> > use set_fs() and converted it not to.
>
> Can you please update the commit message because what is above is not
> the right one?

Ah, sorry about that. I think you can copy from the openrisc patch,
see https://lore.kernel.org/lkml/[email protected]/

Arnd

2022-02-09 17:34:08

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Wed, Feb 9, 2022 at 3:44 PM Michal Simek <[email protected]> wrote:
> On 2/9/22 15:40, Arnd Bergmann wrote:
> > On Wed, Feb 9, 2022 at 2:50 PM Michal Simek <[email protected]> wrote:
> >>
> >> Hi Arnd,
> >>
> >> po 17. 1. 2022 v 14:28 odesílatel Arnd Bergmann <[email protected]> napsal:
> >>>
> >>> From: Arnd Bergmann <[email protected]>
> >>>
> >>> I picked microblaze as one of the architectures that still
> >>> use set_fs() and converted it not to.
> >>
> >> Can you please update the commit message because what is above is not
> >> the right one?
> >
> > Ah, sorry about that. I think you can copy from the openrisc patch,
> > see https://lore.kernel.org/lkml/[email protected]/
>
> Please do it. You are the author of this patch and we should follow the process.

Done.

Looking at it again, I wonder if it would help to use the __get_kernel_nofault()
and __get_kernel_nofault() helpers as the default in
include/asm-generic/uaccess.h.

I see it's identical to the openrisc version and would probably be the same
for some of the other architectures that have no other use for
set_fs(). That may
help to do a bulk remove of set_fs for alpha, arc, csky, h8300, hexagon, nds32,
nios2, um and extensa, leaving only ia64, sparc and sh.

Arnd

2022-02-09 18:34:27

by Michal Simek

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

Hi Arnd,

po 17. 1. 2022 v 14:28 odesílatel Arnd Bergmann <[email protected]> napsal:
>
> From: Arnd Bergmann <[email protected]>
>
> I picked microblaze as one of the architectures that still
> use set_fs() and converted it not to.

Can you please update the commit message because what is above is not
the right one?

I can't see any issue with the patch when I run it on real HW.
Tested-by: Michal Simek <[email protected]>

Christoph: Is there any recommended test suite which I should run?

Thanks,
Michal

--
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Xilinx Microblaze
Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP ARM64 SoCs
U-Boot custodian - Xilinx Microblaze/Zynq/ZynqMP/Versal SoCs

2022-02-09 23:54:23

by Stafford Horne

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Wed, Feb 09, 2022 at 03:54:54PM +0100, Arnd Bergmann wrote:
> On Wed, Feb 9, 2022 at 3:44 PM Michal Simek <[email protected]> wrote:
> > On 2/9/22 15:40, Arnd Bergmann wrote:
> > > On Wed, Feb 9, 2022 at 2:50 PM Michal Simek <[email protected]> wrote:
> > >>
> > >> Hi Arnd,
> > >>
> > >> po 17. 1. 2022 v 14:28 odes?latel Arnd Bergmann <[email protected]> napsal:
> > >>>
> > >>> From: Arnd Bergmann <[email protected]>
> > >>>
> > >>> I picked microblaze as one of the architectures that still
> > >>> use set_fs() and converted it not to.
> > >>
> > >> Can you please update the commit message because what is above is not
> > >> the right one?
> > >
> > > Ah, sorry about that. I think you can copy from the openrisc patch,
> > > see https://lore.kernel.org/lkml/[email protected]/
> >
> > Please do it. You are the author of this patch and we should follow the process.
>
> Done.
>
> Looking at it again, I wonder if it would help to use the __get_kernel_nofault()
> and __get_kernel_nofault() helpers as the default in
> include/asm-generic/uaccess.h.

That would make sense. Perhaps also the __range_ok() function from OpenRISC
could move there as I think other architectures would also want to use that.

> I see it's identical to the openrisc version and would probably be the same
> for some of the other architectures that have no other use for
> set_fs(). That may
> help to do a bulk remove of set_fs for alpha, arc, csky, h8300, hexagon, nds32,
> nios2, um and extensa, leaving only ia64, sparc and sh.

If you could add it into include/asm-generic/uaccess.h I can test changing my
patch to use it.

-Stafford

2022-02-11 01:45:00

by Stafford Horne

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Thu, Feb 10, 2022 at 08:31:05AM +0900, Stafford Horne wrote:
> On Wed, Feb 09, 2022 at 03:54:54PM +0100, Arnd Bergmann wrote:
> > On Wed, Feb 9, 2022 at 3:44 PM Michal Simek <[email protected]> wrote:
> > > On 2/9/22 15:40, Arnd Bergmann wrote:
> > > > On Wed, Feb 9, 2022 at 2:50 PM Michal Simek <[email protected]> wrote:
> > > >>
> > > >> Hi Arnd,
> > > >>
> > > >> po 17. 1. 2022 v 14:28 odes?latel Arnd Bergmann <[email protected]> napsal:
> > > >>>
> > > >>> From: Arnd Bergmann <[email protected]>
> > > >>>
> > > >>> I picked microblaze as one of the architectures that still
> > > >>> use set_fs() and converted it not to.
> > > >>
> > > >> Can you please update the commit message because what is above is not
> > > >> the right one?
> > > >
> > > > Ah, sorry about that. I think you can copy from the openrisc patch,
> > > > see https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Please do it. You are the author of this patch and we should follow the process.
> >
> > Done.
> >
> > Looking at it again, I wonder if it would help to use the __get_kernel_nofault()
> > and __get_kernel_nofault() helpers as the default in
> > include/asm-generic/uaccess.h.
>
> That would make sense. Perhaps also the __range_ok() function from OpenRISC
> could move there as I think other architectures would also want to use that.
>
> > I see it's identical to the openrisc version and would probably be the same
> > for some of the other architectures that have no other use for
> > set_fs(). That may
> > help to do a bulk remove of set_fs for alpha, arc, csky, h8300, hexagon, nds32,
> > nios2, um and extensa, leaving only ia64, sparc and sh.
>
> If you could add it into include/asm-generic/uaccess.h I can test changing my
> patch to use it.

Note, I would be happy to do the work to move these into include/asm-generic/uaccess.h.
But as I see it the existing include/asm-generic/uaccess.h is for NOMMU. How
should we go about having an MMU and NOMMU version? Should we move uaccess.h to
uaccess-nommu.h? Or add more ifdefs to uaccess.h?

-Stafford

2022-02-11 21:41:02

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Fri, Feb 11, 2022 at 1:17 AM Stafford Horne <[email protected]> wrote:
> On Thu, Feb 10, 2022 at 08:31:05AM +0900, Stafford Horne wrote:

> > > Looking at it again, I wonder if it would help to use the __get_kernel_nofault()
> > > and __get_kernel_nofault() helpers as the default in
> > > include/asm-generic/uaccess.h.
> >
> > That would make sense. Perhaps also the __range_ok() function from OpenRISC
> > could move there as I think other architectures would also want to use that.

I have now uploaded a cleanup series to
https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=set_fs

This uses the same access_ok() function across almost all
architectures, with the
exception of those that need something else, and I then I went further
and killed
off set_fs for everything other than ia64.

> > > I see it's identical to the openrisc version and would probably be the same
> > > for some of the other architectures that have no other use for
> > > set_fs(). That may
> > > help to do a bulk remove of set_fs for alpha, arc, csky, h8300, hexagon, nds32,
> > > nios2, um and extensa, leaving only ia64, sparc and sh.
> >
> > If you could add it into include/asm-generic/uaccess.h I can test changing my
> > patch to use it.
>
> Note, I would be happy to do the work to move these into include/asm-generic/uaccess.h.
> But as I see it the existing include/asm-generic/uaccess.h is for NOMMU. How
> should we go about having an MMU and NOMMU version? Should we move uaccess.h to
> uaccess-nommu.h? Or add more ifdefs to uaccess.h?

There are two parts of asm-generic/uaccess.h:

- the CONFIG_UACCESS_MEMCPYsection is fundamentally limited to nommu
targets and cannot be shared. Similarly, targets with an MMU must use a custom
implementation to get the correct fixups.

- the put_user/get_user implementation is fairly dumb, you can use these to
avoid having your own ones, but you still need copy_{to,from}_user, and
a custom implementation tends to produce better code.

So chances are that you won't want to use either one. In my new branch,
I added the common helpers to linux/uaccess.h and asm-generic/access-ok.h,
respectively, both of which are used everywhere now.

Arnd

2022-02-14 10:02:13

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Fri, Feb 11, 2022 at 9:00 AM Arnd Bergmann <[email protected]> wrote:
>
> I have now uploaded a cleanup series to
> https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=set_fs
>
> This uses the same access_ok() function across almost all
> architectures, with the exception of those that need something else,
> and I then I went further and killed off set_fs for everything other
> than ia64.

Thanks, looks good to me.

Can you say why you didn't convert ia64? I don't see any set_fs() use
there, except for the unaligned handler, which looks trivial to
remove. It looks like the only reason for it is kernel-mode unaligned
exceptions, which we should just turn fatal, I suspect (they already
get logged).

And ia64 people could make the unaligned handling do the kernel mode
case in emulate_load/store_int() - it doesn't look *that* painful.

But maybe you noticed something else?

It would be really good to just be able to say that set_fs() no longer
exists at all.

Linus

2022-02-14 10:33:07

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Fri, Feb 11, 2022 at 09:46:03AM -0800, Linus Torvalds wrote:
> Can you say why you didn't convert ia64? I don't see any set_fs() use
> there, except for the unaligned handler, which looks trivial to
> remove. It looks like the only reason for it is kernel-mode unaligned
> exceptions, which we should just turn fatal, I suspect (they already
> get logged).
>
> And ia64 people could make the unaligned handling do the kernel mode
> case in emulate_load/store_int() - it doesn't look *that* painful.

Are there any ia64 people left? :)

2022-02-14 14:42:12

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

I like the series a lot.

Superficial comments:

for nds32 is there any good reason why __get_user / __set_user check
the address limit directly? Maybe we should unify this and make it work
like the other architectures.

With "uaccess: add generic __{get,put}_kernel_nofault" we should be able
to remove HAVE_GET_KERNEL_NOFAULT entirely and just check if the helpers
are already defined in linux/uaccess.h.

The new generic __access_ok, and the 3 fixed up version early on
have a whole lot of superflous braces.

2022-02-14 16:30:42

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Fri, Feb 11, 2022 at 6:46 PM Linus Torvalds
<[email protected]> wrote:
> On Fri, Feb 11, 2022 at 9:00 AM Arnd Bergmann <[email protected]> wrote:
> >
> > I have now uploaded a cleanup series to
> > https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=set_fs
> >
> > This uses the same access_ok() function across almost all
> > architectures, with the exception of those that need something else,
> > and I then I went further and killed off set_fs for everything other
> > than ia64.
>
> Thanks, looks good to me.
>
> Can you say why you didn't convert ia64? I don't see any set_fs() use
> there, except for the unaligned handler, which looks trivial to
> remove. It looks like the only reason for it is kernel-mode unaligned
> exceptions, which we should just turn fatal, I suspect (they already
> get logged).
>
> And ia64 people could make the unaligned handling do the kernel mode
> case in emulate_load/store_int() - it doesn't look *that* painful.
>
> But maybe you noticed something else?
>
> It would be really good to just be able to say that set_fs() no longer
> exists at all.

I had previously gotten stuck at ia64, but gave it another go now
and uploaded an updated branch with ia64 taken care of and another
patch to clean up bits afterwards.

I only gave it light testing so far, mainly building the defconfig for every
architecture. I'll post the series once the build bots are happy with the
branch overall.

Arnd

2022-02-14 19:21:11

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] microblaze: remove CONFIG_SET_FS

On Mon, Feb 14, 2022 at 8:50 AM Christoph Hellwig <[email protected]> wrote:
>
> I like the series a lot.
>
> Superficial comments:
>
> for nds32 is there any good reason why __get_user / __set_user check
> the address limit directly? Maybe we should unify this and make it work
> like the other architectures.

I've done that now, and am glad I did, because I uncovered that
put_user() was actually missing the check that got added to __get_user()
by accident.

> With "uaccess: add generic __{get,put}_kernel_nofault" we should be able
> to remove HAVE_GET_KERNEL_NOFAULT entirely and just check if the helpers
> are already defined in linux/uaccess.h.

Good idea, changed now as well.

> The new generic __access_ok, and the 3 fixed up version early on
> have a whole lot of superflous braces.

I'd prefer to leave those in, the logic is complex enough that I'd
rather make sure this is completely obvious to readers.

I got a few built bot reports about missing __user annotations that were
uncovered by the added type checking in access_ok, fixed those
as well.

I'm doing another test build after the last changes, will send it out in a bit
if there are no further regressions.

Arnd

2022-03-09 02:11:43

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 00/13] Removing tracehook.h

While working on cleaning up do_exit I have been having to deal with the
code in tracehook.h. Unfortunately the code in tracehook.h does not
make sense as organized.

This set of changes reorganizes things so that tracehook.h no longer
exists, and so that it's current contents are organized in a fashion
that is a little easier to understand.

The biggest change is that I lean into the fact that get_signal
always calls task_work_run and removes the logic that tried to
be smart and decouple task_work_run and get_signal as it has proven
to not be effective.

This is a conservative change and I am not changing the how things
like signal_pending operate (although it is probably justified).

A new header resume_user_mode.h is added to hold resume_user_mode_work
which was previously known as tracehook_notify_resume.

Eric W. Biederman (13):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h

MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 6 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 78 ++++++++++++
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/seccomp.c | 1 -
kernel/signal.c | 23 ++--
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
84 files changed, 317 insertions(+), 462 deletions(-)

Eric

2022-03-09 02:22:44

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 00/13] Removing tracehook.h

While working on cleaning up do_exit I have been having to deal with the
code in tracehook.h. Unfortunately the code in tracehook.h does not
make sense as organized.

This set of changes reorganizes things so that tracehook.h no longer
exists, and so that it's current contents are organized in a fashion
that is a little easier to understand.

The biggest change is that I lean into the fact that get_signal
always calls task_work_run and removes the logic that tried to
be smart and decouple task_work_run and get_signal as it has proven
to not be effective.

This is a conservative change and I am not changing the how things
like signal_pending operate (although it is probably justified).

A new header resume_user_mode.h is added to hold resume_user_mode_work
which was previously known as tracehook_notify_resume.

Eric W. Biederman (13):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h

MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 6 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 78 ++++++++++++
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/seccomp.c | 1 -
kernel/signal.c | 23 ++--
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
84 files changed, 317 insertions(+), 462 deletions(-)

Eric

2022-03-09 17:00:38

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 01/13] ptrace: Move ptrace_report_syscall into ptrace.h

Move ptrace_report_syscall from tracehook.h into ptrace.h where it
belongs.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/ptrace.h | 27 +++++++++++++++++++++++++++
include/linux/tracehook.h | 26 --------------------------
2 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 8aee2945ff08..91b1074edb4c 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -413,4 +413,31 @@ static inline void user_single_step_report(struct pt_regs *regs)
extern int task_current_syscall(struct task_struct *target, struct syscall_info *info);

extern void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact);
+
+/*
+ * ptrace report for syscall entry and exit looks identical.
+ */
+static inline int ptrace_report_syscall(unsigned long message)
+{
+ int ptrace = current->ptrace;
+
+ if (!(ptrace & PT_PTRACED))
+ return 0;
+
+ current->ptrace_message = message;
+ ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
+
+ /*
+ * this isn't the same as continuing with a signal, but it will do
+ * for normal use. strace only continues with a signal if the
+ * stopping signal is not SIGTRAP. -brl
+ */
+ if (current->exit_code) {
+ send_sig(current->exit_code, current, 1);
+ current->exit_code = 0;
+ }
+
+ current->ptrace_message = 0;
+ return fatal_signal_pending(current);
+}
#endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 88c007ab5ebc..998bc3863559 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -51,32 +51,6 @@
#include <linux/blk-cgroup.h>
struct linux_binprm;

-/*
- * ptrace report for syscall entry and exit looks identical.
- */
-static inline int ptrace_report_syscall(unsigned long message)
-{
- int ptrace = current->ptrace;
-
- if (!(ptrace & PT_PTRACED))
- return 0;
-
- current->ptrace_message = message;
- ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
-
- /*
- * this isn't the same as continuing with a signal, but it will do
- * for normal use. strace only continues with a signal if the
- * stopping signal is not SIGTRAP. -brl
- */
- if (current->exit_code) {
- send_sig(current->exit_code, current, 1);
- current->exit_code = 0;
- }
-
- current->ptrace_message = 0;
- return fatal_signal_pending(current);
-}

/**
* tracehook_report_syscall_entry - task is about to attempt a system call
--
2.29.2

2022-03-09 17:00:52

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 04/13] ptrace: Remove arch_syscall_{enter,exit}_tracehook

These functions are alwasy one-to-one wrappers around
ptrace_report_syscall_entry and ptrace_report_syscall_exit.
So directly call the functions they are wrapping instead.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/entry-common.h | 43 ++----------------------------------
kernel/entry/common.c | 4 ++--
2 files changed, 4 insertions(+), 43 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index a670e9fba7a9..9efbdda61f7a 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -79,26 +79,6 @@ static __always_inline void arch_check_user_regs(struct pt_regs *regs);
static __always_inline void arch_check_user_regs(struct pt_regs *regs) {}
#endif

-/**
- * arch_syscall_enter_tracehook - Wrapper around tracehook_report_syscall_entry()
- * @regs: Pointer to currents pt_regs
- *
- * Returns: 0 on success or an error code to skip the syscall.
- *
- * Defaults to tracehook_report_syscall_entry(). Can be replaced by
- * architecture specific code.
- *
- * Invoked from syscall_enter_from_user_mode()
- */
-static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs);
-
-#ifndef arch_syscall_enter_tracehook
-static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs)
-{
- return ptrace_report_syscall_entry(regs);
-}
-#endif
-
/**
* enter_from_user_mode - Establish state when coming from user mode
*
@@ -157,7 +137,7 @@ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
* It handles the following work items:
*
* 1) syscall_work flag dependent invocations of
- * arch_syscall_enter_tracehook(), __secure_computing(), trace_sys_enter()
+ * ptrace_report_syscall_entry(), __secure_computing(), trace_sys_enter()
* 2) Invocation of audit_syscall_entry()
*/
long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
@@ -279,25 +259,6 @@ static __always_inline void arch_exit_to_user_mode(void) { }
*/
void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal);

-/**
- * arch_syscall_exit_tracehook - Wrapper around tracehook_report_syscall_exit()
- * @regs: Pointer to currents pt_regs
- * @step: Indicator for single step
- *
- * Defaults to tracehook_report_syscall_exit(). Can be replaced by
- * architecture specific code.
- *
- * Invoked from syscall_exit_to_user_mode()
- */
-static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step);
-
-#ifndef arch_syscall_exit_tracehook
-static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
-{
- ptrace_report_syscall_exit(regs, step);
-}
-#endif
-
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
@@ -347,7 +308,7 @@ void syscall_exit_to_user_mode_work(struct pt_regs *regs);
* - rseq syscall exit
* - audit
* - syscall tracing
- * - tracehook (single stepping)
+ * - ptrace (single stepping)
*
* 2) Preparatory work
* - Exit to user mode loop (common TIF handling). Invokes
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f52e57c4d6d8..f0b1daa1e8da 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -59,7 +59,7 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,

/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
- ret = arch_syscall_enter_tracehook(regs);
+ ret = ptrace_report_syscall_entry(regs);
if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
return -1L;
}
@@ -253,7 +253,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)

step = report_single_step(work);
if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
- arch_syscall_exit_tracehook(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}

/*
--
2.29.2

2022-03-09 17:01:15

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 03/13] ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h

Rename tracehook_report_syscall_{entry,exit} to
ptrace_report_syscall_{entry,exit} and place them in ptrace.h

There is no longer any generic tracehook infractructure so make
these ptrace specific functions ptrace specific.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/Kconfig | 2 +-
arch/alpha/kernel/ptrace.c | 5 ++-
arch/arc/kernel/ptrace.c | 5 ++-
arch/arm/kernel/ptrace.c | 5 ++-
arch/arm64/kernel/ptrace.c | 7 ++--
arch/csky/kernel/ptrace.c | 5 ++-
arch/h8300/kernel/ptrace.c | 5 ++-
arch/hexagon/kernel/traps.c | 6 ++--
arch/ia64/kernel/ptrace.c | 4 +--
arch/m68k/kernel/ptrace.c | 6 ++--
arch/microblaze/kernel/ptrace.c | 5 ++-
arch/mips/kernel/ptrace.c | 5 ++-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 ++-
arch/nios2/kernel/ptrace.c | 5 ++-
arch/openrisc/kernel/ptrace.c | 5 ++-
arch/parisc/kernel/ptrace.c | 7 ++--
arch/powerpc/kernel/ptrace/ptrace.c | 8 ++---
arch/riscv/kernel/ptrace.c | 5 ++-
arch/sh/kernel/ptrace_32.c | 5 ++-
arch/sparc/kernel/ptrace_32.c | 5 ++-
arch/sparc/kernel/ptrace_64.c | 5 ++-
arch/um/kernel/ptrace.c | 5 ++-
arch/xtensa/kernel/ptrace.c | 5 ++-
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 6 ++--
include/linux/ptrace.h | 51 +++++++++++++++++++++++++++++
include/linux/tracehook.h | 51 -----------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 1 +
30 files changed, 109 insertions(+), 126 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..a517a949eb1d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -217,7 +217,7 @@ config TRACE_IRQFLAGS_SUPPORT
# asm/syscall.h supplying asm-generic/syscall.h interface
# linux/regset.h user_regset interfaces
# CORE_DUMP_USE_REGSET #define'd in linux/elf.h
-# TIF_SYSCALL_TRACE calls tracehook_report_syscall_{entry,exit}
+# TIF_SYSCALL_TRACE calls ptrace_report_syscall_{entry,exit}
# TIF_NOTIFY_RESUME calls tracehook_notify_resume()
# signal delivery calls tracehook_signal_handler()
#
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index 8c43212ae38e..a1a239ea002d 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -15,7 +15,6 @@
#include <linux/user.h>
#include <linux/security.h>
#include <linux/signal.h>
-#include <linux/tracehook.h>
#include <linux/audit.h>

#include <linux/uaccess.h>
@@ -323,7 +322,7 @@ asmlinkage unsigned long syscall_trace_enter(void)
unsigned long ret = 0;
struct pt_regs *regs = current_pt_regs();
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(current_pt_regs()))
+ ptrace_report_syscall_entry(current_pt_regs()))
ret = -1UL;
audit_syscall_entry(regs->r0, regs->r16, regs->r17, regs->r18, regs->r19);
return ret ?: current_pt_regs()->r0;
@@ -334,5 +333,5 @@ syscall_trace_leave(void)
{
audit_syscall_exit(current_pt_regs());
if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(current_pt_regs(), 0);
+ ptrace_report_syscall_exit(current_pt_regs(), 0);
}
diff --git a/arch/arc/kernel/ptrace.c b/arch/arc/kernel/ptrace.c
index 883391977fdf..54b419ac8bda 100644
--- a/arch/arc/kernel/ptrace.c
+++ b/arch/arc/kernel/ptrace.c
@@ -4,7 +4,6 @@
*/

#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/sched/task_stack.h>
#include <linux/regset.h>
#include <linux/unistd.h>
@@ -258,7 +257,7 @@ long arch_ptrace(struct task_struct *child, long request,

asmlinkage int syscall_trace_entry(struct pt_regs *regs)
{
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
return ULONG_MAX;

return regs->r8;
@@ -266,5 +265,5 @@ asmlinkage int syscall_trace_entry(struct pt_regs *regs)

asmlinkage void syscall_trace_exit(struct pt_regs *regs)
{
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);
}
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index e5aa3237853d..bfe88c6e60d5 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -22,7 +22,6 @@
#include <linux/hw_breakpoint.h>
#include <linux/regset.h>
#include <linux/audit.h>
-#include <linux/tracehook.h>
#include <linux/unistd.h>

#include <asm/syscall.h>
@@ -843,8 +842,8 @@ static void report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir)
regs->ARM_ip = dir;

if (dir == PTRACE_SYSCALL_EXIT)
- tracehook_report_syscall_exit(regs, 0);
- else if (tracehook_report_syscall_entry(regs))
+ ptrace_report_syscall_exit(regs, 0);
+ else if (ptrace_report_syscall_entry(regs))
current_thread_info()->abi_syscall = -1;

regs->ARM_ip = ip;
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index b7845575f86f..230a47b9189e 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -27,7 +27,6 @@
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <linux/regset.h>
-#include <linux/tracehook.h>
#include <linux/elf.h>

#include <asm/compat.h>
@@ -1818,11 +1817,11 @@ static void report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir)
regs->regs[regno] = dir;

if (dir == PTRACE_SYSCALL_ENTER) {
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
forget_syscall(regs);
regs->regs[regno] = saved_reg;
} else if (!test_thread_flag(TIF_SINGLESTEP)) {
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);
regs->regs[regno] = saved_reg;
} else {
regs->regs[regno] = saved_reg;
@@ -1832,7 +1831,7 @@ static void report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir)
* tracer modifications to the registers may have rewound the
* state machine.
*/
- tracehook_report_syscall_exit(regs, 1);
+ ptrace_report_syscall_exit(regs, 1);
}
}

diff --git a/arch/csky/kernel/ptrace.c b/arch/csky/kernel/ptrace.c
index 1a5f54e0d272..0f7e7b653c72 100644
--- a/arch/csky/kernel/ptrace.c
+++ b/arch/csky/kernel/ptrace.c
@@ -12,7 +12,6 @@
#include <linux/sched/task_stack.h>
#include <linux/signal.h>
#include <linux/smp.h>
-#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <linux/user.h>

@@ -321,7 +320,7 @@ long arch_ptrace(struct task_struct *child, long request,
asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE))
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
return -1;

if (secure_computing() == -1)
@@ -339,7 +338,7 @@ asmlinkage void syscall_trace_exit(struct pt_regs *regs)
audit_syscall_exit(regs);

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);

if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
trace_sys_exit(regs, syscall_get_return_value(current, regs));
diff --git a/arch/h8300/kernel/ptrace.c b/arch/h8300/kernel/ptrace.c
index a11db009d0ea..a9898b27b756 100644
--- a/arch/h8300/kernel/ptrace.c
+++ b/arch/h8300/kernel/ptrace.c
@@ -12,7 +12,6 @@
#include <linux/errno.h>
#include <linux/ptrace.h>
#include <linux/audit.h>
-#include <linux/tracehook.h>
#include <linux/regset.h>
#include <linux/elf.h>

@@ -174,7 +173,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;

if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs))
+ ptrace_report_syscall_entry(regs))
/*
* Tracing decided this syscall should not happen.
* We'll return a bogus call number to get an ENOSYS
@@ -196,5 +195,5 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}
diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c
index 1240f038cce0..6447763ce5a9 100644
--- a/arch/hexagon/kernel/traps.c
+++ b/arch/hexagon/kernel/traps.c
@@ -14,7 +14,7 @@
#include <linux/kdebug.h>
#include <linux/syscalls.h>
#include <linux/signal.h>
-#include <linux/tracehook.h>
+#include <linux/ptrace.h>
#include <asm/traps.h>
#include <asm/vm_fault.h>
#include <asm/syscall.h>
@@ -348,7 +348,7 @@ void do_trap0(struct pt_regs *regs)

/* allow strace to catch syscall args */
if (unlikely(test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs)))
+ ptrace_report_syscall_entry(regs)))
return; /* return -ENOSYS somewhere? */

/* Interrupts should be re-enabled for syscall processing */
@@ -386,7 +386,7 @@ void do_trap0(struct pt_regs *regs)

/* allow strace to get the syscall return state */
if (unlikely(test_thread_flag(TIF_SYSCALL_TRACE)))
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);

break;
case TRAP_DEBUG:
diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
index 6a1439eaa050..6af64aae087d 100644
--- a/arch/ia64/kernel/ptrace.c
+++ b/arch/ia64/kernel/ptrace.c
@@ -1217,7 +1217,7 @@ syscall_trace_enter (long arg0, long arg1, long arg2, long arg3,
struct pt_regs regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE))
- if (tracehook_report_syscall_entry(&regs))
+ if (ptrace_report_syscall_entry(&regs))
return -ENOSYS;

/* copy user rbs to kernel rbs */
@@ -1243,7 +1243,7 @@ syscall_trace_leave (long arg0, long arg1, long arg2, long arg3,

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(&regs, step);
+ ptrace_report_syscall_exit(&regs, step);

/* copy user rbs to kernel rbs */
if (test_thread_flag(TIF_RESTORE_RSE))
diff --git a/arch/m68k/kernel/ptrace.c b/arch/m68k/kernel/ptrace.c
index aa3a0b8d07e9..a0c99fe3118e 100644
--- a/arch/m68k/kernel/ptrace.c
+++ b/arch/m68k/kernel/ptrace.c
@@ -19,7 +19,7 @@
#include <linux/ptrace.h>
#include <linux/user.h>
#include <linux/signal.h>
-#include <linux/tracehook.h>
+#include <linux/ptrace.h>

#include <linux/uaccess.h>
#include <asm/page.h>
@@ -282,13 +282,13 @@ asmlinkage int syscall_trace_enter(void)
int ret = 0;

if (test_thread_flag(TIF_SYSCALL_TRACE))
- ret = tracehook_report_syscall_entry(task_pt_regs(current));
+ ret = ptrace_report_syscall_entry(task_pt_regs(current));
return ret;
}

asmlinkage void syscall_trace_leave(void)
{
if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(task_pt_regs(current), 0);
+ ptrace_report_syscall_exit(task_pt_regs(current), 0);
}
#endif /* CONFIG_COLDFIRE */
diff --git a/arch/microblaze/kernel/ptrace.c b/arch/microblaze/kernel/ptrace.c
index badd286882ae..5234d0c1dcaa 100644
--- a/arch/microblaze/kernel/ptrace.c
+++ b/arch/microblaze/kernel/ptrace.c
@@ -33,7 +33,6 @@
#include <linux/elf.h>
#include <linux/audit.h>
#include <linux/seccomp.h>
-#include <linux/tracehook.h>

#include <linux/errno.h>
#include <asm/processor.h>
@@ -140,7 +139,7 @@ asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs)
secure_computing_strict(regs->r12);

if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs))
+ ptrace_report_syscall_entry(regs))
/*
* Tracing decided this syscall should not happen.
* We'll return a bogus call number to get an ENOSYS
@@ -161,7 +160,7 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}

void ptrace_disable(struct task_struct *child)
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index db7c5be1d4a3..567aec4abac0 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -27,7 +27,6 @@
#include <linux/smp.h>
#include <linux/security.h>
#include <linux/stddef.h>
-#include <linux/tracehook.h>
#include <linux/audit.h>
#include <linux/seccomp.h>
#include <linux/ftrace.h>
@@ -1317,7 +1316,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs *regs, long syscall)
current_thread_info()->syscall = syscall;

if (test_thread_flag(TIF_SYSCALL_TRACE)) {
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
return -1;
syscall = current_thread_info()->syscall;
}
@@ -1376,7 +1375,7 @@ asmlinkage void syscall_trace_leave(struct pt_regs *regs)
trace_sys_exit(regs, regs_return_value(regs));

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);

user_enter();
}
diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h
index 90aa56c94af1..04d55ce18d50 100644
--- a/arch/nds32/include/asm/syscall.h
+++ b/arch/nds32/include/asm/syscall.h
@@ -39,7 +39,7 @@ syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
*
* It's only valid to call this when @task is stopped for system
* call exit tracing (due to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT),
- * after tracehook_report_syscall_entry() returned nonzero to prevent
+ * after ptrace_report_syscall_entry() returned nonzero to prevent
* the system call from taking place.
*
* This rolls back the register state in @regs so it's as if the
diff --git a/arch/nds32/kernel/ptrace.c b/arch/nds32/kernel/ptrace.c
index d0eda870fbc2..6a6988cf689d 100644
--- a/arch/nds32/kernel/ptrace.c
+++ b/arch/nds32/kernel/ptrace.c
@@ -3,7 +3,6 @@

#include <linux/ptrace.h>
#include <linux/regset.h>
-#include <linux/tracehook.h>
#include <linux/elf.h>
#include <linux/sched/task_stack.h>

@@ -103,7 +102,7 @@ void user_disable_single_step(struct task_struct *child)
asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE)) {
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
forget_syscall(regs);
}
return regs->syscallno;
@@ -113,6 +112,6 @@ asmlinkage void syscall_trace_leave(struct pt_regs *regs)
{
int step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);

}
diff --git a/arch/nios2/kernel/ptrace.c b/arch/nios2/kernel/ptrace.c
index a6ea9e1b4f61..cd62f310778b 100644
--- a/arch/nios2/kernel/ptrace.c
+++ b/arch/nios2/kernel/ptrace.c
@@ -15,7 +15,6 @@
#include <linux/regset.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
-#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <linux/user.h>

@@ -134,7 +133,7 @@ asmlinkage int do_syscall_trace_enter(void)
int ret = 0;

if (test_thread_flag(TIF_SYSCALL_TRACE))
- ret = tracehook_report_syscall_entry(task_pt_regs(current));
+ ret = ptrace_report_syscall_entry(task_pt_regs(current));

return ret;
}
@@ -142,5 +141,5 @@ asmlinkage int do_syscall_trace_enter(void)
asmlinkage void do_syscall_trace_exit(void)
{
if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(task_pt_regs(current), 0);
+ ptrace_report_syscall_exit(task_pt_regs(current), 0);
}
diff --git a/arch/openrisc/kernel/ptrace.c b/arch/openrisc/kernel/ptrace.c
index 4d60ae2a12fa..b971740fc2aa 100644
--- a/arch/openrisc/kernel/ptrace.c
+++ b/arch/openrisc/kernel/ptrace.c
@@ -22,7 +22,6 @@
#include <linux/ptrace.h>
#include <linux/audit.h>
#include <linux/regset.h>
-#include <linux/tracehook.h>
#include <linux/elf.h>

#include <asm/thread_info.h>
@@ -159,7 +158,7 @@ asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
long ret = 0;

if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs))
+ ptrace_report_syscall_entry(regs))
/*
* Tracing decided this syscall should not happen.
* We'll return a bogus call number to get an ENOSYS
@@ -181,5 +180,5 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}
diff --git a/arch/parisc/kernel/ptrace.c b/arch/parisc/kernel/ptrace.c
index 65de6c4c9354..96ef6a6b66e5 100644
--- a/arch/parisc/kernel/ptrace.c
+++ b/arch/parisc/kernel/ptrace.c
@@ -15,7 +15,6 @@
#include <linux/elf.h>
#include <linux/errno.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/user.h>
#include <linux/personality.h>
#include <linux/regset.h>
@@ -316,7 +315,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
long do_syscall_trace_enter(struct pt_regs *regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE)) {
- int rc = tracehook_report_syscall_entry(regs);
+ int rc = ptrace_report_syscall_entry(regs);

/*
* As tracesys_next does not set %r28 to -ENOSYS
@@ -327,7 +326,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
if (rc) {
/*
* A nonzero return code from
- * tracehook_report_syscall_entry() tells us
+ * ptrace_report_syscall_entry() tells us
* to prevent the syscall execution. Skip
* the syscall call and the syscall restart handling.
*
@@ -381,7 +380,7 @@ void do_syscall_trace_exit(struct pt_regs *regs)
#endif

if (stepping || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, stepping);
+ ptrace_report_syscall_exit(regs, stepping);
}

diff --git a/arch/powerpc/kernel/ptrace/ptrace.c b/arch/powerpc/kernel/ptrace/ptrace.c
index c43f77e2ac31..f394b0d6473f 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -16,7 +16,7 @@
*/

#include <linux/regset.h>
-#include <linux/tracehook.h>
+#include <linux/ptrace.h>
#include <linux/audit.h>
#include <linux/context_tracking.h>
#include <linux/syscalls.h>
@@ -263,12 +263,12 @@ long do_syscall_trace_enter(struct pt_regs *regs)
flags = read_thread_flags() & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);

if (flags) {
- int rc = tracehook_report_syscall_entry(regs);
+ int rc = ptrace_report_syscall_entry(regs);

if (unlikely(flags & _TIF_SYSCALL_EMU)) {
/*
* A nonzero return code from
- * tracehook_report_syscall_entry() tells us to prevent
+ * ptrace_report_syscall_entry() tells us to prevent
* the syscall execution, but we are not going to
* execute it anyway.
*
@@ -334,7 +334,7 @@ void do_syscall_trace_leave(struct pt_regs *regs)

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}

void __init pt_regs_check(void);
diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index a89243730153..793c7da0554b 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -17,7 +17,6 @@
#include <linux/regset.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
-#include <linux/tracehook.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -241,7 +240,7 @@ long arch_ptrace(struct task_struct *child, long request,
__visible int do_syscall_trace_enter(struct pt_regs *regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE))
- if (tracehook_report_syscall_entry(regs))
+ if (ptrace_report_syscall_entry(regs))
return -1;

/*
@@ -266,7 +265,7 @@ __visible void do_syscall_trace_exit(struct pt_regs *regs)
audit_syscall_exit(regs);

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);

#ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
index 5281685f6ad1..d417988d9770 100644
--- a/arch/sh/kernel/ptrace_32.c
+++ b/arch/sh/kernel/ptrace_32.c
@@ -20,7 +20,6 @@
#include <linux/io.h>
#include <linux/audit.h>
#include <linux/seccomp.h>
-#include <linux/tracehook.h>
#include <linux/elf.h>
#include <linux/regset.h>
#include <linux/hw_breakpoint.h>
@@ -456,7 +455,7 @@ long arch_ptrace(struct task_struct *child, long request,
asmlinkage long do_syscall_trace_enter(struct pt_regs *regs)
{
if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs)) {
+ ptrace_report_syscall_entry(regs)) {
regs->regs[0] = -ENOSYS;
return -1;
}
@@ -484,5 +483,5 @@ asmlinkage void do_syscall_trace_leave(struct pt_regs *regs)

step = test_thread_flag(TIF_SINGLESTEP);
if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}
diff --git a/arch/sparc/kernel/ptrace_32.c b/arch/sparc/kernel/ptrace_32.c
index 5318174a0268..e7db48acb838 100644
--- a/arch/sparc/kernel/ptrace_32.c
+++ b/arch/sparc/kernel/ptrace_32.c
@@ -21,7 +21,6 @@
#include <linux/signal.h>
#include <linux/regset.h>
#include <linux/elf.h>
-#include <linux/tracehook.h>

#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -439,9 +438,9 @@ asmlinkage int syscall_trace(struct pt_regs *regs, int syscall_exit_p)

if (test_thread_flag(TIF_SYSCALL_TRACE)) {
if (syscall_exit_p)
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);
else
- ret = tracehook_report_syscall_entry(regs);
+ ret = ptrace_report_syscall_entry(regs);
}

return ret;
diff --git a/arch/sparc/kernel/ptrace_64.c b/arch/sparc/kernel/ptrace_64.c
index 2b92155db8a5..86a7eb5c27ba 100644
--- a/arch/sparc/kernel/ptrace_64.c
+++ b/arch/sparc/kernel/ptrace_64.c
@@ -25,7 +25,6 @@
#include <linux/audit.h>
#include <linux/signal.h>
#include <linux/regset.h>
-#include <linux/tracehook.h>
#include <trace/syscall.h>
#include <linux/compat.h>
#include <linux/elf.h>
@@ -1095,7 +1094,7 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
user_exit();

if (test_thread_flag(TIF_SYSCALL_TRACE))
- ret = tracehook_report_syscall_entry(regs);
+ ret = ptrace_report_syscall_entry(regs);

if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->u_regs[UREG_G1]);
@@ -1118,7 +1117,7 @@ asmlinkage void syscall_trace_leave(struct pt_regs *regs)
trace_sys_exit(regs, regs->u_regs[UREG_I0]);

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);

if (test_thread_flag(TIF_NOHZ))
user_enter();
diff --git a/arch/um/kernel/ptrace.c b/arch/um/kernel/ptrace.c
index b425f47bddbb..bfaf6ab1ac03 100644
--- a/arch/um/kernel/ptrace.c
+++ b/arch/um/kernel/ptrace.c
@@ -6,7 +6,6 @@
#include <linux/audit.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
-#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <asm/ptrace-abi.h>

@@ -135,7 +134,7 @@ int syscall_trace_enter(struct pt_regs *regs)
if (!test_thread_flag(TIF_SYSCALL_TRACE))
return 0;

- return tracehook_report_syscall_entry(regs);
+ return ptrace_report_syscall_entry(regs);
}

void syscall_trace_leave(struct pt_regs *regs)
@@ -151,7 +150,7 @@ void syscall_trace_leave(struct pt_regs *regs)
if (!test_thread_flag(TIF_SYSCALL_TRACE))
return;

- tracehook_report_syscall_exit(regs, 0);
+ ptrace_report_syscall_exit(regs, 0);
/* force do_signal() --> is_syscall() */
if (ptraced & PT_PTRACED)
set_thread_flag(TIF_SIGPENDING);
diff --git a/arch/xtensa/kernel/ptrace.c b/arch/xtensa/kernel/ptrace.c
index bb3f4797d212..323c678a691f 100644
--- a/arch/xtensa/kernel/ptrace.c
+++ b/arch/xtensa/kernel/ptrace.c
@@ -26,7 +26,6 @@
#include <linux/security.h>
#include <linux/signal.h>
#include <linux/smp.h>
-#include <linux/tracehook.h>
#include <linux/uaccess.h>

#define CREATE_TRACE_POINTS
@@ -550,7 +549,7 @@ int do_syscall_trace_enter(struct pt_regs *regs)
regs->areg[2] = -ENOSYS;

if (test_thread_flag(TIF_SYSCALL_TRACE) &&
- tracehook_report_syscall_entry(regs)) {
+ ptrace_report_syscall_entry(regs)) {
regs->areg[2] = -ENOSYS;
regs->syscall = NO_SYSCALL;
return 0;
@@ -583,5 +582,5 @@ void do_syscall_trace_leave(struct pt_regs *regs)
step = test_thread_flag(TIF_SINGLESTEP);

if (step || test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}
diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 81695eb02a12..5a80fe728dc8 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -44,7 +44,7 @@ int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);
*
* It's only valid to call this when @task is stopped for system
* call exit tracing (due to %SYSCALL_WORK_SYSCALL_TRACE or
- * %SYSCALL_WORK_SYSCALL_AUDIT), after tracehook_report_syscall_entry()
+ * %SYSCALL_WORK_SYSCALL_AUDIT), after ptrace_report_syscall_entry()
* returned nonzero to prevent the system call from taking place.
*
* This rolls back the register state in @regs so it's as if the
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..a670e9fba7a9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,7 +3,7 @@
#define __LINUX_ENTRYCOMMON_H

#include <linux/static_call_types.h>
-#include <linux/tracehook.h>
+#include <linux/ptrace.h>
#include <linux/syscalls.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
@@ -95,7 +95,7 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
#ifndef arch_syscall_enter_tracehook
static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs)
{
- return tracehook_report_syscall_entry(regs);
+ return ptrace_report_syscall_entry(regs);
}
#endif

@@ -294,7 +294,7 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step);
#ifndef arch_syscall_exit_tracehook
static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
{
- tracehook_report_syscall_exit(regs, step);
+ ptrace_report_syscall_exit(regs, step);
}
#endif

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 91b1074edb4c..5310f43e4762 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -440,4 +440,55 @@ static inline int ptrace_report_syscall(unsigned long message)
current->ptrace_message = 0;
return fatal_signal_pending(current);
}
+
+/**
+ * ptrace_report_syscall_entry - task is about to attempt a system call
+ * @regs: user register state of current task
+ *
+ * This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
+ * %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
+ * entered the kernel for a system call. Full user register state is
+ * available here. Changing the values in @regs can affect the system
+ * call number and arguments to be tried. It is safe to block here,
+ * preventing the system call from beginning.
+ *
+ * Returns zero normally, or nonzero if the calling arch code should abort
+ * the system call. That must prevent normal entry so no system call is
+ * made. If @task ever returns to user mode after this, its register state
+ * is unspecified, but should be something harmless like an %ENOSYS error
+ * return. It should preserve enough information so that syscall_rollback()
+ * can work (see asm-generic/syscall.h).
+ *
+ * Called without locks, just after entering kernel mode.
+ */
+static inline __must_check int ptrace_report_syscall_entry(
+ struct pt_regs *regs)
+{
+ return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
+}
+
+/**
+ * ptrace_report_syscall_exit - task has just finished a system call
+ * @regs: user register state of current task
+ * @step: nonzero if simulating single-step or block-step
+ *
+ * This will be called if %SYSCALL_WORK_SYSCALL_TRACE has been set, when
+ * the current task has just finished an attempted system call. Full
+ * user register state is available here. It is safe to block here,
+ * preventing signals from being processed.
+ *
+ * If @step is nonzero, this report is also in lieu of the normal
+ * trap that would follow the system call instruction because
+ * user_enable_block_step() or user_enable_single_step() was used.
+ * In this case, %SYSCALL_WORK_SYSCALL_TRACE might not be set.
+ *
+ * Called without locks, just before checking for pending signals.
+ */
+static inline void ptrace_report_syscall_exit(struct pt_regs *regs, int step)
+{
+ if (step)
+ user_single_step_report(regs);
+ else
+ ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_EXIT);
+}
#endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 998bc3863559..819e82ac09bd 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -52,57 +52,6 @@
struct linux_binprm;

-/**
- * tracehook_report_syscall_entry - task is about to attempt a system call
- * @regs: user register state of current task
- *
- * This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
- * %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
- * entered the kernel for a system call. Full user register state is
- * available here. Changing the values in @regs can affect the system
- * call number and arguments to be tried. It is safe to block here,
- * preventing the system call from beginning.
- *
- * Returns zero normally, or nonzero if the calling arch code should abort
- * the system call. That must prevent normal entry so no system call is
- * made. If @task ever returns to user mode after this, its register state
- * is unspecified, but should be something harmless like an %ENOSYS error
- * return. It should preserve enough information so that syscall_rollback()
- * can work (see asm-generic/syscall.h).
- *
- * Called without locks, just after entering kernel mode.
- */
-static inline __must_check int tracehook_report_syscall_entry(
- struct pt_regs *regs)
-{
- return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
-}
-
-/**
- * tracehook_report_syscall_exit - task has just finished a system call
- * @regs: user register state of current task
- * @step: nonzero if simulating single-step or block-step
- *
- * This will be called if %SYSCALL_WORK_SYSCALL_TRACE has been set, when
- * the current task has just finished an attempted system call. Full
- * user register state is available here. It is safe to block here,
- * preventing signals from being processed.
- *
- * If @step is nonzero, this report is also in lieu of the normal
- * trap that would follow the system call instruction because
- * user_enable_block_step() or user_enable_single_step() was used.
- * In this case, %SYSCALL_WORK_SYSCALL_TRACE might not be set.
- *
- * Called without locks, just before checking for pending signals.
- */
-static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
-{
- if (step)
- user_single_step_report(regs);
- else
- ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_EXIT);
-}
-
/**
* tracehook_signal_handler - signal handler setup is complete
* @stepping: nonzero if debugger single-step or block-step in use
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 3747bf816f9a..b7af92e07d1f 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -114,7 +114,7 @@ struct ptrace_rseq_configuration {

/*
* These values are stored in task->ptrace_message
- * by tracehook_report_syscall_* to describe the current syscall-stop.
+ * by ptrace_report_syscall_* to describe the current syscall-stop.
*/
#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bad713684c2e..f52e57c4d6d8 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,6 +2,7 @@

#include <linux/context_tracking.h>
#include <linux/entry-common.h>
+#include <linux/tracehook.h>
#include <linux/highmem.h>
#include <linux/livepatch.h>
#include <linux/audit.h>
--
2.29.2

2022-03-09 17:01:21

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 07/13] task_work: Introduce task_work_pending

Wrap the test of task->task_works in a helper function to make
it clear what is being tested.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/io_uring.c | 6 +++---
include/linux/task_work.h | 5 +++++
include/linux/tracehook.h | 4 ++--
kernel/signal.c | 4 ++--
kernel/task_work.c | 2 +-
5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e54c4127422e..e85261079a78 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2590,7 +2590,7 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)

static inline bool io_run_task_work(void)
{
- if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL) || task_work_pending(current)) {
__set_current_state(TASK_RUNNING);
tracehook_notify_signal();
return true;
@@ -7602,7 +7602,7 @@ static int io_sq_thread(void *data)
}

prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
- if (!io_sqd_events_pending(sqd) && !current->task_works) {
+ if (!io_sqd_events_pending(sqd) && !task_work_pending(current)) {
bool needs_sched = true;

list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
@@ -10321,7 +10321,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx,

hlist_for_each_entry(req, list, hash_node)
seq_printf(m, " op=%d, task_works=%d\n", req->opcode,
- req->task->task_works != NULL);
+ task_work_pending(req->task));
}

seq_puts(m, "CqOverflowList:\n");
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 5b8a93f288bb..897494b597ba 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -19,6 +19,11 @@ enum task_work_notify_mode {
TWA_SIGNAL,
};

+static inline bool task_work_pending(struct task_struct *task)
+{
+ return READ_ONCE(task->task_works);
+}
+
int task_work_add(struct task_struct *task, struct callback_head *twork,
enum task_work_notify_mode mode);

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index b77bf4917196..fa834a22e86e 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -90,7 +90,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
* hlist_add_head(task->task_works);
*/
smp_mb__after_atomic();
- if (unlikely(current->task_works))
+ if (unlikely(task_work_pending(current)))
task_work_run();

#ifdef CONFIG_KEYS_REQUEST_CACHE
@@ -115,7 +115,7 @@ static inline void tracehook_notify_signal(void)
{
clear_thread_flag(TIF_NOTIFY_SIGNAL);
smp_mb__after_atomic();
- if (current->task_works)
+ if (task_work_pending(current))
task_work_run();
}

diff --git a/kernel/signal.c b/kernel/signal.c
index 0e0bd1c1068b..3b4cf25fb9b3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2344,7 +2344,7 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
void ptrace_notify(int exit_code)
{
BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
- if (unlikely(current->task_works))
+ if (unlikely(task_work_pending(current)))
task_work_run();

spin_lock_irq(&current->sighand->siglock);
@@ -2626,7 +2626,7 @@ bool get_signal(struct ksignal *ksig)
struct signal_struct *signal = current->signal;
int signr;

- if (unlikely(current->task_works))
+ if (unlikely(task_work_pending(current)))
task_work_run();

/*
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 1698fbe6f0e1..cc6fccb0e24d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -78,7 +78,7 @@ task_work_cancel_match(struct task_struct *task,
struct callback_head *work;
unsigned long flags;

- if (likely(!task->task_works))
+ if (likely(!task_work_pending(task)))
return NULL;
/*
* If cmpxchg() fails we continue without updating pprev.
--
2.29.2

2022-03-09 17:01:33

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 05/13] ptrace: Remove tracehook_signal_handler

The two line function tracehook_signal_handler is only called from
signal_delivered. Expand it inline in signal_delivered and remove it.
Just to make it easier to understand what is going on.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/Kconfig | 1 -
include/linux/tracehook.h | 17 -----------------
kernel/signal.c | 3 ++-
3 files changed, 2 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a517a949eb1d..6382520ef0a5 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -219,7 +219,6 @@ config TRACE_IRQFLAGS_SUPPORT
# CORE_DUMP_USE_REGSET #define'd in linux/elf.h
# TIF_SYSCALL_TRACE calls ptrace_report_syscall_{entry,exit}
# TIF_NOTIFY_RESUME calls tracehook_notify_resume()
-# signal delivery calls tracehook_signal_handler()
#
config HAVE_ARCH_TRACEHOOK
bool
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 819e82ac09bd..b77bf4917196 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -52,23 +52,6 @@
struct linux_binprm;

-/**
- * tracehook_signal_handler - signal handler setup is complete
- * @stepping: nonzero if debugger single-step or block-step in use
- *
- * Called by the arch code after a signal handler has been set up.
- * Register and stack state reflects the user handler about to run.
- * Signal mask changes have already been made.
- *
- * Called without locks, shortly before returning to user mode
- * (or handling more signals).
- */
-static inline void tracehook_signal_handler(int stepping)
-{
- if (stepping)
- ptrace_notify(SIGTRAP);
-}
-
/**
* set_notify_resume - cause tracehook_notify_resume() to be called
* @task: task that will call tracehook_notify_resume()
diff --git a/kernel/signal.c b/kernel/signal.c
index 38602738866e..0e0bd1c1068b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2898,7 +2898,8 @@ static void signal_delivered(struct ksignal *ksig, int stepping)
set_current_blocked(&blocked);
if (current->sas_ss_flags & SS_AUTODISARM)
sas_ss_reset(current);
- tracehook_signal_handler(stepping);
+ if (stepping)
+ ptrace_notify(SIGTRAP);
}

void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
--
2.29.2

2022-03-09 17:01:33

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 08/13] task_work: Call tracehook_notify_signal from get_signal on all architectures

Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3b2
("task_work: unconditionally run task_work from get_signal()") always
calling task_wofffffffrk_run all of the work of tracehook_notify_signal is
already happening except clearing TIF_NOTIFY_SIGNAL.

Factor clear_notify_signal out of tracehook_notify_signal and use it in
get_signal so that get_signal only needs one call of trask_work_run.

To keep the semantics in sync update xfer_to_guest_mode_work (which
does not call get_signal) to call tracehook_notify_signal if either
_TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/s390/kernel/signal.c | 4 ++--
arch/x86/kernel/signal.c | 4 ++--
include/linux/entry-common.h | 2 +-
include/linux/tracehook.h | 9 +++++++--
kernel/entry/common.c | 12 ++----------
kernel/entry/kvm.c | 2 +-
kernel/signal.c | 14 +++-----------
7 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
index 307f5d99514d..ea9e5e8182cd 100644
--- a/arch/s390/kernel/signal.c
+++ b/arch/s390/kernel/signal.c
@@ -453,7 +453,7 @@ static void handle_signal(struct ksignal *ksig, sigset_t *oldset,
* stack-frames in one go after that.
*/

-void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
+void arch_do_signal_or_restart(struct pt_regs *regs)
{
struct ksignal ksig;
sigset_t *oldset = sigmask_to_save();
@@ -466,7 +466,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
current->thread.system_call =
test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0;

- if (has_signal && get_signal(&ksig)) {
+ if (get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
if (current->thread.system_call) {
regs->int_code = current->thread.system_call;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ec71e06ae364..de3d5b5724d8 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -861,11 +861,11 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
* want to handle. Thus you cannot kill init even with a SIGKILL even by
* mistake.
*/
-void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
+void arch_do_signal_or_restart(struct pt_regs *regs)
{
struct ksignal ksig;

- if (has_signal && get_signal(&ksig)) {
+ if (get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
handle_signal(&ksig, regs);
return;
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 9efbdda61f7a..3537fd25f14e 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -257,7 +257,7 @@ static __always_inline void arch_exit_to_user_mode(void) { }
*
* Invoked from exit_to_user_mode_loop().
*/
-void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal);
+void arch_do_signal_or_restart(struct pt_regs *regs);

/**
* exit_to_user_mode - Fixup state when exiting to user mode
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index fa834a22e86e..b44a7820c468 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -106,6 +106,12 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
rseq_handle_notify_resume(NULL, regs);
}

+static inline void clear_notify_signal(void)
+{
+ clear_thread_flag(TIF_NOTIFY_SIGNAL);
+ smp_mb__after_atomic();
+}
+
/*
* called by exit_to_user_mode_loop() if ti_work & _TIF_NOTIFY_SIGNAL. This
* is currently used by TWA_SIGNAL based task_work, which requires breaking
@@ -113,8 +119,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
*/
static inline void tracehook_notify_signal(void)
{
- clear_thread_flag(TIF_NOTIFY_SIGNAL);
- smp_mb__after_atomic();
+ clear_notify_signal();
if (task_work_pending(current))
task_work_run();
}
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f0b1daa1e8da..79eaf9b4b10d 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -139,15 +139,7 @@ void noinstr exit_to_user_mode(void)
}

/* Workaround to allow gradual conversion of architecture code */
-void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
-
-static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
-{
- if (ti_work & _TIF_NOTIFY_SIGNAL)
- tracehook_notify_signal();
-
- arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
-}
+void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }

static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
@@ -170,7 +162,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
klp_update_patch_state(current);

if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
- handle_signal_work(regs, ti_work);
+ arch_do_signal_or_restart(regs);

if (ti_work & _TIF_NOTIFY_RESUME)
tracehook_notify_resume(regs);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 96d476e06c77..cabf36a489e4 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -8,7 +8,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
do {
int ret;

- if (ti_work & _TIF_NOTIFY_SIGNAL)
+ if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
tracehook_notify_signal();

if (ti_work & _TIF_SIGPENDING) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 3b4cf25fb9b3..8632b88982c9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2626,20 +2626,12 @@ bool get_signal(struct ksignal *ksig)
struct signal_struct *signal = current->signal;
int signr;

+ clear_notify_signal();
if (unlikely(task_work_pending(current)))
task_work_run();

- /*
- * For non-generic architectures, check for TIF_NOTIFY_SIGNAL so
- * that the arch handlers don't all have to do it. If we get here
- * without TIF_SIGPENDING, just exit after running signal work.
- */
- if (!IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
- if (test_thread_flag(TIF_NOTIFY_SIGNAL))
- tracehook_notify_signal();
- if (!task_sigpending(current))
- return false;
- }
+ if (!task_sigpending(current))
+ return false;

if (unlikely(uprobe_deny_signal()))
return false;
--
2.29.2

2022-03-09 17:02:02

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 10/13] signal: Move set_notify_signal and clear_notify_signal into sched/signal.h

The header tracehook.h is no place for code to live. The functions
set_notify_signal and clear_notify_signal are not about signals. They
are about interruptions that act like signals. The fundamental signal
primitives wind up calling set_notify_signal and clear_notify_signal.
Which means they need to be maintained with the signal code.

Since set_notify_signal and clear_notify_signal must be maintained
with the signal subsystem move them into sched/signal.h and claim them
as part of the signal subsystem.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 17 +++++++++++++++++
include/linux/tracehook.h | 17 -----------------
2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b6ecb9fc4cd2..3c8b34876744 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -349,6 +349,23 @@ extern void sigqueue_free(struct sigqueue *);
extern int send_sigqueue(struct sigqueue *, struct pid *, enum pid_type);
extern int do_sigaction(int, struct k_sigaction *, struct k_sigaction *);

+static inline void clear_notify_signal(void)
+{
+ clear_thread_flag(TIF_NOTIFY_SIGNAL);
+ smp_mb__after_atomic();
+}
+
+/*
+ * Called to break out of interruptible wait loops, and enter the
+ * exit_to_user_mode_loop().
+ */
+static inline void set_notify_signal(struct task_struct *task)
+{
+ if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_SIGNAL) &&
+ !wake_up_state(task, TASK_INTERRUPTIBLE))
+ kick_process(task);
+}
+
static inline int restart_syscall(void)
{
set_tsk_thread_flag(current, TIF_SIGPENDING);
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index e5d676e841e3..1b7365aef8da 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -106,21 +106,4 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
rseq_handle_notify_resume(NULL, regs);
}

-static inline void clear_notify_signal(void)
-{
- clear_thread_flag(TIF_NOTIFY_SIGNAL);
- smp_mb__after_atomic();
-}
-
-/*
- * Called to break out of interruptible wait loops, and enter the
- * exit_to_user_mode_loop().
- */
-static inline void set_notify_signal(struct task_struct *task)
-{
- if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_SIGNAL) &&
- !wake_up_state(task, TASK_INTERRUPTIBLE))
- kick_process(task);
-}
-
#endif /* <linux/tracehook.h> */
--
2.29.2

2022-03-09 17:02:08

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 13/13] tracehook: Remove tracehook.h

Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.

Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
MAINTAINERS | 1 -
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 1 -
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 1 -
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 2 +-
fs/io_uring.c | 1 -
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/linux/tracehook.h | 56 ----------------------------
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/seccomp.c | 1 -
kernel/signal.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
19 files changed, 4 insertions(+), 74 deletions(-)
delete mode 100644 include/linux/tracehook.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ea3e6c914384..2f16a23a26a2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15623,7 +15623,6 @@ F: arch/*/ptrace*.c
F: include/asm-generic/syscall.h
F: include/linux/ptrace.h
F: include/linux/regset.h
-F: include/linux/tracehook.h
F: include/uapi/linux/ptrace.h
F: include/uapi/linux/ptrace.h
F: kernel/ptrace.c
diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h
index 17aead80aadb..eabab24b71dd 100644
--- a/arch/s390/include/asm/entry-common.h
+++ b/arch/s390/include/asm/entry-common.h
@@ -5,7 +5,6 @@
#include <linux/sched.h>
#include <linux/audit.h>
#include <linux/randomize_kstack.h>
-#include <linux/tracehook.h>
#include <linux/processor.h>
#include <linux/uaccess.h>
#include <asm/timex.h>
diff --git a/arch/s390/kernel/ptrace.c b/arch/s390/kernel/ptrace.c
index 0ea3d02b378d..641fa36f6101 100644
--- a/arch/s390/kernel/ptrace.c
+++ b/arch/s390/kernel/ptrace.c
@@ -21,7 +21,6 @@
#include <linux/signal.h>
#include <linux/elf.h>
#include <linux/regset.h>
-#include <linux/tracehook.h>
#include <linux/seccomp.h>
#include <linux/compat.h>
#include <trace/syscall.h>
diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
index ea9e5e8182cd..8b7b5f80f722 100644
--- a/arch/s390/kernel/signal.c
+++ b/arch/s390/kernel/signal.c
@@ -25,7 +25,6 @@
#include <linux/tty.h>
#include <linux/personality.h>
#include <linux/binfmts.h>
-#include <linux/tracehook.h>
#include <linux/syscalls.h>
#include <linux/compat.h>
#include <asm/ucontext.h>
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 6d2244c94799..419768d7605e 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -13,7 +13,6 @@
#include <linux/errno.h>
#include <linux/slab.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/user.h>
#include <linux/elf.h>
#include <linux/security.h>
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index de3d5b5724d8..e439eb14325f 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -18,7 +18,6 @@
#include <linux/kstrtox.h>
#include <linux/errno.h>
#include <linux/wait.h>
-#include <linux/tracehook.h>
#include <linux/unistd.h>
#include <linux/stddef.h>
#include <linux/personality.h>
diff --git a/fs/coredump.c b/fs/coredump.c
index 1c060c0a2d72..f54c5e316df3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -31,7 +31,6 @@
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
#include <linux/audit.h>
-#include <linux/tracehook.h>
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..e23e2d430485 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -56,7 +56,6 @@
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
#include <linux/audit.h>
-#include <linux/tracehook.h>
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 8b9147873c2c..cb3cb1833ef6 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -13,7 +13,7 @@
#include <linux/slab.h>
#include <linux/rculist_nulls.h>
#include <linux/cpu.h>
-#include <linux/tracehook.h>
+#include <linux/task_work.h>
#include <linux/audit.h>
#include <uapi/linux/io_uring.h>

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d5fbae1030f9..6c7eacc0ebd6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,7 +78,6 @@
#include <linux/task_work.h>
#include <linux/pagemap.h>
#include <linux/io_uring.h>
-#include <linux/tracehook.h>
#include <linux/audit.h>
#include <linux/security.h>

diff --git a/fs/proc/array.c b/fs/proc/array.c
index fd8b0c12b2cb..eb815759842c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -88,7 +88,6 @@
#include <linux/pid_namespace.h>
#include <linux/prctl.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/string_helpers.h>
#include <linux/user_namespace.h>
#include <linux/fs_struct.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d654ce7150fd..01fb37ecc89f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -74,7 +74,6 @@
#include <linux/mount.h>
#include <linux/security.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/printk.h>
#include <linux/cache.h>
#include <linux/cgroup.h>
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
deleted file mode 100644
index 9f6b3fd1880a..000000000000
--- a/include/linux/tracehook.h
+++ /dev/null
@@ -1,56 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Tracing hooks
- *
- * Copyright (C) 2008-2009 Red Hat, Inc. All rights reserved.
- *
- * This file defines hook entry points called by core code where
- * user tracing/debugging support might need to do something. These
- * entry points are called tracehook_*(). Each hook declared below
- * has a detailed kerneldoc comment giving the context (locking et
- * al) from which it is called, and the meaning of its return value.
- *
- * Each function here typically has only one call site, so it is ok
- * to have some nontrivial tracehook_*() inlines. In all cases, the
- * fast path when no tracing is enabled should be very short.
- *
- * The purpose of this file and the tracehook_* layer is to consolidate
- * the interface that the kernel core and arch code uses to enable any
- * user debugging or tracing facility (such as ptrace). The interfaces
- * here are carefully documented so that maintainers of core and arch
- * code do not need to think about the implementation details of the
- * tracing facilities. Likewise, maintainers of the tracing code do not
- * need to understand all the calling core or arch code in detail, just
- * documented circumstances of each call, such as locking conditions.
- *
- * If the calling core code changes so that locking is different, then
- * it is ok to change the interface documented here. The maintainer of
- * core code changing should notify the maintainers of the tracing code
- * that they need to work out the change.
- *
- * Some tracehook_*() inlines take arguments that the current tracing
- * implementations might not necessarily use. These function signatures
- * are chosen to pass in all the information that is on hand in the
- * caller and might conceivably be relevant to a tracer, so that the
- * core code won't have to be updated when tracing adds more features.
- * If a call site changes so that some of those parameters are no longer
- * already on hand without extra work, then the tracehook_* interface
- * can change so there is no make-work burden on the core code. The
- * maintainer of core code changing should notify the maintainers of the
- * tracing code that they need to work out the change.
- */
-
-#ifndef _LINUX_TRACEHOOK_H
-#define _LINUX_TRACEHOOK_H 1
-
-#include <linux/sched.h>
-#include <linux/ptrace.h>
-#include <linux/security.h>
-#include <linux/task_work.h>
-#include <linux/memcontrol.h>
-#include <linux/blk-cgroup.h>
-struct linux_binprm;
-
-
-
-#endif /* <linux/tracehook.h> */
diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..9326d1f97fc7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -49,7 +49,8 @@
#include <linux/audit.h> /* for audit_free() */
#include <linux/resource.h>
#include <linux/task_io_accounting_ops.h>
-#include <linux/tracehook.h>
+#include <linux/blkdev.h>
+#include <linux/task_work.h>
#include <linux/fs_struct.h>
#include <linux/init_task.h>
#include <linux/perf_event.h>
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 5683ac0d2566..df808d97d84f 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,7 +9,6 @@

#include <linux/cpu.h>
#include <linux/stacktrace.h>
-#include <linux/tracehook.h>
#include "core.h"
#include "patch.h"
#include "transition.h"
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4d8f44a17727..63198086ee83 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -39,7 +39,6 @@
#include <linux/pid.h>
#include <linux/ptrace.h>
#include <linux/capability.h>
-#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <linux/anon_inodes.h>
#include <linux/lockdep.h>
diff --git a/kernel/signal.c b/kernel/signal.c
index 8632b88982c9..c2dee5420567 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -32,7 +32,7 @@
#include <linux/signal.h>
#include <linux/signalfd.h>
#include <linux/ratelimit.h>
-#include <linux/tracehook.h>
+#include <linux/task_work.h>
#include <linux/capability.h>
#include <linux/freezer.h>
#include <linux/pid_namespace.h>
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 583680f6cd81..a29e69d2c300 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -14,7 +14,6 @@
#include <linux/file.h>
#include <linux/mount.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
#include <linux/personality.h>
#include <linux/xattr.h>
#include <linux/user_namespace.h>
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 5b6895e4fc29..4d2cd6b9f6fc 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -25,7 +25,6 @@
#include <linux/kd.h>
#include <linux/kernel.h>
#include <linux/kernel_read_file.h>
-#include <linux/tracehook.h>
#include <linux/errno.h>
#include <linux/sched/signal.h>
#include <linux/sched/task.h>
--
2.29.2

2022-03-09 17:04:49

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 11/13] resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume

Every architecture defines TIF_NOTIFY_RESUME so remove the unnecessary
ifdef.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/tracehook.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 1b7365aef8da..946404ebe10b 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -63,10 +63,8 @@ struct linux_binprm;
*/
static inline void set_notify_resume(struct task_struct *task)
{
-#ifdef TIF_NOTIFY_RESUME
if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_RESUME))
kick_process(task);
-#endif
}

/**
--
2.29.2

2022-03-09 17:05:48

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 02/13] ptrace/arm: Rename tracehook_report_syscall report_syscall

Make the arm and arm64 code more concise and less confusing by
renaming the architecture specific tracehook_report_syscall to
report_syscall.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/arm/kernel/ptrace.c | 7 +++----
arch/arm64/kernel/ptrace.c | 7 +++----
2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 43b963ea4a0e..e5aa3237853d 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -831,8 +831,7 @@ enum ptrace_syscall_dir {
PTRACE_SYSCALL_EXIT,
};

-static void tracehook_report_syscall(struct pt_regs *regs,
- enum ptrace_syscall_dir dir)
+static void report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir)
{
unsigned long ip;

@@ -856,7 +855,7 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs)
int scno;

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
+ report_syscall(regs, PTRACE_SYSCALL_ENTER);

/* Do seccomp after ptrace; syscall may have changed. */
#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
@@ -897,5 +896,5 @@ asmlinkage void syscall_trace_exit(struct pt_regs *regs)
trace_sys_exit(regs, regs_return_value(regs));

if (test_thread_flag(TIF_SYSCALL_TRACE))
- tracehook_report_syscall(regs, PTRACE_SYSCALL_EXIT);
+ report_syscall(regs, PTRACE_SYSCALL_EXIT);
}
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 39dbdfdc38d3..b7845575f86f 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1792,8 +1792,7 @@ enum ptrace_syscall_dir {
PTRACE_SYSCALL_EXIT,
};

-static void tracehook_report_syscall(struct pt_regs *regs,
- enum ptrace_syscall_dir dir)
+static void report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir)
{
int regno;
unsigned long saved_reg;
@@ -1842,7 +1841,7 @@ int syscall_trace_enter(struct pt_regs *regs)
unsigned long flags = read_thread_flags();

if (flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
- tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
+ report_syscall(regs, PTRACE_SYSCALL_ENTER);
if (flags & _TIF_SYSCALL_EMU)
return NO_SYSCALL;
}
@@ -1870,7 +1869,7 @@ void syscall_trace_exit(struct pt_regs *regs)
trace_sys_exit(regs, syscall_get_return_value(current, regs));

if (flags & (_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP))
- tracehook_report_syscall(regs, PTRACE_SYSCALL_EXIT);
+ report_syscall(regs, PTRACE_SYSCALL_EXIT);

rseq_syscall(regs);
}
--
2.29.2

2022-03-09 17:06:08

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 09/13] task_work: Decouple TIF_NOTIFY_SIGNAL and task_work

There are a small handful of reasons besides pending signals that the
kernel might want to break out of interruptible sleeps. The flag
TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL
provide that the infrastructure for breaking out of interruptible
sleeps and entering the return to user space slow path for those
cases.

Expand tracehook_notify_signal inline in it's callers and remove it,
which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate
concepts.

Update the comment on set_notify_signal to more accurately describe
it's purpose.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/io-wq.c | 4 +++-
fs/io_uring.c | 4 +++-
include/linux/tracehook.h | 15 ++-------------
kernel/entry/kvm.c | 7 +++++--
4 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index bb7f161bb19c..8b9147873c2c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -515,7 +515,9 @@ static bool io_flush_signals(void)
{
if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) {
__set_current_state(TASK_RUNNING);
- tracehook_notify_signal();
+ clear_notify_signal();
+ if (task_work_pending(current))
+ task_work_run();
return true;
}
return false;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e85261079a78..d5fbae1030f9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2592,7 +2592,9 @@ static inline bool io_run_task_work(void)
{
if (test_thread_flag(TIF_NOTIFY_SIGNAL) || task_work_pending(current)) {
__set_current_state(TASK_RUNNING);
- tracehook_notify_signal();
+ clear_notify_signal();
+ if (task_work_pending(current))
+ task_work_run();
return true;
}

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index b44a7820c468..e5d676e841e3 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -113,19 +113,8 @@ static inline void clear_notify_signal(void)
}

/*
- * called by exit_to_user_mode_loop() if ti_work & _TIF_NOTIFY_SIGNAL. This
- * is currently used by TWA_SIGNAL based task_work, which requires breaking
- * wait loops to ensure that task_work is noticed and run.
- */
-static inline void tracehook_notify_signal(void)
-{
- clear_notify_signal();
- if (task_work_pending(current))
- task_work_run();
-}
-
-/*
- * Called when we have work to process from exit_to_user_mode_loop()
+ * Called to break out of interruptible wait loops, and enter the
+ * exit_to_user_mode_loop().
*/
static inline void set_notify_signal(struct task_struct *task)
{
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index cabf36a489e4..3ab5f98988c3 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -8,8 +8,11 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
do {
int ret;

- if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
- tracehook_notify_signal();
+ if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
+ clear_notify_signal();
+ if (task_work_pending(current))
+ task_work_run();
+ }

if (ti_work & _TIF_SIGPENDING) {
kvm_handle_signal_exit(vcpu);
--
2.29.2

2022-03-09 17:06:18

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 12/13] resume_user_mode: Move to resume_user_mode.h

Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h.
While doing that rename tracehook_notify_resume to resume_user_mode_work.

Update all of the places that included tracehook.h for these functions to
include resume_user_mode.h instead.

Update all of the callers of tracehook_notify_resume to call
resume_user_mode_work.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/Kconfig | 2 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 2 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/signal.c | 4 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
include/linux/entry-kvm.h | 2 +-
include/linux/resume_user_mode.h | 64 ++++++++++++++++++++++++++++++++
include/linux/tracehook.h | 51 -------------------------
kernel/entry/common.c | 4 +-
kernel/entry/kvm.c | 2 +-
kernel/task_work.c | 2 +-
mm/memcontrol.c | 2 +-
35 files changed, 117 insertions(+), 107 deletions(-)
create mode 100644 include/linux/resume_user_mode.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 6382520ef0a5..2e3979c3d66d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -218,7 +218,7 @@ config TRACE_IRQFLAGS_SUPPORT
# linux/regset.h user_regset interfaces
# CORE_DUMP_USE_REGSET #define'd in linux/elf.h
# TIF_SYSCALL_TRACE calls ptrace_report_syscall_{entry,exit}
-# TIF_NOTIFY_RESUME calls tracehook_notify_resume()
+# TIF_NOTIFY_RESUME calls resume_user_mode_work()
#
config HAVE_ARCH_TRACEHOOK
bool
diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c
index d8ed71d5bed3..6f47f256fe80 100644
--- a/arch/alpha/kernel/signal.c
+++ b/arch/alpha/kernel/signal.c
@@ -22,7 +22,7 @@
#include <linux/binfmts.h>
#include <linux/bitops.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <linux/uaccess.h>
#include <asm/sigcontext.h>
@@ -531,7 +531,7 @@ do_work_pending(struct pt_regs *regs, unsigned long thread_flags,
do_signal(regs, r0, r19);
r0 = 0;
} else {
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
}
local_irq_disable();
diff --git a/arch/arc/kernel/signal.c b/arch/arc/kernel/signal.c
index cb2f88502baf..f748483628f2 100644
--- a/arch/arc/kernel/signal.c
+++ b/arch/arc/kernel/signal.c
@@ -49,7 +49,7 @@
#include <linux/personality.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/sched/task_stack.h>

#include <asm/ucontext.h>
@@ -438,5 +438,5 @@ void do_notify_resume(struct pt_regs *regs)
* user mode
*/
if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index c532a6041066..459abc5d1819 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -9,7 +9,7 @@
#include <linux/signal.h>
#include <linux/personality.h>
#include <linux/uaccess.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/uprobes.h>
#include <linux/syscalls.h>

@@ -627,7 +627,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
} else if (thread_flags & _TIF_UPROBE) {
uprobe_notify_resume(regs);
} else {
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
}
local_irq_disable();
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index d8aaf4b6f432..413c51de9d10 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -17,7 +17,7 @@
#include <linux/uaccess.h>
#include <linux/sizes.h>
#include <linux/string.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/ratelimit.h>
#include <linux/syscalls.h>

@@ -941,7 +941,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)
do_signal(regs);

if (thread_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);

if (thread_flags & _TIF_FOREIGN_FPSTATE)
fpsimd_restore_current_state();
diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
index c7b763d2f526..7a3149a27e4d 100644
--- a/arch/csky/kernel/signal.c
+++ b/arch/csky/kernel/signal.c
@@ -3,7 +3,7 @@
#include <linux/signal.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/traps.h>
#include <asm/ucontext.h>
@@ -265,5 +265,5 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
do_signal(regs);

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/h8300/kernel/signal.c b/arch/h8300/kernel/signal.c
index 75a1c36b105a..0716fc8a8ce2 100644
--- a/arch/h8300/kernel/signal.c
+++ b/arch/h8300/kernel/signal.c
@@ -39,7 +39,7 @@
#include <linux/personality.h>
#include <linux/tty.h>
#include <linux/binfmts.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/setup.h>
#include <linux/uaccess.h>
@@ -283,5 +283,5 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, u32 thread_info_flags)
do_signal(regs);

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 232dfd8956aa..ae3f728eeca0 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -14,7 +14,7 @@
#include <linux/tick.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

/*
* Program thread launch. Often defined as a macro in processor.h,
@@ -178,7 +178,7 @@ int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
}

if (thread_info_flags & _TIF_NOTIFY_RESUME) {
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
return 1;
}

diff --git a/arch/hexagon/kernel/signal.c b/arch/hexagon/kernel/signal.c
index 94cc7ff52dce..bcba31e9e0ae 100644
--- a/arch/hexagon/kernel/signal.c
+++ b/arch/hexagon/kernel/signal.c
@@ -7,7 +7,6 @@

#include <linux/linkage.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
#include <linux/sched/task_stack.h>

#include <asm/registers.h>
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 834df24a88f1..d7a256bd9d6b 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -32,7 +32,7 @@
#include <linux/delay.h>
#include <linux/kdebug.h>
#include <linux/utsname.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/rcupdate.h>

#include <asm/cpu.h>
@@ -179,7 +179,7 @@ do_notify_resume_user(sigset_t *unused, struct sigscratch *scr, long in_syscall)

if (test_thread_flag(TIF_NOTIFY_RESUME)) {
local_irq_enable(); /* force interrupt enable */
- tracehook_notify_resume(&scr->pt);
+ resume_user_mode_work(&scr->pt);
}

/* copy user rbs to kernel rbs */
diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
index 6af64aae087d..a19acd9f5e1f 100644
--- a/arch/ia64/kernel/ptrace.c
+++ b/arch/ia64/kernel/ptrace.c
@@ -23,7 +23,7 @@
#include <linux/signal.h>
#include <linux/regset.h>
#include <linux/elf.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/processor.h>
#include <asm/ptrace_offsets.h>
diff --git a/arch/ia64/kernel/signal.c b/arch/ia64/kernel/signal.c
index c1b299760bf7..51cf6a7ec158 100644
--- a/arch/ia64/kernel/signal.c
+++ b/arch/ia64/kernel/signal.c
@@ -12,7 +12,6 @@
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
#include <linux/sched.h>
#include <linux/signal.h>
#include <linux/smp.h>
diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c
index 338817d0cb3f..49533f65958a 100644
--- a/arch/m68k/kernel/signal.c
+++ b/arch/m68k/kernel/signal.c
@@ -43,7 +43,7 @@
#include <linux/tty.h>
#include <linux/binfmts.h>
#include <linux/extable.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/setup.h>
#include <linux/uaccess.h>
@@ -1109,5 +1109,5 @@ void do_notify_resume(struct pt_regs *regs)
do_signal(regs);

if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c
index 23e8a9336a29..561eb82d7af6 100644
--- a/arch/microblaze/kernel/signal.c
+++ b/arch/microblaze/kernel/signal.c
@@ -31,7 +31,7 @@
#include <linux/personality.h>
#include <linux/percpu.h>
#include <linux/linkage.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <asm/entry.h>
#include <asm/ucontext.h>
#include <linux/uaccess.h>
@@ -311,5 +311,5 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, int in_syscall)
do_signal(regs, in_syscall);

if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 5bce782e694c..1a99f26bf99f 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -25,7 +25,7 @@
#include <linux/compiler.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/abi.h>
#include <asm/asm.h>
@@ -916,7 +916,7 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, void *unused,
do_signal(regs);

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);

user_enter();
}
diff --git a/arch/nds32/kernel/signal.c b/arch/nds32/kernel/signal.c
index 7e3ca430a223..551caef595cb 100644
--- a/arch/nds32/kernel/signal.c
+++ b/arch/nds32/kernel/signal.c
@@ -6,7 +6,7 @@
#include <linux/ptrace.h>
#include <linux/personality.h>
#include <linux/freezer.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/uaccess.h>

#include <asm/cacheflush.h>
@@ -380,5 +380,5 @@ do_notify_resume(struct pt_regs *regs, unsigned int thread_flags)
do_signal(regs);

if (thread_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/nios2/kernel/signal.c b/arch/nios2/kernel/signal.c
index 2009ae2d3c3b..530b60c99545 100644
--- a/arch/nios2/kernel/signal.c
+++ b/arch/nios2/kernel/signal.c
@@ -15,7 +15,7 @@
#include <linux/uaccess.h>
#include <linux/unistd.h>
#include <linux/personality.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/ucontext.h>
#include <asm/cacheflush.h>
@@ -319,7 +319,7 @@ asmlinkage int do_notify_resume(struct pt_regs *regs)
return restart;
}
} else if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);

return 0;
}
diff --git a/arch/openrisc/kernel/signal.c b/arch/openrisc/kernel/signal.c
index 92c5b70740f5..80f69740c731 100644
--- a/arch/openrisc/kernel/signal.c
+++ b/arch/openrisc/kernel/signal.c
@@ -21,7 +21,7 @@
#include <linux/ptrace.h>
#include <linux/unistd.h>
#include <linux/stddef.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <asm/processor.h>
#include <asm/syscall.h>
@@ -309,7 +309,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
}
syscall = 0;
} else {
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
}
local_irq_disable();
diff --git a/arch/parisc/kernel/signal.c b/arch/parisc/kernel/signal.c
index 46b1050640b8..2f7ebe9add20 100644
--- a/arch/parisc/kernel/signal.c
+++ b/arch/parisc/kernel/signal.c
@@ -22,7 +22,7 @@
#include <linux/errno.h>
#include <linux/wait.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/unistd.h>
#include <linux/stddef.h>
#include <linux/compat.h>
@@ -602,5 +602,5 @@ void do_notify_resume(struct pt_regs *regs, long in_syscall)
do_signal(regs, in_syscall);

if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index b93b87df499d..f7f8620663c7 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -9,7 +9,7 @@
* this archive for more details.
*/

-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/signal.h>
#include <linux/uprobes.h>
#include <linux/key.h>
@@ -294,7 +294,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
}

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}

static unsigned long get_tm_stackpointer(struct task_struct *tsk)
diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
index c2d5ecbe5526..d80bf5896c6f 100644
--- a/arch/riscv/kernel/signal.c
+++ b/arch/riscv/kernel/signal.c
@@ -9,7 +9,7 @@
#include <linux/signal.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/linkage.h>

#include <asm/ucontext.h>
@@ -317,5 +317,5 @@ asmlinkage __visible void do_notify_resume(struct pt_regs *regs,
do_signal(regs);

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/sh/kernel/signal_32.c b/arch/sh/kernel/signal_32.c
index dd3092911efa..90f495d35db2 100644
--- a/arch/sh/kernel/signal_32.c
+++ b/arch/sh/kernel/signal_32.c
@@ -25,7 +25,7 @@
#include <linux/personality.h>
#include <linux/binfmts.h>
#include <linux/io.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <asm/ucontext.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -503,5 +503,5 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, unsigned int save_r0,
do_signal(regs, save_r0);

if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index 6cc124a3bb98..f9fe502b81c6 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -20,7 +20,6 @@
#include <linux/binfmts.h>
#include <linux/compat.h>
#include <linux/bitops.h>
-#include <linux/tracehook.h>

#include <linux/uaccess.h>
#include <asm/ptrace.h>
diff --git a/arch/sparc/kernel/signal_32.c b/arch/sparc/kernel/signal_32.c
index ffab16369bea..80c89b362d8b 100644
--- a/arch/sparc/kernel/signal_32.c
+++ b/arch/sparc/kernel/signal_32.c
@@ -19,7 +19,7 @@
#include <linux/smp.h>
#include <linux/binfmts.h> /* do_coredum */
#include <linux/bitops.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

#include <linux/uaccess.h>
#include <asm/ptrace.h>
@@ -524,7 +524,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0,
if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
do_signal(regs, orig_i0);
if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}

asmlinkage int do_sys_sigstack(struct sigstack __user *ssptr,
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 2a78d2af1265..8b9fc76cd3e0 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -15,7 +15,7 @@
#include <linux/errno.h>
#include <linux/wait.h>
#include <linux/ptrace.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/unistd.h>
#include <linux/mm.h>
#include <linux/tty.h>
@@ -552,7 +552,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
do_signal(regs, orig_i0);
if (thread_info_flags & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
user_enter();
}

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 4a420778ed87..80504680be08 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -23,7 +23,7 @@
#include <linux/seq_file.h>
#include <linux/tick.h>
#include <linux/threads.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <asm/current.h>
#include <asm/mmu_context.h>
#include <linux/uaccess.h>
@@ -104,7 +104,7 @@ void interrupt_end(void)
test_thread_flag(TIF_NOTIFY_SIGNAL))
do_signal(regs);
if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}

int get_current_pid(void)
diff --git a/arch/xtensa/kernel/signal.c b/arch/xtensa/kernel/signal.c
index f6c949895b3e..6f68649e86ba 100644
--- a/arch/xtensa/kernel/signal.c
+++ b/arch/xtensa/kernel/signal.c
@@ -19,7 +19,7 @@
#include <linux/errno.h>
#include <linux/ptrace.h>
#include <linux/personality.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/sched/task_stack.h>

#include <asm/ucontext.h>
@@ -511,5 +511,5 @@ void do_notify_resume(struct pt_regs *regs)
do_signal(regs);

if (test_thread_flag(TIF_NOTIFY_RESUME))
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);
}
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 650f7e27989f..4d8be1634bc6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -28,7 +28,7 @@
#include <linux/atomic.h>
#include <linux/ctype.h>
#include <linux/blk-cgroup.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/part_stat.h>
#include "blk.h"
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 07c878d6e323..6813171afccb 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -3,7 +3,7 @@
#define __LINUX_ENTRYKVM_H

#include <linux/static_call_types.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/syscalls.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
new file mode 100644
index 000000000000..285189454449
--- /dev/null
+++ b/include/linux/resume_user_mode.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef LINUX_RESUME_USER_MODE_H
+#define LINUX_RESUME_USER_MODE_H
+
+#include <linux/sched.h>
+#include <linux/task_work.h>
+#include <linux/memcontrol.h>
+#include <linux/blk-cgroup.h>
+
+/**
+ * set_notify_resume - cause resume_user_mode_work() to be called
+ * @task: task that will call resume_user_mode_work()
+ *
+ * Calling this arranges that @task will call resume_user_mode_work()
+ * before returning to user mode. If it's already running in user mode,
+ * it will enter the kernel and call resume_user_mode_work() soon.
+ * If it's blocked, it will not be woken.
+ */
+static inline void set_notify_resume(struct task_struct *task)
+{
+ if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_RESUME))
+ kick_process(task);
+}
+
+
+/**
+ * resume_user_mode_work - Perform work before returning to user mode
+ * @regs: user-mode registers of @current task
+ *
+ * This is called when %TIF_NOTIFY_RESUME has been set. Now we are
+ * about to return to user mode, and the user state in @regs can be
+ * inspected or adjusted. The caller in arch code has cleared
+ * %TIF_NOTIFY_RESUME before the call. If the flag gets set again
+ * asynchronously, this will be called again before we return to
+ * user mode.
+ *
+ * Called without locks.
+ */
+static inline void resume_user_mode_work(struct pt_regs *regs)
+{
+ clear_thread_flag(TIF_NOTIFY_RESUME);
+ /*
+ * This barrier pairs with task_work_add()->set_notify_resume() after
+ * hlist_add_head(task->task_works);
+ */
+ smp_mb__after_atomic();
+ if (unlikely(task_work_pending(current)))
+ task_work_run();
+
+#ifdef CONFIG_KEYS_REQUEST_CACHE
+ if (unlikely(current->cached_requested_key)) {
+ key_put(current->cached_requested_key);
+ current->cached_requested_key = NULL;
+ }
+#endif
+
+ mem_cgroup_handle_over_high();
+ blkcg_maybe_throttle_current();
+
+ rseq_handle_notify_resume(NULL, regs);
+}
+
+#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 946404ebe10b..9f6b3fd1880a 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -52,56 +52,5 @@
struct linux_binprm;

-/**
- * set_notify_resume - cause tracehook_notify_resume() to be called
- * @task: task that will call tracehook_notify_resume()
- *
- * Calling this arranges that @task will call tracehook_notify_resume()
- * before returning to user mode. If it's already running in user mode,
- * it will enter the kernel and call tracehook_notify_resume() soon.
- * If it's blocked, it will not be woken.
- */
-static inline void set_notify_resume(struct task_struct *task)
-{
- if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_RESUME))
- kick_process(task);
-}
-
-/**
- * tracehook_notify_resume - report when about to return to user mode
- * @regs: user-mode registers of @current task
- *
- * This is called when %TIF_NOTIFY_RESUME has been set. Now we are
- * about to return to user mode, and the user state in @regs can be
- * inspected or adjusted. The caller in arch code has cleared
- * %TIF_NOTIFY_RESUME before the call. If the flag gets set again
- * asynchronously, this will be called again before we return to
- * user mode.
- *
- * Called without locks.
- */
-static inline void tracehook_notify_resume(struct pt_regs *regs)
-{
- clear_thread_flag(TIF_NOTIFY_RESUME);
- /*
- * This barrier pairs with task_work_add()->set_notify_resume() after
- * hlist_add_head(task->task_works);
- */
- smp_mb__after_atomic();
- if (unlikely(task_work_pending(current)))
- task_work_run();
-
-#ifdef CONFIG_KEYS_REQUEST_CACHE
- if (unlikely(current->cached_requested_key)) {
- key_put(current->cached_requested_key);
- current->cached_requested_key = NULL;
- }
-#endif
-
- mem_cgroup_handle_over_high();
- blkcg_maybe_throttle_current();
-
- rseq_handle_notify_resume(NULL, regs);
-}

#endif /* <linux/tracehook.h> */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 79eaf9b4b10d..a86823cad853 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -2,7 +2,7 @@

#include <linux/context_tracking.h>
#include <linux/entry-common.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/highmem.h>
#include <linux/livepatch.h>
#include <linux/audit.h>
@@ -165,7 +165,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
arch_do_signal_or_restart(regs);

if (ti_work & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(regs);
+ resume_user_mode_work(regs);

/* Architecture specific TIF work */
arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 3ab5f98988c3..9d09f489b60e 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -23,7 +23,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
schedule();

if (ti_work & _TIF_NOTIFY_RESUME)
- tracehook_notify_resume(NULL);
+ resume_user_mode_work(NULL);

ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
if (ret)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index cc6fccb0e24d..c59e1a49bc40 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/spinlock.h>
#include <linux/task_work.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>

static struct callback_head work_exited; /* all we need is ->next == NULL */

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..2aaa400f34d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -59,7 +59,7 @@
#include <linux/oom.h>
#include <linux/lockdep.h>
#include <linux/file.h>
-#include <linux/tracehook.h>
+#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
#include "internal.h"
--
2.29.2

2022-03-09 17:11:45

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 06/13] task_work: Remove unnecessary include from posix_timers.h

Break a header file circular dependency by removing the unnecessary
include of task_work.h from posix_timers.h.

sched.h -> posix-timers.h
posix-timers.h -> task_work.h
task_work.h -> sched.h

Add missing includes of task_work.h to:
arch/x86/mm/tlb.c
kernel/time/posix-cpu-timers.c

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/x86/mm/tlb.c | 1 +
include/linux/posix-timers.h | 1 -
kernel/time/posix-cpu-timers.c | 1 +
3 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a6cf56a14939..6eb4d91d5365 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -9,6 +9,7 @@
#include <linux/cpu.h>
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
+#include <linux/task_work.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 5bbcd280bfd2..83539bb2f023 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -6,7 +6,6 @@
#include <linux/list.h>
#include <linux/alarmtimer.h>
#include <linux/timerqueue.h>
-#include <linux/task_work.h>

struct kernel_siginfo;
struct task_struct;
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 96b4e7810426..9190d9eb236d 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -15,6 +15,7 @@
#include <linux/workqueue.h>
#include <linux/compat.h>
#include <linux/sched/deadline.h>
+#include <linux/task_work.h>

#include "posix-timers.h"

--
2.29.2

2022-03-10 00:27:19

[permalink] [raw]

Subject: Re: [PATCH 05/13] ptrace: Remove tracehook_signal_handler

On Wed, Mar 09, 2022 at 10:24:46AM -0600, Eric W. Biederman wrote:
> The two line function tracehook_signal_handler is only called from
> signal_delivered. Expand it inline in signal_delivered and remove it.
> Just to make it easier to understand what is going on.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 00:31:18

[permalink] [raw]

Subject: Re: [PATCH 07/13] task_work: Introduce task_work_pending

On 3/9/22 4:24 PM, Eric W. Biederman wrote:
> Jens Axboe <[email protected]> writes:
>
>> On 3/9/22 9:24 AM, Eric W. Biederman wrote:
>>> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
>>> index 5b8a93f288bb..897494b597ba 100644
>>> --- a/include/linux/task_work.h
>>> +++ b/include/linux/task_work.h
>>> @@ -19,6 +19,11 @@ enum task_work_notify_mode {
>>> TWA_SIGNAL,
>>> };
>>>
>>> +static inline bool task_work_pending(struct task_struct *task)
>>> +{
>>> + return READ_ONCE(task->task_works);
>>> +}
>>> +
>>
>> Most of the checks for this is current, do we need READ_ONCE here?
>
> There is a non-current use in fs/io_uring in __io_uring_show_fdinfo
> and another in task_work_cancel_match.
>
> Beyond that there are quite a few writes that are not at all from
> current so even on current task->task_works can change if you look
> twice.
>
> So if only to keep it from making unwarranted assumptions I think
> READ_ONCE makes sense.
>
> Given that READ_ONCE is practically free I don't see where there is
> any harm in using it to document the kind of code we expect the compiler
> to generate.
>
> Looking a second time I see all of the other reads of task->task_works
> are already READ_ONCE in kernel/task_work.c, so really I think if we
> don't want READ_ONCE we need a big fat comment about why it is safe
> in a check like task_work_pending and while it is needed everywhere
> else. At the moment I am not smart enough to write that comment.
>
> I will see about adding this bit of discussion in the commit comment to
> make it a little clearer why I am introducing READ_ONCE.

Fair enough, and doesn't warrant a current_tw_pending() helper in that
case either.

--
Jens Axboe

2022-03-10 00:32:01

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 07/13] task_work: Introduce task_work_pending

Jens Axboe <[email protected]> writes:

> On 3/9/22 9:24 AM, Eric W. Biederman wrote:
>> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
>> index 5b8a93f288bb..897494b597ba 100644
>> --- a/include/linux/task_work.h
>> +++ b/include/linux/task_work.h
>> @@ -19,6 +19,11 @@ enum task_work_notify_mode {
>> TWA_SIGNAL,
>> };
>>
>> +static inline bool task_work_pending(struct task_struct *task)
>> +{
>> + return READ_ONCE(task->task_works);
>> +}
>> +
>
> Most of the checks for this is current, do we need READ_ONCE here?

There is a non-current use in fs/io_uring in __io_uring_show_fdinfo
and another in task_work_cancel_match.

Beyond that there are quite a few writes that are not at all from
current so even on current task->task_works can change if you look
twice.

So if only to keep it from making unwarranted assumptions I think
READ_ONCE makes sense.

Given that READ_ONCE is practically free I don't see where there is
any harm in using it to document the kind of code we expect the compiler
to generate.

Looking a second time I see all of the other reads of task->task_works
are already READ_ONCE in kernel/task_work.c, so really I think if we
don't want READ_ONCE we need a big fat comment about why it is safe
in a check like task_work_pending and while it is needed everywhere
else. At the moment I am not smart enough to write that comment.

I will see about adding this bit of discussion in the commit comment to
make it a little clearer why I am introducing READ_ONCE.

Eric

2022-03-10 03:37:40

[permalink] [raw]

Subject: Re: [PATCH 07/13] task_work: Introduce task_work_pending

On 3/9/22 9:24 AM, Eric W. Biederman wrote:
> diff --git a/include/linux/task_work.h b/include/linux/task_work.h
> index 5b8a93f288bb..897494b597ba 100644
> --- a/include/linux/task_work.h
> +++ b/include/linux/task_work.h
> @@ -19,6 +19,11 @@ enum task_work_notify_mode {
> TWA_SIGNAL,
> };
>
> +static inline bool task_work_pending(struct task_struct *task)
> +{
> + return READ_ONCE(task->task_works);
> +}
> +

Most of the checks for this is current, do we need READ_ONCE here?

--
Jens Axboe

2022-03-10 04:43:20

[permalink] [raw]

Subject: Re: [PATCH 12/13] resume_user_mode: Move to resume_user_mode.h

On Wed, Mar 09, 2022 at 10:24:53AM -0600, Eric W. Biederman wrote:
> Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h.
> While doing that rename tracehook_notify_resume to resume_user_mode_work.
>
> Update all of the places that included tracehook.h for these functions to
> include resume_user_mode.h instead.
>
> Update all of the callers of tracehook_notify_resume to call
> resume_user_mode_work.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 06:24:08

[permalink] [raw]

Subject: Re: [PATCH 08/13] task_work: Call tracehook_notify_signal from get_signal on all architectures

On Wed, Mar 09, 2022 at 10:24:49AM -0600, Eric W. Biederman wrote:
> Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3b2
> ("task_work: unconditionally run task_work from get_signal()") always
> calling task_wofffffffrk_run all of the work of tracehook_notify_signal is

typo: cat on keyboard

> already happening except clearing TIF_NOTIFY_SIGNAL.
>
> Factor clear_notify_signal out of tracehook_notify_signal and use it in
> get_signal so that get_signal only needs one call of trask_work_run.

typo: trask -> task

>
> To keep the semantics in sync update xfer_to_guest_mode_work (which
> does not call get_signal) to call tracehook_notify_signal if either
> _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.

I see three logical changes in this patch, I think?

- creation and use of clear_notify_signal()
- removal of handle_signal_work() and removal of
arch_do_signal_or_restart() has_signal arg
- something with get_signal() I don't understand yet:
- why is clear_notify_signal() added?
- why is tracehook_notify_signal() removed?

>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> arch/s390/kernel/signal.c | 4 ++--
> arch/x86/kernel/signal.c | 4 ++--
> include/linux/entry-common.h | 2 +-
> include/linux/tracehook.h | 9 +++++++--
> kernel/entry/common.c | 12 ++----------
> kernel/entry/kvm.c | 2 +-
> kernel/signal.c | 14 +++-----------
> 7 files changed, 18 insertions(+), 29 deletions(-)
>
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 307f5d99514d..ea9e5e8182cd 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/signal.c
> @@ -453,7 +453,7 @@ static void handle_signal(struct ksignal *ksig, sigset_t *oldset,
> * stack-frames in one go after that.
> */
>
> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
> +void arch_do_signal_or_restart(struct pt_regs *regs)
> {
> struct ksignal ksig;
> sigset_t *oldset = sigmask_to_save();
> @@ -466,7 +466,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
> current->thread.system_call =
> test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0;
>
> - if (has_signal && get_signal(&ksig)) {

Right, the only caller of arch_do_signal_or_restart(),
handle_signal_work(), only happens after its caller has already checked
_TIF_SIGPENDING.

> + if (get_signal(&ksig)) {
> /* Whee! Actually deliver the signal. */
> if (current->thread.system_call) {
> regs->int_code = current->thread.system_call;
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index ec71e06ae364..de3d5b5724d8 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -861,11 +861,11 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
> * want to handle. Thus you cannot kill init even with a SIGKILL even by
> * mistake.
> */
> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
> +void arch_do_signal_or_restart(struct pt_regs *regs)
> {
> struct ksignal ksig;
>
> - if (has_signal && get_signal(&ksig)) {
> + if (get_signal(&ksig)) {
> /* Whee! Actually deliver the signal. */
> handle_signal(&ksig, regs);
> return;
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 9efbdda61f7a..3537fd25f14e 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -257,7 +257,7 @@ static __always_inline void arch_exit_to_user_mode(void) { }
> *
> * Invoked from exit_to_user_mode_loop().
> */
> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal);
> +void arch_do_signal_or_restart(struct pt_regs *regs);
>
> /**
> * exit_to_user_mode - Fixup state when exiting to user mode
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index fa834a22e86e..b44a7820c468 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -106,6 +106,12 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
> rseq_handle_notify_resume(NULL, regs);
> }
>
> +static inline void clear_notify_signal(void)
> +{
> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
> + smp_mb__after_atomic();
> +}
> +
> /*
> * called by exit_to_user_mode_loop() if ti_work & _TIF_NOTIFY_SIGNAL. This
> * is currently used by TWA_SIGNAL based task_work, which requires breaking
> @@ -113,8 +119,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
> */
> static inline void tracehook_notify_signal(void)
> {
> - clear_thread_flag(TIF_NOTIFY_SIGNAL);
> - smp_mb__after_atomic();
> + clear_notify_signal();
> if (task_work_pending(current))
> task_work_run();
> }
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index f0b1daa1e8da..79eaf9b4b10d 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -139,15 +139,7 @@ void noinstr exit_to_user_mode(void)
> }
>
> /* Workaround to allow gradual conversion of architecture code */
> -void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
> -
> -static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
> -{
> - if (ti_work & _TIF_NOTIFY_SIGNAL)
> - tracehook_notify_signal();
> -
> - arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
> -}
> +void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
>
> static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> unsigned long ti_work)
> @@ -170,7 +162,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> klp_update_patch_state(current);
>
> if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> - handle_signal_work(regs, ti_work);
> + arch_do_signal_or_restart(regs);
>
> if (ti_work & _TIF_NOTIFY_RESUME)
> tracehook_notify_resume(regs);
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 96d476e06c77..cabf36a489e4 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -8,7 +8,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
> do {
> int ret;
>
> - if (ti_work & _TIF_NOTIFY_SIGNAL)
> + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> tracehook_notify_signal();
>
> if (ti_work & _TIF_SIGPENDING) {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 3b4cf25fb9b3..8632b88982c9 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2626,20 +2626,12 @@ bool get_signal(struct ksignal *ksig)
> struct signal_struct *signal = current->signal;
> int signr;
>
> + clear_notify_signal();

Why is this added?

> if (unlikely(task_work_pending(current)))
> task_work_run();
>
> - /*
> - * For non-generic architectures, check for TIF_NOTIFY_SIGNAL so
> - * that the arch handlers don't all have to do it. If we get here
> - * without TIF_SIGPENDING, just exit after running signal work.
> - */
> - if (!IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
> - if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> - tracehook_notify_signal();

I don't see why this gets removed?

> - if (!task_sigpending(current))
> - return false;
> - }
> + if (!task_sigpending(current))
> + return false;
>
> if (unlikely(uprobe_deny_signal()))
> return false;
> --
> 2.29.2
>

--
Kees Cook

2022-03-10 07:22:31

[permalink] [raw]

Subject: Re: [PATCH 07/13] task_work: Introduce task_work_pending

On Wed, Mar 09, 2022 at 10:24:48AM -0600, Eric W. Biederman wrote:
> Wrap the test of task->task_works in a helper function to make
> it clear what is being tested.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 08:19:20

[permalink] [raw]

Subject: Re: [PATCH 10/13] signal: Move set_notify_signal and clear_notify_signal into sched/signal.h

On Wed, Mar 09, 2022 at 10:24:51AM -0600, Eric W. Biederman wrote:
> The header tracehook.h is no place for code to live. The functions
> set_notify_signal and clear_notify_signal are not about signals. They
> are about interruptions that act like signals. The fundamental signal
> primitives wind up calling set_notify_signal and clear_notify_signal.
> Which means they need to be maintained with the signal code.
>
> Since set_notify_signal and clear_notify_signal must be maintained
> with the signal subsystem move them into sched/signal.h and claim them
> as part of the signal subsystem.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 10:04:00

[permalink] [raw]

Subject: Re: [PATCH 04/13] ptrace: Remove arch_syscall_{enter,exit}_tracehook

On Wed, Mar 09, 2022 at 10:24:45AM -0600, Eric W. Biederman wrote:
> These functions are alwasy one-to-one wrappers around
> ptrace_report_syscall_entry and ptrace_report_syscall_exit.
> So directly call the functions they are wrapping instead.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/entry-common.h | 43 ++----------------------------------
> kernel/entry/common.c | 4 ++--
> 2 files changed, 4 insertions(+), 43 deletions(-)

nit: This should maybe talk about how this is just removing needless
indirection in the common entry code?

Regardless:

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 10:25:59

[permalink] [raw]

Subject: Re: [PATCH 00/13] Removing tracehook.h

On 3/8/22 5:13 PM, Eric W. Biederman wrote:
>
> While working on cleaning up do_exit I have been having to deal with the
> code in tracehook.h. Unfortunately the code in tracehook.h does not
> make sense as organized.
>
> This set of changes reorganizes things so that tracehook.h no longer
> exists, and so that it's current contents are organized in a fashion
> that is a little easier to understand.
>
> The biggest change is that I lean into the fact that get_signal
> always calls task_work_run and removes the logic that tried to
> be smart and decouple task_work_run and get_signal as it has proven
> to not be effective.
>
> This is a conservative change and I am not changing the how things
> like signal_pending operate (although it is probably justified).
>
> A new header resume_user_mode.h is added to hold resume_user_mode_work
> which was previously known as tracehook_notify_resume.

This is a nice cleanup. I did bother me adding the TIF_NOTIFY_SIGNAL
bits and work hooks to something unrelated, but that's where other
things resided then. This makes it a lot better.

--
Jens Axboe

2022-03-10 11:01:29

[permalink] [raw]

Subject: Re: [PATCH 09/13] task_work: Decouple TIF_NOTIFY_SIGNAL and task_work

On Wed, Mar 09, 2022 at 10:24:50AM -0600, Eric W. Biederman wrote:
> There are a small handful of reasons besides pending signals that the
> kernel might want to break out of interruptible sleeps. The flag
> TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL
> provide that the infrastructure for breaking out of interruptible
> sleeps and entering the return to user space slow path for those
> cases.
>
> Expand tracehook_notify_signal inline in it's callers and remove it,
> which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate
> concepts.
>
> Update the comment on set_notify_signal to more accurately describe
> it's purpose.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 11:35:22

[permalink] [raw]

Subject: Re: [PATCH 06/13] task_work: Remove unnecessary include from posix_timers.h

On Wed, Mar 09, 2022 at 10:24:47AM -0600, Eric W. Biederman wrote:
> Break a header file circular dependency by removing the unnecessary
> include of task_work.h from posix_timers.h.
>
> sched.h -> posix-timers.h
> posix-timers.h -> task_work.h
> task_work.h -> sched.h
>
> Add missing includes of task_work.h to:
> arch/x86/mm/tlb.c
> kernel/time/posix-cpu-timers.c
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 11:42:24

[permalink] [raw]

Subject: Re: [PATCH 01/13] ptrace: Move ptrace_report_syscall into ptrace.h

On Wed, Mar 09, 2022 at 10:24:42AM -0600, Eric W. Biederman wrote:
> Move ptrace_report_syscall from tracehook.h into ptrace.h where it
> belongs.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Yes, as others have said, fixing the naming confusion alone is reason
enough. :)

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 12:41:07

[permalink] [raw]

Subject: Re: [PATCH 13/13] tracehook: Remove tracehook.h

On Wed, Mar 09, 2022 at 10:24:54AM -0600, Eric W. Biederman wrote:
> Now that all of the definitions have moved out of tracehook.h into
> ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
> tracehook.h so remove it.
>
> Update the few files that were depending upon tracehook.h to bring in
> definitions to use the headers they need directly.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 14:53:24

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH 00/13] Removing tracehook.h

On Tue, Mar 8, 2022 at 4:16 PM Eric W. Biederman <[email protected]> wrote:
>
> While working on cleaning up do_exit I have been having to deal with the
> code in tracehook.h. Unfortunately the code in tracehook.h does not
> make sense as organized. [...]

Thanks, that odd naming has tripped me up several times, this looks
like an improvement.

Linus

2022-03-10 16:06:26

[permalink] [raw]

Subject: Re: [PATCH 11/13] resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume

On Wed, Mar 09, 2022 at 10:24:52AM -0600, Eric W. Biederman wrote:
> Every architecture defines TIF_NOTIFY_RESUME so remove the unnecessary
> ifdef.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 16:25:30

[permalink] [raw]

Subject: Re: [PATCH 02/13] ptrace/arm: Rename tracehook_report_syscall report_syscall

On Wed, Mar 09, 2022 at 10:24:43AM -0600, Eric W. Biederman wrote:
> Make the arm and arm64 code more concise and less confusing by
> renaming the architecture specific tracehook_report_syscall to
> report_syscall.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

As a person who is repeatedly stumped trying to finding this function,
yes please.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-10 23:27:28

[permalink] [raw]

Subject: Re: [PATCH 03/13] ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h

On Wed, Mar 09, 2022 at 10:24:44AM -0600, Eric W. Biederman wrote:
> Rename tracehook_report_syscall_{entry,exit} to
> ptrace_report_syscall_{entry,exit} and place them in ptrace.h
>
> There is no longer any generic tracehook infractructure so make
> these ptrace specific functions ptrace specific.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-11 05:53:59

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 08/13] task_work: Call tracehook_notify_signal from get_signal on all architectures

Kees Cook <[email protected]> writes:

> On Wed, Mar 09, 2022 at 10:24:49AM -0600, Eric W. Biederman wrote:
>> Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3b2
>> ("task_work: unconditionally run task_work from get_signal()") always
>> calling task_wofffffffrk_run all of the work of tracehook_notify_signal is
>
> typo: cat on keyboard
>
>> already happening except clearing TIF_NOTIFY_SIGNAL.
>>
>> Factor clear_notify_signal out of tracehook_notify_signal and use it in
>> get_signal so that get_signal only needs one call of trask_work_run.
>
> typo: trask -> task
>
>>
>> To keep the semantics in sync update xfer_to_guest_mode_work (which
>> does not call get_signal) to call tracehook_notify_signal if either
>> _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.

First let me say thanks for the close look at this work.

> I see three logical changes in this patch, I think?
>
> - creation and use of clear_notify_signal()
> - removal of handle_signal_work() and removal of
> arch_do_signal_or_restart() has_signal arg
> - something with get_signal() I don't understand yet:
> - why is clear_notify_signal() added?
> - why is tracehook_notify_signal() removed?

The spoiler is the change to get_signal is the logical change.
The rest of the changes follow from that change. Please see below.

The inline expansion of tracehook_notify_signal in get_signal and
in it's other two callers in the next change is the only real kernel
internal api change in this series of changes.

The optimization that was tried with TIF_NOTIFY_SIGNAL and being able to
only call task_work_run() when TIF_NOTIFY_SIGNAL was set instead of when
get_signal was called failed, and caused a regression. The removal of
calling task_work_run from get_signal has been reverted but the rest
of the change had not been. So this change just removes the rest of
the failed optimization.

Please see below for my detailed description of the get_signal change.

I hope this helps.

>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> arch/s390/kernel/signal.c | 4 ++--
>> arch/x86/kernel/signal.c | 4 ++--
>> include/linux/entry-common.h | 2 +-
>> include/linux/tracehook.h | 9 +++++++--
>> kernel/entry/common.c | 12 ++----------
>> kernel/entry/kvm.c | 2 +-
>> kernel/signal.c | 14 +++-----------
>> 7 files changed, 18 insertions(+), 29 deletions(-)
>>
>> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
>> index 307f5d99514d..ea9e5e8182cd 100644
>> --- a/arch/s390/kernel/signal.c
>> +++ b/arch/s390/kernel/signal.c
>> @@ -453,7 +453,7 @@ static void handle_signal(struct ksignal *ksig, sigset_t *oldset,
>> * stack-frames in one go after that.
>> */
>>
>> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
>> +void arch_do_signal_or_restart(struct pt_regs *regs)
>> {
>> struct ksignal ksig;
>> sigset_t *oldset = sigmask_to_save();
>> @@ -466,7 +466,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
>> current->thread.system_call =
>> test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0;
>>
>> - if (has_signal && get_signal(&ksig)) {
>
> Right, the only caller of arch_do_signal_or_restart(),
> handle_signal_work(), only happens after its caller has already checked
> _TIF_SIGPENDING.

It could be TIF_SIGPENDING or TIF_NOTIFY_SIGNAL. The work for
TIF_NOTIFY_SIGNAL has been moved unconditionally into get_signal.
So it no longer makes sense to care which flag has been set.

>> + if (get_signal(&ksig)) {
>> /* Whee! Actually deliver the signal. */
>> if (current->thread.system_call) {
>> regs->int_code = current->thread.system_call;
>> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
>> index ec71e06ae364..de3d5b5724d8 100644
>> --- a/arch/x86/kernel/signal.c
>> +++ b/arch/x86/kernel/signal.c
>> @@ -861,11 +861,11 @@ static inline unsigned long get_nr_restart_syscall(const struct pt_regs *regs)
>> * want to handle. Thus you cannot kill init even with a SIGKILL even by
>> * mistake.
>> */
>> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
>> +void arch_do_signal_or_restart(struct pt_regs *regs)
>> {
>> struct ksignal ksig;
>>
>> - if (has_signal && get_signal(&ksig)) {
>> + if (get_signal(&ksig)) {
>> /* Whee! Actually deliver the signal. */
>> handle_signal(&ksig, regs);
>> return;
>> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>> index 9efbdda61f7a..3537fd25f14e 100644
>> --- a/include/linux/entry-common.h
>> +++ b/include/linux/entry-common.h
>> @@ -257,7 +257,7 @@ static __always_inline void arch_exit_to_user_mode(void) { }
>> *
>> * Invoked from exit_to_user_mode_loop().
>> */
>> -void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal);
>> +void arch_do_signal_or_restart(struct pt_regs *regs);
>>
>> /**
>> * exit_to_user_mode - Fixup state when exiting to user mode
>> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
>> index fa834a22e86e..b44a7820c468 100644
>> --- a/include/linux/tracehook.h
>> +++ b/include/linux/tracehook.h
>> @@ -106,6 +106,12 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
>> rseq_handle_notify_resume(NULL, regs);
>> }
>>
>> +static inline void clear_notify_signal(void)
>> +{
>> + clear_thread_flag(TIF_NOTIFY_SIGNAL);
>> + smp_mb__after_atomic();
>> +}
>> +
>> /*
>> * called by exit_to_user_mode_loop() if ti_work & _TIF_NOTIFY_SIGNAL. This
>> * is currently used by TWA_SIGNAL based task_work, which requires breaking
>> @@ -113,8 +119,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
>> */
>> static inline void tracehook_notify_signal(void)
>> {
>> - clear_thread_flag(TIF_NOTIFY_SIGNAL);
>> - smp_mb__after_atomic();
>> + clear_notify_signal();
>> if (task_work_pending(current))
>> task_work_run();
>> }
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index f0b1daa1e8da..79eaf9b4b10d 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -139,15 +139,7 @@ void noinstr exit_to_user_mode(void)
>> }
>>
>> /* Workaround to allow gradual conversion of architecture code */
>> -void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
>> -
>> -static void handle_signal_work(struct pt_regs *regs, unsigned long ti_work)
>> -{
>> - if (ti_work & _TIF_NOTIFY_SIGNAL)
>> - tracehook_notify_signal();
>> -
>> - arch_do_signal_or_restart(regs, ti_work & _TIF_SIGPENDING);
>> -}

With the work of tracehook_notify_signal happening in get_signal (called
from arch_do_signal_or_restart) there is no longer a reason to call
tracehook_notify_signal on it's own, or to remember if it was
TIF_NOTIFY_SIGNAL or TIF_SIGPENDING which was set.

>> +void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
>>
>> static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> unsigned long ti_work)
>> @@ -170,7 +162,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> klp_update_patch_state(current);
>>
>> if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>> - handle_signal_work(regs, ti_work);
>> + arch_do_signal_or_restart(regs);
>>
>> if (ti_work & _TIF_NOTIFY_RESUME)
>> tracehook_notify_resume(regs);
>> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
>> index 96d476e06c77..cabf36a489e4 100644
>> --- a/kernel/entry/kvm.c
>> +++ b/kernel/entry/kvm.c
>> @@ -8,7 +8,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
>> do {
>> int ret;
>>
>> - if (ti_work & _TIF_NOTIFY_SIGNAL)
>> + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>> tracehook_notify_signal();
>>
>> if (ti_work & _TIF_SIGPENDING) {
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 3b4cf25fb9b3..8632b88982c9 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2626,20 +2626,12 @@ bool get_signal(struct ksignal *ksig)
>> struct signal_struct *signal = current->signal;
>> int signr;
>>
>> + clear_notify_signal();
>
> Why is this added?

See below.
>
>> if (unlikely(task_work_pending(current)))
>> task_work_run();
>>
>> - /*
>> - * For non-generic architectures, check for TIF_NOTIFY_SIGNAL so
>> - * that the arch handlers don't all have to do it. If we get here
>> - * without TIF_SIGPENDING, just exit after running signal work.
>> - */
>> - if (!IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>> - if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>> - tracehook_notify_signal();
>
> I don't see why this gets removed?

This is the core change of this change, the rest
follows from this change.

The definition of tracehook_notify_signal is:

static inline void tracehook_notify_signal(void)
{
clear_notify_signal();
if (task_work_pending(current))
task_work_run();
}

Which means the only difference between:
if (unlikely(task_work_pending(current)))
task_work_run()

and tracehook_notify_signal is the work done by clear_notify_signal.
So I added the missing work and remove the fancy conditional.

There are only two architectures that define CONFIG_GENERIC_ENTRY
x86 and s390. At some point the task_work_run was moved completely
out of get_signal (on x86 and s390) and it was assumed it was enough
to call tracehook_notify_signal when TIF_NOTIFY_SIGNAL was set.

That turned out to be false and kernel regressions were encountered. So
the call to task_work_run was added back to get_signal.

Which is where this change comes in. There is an unconditional call to
task_work_run() in get_signal, and a funny conditional call to
task_work_run().

So this change is just changing the kernel to unconditionally call
task_work_run() in get_signal(), as well as clearing TIF_NOTIFY_SIGNAL.

The result is that on the common path tracehook_notify_signal no longer
needs to be called.

My next change removes tracehook_notify_signal entirely. Which is
the other reason I add clear_notify_signal.

>
>> - if (!task_sigpending(current))
>> - return false;
>> - }
>> + if (!task_sigpending(current))
>> + return false;
>>
>> if (unlikely(uprobe_deny_signal()))
>> return false;
>> --
>> 2.29.2
>>

Eric

2022-03-11 14:38:29

[permalink] [raw]

Subject: Re: [PATCH 08/13] task_work: Call tracehook_notify_signal from get_signal on all architectures

On Thu, Mar 10, 2022 at 01:04:52PM -0600, Eric W. Biederman wrote:
> Kees Cook <[email protected]> writes:
>
> > On Wed, Mar 09, 2022 at 10:24:49AM -0600, Eric W. Biederman wrote:
> >> Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3b2
> >> ("task_work: unconditionally run task_work from get_signal()") always
> >> calling task_wofffffffrk_run all of the work of tracehook_notify_signal is
> >
> > typo: cat on keyboard
> >
> >> already happening except clearing TIF_NOTIFY_SIGNAL.
> >>
> >> Factor clear_notify_signal out of tracehook_notify_signal and use it in
> >> get_signal so that get_signal only needs one call of trask_work_run.
> >
> > typo: trask -> task
> >
> >>
> >> To keep the semantics in sync update xfer_to_guest_mode_work (which
> >> does not call get_signal) to call tracehook_notify_signal if either
> >> _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.
>
> First let me say thanks for the close look at this work.
>
> > I see three logical changes in this patch, I think?
> >
> > - creation and use of clear_notify_signal()
> > - removal of handle_signal_work() and removal of
> > arch_do_signal_or_restart() has_signal arg
> > - something with get_signal() I don't understand yet:
> > - why is clear_notify_signal() added?
> > - why is tracehook_notify_signal() removed?
>
>
> The spoiler is the change to get_signal is the logical change.
> The rest of the changes follow from that change. Please see below.
>
> The inline expansion of tracehook_notify_signal in get_signal and
> in it's other two callers in the next change is the only real kernel
> internal api change in this series of changes.
>
> The optimization that was tried with TIF_NOTIFY_SIGNAL and being able to
> only call task_work_run() when TIF_NOTIFY_SIGNAL was set instead of when
> get_signal was called failed, and caused a regression. The removal of
> calling task_work_run from get_signal has been reverted but the rest
> of the change had not been. So this change just removes the rest of
> the failed optimization.
>
> Please see below for my detailed description of the get_signal change.
>
> I hope this helps.

It does! Thanks very much for the additional details.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-03-17 05:45:25

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 0/2] ptrace: Making the ptrace changes atomic

While working on cleaning up the exit path it have had occasion to look
at what guarantees are provided for setting and reading the fields that
are provided in task_struct for ptraces. Namely exit_code,
ptrace_message, and last_siginfo. It turns out as the ptrace interface
in the kernel was extended in the kernel the old existing interfaces
in the kernel were just wrapped and not properly updated to handle
the new functionality. This lead to races and inconsistencies.

This fixes the reason for the races and inconsistencies by moving the
work of maintaining the ptrace fields into ptrace_stop.

The inconsistency that results in some ptrace_stop points continuing
with a signal while others will not I have left alone as it appears to
be part of our userspace ABI, and changing that risks breaking
userspace.

Eric W. Biederman (2):
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop

include/linux/ptrace.h | 17 +++++++----------
include/uapi/linux/ptrace.h | 2 +-
kernel/signal.c | 40 ++++++++++++++++++++++++----------------
3 files changed, 32 insertions(+), 27 deletions(-)

Eric

2022-03-29 00:24:24

by Eric W. Biederman

[permalink] [raw]

Subject: [GIT PULL] ptrace: Cleanups for v5.18

Linus,

Please pull the ptrace-cleanups-for-v5.18 tag from the git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ptrace-cleanups-for-v5.18

HEAD: dcbc65aac28360df5f5a3b613043ccc0e81da3cf ptrace: Remove duplicated include in ptrace.c

This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c

The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was
around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
making the semantics clearer).

For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where now anything left in tracehook.h is
some weird strange thing that is difficult to understand.

Eric W. Biederman (15):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop

Jann Horn (1):
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE

Yang Li (1):
ptrace: Remove duplicated include in ptrace.c

MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 5 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 81 ++++++++++++-
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/ptrace.c | 47 +++++---
kernel/seccomp.c | 1 -
kernel/signal.c | 62 +++++-----
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
85 files changed, 372 insertions(+), 495 deletions(-)

Signed-off-by: "Eric W. Biederman" <[email protected]>

2022-03-29 00:30:06

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

On 3/28/22 5:56 PM, Eric W. Biederman wrote:
>
> Linus,
>
> Please pull the ptrace-cleanups-for-v5.18 tag from the git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ptrace-cleanups-for-v5.18
>
> HEAD: dcbc65aac28360df5f5a3b613043ccc0e81da3cf ptrace: Remove duplicated include in ptrace.c
>
> This set of changes removes tracehook.h, moves modification of all of
> the ptrace fields inside of siglock to remove races, adds a missing
> permission check to ptrace.c
>
> The removal of tracehook.h is quite significant as it has been a major
> source of confusion in recent years. Much of that confusion was
> around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
> making the semantics clearer).
>
> For people who don't know tracehook.h is a vestiage of an attempt to
> implement uprobes like functionality that was never fully merged, and
> was later superseeded by uprobes when uprobes was merged. For many
> years now we have been removing what tracehook functionaly a little
> bit at a time. To the point where now anything left in tracehook.h is
> some weird strange thing that is difficult to understand.

FWIW, the notify/task_work/io_uring changes look good to me. Thanks for
cleaning this up, Eric.

--
Jens Axboe

2022-03-29 00:49:07

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

On Mon, Mar 28, 2022 at 4:56 PM Eric W. Biederman <[email protected]> wrote:
>
> The removal of tracehook.h is quite significant as it has been a major
> source of confusion in recent years. Much of that confusion was
> around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
> making the semantics clearer).

Hmm. I love removing tracehook.c, but this looks like it hasn't been
in linux-next.

The header file changes messes with other changes, and we have

kernel/sched/fair.c:2884:9: error: implicit declaration of function
‘init_task_work’; did you mean ‘init_irq_work’?
[-Werror=implicit-function-declaration]
2884 | init_task_work(&p->numa_work, task_numa_work);
| ^~~~~~~~~~~~~~

as a result (also a few other things in that same file).

Now, this is trivial to fix - just add an include for
<linux/task_work.h> from that file - and that's the right thing to do
anyway.

But I'm a bit unhappy that this was either not tested in linux-next,
or if it was, I wasn't notified about the semantic in the pull
request.

So I've pulled this, and fixed up things in my merge, but I'm a bit
worried that there might be other situations like this where some
header file is no longer included and it was included implicitly
before through that disgusting tracehook.h header..

I *hope* it was just the scheduler header file updates that ended up
having this effect, and nothing else is affected.

Let's see if the test robots start complaining about non-x86
architecture-specific stuff that I don't build test.

Linus

2022-03-29 00:49:20

by pr-tracker-bot

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

The pull request you sent on Mon, 28 Mar 2022 18:56:46 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git ptrace-cleanups-for-v5.18

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1930a6e739c4b4a654a69164dbe39e554d228915

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2022-03-29 01:06:17

by Stephen Rothwell

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

Hi Linus,

On Mon, 28 Mar 2022 17:33:52 -0700 Linus Torvalds <[email protected]> wrote:
>
> On Mon, Mar 28, 2022 at 4:56 PM Eric W. Biederman <[email protected]> wrote:
> >
> > The removal of tracehook.h is quite significant as it has been a major
> > source of confusion in recent years. Much of that confusion was
> > around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
> > making the semantics clearer).
>
> Hmm. I love removing tracehook.c, but this looks like it hasn't been
> in linux-next.

See https://lore.kernel.org/lkml/[email protected]/

--
Cheers,
Stephen Rothwell

Attachments:

(No filename) (499.00 B)
OpenPGP digital signature

2022-03-29 01:11:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

On Mon, Mar 28, 2022 at 5:53 PM Stephen Rothwell <[email protected]> wrote:
>
> See https://lore.kernel.org/lkml/[email protected]/

Ok, so it was known, just not reported to me.

And the good news seems to be that at least there isn't anything
hiding elsewhere.

Linus

2022-03-29 05:46:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

On Mon, Mar 28, 2022 at 8:38 PM Eric W. Biederman <[email protected]> wrote:
>
> Dumb question because this seems to burning a few extra creativity
> points. Is there any way to create a signed tag and a branch with the
> same name?

Oh, absolutely.

But you may find it annoying, because as you noticed:

> Having a tag and a branch with the same name seems to completely confuse
> git and it just tells me no I won't push anything to another git tree,
> because what you are asking me to do is ambiguous.

Not at all.

git is not at all confused by the situation, git is in fact very aware
of it indeed.

But as git then tells you, exactly *because* it is aware that you have
picked the same name for both a branch and a tag, it will keep warning
you about the ambiguity of said name (but after warning, will
generally then preferentially use the tag of that name over the branch
of the same name).

So if you have both a branch and a tag named 'xyz', you generally need
to disambiguate them when you use them. That will make git happy,
because now it's not ambiguous any more.

(Technical detail: some git operations work on specific namespaces, so
"git branch -d xyz" should never remove a _tag_ called 'xyz', and as
such the name isn't ambiguous in the context of that git command)

And disambiguating them is simple, but I suspect you'll find it's
annoying enough that you simply don't want to use the same name for
tags and branches.

The full name of a tag 'x' is actually 'refs/tags/x', and the full
unambiguous name of a branch 'x' is 'refs/heads/x'.

So technically a tag and a branch can never _really_ have the same
name, because internally they have longer unambiguous names.

You would almost never actually use that full name, it's mostly an
internal git thing. Because even if you have ambiguous branch and tag
names, you'd then still shorten it to just 'tags/x' and 'heads/x'
respectively.

Git has other "namespaces" for these refs, too, notably
'refs/remotes/'. In fact, I guess you could make up your own namespace
too if you really wanted, although I don't think anybody really has
ever had a good reason for it.

(There is also the non-namespaced special refs like HEAD and
FETCH_HEAD and MERGE_HEAD for "current branch", "what you fetched" and
"what you are merging")

So you *can* do this:

# create and check out the branch 'xyz'
$ git checkout -b xyz master

# create a tag called 'xyz', but to confuse everybody, make it
point somewhere else
$ git tag xyz master~3

# look at what 'xyz' means:
$ git rev-parse xyz
warning: refname 'xyz' is ambiguous.
cffb2b72d3ed47f5093d128bd44d9ce136b6b5af

# Hmm. it warned about it being ambiguous
$ git rev-parse heads/xyz
1930a6e739c4b4a654a69164dbe39e554d228915

# Ok, it clearly took the tag, not the branch:
$ git rev-parse tags/xyz
cffb2b72d3ed47f5093d128bd44d9ce136b6b5af

so as you can see, yes, you can work with a tag and a branch that have
the same 'short name', but having to disambiguate them all the time
will likely just drive you mad.

And it's worth pointing out that the name does not imply a
relationship. So the branch called 'xyz' (ie refs/heads/xyz) has
absolutely *no* relationship to the tag called 'xyz' (ie
refs/tags/xyz) except for that ambiguous short name. So updating the
branch - by perhaps committing more to it - will in no way affect the
tag.

Also note that branches and tags are both "just refs" as far as git is
concerned, but git *does* give some semantic meaning to the
namespaces.

So the branch namespace (it those 'refs/heads/*' things) are things
that get updated automatically as you make new commits.

In contrast, the tag namespace is something you *can* update, but it's
considered odd, and if you want to overwrite an existing tag you
generally need to do something special (eg "git tag -f" to force
overwriting a new tag rather than telling you that you already have
one).

So in a very real way, to git a ref is a ref is a ref, with very
little to no real *technical* distinction. They are just ways to point
to the SHA1 hash of an object. But there are basically some common
semantic rules that are based on the namespaces, and all those git
operations that use certain namespaces by default.

Linus

2022-03-29 10:05:09

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

Linus Torvalds <[email protected]> writes:

> On Mon, Mar 28, 2022 at 4:56 PM Eric W. Biederman <[email protected]> wrote:
>>
>> The removal of tracehook.h is quite significant as it has been a major
>> source of confusion in recent years. Much of that confusion was
>> around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
>> making the semantics clearer).
>
> Hmm. I love removing tracehook.c, but this looks like it hasn't been
> in linux-next.
>
> The header file changes messes with other changes, and we have
>
> kernel/sched/fair.c:2884:9: error: implicit declaration of function
> ‘init_task_work’; did you mean ‘init_irq_work’?
> [-Werror=implicit-function-declaration]
> 2884 | init_task_work(&p->numa_work, task_numa_work);
> | ^~~~~~~~~~~~~~
>
> as a result (also a few other things in that same file).
>
> Now, this is trivial to fix - just add an include for
> <linux/task_work.h> from that file - and that's the right thing to do
> anyway.
>
> But I'm a bit unhappy that this was either not tested in linux-next,
> or if it was, I wasn't notified about the semantic in the pull
> request.
>
> So I've pulled this, and fixed up things in my merge, but I'm a bit
> worried that there might be other situations like this where some
> header file is no longer included and it was included implicitly
> before through that disgusting tracehook.h header..
>
> I *hope* it was just the scheduler header file updates that ended up
> having this effect, and nothing else is affected.
>
> Let's see if the test robots start complaining about non-x86
> architecture-specific stuff that I don't build test.

Sorry for not mentioning that. I had tracked it down. It was
fundamentally in the scheduler headers changes removing an include of
task_work.h, so it didn't feel like there was anything I could do in my
tree. I asked Ingo if he could fix his tree and unfortunately forgot
about it.

For the record there were also a couple of other pretty trivial
conflicts, the removal of nds32, some block_cgroup header where
an adjacent line was modified to what I was changing. But it thankfully
looks like none of those caused you any problems.

Sorry about all of that I am about that. I am running pretty weak this
last couple of days as a cold has been running through the household.

Dumb question because this seems to burning a few extra creativity
points. Is there any way to create a signed tag and a branch with the
same name? Or in general is there a good way to manage topic branches
and then tag them at the end before I send them?

Having a tag and a branch with the same name seems to completely confuse
git and it just tells me no I won't push anything to another git tree,
because what you are asking me to do is ambiguous. So now I am having
to come up with two names for each topic branch, even if I only push the
tags upstream.

I feel like there is a best practice on how to manage tags and topic
branches and I just haven't seen it yet.

Eric

2022-03-30 10:39:49

by Linus Torvalds

[permalink] [raw]

Subject: Re: [GIT PULL] ptrace: Cleanups for v5.18

On Mon, Mar 28, 2022 at 9:49 PM Linus Torvalds
<[email protected]> wrote:
>
> So if you have both a branch and a tag named 'xyz', you generally need
> to disambiguate them when you use them. That will make git happy,
> because now it's not ambiguous any more.

On a similar but very different issue: this is not the only kind of
naming ambiguity you can have.

For example, try this insane thing:

git tag Makefile

that creates a tag called 'Makefile' (pointing to your current HEAD, of course).

Now, guess what happens when you then do

git log Makefile

that's right - git once again notices that you are doing something ambiguous.

In fact, git will be *so* unhappy about this kind of ambiguity that
(unlike the 'tag vs branch' case) it will not prefer one version of
reality over the other, and simply consider this to be a fatal error,
and say

fatal: ambiguous argument 'Makefile': both revision and filename
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

and as a result you will hopefully go "Oh, I didn't mean to have that
tag at all" and just remove the bogus tagname.

Because you probably made it by mistake.

But you *can* choose to say

git log Makefile --

or

git log refs/tags/Makefile

to make it clear that no, 'Makefile' is not the pathname in the
repository, you really mean the tag 'Makefile'.

Or use

git log -- Makefile

or

git log ./Makefile

to say "yeah, I meant the _pathname_ 'Makefile', not the tag".

Or, if you just like playing games, do

git log Makefile -- Makefile

if you want to track the history of the path 'Makefile' starting at
the tag 'Makefile'.

But don't actually do this for real. There's a reason git notices these things.

Using ambiguous branch names (whether ambiguous with tag-names or with
filenames) is a pain that is not worth it in practice. Git will
notice, and git will allow you to work around it, but it's just a
BadIdea(tm) despite that.

But you probably want to be aware of issues like this if you script
things around git, though.

That "--" form in particular is generally the one you want to use for
scripting, if you want to allow people to do anything they damn well
please. It's the "Give them rope" syntax.

Of course, a much more common reason for "--" for pathname separation,
and one you may already be familiar with, is that you want to see the
history of a pathname that is not *currently* a pathname, but was one
at some point in the past.

So in my current kernel tree, doing

$ git log arch/nds32/

will actually cause an error, because of "unknown revision or path not
in the working tree".

But if you really want to see the history of that now-deleted
architecture, you do

$ git log -- arch/nds32/

and it's all good.

This concludes today's sermon on git,

Linus