2022-05-09 05:45:20

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/7] fork: Make init and umh ordinary tasks


In commit 40966e316f86 ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them.

I believe my first patch in this series is enough to fix the bug
and is simple enough and obvious enough to be backportable.

The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.

There is one rough spot in this change. In the init process before the
user space init process is exec'd there is a lot going on. I have found
when async_schedule_domain is low on memory or has more than 32K callers
executing do_populate_rootfs will now run in a user space thread making
flush_delayed_fput meaningless, and __fput_sync is unusable. I solved
this as I did in usermode_driver.c with an added explicit task_work_run.
I point this out as I have seen some talk about making flushing file
handles more explicit.

Eric W. Biederman (7):
kthread: Don't allocate kthread_struct for init and umh
fork: Pass struct kernel_clone_args into copy_thread
fork: Explicity test for idle tasks in copy_thread
fork: Generalize PF_IO_WORKER handling
init: Deal with the init process being a user mode process
fork: Explicitly set PF_KTHREAD
fork: Stop allowing kthreads to call execve

arch/alpha/kernel/process.c | 13 ++++++------
arch/arc/kernel/process.c | 13 ++++++------
arch/arm/kernel/process.c | 12 ++++++-----
arch/arm64/kernel/process.c | 12 ++++++-----
arch/csky/kernel/process.c | 15 ++++++-------
arch/h8300/kernel/process.c | 10 ++++-----
arch/hexagon/kernel/process.c | 12 ++++++-----
arch/ia64/kernel/process.c | 15 +++++++------
arch/m68k/kernel/process.c | 12 ++++++-----
arch/microblaze/kernel/process.c | 12 ++++++-----
arch/mips/kernel/process.c | 13 ++++++------
arch/nios2/kernel/process.c | 12 ++++++-----
arch/openrisc/kernel/process.c | 12 ++++++-----
arch/parisc/kernel/process.c | 18 +++++++++-------
arch/powerpc/kernel/process.c | 15 +++++++------
arch/riscv/kernel/process.c | 12 ++++++-----
arch/s390/kernel/process.c | 12 ++++++-----
arch/sh/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_64.c | 12 ++++++-----
arch/um/kernel/process.c | 15 +++++++------
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/include/asm/switch_to.h | 8 +++----
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 18 +++++++++-------
arch/xtensa/kernel/process.c | 17 ++++++++-------
fs/exec.c | 8 ++++---
include/linux/sched/task.h | 8 +++++--
init/initramfs.c | 2 ++
init/main.c | 2 +-
kernel/fork.c | 46 +++++++++++++++++++++++++++++++++-------
kernel/umh.c | 6 +++---
32 files changed, 233 insertions(+), 159 deletions(-)

Eric


2022-05-09 08:14:51

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 4/7] fork: Generalize PF_IO_WORKER handling

Add fn and fn_arg members into struct kernel_clone_args and test for
them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER).
This allows any task that wants to be a user space task that only runs
in kernel mode to use this functionality.

The code on x86 is an exception and still retains a PF_KTHREAD test
because x86 unlikely everything else handles kthreads slightly
differently than user space tasks that start with a function.

The functions that created tasks that start with a function
have been updated to set ".fn" and ".fn_arg" instead of
".stack" and ".stack_size". These functions are fork_idle(),
create_io_thread(), kernel_thread(), and user_mode_thread().

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/process.c | 7 +++----
arch/arc/kernel/process.c | 7 +++----
arch/arm/kernel/process.c | 7 +++----
arch/arm64/kernel/process.c | 7 +++----
arch/csky/kernel/process.c | 7 +++----
arch/h8300/kernel/process.c | 7 +++----
arch/hexagon/kernel/process.c | 7 +++----
arch/ia64/kernel/process.c | 6 +++---
arch/m68k/kernel/process.c | 7 +++----
arch/microblaze/kernel/process.c | 7 +++----
arch/mips/kernel/process.c | 7 +++----
arch/nios2/kernel/process.c | 7 +++----
arch/openrisc/kernel/process.c | 7 +++----
arch/parisc/kernel/process.c | 11 +++++------
arch/powerpc/kernel/process.c | 9 ++++-----
arch/riscv/kernel/process.c | 7 +++----
arch/s390/kernel/process.c | 7 +++----
arch/sh/kernel/process_32.c | 7 +++----
arch/sparc/kernel/process_32.c | 7 +++----
arch/sparc/kernel/process_64.c | 7 +++----
arch/um/kernel/process.c | 10 ++++------
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/include/asm/switch_to.h | 8 ++++----
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 13 ++++++-------
arch/xtensa/kernel/process.c | 11 +++++------
include/linux/sched/task.h | 2 ++
kernel/fork.c | 16 ++++++++--------
28 files changed, 95 insertions(+), 116 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 732e39217c7f..6cbba7370b4e 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -237,7 +237,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
extern void ret_from_fork(void);
extern void ret_from_kernel_thread(void);
@@ -251,13 +250,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
childti->pcb.ksp = (unsigned long) childstack;
childti->pcb.flags = 1; /* set FEN, clear everything else */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));
childstack->r26 = (unsigned long) ret_from_kernel_thread;
- childstack->r9 = usp; /* function */
- childstack->r10 = kthread_arg;
+ childstack->r9 = (unsigned long) args->fn;
+ childstack->r10 = (unsigned long) args->fn_arg;
childregs->hae = alpha_mv.hae_cache;
childti->pcb.usp = 0;
return 0;
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index caf948ba647c..3369f0700702 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -166,7 +166,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *c_regs; /* child's pt_regs */
unsigned long *childksp; /* to unwind out of __switch_to() */
@@ -193,11 +192,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
childksp[0] = 0; /* fp */
childksp[1] = (unsigned long)ret_from_fork; /* blink */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(c_regs, 0, sizeof(struct pt_regs));

- c_callee->r13 = kthread_arg;
- c_callee->r14 = usp; /* function */
+ c_callee->r13 = (unsigned long)args->fn_arg;
+ c_callee->r14 = (unsigned long)args->fn;

return 0;
}
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 8e13b426dd26..3d9cace63884 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -242,7 +242,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long stack_start = args->stack;
- unsigned long stk_sz = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *thread = task_thread_info(p);
struct pt_regs *childregs = task_pt_regs(p);
@@ -259,15 +258,15 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
thread->cpu_domain = get_domain();
#endif

- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!args->fn)) {
*childregs = *current_pt_regs();
childregs->ARM_r0 = 0;
if (stack_start)
childregs->ARM_sp = stack_start;
} else {
memset(childregs, 0, sizeof(struct pt_regs));
- thread->cpu_context.r4 = stk_sz;
- thread->cpu_context.r5 = stack_start;
+ thread->cpu_context.r4 = (unsigned long)args->fn_arg;
+ thread->cpu_context.r5 = (unsigned long)args->fn;
childregs->ARM_cpsr = SVC_MODE;
}
thread->cpu_context.pc = (unsigned long)ret_from_fork;
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index e002f6681c8d..d0ef05c661b0 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -320,7 +320,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long stack_start = args->stack;
- unsigned long stk_sz = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs = task_pt_regs(p);

@@ -337,7 +336,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)

ptrauth_thread_init_kernel(p);

- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!args->fn)) {
*childregs = *current_pt_regs();
childregs->regs[0] = 0;

@@ -371,8 +370,8 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
memset(childregs, 0, sizeof(struct pt_regs));
childregs->pstate = PSR_MODE_EL1h | PSR_IL_BIT;

- p->thread.cpu_context.x19 = stack_start;
- p->thread.cpu_context.x20 = stk_sz;
+ p->thread.cpu_context.x19 = (unsigned long)args->fn;
+ p->thread.cpu_context.x20 = (unsigned long)args->fn_arg;
}
p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
p->thread.cpu_context.sp = (unsigned long)childregs;
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 7dba33d37e1a..9af49aea1c3b 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -34,7 +34,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
struct switch_stack *childstack;
struct pt_regs *childregs = task_pt_regs(p);
@@ -49,11 +48,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
/* setup thread.sp for switch_to !!! */
p->thread.sp = (unsigned long)childstack;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(childregs, 0, sizeof(struct pt_regs));
childstack->r15 = (unsigned long) ret_from_kernel_thread;
- childstack->r10 = kthread_arg;
- childstack->r9 = usp;
+ childstack->r10 = (unsigned long) args->fn_arg;
+ childstack->r9 = (unsigned long) args->fn;
childregs->sr = mfcr("psr");
} else {
*childregs = *(current_pt_regs());
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 296ff831446e..5bfe0f8f54bf 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -108,16 +108,15 @@ void flush_thread(void)
int copy_thread(struct task_struct *p, const kernel_cloen_args *args)
{
unsigned long usp = args->stack;
- unsigned long topstk = args->stack_size;
struct pt_regs *childregs;

childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(childregs, 0, sizeof(struct pt_regs));
childregs->retpc = (unsigned long) ret_from_kernel_thread;
- childregs->er4 = topstk; /* arg */
- childregs->er5 = usp; /* fn */
+ childregs->er4 = (unsigned long) args->fn_arg;
+ childregs->er5 = (unsigned long) args->fn;
} else {
*childregs = *current_pt_regs();
childregs->er0 = 0;
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index f1c1f6f21941..f0552f98a7ba 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -54,7 +54,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *ti = task_thread_info(p);
struct hexagon_switch_stack *ss;
@@ -76,11 +75,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
sizeof(*ss));
ss->lr = (unsigned long)ret_from_fork;
p->thread.switch_sp = ss;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(childregs, 0, sizeof(struct pt_regs));
/* r24 <- fn, r25 <- arg */
- ss->r24 = usp;
- ss->r25 = arg;
+ ss->r24 = (unsigned long)args->fn;
+ ss->r25 = (unsigned long)args->fn_arg;
pt_set_kmode(childregs);
return 0;
}
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 8f010ae818bc..167b1765bea1 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -341,14 +341,14 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)

ia64_drop_fpu(p); /* don't pick up stale state from a CPU's fph */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
if (unlikely(args->idle)) {
/* fork_idle() called us */
return 0;
}
memset(child_stack, 0, sizeof(*child_ptregs) + sizeof(*child_stack));
- child_stack->r4 = user_stack_base; /* payload */
- child_stack->r5 = user_stack_size; /* argument */
+ child_stack->r4 = (unsigned long) args->fn;
+ child_stack->r5 = (unsigned long) args->fn_arg;
/*
* Preserve PSR bits, except for bits 32-34 and 37-45,
* which we can't read.
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 8ac575656fc4..221feb0269f1 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -142,7 +142,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct fork_frame {
struct switch_stack sw;
@@ -160,12 +159,12 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
*/
p->thread.fc = USER_DATA;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
memset(frame, 0, sizeof(struct fork_frame));
frame->regs.sr = PS_S;
- frame->sw.a3 = usp; /* function */
- frame->sw.d7 = arg;
+ frame->sw.a3 = (unsigned long)args->fn;
+ frame->sw.d7 = (unsigned long)args->fn_arg;
frame->sw.retpc = (unsigned long)ret_from_kernel_thread;
p->thread.usp = 0;
return 0;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index b5f549125c6a..3c6241bcaea8 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -56,19 +56,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs = task_pt_regs(p);
struct thread_info *ti = task_thread_info(p);

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* if we're creating a new kernel thread then just zeroing all
* the registers. That's OK for a brand new thread.*/
memset(childregs, 0, sizeof(struct pt_regs));
memset(&ti->cpu_context, 0, sizeof(struct cpu_context));
ti->cpu_context.r1 = (unsigned long)childregs;
- ti->cpu_context.r20 = (unsigned long)usp; /* fn */
- ti->cpu_context.r19 = (unsigned long)arg;
+ ti->cpu_context.r20 = (unsigned long)args->fn;
+ ti->cpu_context.r19 = (unsigned long)args->fn_arg;
childregs->pt_mode = 1;
local_save_flags(childregs->msr);
ti->cpu_context.msr = childregs->msr & ~MSR_IE;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index a572d097b16b..35b912bce429 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -109,7 +109,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
@@ -122,12 +121,12 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
/* Put the stack after the struct pt_regs. */
childksp = (unsigned long) childregs;
p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
unsigned long status = p->thread.cp0_status;
memset(childregs, 0, sizeof(struct pt_regs));
- p->thread.reg16 = usp; /* fn */
- p->thread.reg17 = kthread_arg;
+ p->thread.reg16 = (unsigned long)args->fn;
+ p->thread.reg17 = (unsigned long)args->fn_arg;
p->thread.reg29 = childksp;
p->thread.reg31 = (unsigned long) ret_from_kernel_thread;
#if defined(CONFIG_CPU_R3000)
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index 98c4bfe972e0..29593b98567d 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -104,7 +104,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs = task_pt_regs(p);
struct pt_regs *regs;
@@ -112,12 +111,12 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
struct switch_stack *childstack =
((struct switch_stack *)childregs) - 1;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));

- childstack->r16 = usp; /* fn */
- childstack->r17 = arg;
+ childstack->r16 = (unsigned long) args->fn;
+ childstack->r17 = (unsigned long) args->fn_arg;
childstack->ra = (unsigned long) ret_from_kernel_thread;
childregs->estatus = STATUS_PIE;
childregs->sp = (unsigned long) childstack;
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 486e46dd5883..d9697cc9bc4d 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -156,7 +156,6 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *userregs;
struct pt_regs *kregs;
@@ -175,10 +174,10 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
sp -= sizeof(struct pt_regs);
kregs = (struct pt_regs *)sp;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(kregs, 0, sizeof(struct pt_regs));
- kregs->gpr[20] = usp; /* fn, kernel thread */
- kregs->gpr[22] = arg;
+ kregs->gpr[20] = (unsigned long)args->fn;
+ kregs->gpr[22] = (unsigned long)args->fn_arg;
} else {
*userregs = *current_pt_regs();

diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index 30a5874ca845..a6a2a558fc5b 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -210,7 +210,6 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *cregs = &(p->thread.regs);
void *stack = task_stack_page(p);
@@ -221,7 +220,7 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
extern void * const ret_from_kernel_thread;
extern void * const child_return;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
memset(cregs, 0, sizeof(struct pt_regs));
if (args->idle) /* idle thread */
@@ -236,12 +235,12 @@ copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
* ret_from_kernel_thread.
*/
#ifdef CONFIG_64BIT
- cregs->gr[27] = ((unsigned long *)usp)[3];
- cregs->gr[26] = ((unsigned long *)usp)[2];
+ cregs->gr[27] = ((unsigned long *)args->fn)[3];
+ cregs->gr[26] = ((unsigned long *)args->fn)[2];
#else
- cregs->gr[26] = usp;
+ cregs->gr[26] = (unsigned long) args->fn;
#endif
- cregs->gr[25] = kthread_arg;
+ cregs->gr[25] = (unsigned long) args->fn_arg;
} else {
/* user thread */
/* usp must be word aligned. This also prevents users from
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 3fd67c861d54..4f367bb68906 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1720,7 +1720,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long kthread_arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs, *kregs;
extern void ret_from_fork(void);
@@ -1738,18 +1737,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
/* Copy registers */
sp -= sizeof(struct pt_regs);
childregs = (struct pt_regs *) sp;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gpr[1] = sp + sizeof(struct pt_regs);
/* function */
- if (usp)
- childregs->gpr[14] = ppc_function_entry((void *)usp);
+ if (args->fn)
+ childregs->gpr[14] = ppc_function_entry((void *)args->fn);
#ifdef CONFIG_PPC64
clear_tsk_thread_flag(p, TIF_32BIT);
childregs->softe = IRQS_ENABLED;
#endif
- childregs->gpr[15] = kthread_arg;
+ childregs->gpr[15] = (unsigned long)args->fn_arg;
p->thread.regs = NULL; /* no user register state */
ti->flags |= _TIF_RESTOREALL;
f = ret_from_kernel_thread;
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 334382731725..24efabdbc551 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -124,12 +124,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs = task_pt_regs(p);

/* p->thread holds context to be restored by __switch_to() */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* Kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gp = gp_in_global;
@@ -137,8 +136,8 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
childregs->status = SR_PP | SR_PIE;

p->thread.ra = (unsigned long)ret_from_kernel_thread;
- p->thread.s[0] = usp; /* fn */
- p->thread.s[1] = arg;
+ p->thread.s[0] = (unsigned long)args->fn;
+ p->thread.s[1] = (unsigned long)args->fn_arg;
} else {
*childregs = *(current_pt_regs());
if (usp) /* User fork */
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index bb5daec39516..89949b9f3cf8 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -98,7 +98,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long new_stackp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct fake_frame
{
@@ -133,15 +132,15 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->sf.gprs[9] = (unsigned long)frame;

/* Store access registers to kernel stack of new process. */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
/* kernel thread */
memset(&frame->childregs, 0, sizeof(struct pt_regs));
frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
PSW_MASK_IO | PSW_MASK_EXT | PSW_MASK_MCHECK;
frame->childregs.psw.addr =
(unsigned long)__ret_from_fork;
- frame->childregs.gprs[9] = new_stackp; /* function */
- frame->childregs.gprs[10] = arg;
+ frame->childregs.gprs[9] = (unsigned long)args->fn;
+ frame->childregs.gprs[10] = (unsigned long)args->fn_arg;
frame->childregs.orig_gpr2 = -1;
frame->childregs.last_break = 1;
return 0;
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 6023399b1892..a808843375e7 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -96,7 +96,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs;
@@ -117,11 +116,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)

childregs = task_pt_regs(p);
p->thread.sp = (unsigned long) childregs;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(childregs, 0, sizeof(struct pt_regs));
p->thread.pc = (unsigned long) ret_from_kernel_thread;
- childregs->regs[4] = arg;
- childregs->regs[5] = usp;
+ childregs->regs[4] = (unsigned long) args->fn_arg;
+ childregs->regs[5] = (unsigned long) args->fn;
childregs->sr = SR_MD;
#if defined(CONFIG_SH_FPU)
childregs->sr |= SR_FD;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 80e6775e18c0..33b0215a4182 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -263,7 +263,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
@@ -299,13 +298,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
ti->ksp = (unsigned long) new_stack;
p->thread.kregs = childregs;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
extern int nwindows;
unsigned long psr;
memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
ti->kpc = (((unsigned long) ret_from_kernel_thread) - 0x8);
- childregs->u_regs[UREG_G1] = sp; /* function */
- childregs->u_regs[UREG_G2] = arg;
+ childregs->u_regs[UREG_G1] = (unsigned long) args->fn;
+ childregs->u_regs[UREG_G2] = (unsigned long) args->fn_arg;
psr = childregs->psr = get_psr();
ti->kpsr = psr | PSR_PIL;
ti->kwim = 1 << (((psr & PSR_CWP) + 1) % nwindows);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index 38c46ca826d9..6335b698a4b4 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -568,7 +568,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct thread_info *t = task_thread_info(p);
struct pt_regs *regs = current_pt_regs();
@@ -587,12 +586,12 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
sizeof(struct sparc_stackf));
t->fpsaved[0] = 0;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(args->fn)) {
memset(child_trap_frame, 0, child_stack_sz);
__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] =
(current_pt_regs()->tstate + 1) & TSTATE_CWP;
- t->kregs->u_regs[UREG_G1] = sp; /* function */
- t->kregs->u_regs[UREG_G2] = arg;
+ t->kregs->u_regs[UREG_G1] = (unsigned long) args->fn;
+ t->kregs->u_regs[UREG_G2] = (unsigned long) args->fn_arg;
return 0;
}

diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index fd2d2361484d..181cc9aafb25 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -158,15 +158,13 @@ int copy_thread(struct task_struct * p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
void (*handler)(void);
- int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
int ret = 0;

p->thread = (struct thread_struct) INIT_THREAD;

- if (!kthread) {
+ if (!args->fn) {
memcpy(&p->thread.regs.regs, current_pt_regs(),
sizeof(p->thread.regs.regs));
PT_REGS_SET_SYSCALL_RETURN(&p->thread.regs, 0);
@@ -178,14 +176,14 @@ int copy_thread(struct task_struct * p, const struct kernel_clone_args *args)
arch_copy_thread(&current->thread.arch, &p->thread.arch);
} else {
get_safe_registers(p->thread.regs.regs.gp, p->thread.regs.regs.fp);
- p->thread.request.u.thread.proc = (int (*)(void *))sp;
- p->thread.request.u.thread.arg = (void *)arg;
+ p->thread.request.u.thread.proc = args->fn;
+ p->thread.request.u.thread.arg = args->fn_arg;
handler = new_thread_handler;
}

new_thread(task_stack_page(p), &p->thread.switch_buf, handler);

- if (!kthread) {
+ if (!args->fn) {
clear_flushed_tls(p);

/*
diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index 99a8820e8cc4..b2486b2cbc6e 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,7 @@

extern void save_fpregs_to_fpstate(struct fpu *fpu);
extern void fpu__drop(struct fpu *fpu);
-extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags);
+extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
extern void fpu_flush_thread(void);

/*
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index b5f0d2ff47e4..c08eb0fdd11f 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -78,13 +78,13 @@ static inline void update_task_stack(struct task_struct *task)
}

static inline void kthread_frame_init(struct inactive_task_frame *frame,
- unsigned long fun, unsigned long arg)
+ int (*fun)(void *), void *arg)
{
- frame->bx = fun;
+ frame->bx = (unsigned long)fun;
#ifdef CONFIG_X86_32
- frame->di = arg;
+ frame->di = (unsigned long)arg;
#else
- frame->r12 = arg;
+ frame->r12 = (unsigned long)arg;
#endif
}

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c049561f373a..fbade5a3975b 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -556,7 +556,7 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
}

/* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
{
struct fpu *src_fpu = &current->thread.fpu;
struct fpu *dst_fpu = &dst->thread.fpu;
@@ -579,7 +579,7 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
* No FPU state inheritance for kernel threads and IO
* worker threads.
*/
- if (dst->flags & (PF_KTHREAD | PF_IO_WORKER)) {
+ if (minimal) {
/* Clear out the minimal state */
memcpy(&dst_fpu->fpstate->regs, &init_fpstate.regs,
init_fpstate_copy_size());
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0fce52b10dc4..d20eaad52a85 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -134,7 +134,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long sp = args->stack;
- unsigned long arg = args->stack_size;
unsigned long tls = args->tls;
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
@@ -172,13 +171,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->flags = X86_EFLAGS_FIXED;
#endif

- fpu_clone(p, clone_flags);
+ fpu_clone(p, clone_flags, args->fn);

/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {
p->thread.pkru = pkru_get_init_value();
memset(childregs, 0, sizeof(struct pt_regs));
- kthread_frame_init(frame, sp, arg);
+ kthread_frame_init(frame, args->fn, args->fn_arg);
return 0;
}

@@ -198,10 +197,10 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
task_user_gs(p) = get_user_gs(current_pt_regs());
#endif

- if (unlikely(p->flags & PF_IO_WORKER)) {
+ if (unlikely(args->fn)) {
/*
- * An IO thread is a user space thread, but it doesn't
- * return to ret_after_fork().
+ * A user space thread, but it doesn't return to
+ * ret_after_fork().
*
* In order to indicate that to tools like gdb,
* we reset the stack and instruction pointers.
@@ -211,7 +210,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
*/
childregs->sp = 0;
childregs->ip = 0;
- kthread_frame_init(frame, sp, arg);
+ kthread_frame_init(frame, args->fn, args->fn_arg);
return 0;
}

diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index 15ce25073142..c3751cc88e5d 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -205,7 +205,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
unsigned long usp_thread_fn = args->stack;
- unsigned long thread_fn_arg = args->stack_size;
unsigned long tls = args->tls;
struct pt_regs *childregs = task_pt_regs(p);

@@ -226,7 +225,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
#error Unsupported Xtensa ABI
#endif

- if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (!args->fn) {
struct pt_regs *regs = current_pt_regs();
unsigned long usp = usp_thread_fn ?
usp_thread_fn : regs->areg[1];
@@ -278,15 +277,15 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
* Window underflow will load registers from the
* spill slots on the stack on return from _switch_to.
*/
- SPILL_SLOT(childregs, 2) = usp_thread_fn;
- SPILL_SLOT(childregs, 3) = thread_fn_arg;
+ SPILL_SLOT(childregs, 2) = (unsigned long)args->fn;
+ SPILL_SLOT(childregs, 3) = (unsigned long)args->fn_arg;
#elif defined(__XTENSA_CALL0_ABI__)
/*
* a12 = thread_fn, a13 = thread_fn arg.
* _switch_to epilogue will load registers from the stack.
*/
- ((unsigned long *)p->thread.sp)[0] = usp_thread_fn;
- ((unsigned long *)p->thread.sp)[1] = thread_fn_arg;
+ ((unsigned long *)p->thread.sp)[0] = (unsigned long)args->fn;
+ ((unsigned long *)p->thread.sp)[1] = (unsigned long)args->fn_arg;
#else
#error Unsupported Xtensa ABI
#endif
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 3d6b99ce5408..505aaf9fe477 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -34,6 +34,8 @@ struct kernel_clone_args {
int io_thread;
int kthread;
int idle;
+ int (*fn)(void *);
+ void *fn_arg;
struct cgroup *cgrp;
struct css_set *cset;
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 93d77ee921ff..8e17c3fbce42 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2555,8 +2555,8 @@ struct task_struct * __init fork_idle(int cpu)
struct task_struct *task;
struct kernel_clone_args args = {
.flags = CLONE_VM,
- .stack = (unsigned long)&idle_dummy,
- .stack_size = (unsigned long)NULL,
+ .fn = &idle_dummy,
+ .fn_arg = NULL,
.kthread = 1,
.idle = 1,
};
@@ -2589,8 +2589,8 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
- .stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
+ .fn = fn,
+ .fn_arg = arg,
.io_thread = 1,
};

@@ -2694,8 +2694,8 @@ pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
- .stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
+ .fn = fn,
+ .fn_arg = arg,
.kthread = 1,
};

@@ -2711,8 +2711,8 @@ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
- .stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
+ .fn = fn,
+ .fn_arg = arg,
};

return kernel_clone(&args);
--
2.35.3


2022-05-09 09:38:05

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 5/7] init: Deal with the init process being a user mode process

It is silly for user_mode_thread to leave PF_KTHREAD set
on the resulting task. Update the init process so that
it does not care if PF_KTHREAD is set or not.

Ensure do_populate_rootfs flushes all delayed fput work by calling
task_work_run. In the rare instance that async_schedule_domain calls
do_populate_rootfs synchronously it is possible do_populate_rootfs
will be called directly from the init process. At which point fput
will call "task_work_add(current, ..., TWA_RESUME)". The files on the
initramfs need to be completely put before we attempt to exec them
(which is before the code enters userspace). So call task_work_run
just in case there are any pending fput operations.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
init/initramfs.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/init/initramfs.c b/init/initramfs.c
index 2f3d96dc3db6..41e7857d510d 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -15,6 +15,7 @@
#include <linux/mm.h>
#include <linux/namei.h>
#include <linux/init_syscalls.h>
+#include <linux/task_work.h>
#include <linux/umh.h>

static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
@@ -703,6 +704,7 @@ static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
initrd_end = 0;

flush_delayed_fput();
+ task_work_run();
}

static ASYNC_DOMAIN_EXCLUSIVE(initramfs_domain);
--
2.35.3


2022-05-09 09:49:13

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/7] kthread: Don't allocate kthread_struct for init and umh

If kthread_is_per_cpu runs concurrently with free_kthread_struct the
kthread_struct that was just freed may be read from.

This bug was introduced by commit 40966e316f86 ("kthread: Ensure
struct kthread is present for all kthreads"). When kthread_struct
started to be allocated for all tasks that have PF_KTHREAD set. This
in turn required the kthread_struct to be freed in kernel_execve and
violated the assumption that kthread_struct will have the same
lifetime as the task.

Looking a bit deeper this only applies to callers of kernel_execve
which is just the init process and the user mode helper processes.
These processes really don't want to be kernel threads but are for
historical reasons. Mostly that copy_thread does not know how to take
a kernel mode function to the process with for processes without
PF_KTHREAD or PF_IO_WORKER set.

Solve this by not allocating kthread_struct for the init process and
the user mode helper processes.

This is done by adding a kthread member to struct kernel_clone_args.
Setting kthread in fork_idle and kernel_thread. Adding
user_mode_thread that works like kernel_thread except it does not set
kthread. In fork only allocating the kthread_struct if .kthread is set.

I have looked at kernel/kthread.c and since commit 40966e316f86
("kthread: Ensure struct kthread is present for all kthreads") there
have been no assumptions added that to_kthread or __to_kthread will
not return NULL.

There are a few callers of to_kthread or __to_kthread that assume a
non-NULL struct kthread pointer will be returned. These functions are
kthread_data(), kthread_parmme(), kthread_exit(), kthread(),
kthread_park(), kthread_unpark(), kthread_stop(). All of those functions
can reasonably expected to be called when it is know that a task is a
kthread so that assumption seems reasonable.

Cc: [email protected]
Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads")
Reported-by: Максим Кутявин <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 6 ++++--
include/linux/sched/task.h | 2 ++
init/main.c | 2 +-
kernel/fork.c | 22 ++++++++++++++++++++--
kernel/umh.c | 6 +++---
5 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e3e55d5e0be1..75eb6e0ee7b2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1308,8 +1308,6 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out_unlock;

- if (me->flags & PF_KTHREAD)
- free_kthread_struct(me);
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
@@ -1955,6 +1953,10 @@ int kernel_execve(const char *kernel_filename,
int fd = AT_FDCWD;
int retval;

+ if (WARN_ON_ONCE((current->flags & PF_KTHREAD) &&
+ (current->worker_private)))
+ return -EINVAL;
+
filename = getname_kernel(kernel_filename);
if (IS_ERR(filename))
return PTR_ERR(filename);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 719c9a6cac8d..4492266935dd 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -32,6 +32,7 @@ struct kernel_clone_args {
size_t set_tid_size;
int cgroup;
int io_thread;
+ int kthread;
struct cgroup *cgrp;
struct css_set *cset;
};
@@ -89,6 +90,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
+extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags);
extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
int kernel_wait(pid_t pid, int *stat);

diff --git a/init/main.c b/init/main.c
index 98182c3c2c4b..39baac0211c6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -688,7 +688,7 @@ noinline void __ref rest_init(void)
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
- pid = kernel_thread(kernel_init, NULL, CLONE_FS);
+ pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
/*
* Pin init on the boot CPU. Task migration is not properly working
* until sched_init_smp() has been run. It will set the allowed
diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..27c5203750b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2157,7 +2157,7 @@ static __latent_entropy struct task_struct *copy_process(
p->io_context = NULL;
audit_set_context(p, NULL);
cgroup_fork(p);
- if (p->flags & PF_KTHREAD) {
+ if (args->kthread) {
if (!set_kthread_struct(p))
goto bad_fork_cleanup_delayacct;
}
@@ -2548,7 +2548,8 @@ struct task_struct * __init fork_idle(int cpu)
{
struct task_struct *task;
struct kernel_clone_args args = {
- .flags = CLONE_VM,
+ .flags = CLONE_VM,
+ .kthread = 1,
};

task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args);
@@ -2679,6 +2680,23 @@ pid_t kernel_clone(struct kernel_clone_args *args)
* Create a kernel thread.
*/
pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
+{
+ struct kernel_clone_args args = {
+ .flags = ((lower_32_bits(flags) | CLONE_VM |
+ CLONE_UNTRACED) & ~CSIGNAL),
+ .exit_signal = (lower_32_bits(flags) & CSIGNAL),
+ .stack = (unsigned long)fn,
+ .stack_size = (unsigned long)arg,
+ .kthread = 1,
+ };
+
+ return kernel_clone(&args);
+}
+
+/*
+ * Create a user mode thread.
+ */
+pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
diff --git a/kernel/umh.c b/kernel/umh.c
index 36c123360ab8..b989736e8707 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -132,7 +132,7 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info)

/* If SIGCLD is ignored do_wait won't populate the status. */
kernel_sigaction(SIGCHLD, SIG_DFL);
- pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
+ pid = user_mode_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
if (pid < 0)
sub_info->retval = pid;
else
@@ -171,8 +171,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
* want to pollute current->children, and we need a parent
* that always ignores SIGCHLD to ensure auto-reaping.
*/
- pid = kernel_thread(call_usermodehelper_exec_async, sub_info,
- CLONE_PARENT | SIGCHLD);
+ pid = user_mode_thread(call_usermodehelper_exec_async, sub_info,
+ CLONE_PARENT | SIGCHLD);
if (pid < 0) {
sub_info->retval = pid;
umh_complete(sub_info);
--
2.35.3


2022-05-09 21:06:14

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/7] fork: Make init and umh ordinary tasks

On Fri, May 06, 2022 at 09:11:36AM -0500, Eric W. Biederman wrote:
>
> In commit 40966e316f86 ("kthread: Ensure struct kthread is present for
> all kthreads") caused init and the user mode helper threads that call
> kernel_execve to have struct kthread allocated for them.
>
> I believe my first patch in this series is enough to fix the bug
> and is simple enough and obvious enough to be backportable.
>
> The rest of the changes pass struct kernel_clone_args to clean things
> up and cause the code to make sense.
>
> There is one rough spot in this change. In the init process before the
> user space init process is exec'd there is a lot going on. I have found
> when async_schedule_domain is low on memory or has more than 32K callers
> executing do_populate_rootfs will now run in a user space thread making
> flush_delayed_fput meaningless, and __fput_sync is unusable. I solved
> this as I did in usermode_driver.c with an added explicit task_work_run.
> I point this out as I have seen some talk about making flushing file
> handles more explicit.

Reverting the last 3 commits of the series fixed a boot crash.

1b2552cbdbe0 fork: Stop allowing kthreads to call execve
753550eb0ce1 fork: Explicitly set PF_KTHREAD
68d85f0a33b0 init: Deal with the init process being a user mode process

BUG: KASAN: null-ptr-deref in task_nr_scan_windows.isra.0
arch_atomic_long_read at ./include/linux/atomic/atomic-long.h:29
(inlined by) atomic_long_read at ./include/linux/atomic/atomic-instrumented.h:1266
(inlined by) get_mm_counter at ./include/linux/mm.h:1996
(inlined by) get_mm_rss at ./include/linux/mm.h:2049
(inlined by) task_nr_scan_windows at kernel/sched/fair.c:1123
Read of size 8 at addr 00000000000003d0 by task swapper/0/1

CPU: 72 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc6-next-20220509-dirty #29
Call trace:
dump_backtrace
show_stack
dump_stack_lvl
print_report
kasan_report
kasan_check_range
__kasan_check_read
task_nr_scan_windows.isra.0
task_scan_start
task_scan_min at /home/user/linux/kernel/sched/fair.c:1144
(inlined by) task_scan_start at /home/user/linux/kernel/sched/fair.c:1150
task_tick_fair
task_tick_numa at /home/user/linux/kernel/sched/fair.c:2944
(inlined by) task_tick_fair at /home/user/linux/kernel/sched/fair.c:11186
scheduler_tick
update_process_times
tick_periodic
tick_handle_periodic
arch_timer_handler_phys
handle_percpu_devid_irq
generic_handle_domain_irq
gic_handle_irq
call_on_irq_stack
do_interrupt_handler
el1_interrupt
el1h_64_irq_handler
el1h_64_irq
split_page
make_alloc_exact
alloc_pages_exact_nid
init_section_page_ext
page_ext_init
kernel_init_freeable
kernel_init
ret_from_fork
==================================================================
Disabling lock debugging due to kernel taint
Unable to handle kernel paging request at virtual address dfff80000000007a
KASAN: null-ptr-deref in range [0x00000000000003d0-0x00000000000003d7]
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[dfff80000000007a] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 72 PID: 1 Comm: swapper/0 Tainted: G B 5.18.0-rc6-next-20220509-dirty #29
pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : task_nr_scan_windows.isra.0
lr : task_nr_scan_windows.isra.0
sp : ffff800008487cb0
x29: ffff800008487cb0 x28: ffff07ff89728040 x27: 000000003bc47ee0
x26: ffff08367f088980 x25: 1fffe0fff12e525f x24: ffff07ff897292f8
x23: ffff07ff89728040 x22: 1fffe0fff12e5262 x21: 0000000000010000
x20: 00000000000003d0 x19: 0000000000000000 x18: ffffdd41783f7d1c
x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d
x14: 3d3d3d3d3d3d3d3d x13: 746e696174206c65 x12: ffff7ba82f3b98b5
x11: 1ffffba82f3b98b4 x10: ffff7ba82f3b98b4 x9 : dfff800000000000
x8 : ffffdd4179dcc5a7 x7 : 0000000000000001 x6 : ffff7ba82f3b98b4
x5 : ffffdd4179dcc5a0 x4 : ffff7ba82f3b98b5 x3 : ffffdd4171de2b14
x2 : 0000000000000001 x1 : 000000000000007a x0 : dfff800000000000
Call trace:
task_nr_scan_windows.isra.0
task_scan_start
task_tick_fair
scheduler_tick
update_process_times
tick_periodic
tick_handle_periodic
arch_timer_handler_phys
handle_percpu_devid_irq
generic_handle_domain_irq
gic_handle_irq
call_on_irq_stack
do_interrupt_handler
el1_interrupt
el1h_64_irq_handler
el1h_64_irq
split_page
make_alloc_exact
alloc_pages_exact_nid
init_section_page_ext
page_ext_init
kernel_init_freeable
kernel_init
ret_from_fork
Code: d343fe81 d2d00000 f2fbffe0 53185eb5 (38e06820)

2022-05-09 23:58:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] fork: Make init and umh ordinary tasks

Qian Cai <[email protected]> writes:

> On Fri, May 06, 2022 at 09:11:36AM -0500, Eric W. Biederman wrote:
>>
>> In commit 40966e316f86 ("kthread: Ensure struct kthread is present for
>> all kthreads") caused init and the user mode helper threads that call
>> kernel_execve to have struct kthread allocated for them.
>>
>> I believe my first patch in this series is enough to fix the bug
>> and is simple enough and obvious enough to be backportable.
>>
>> The rest of the changes pass struct kernel_clone_args to clean things
>> up and cause the code to make sense.
>>
>> There is one rough spot in this change. In the init process before the
>> user space init process is exec'd there is a lot going on. I have found
>> when async_schedule_domain is low on memory or has more than 32K callers
>> executing do_populate_rootfs will now run in a user space thread making
>> flush_delayed_fput meaningless, and __fput_sync is unusable. I solved
>> this as I did in usermode_driver.c with an added explicit task_work_run.
>> I point this out as I have seen some talk about making flushing file
>> handles more explicit.
>
> Reverting the last 3 commits of the series fixed a boot crash.
>
> 1b2552cbdbe0 fork: Stop allowing kthreads to call execve
> 753550eb0ce1 fork: Explicitly set PF_KTHREAD
> 68d85f0a33b0 init: Deal with the init process being a user mode process

Hmm. It looks like I missed a little detail.

task_tick_fair
task_tick_numa
task_scan_start
task_scan_min
task_nr_scan_windows
p->mm

If I read this code right task_tick_numa makes the assumption that only
tasks with PF_KTHREAD set don't have an mm.

This should fix the failure. For init we could possibly populate .mm
and not just .active_mm. For user mode helpers cloned from kernel
threads I don't think that is a realistic option. So I think this
is going to be the proper fix.

I believe this only happens when numa rebalancing happens at an
unfortunate moment.

Qian Cai can you test this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..db6f0df9d43e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2915,7 +2915,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
/*
* We don't care about NUMA placement if we don't have memory.
*/
- if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+ if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
return;

/*


Eric

2022-05-10 20:45:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/7] kthread: Don't allocate kthread_struct for init and umh

Thomas Gleixner <[email protected]> writes:

> On Fri, May 06 2022 at 09:15, Eric W. Biederman wrote:
>> * the init task will end up wanting to create kthreads, which, if
>> * we schedule it before we create kthreadd, will OOPS.
>> */
>> - pid = kernel_thread(kernel_init, NULL, CLONE_FS);
>> + pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
>
> So init does not have PF_KTHREAD set anymore, which causes this to go
> sideways with a NULL pointer dereference in get_mm_counter() on next:

Well not after the change above, but in a later patch yes.

Patch 1/7 really gets us back to the previous status quo, where
I introduced the breakage.

> get_mm_counter include/linux/mm.h:1996 [inline]
> get_mm_rss include/linux/mm.h:2049 [inline]
> task_nr_scan_windows.isra.0+0x23/0x120 kernel/sched/fair.c:1123
> task_scan_min kernel/sched/fair.c:1144 [inline]
> task_scan_start+0x6c/0x400 kernel/sched/fair.c:1150
> task_tick_numa kernel/sched/fair.c:2944 [inline]
> task_tick_fair+0xaeb/0xef0 kernel/sched/fair.c:11186
> scheduler_tick+0x20a/0x5e0 kernel/sched/core.c:5380
>
> https://lore.kernel.org/lkml/[email protected]
>
> because the fence in task_tick_numa():
>
> if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
> return;
>
> is not longer sufficient. It needs also to bail if !curr->mm.

Agreed. I proposed a patch to do just that a little while ago.

> I'm worried that there are more of these issues lurking. Haven't looked
> yet.

I looked earlier and I missed this one. I am going to look again today,
along with applying the obvious fix to task_tick_numa.

I don't think there are many but when the code has evolved into a shape
that is not easy to understand things occasionally slip through when the
abstractions are made clear to understand. The reason to rework the
code and make it clear is that once the code has evolved to a point of
many subtle issues making any change is brittle.

Eric


2022-05-10 22:45:20

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH 0/7] fork: Make init and umh ordinary tasks

On Mon, May 09, 2022 at 04:52:07PM -0500, Eric W. Biederman wrote:
> I believe this only happens when numa rebalancing happens at an
> unfortunate moment.
>
> Qian Cai can you test this?

Yes, this works fine so far.

2022-05-11 03:11:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/7] kthread: Don't allocate kthread_struct for init and umh

On Fri, May 06 2022 at 09:15, Eric W. Biederman wrote:
> * the init task will end up wanting to create kthreads, which, if
> * we schedule it before we create kthreadd, will OOPS.
> */
> - pid = kernel_thread(kernel_init, NULL, CLONE_FS);
> + pid = user_mode_thread(kernel_init, NULL, CLONE_FS);

So init does not have PF_KTHREAD set anymore, which causes this to go
sideways with a NULL pointer dereference in get_mm_counter() on next:

get_mm_counter include/linux/mm.h:1996 [inline]
get_mm_rss include/linux/mm.h:2049 [inline]
task_nr_scan_windows.isra.0+0x23/0x120 kernel/sched/fair.c:1123
task_scan_min kernel/sched/fair.c:1144 [inline]
task_scan_start+0x6c/0x400 kernel/sched/fair.c:1150
task_tick_numa kernel/sched/fair.c:2944 [inline]
task_tick_fair+0xaeb/0xef0 kernel/sched/fair.c:11186
scheduler_tick+0x20a/0x5e0 kernel/sched/core.c:5380

https://lore.kernel.org/lkml/[email protected]

because the fence in task_tick_numa():

if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
return;

is not longer sufficient. It needs also to bail if !curr->mm.

I'm worried that there are more of these issues lurking. Haven't looked
yet.

Thanks,

tglx

2022-05-12 19:15:39

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 8/7] sched: Update task_tick_numa to ignore tasks without an mm


Qian Cai <[email protected]> wrote:
> Reverting the last 3 commits of the series fixed a boot crash.
>
> 1b2552cbdbe0 fork: Stop allowing kthreads to call execve
> 753550eb0ce1 fork: Explicitly set PF_KTHREAD
> 68d85f0a33b0 init: Deal with the init process being a user mode process
>
> BUG: KASAN: null-ptr-deref in task_nr_scan_windows.isra.0
> arch_atomic_long_read at ./include/linux/atomic/atomic-long.h:29
> (inlined by) atomic_long_read at ./include/linux/atomic/atomic-instrumented.h:1266
> (inlined by) get_mm_counter at ./include/linux/mm.h:1996
> (inlined by) get_mm_rss at ./include/linux/mm.h:2049
> (inlined by) task_nr_scan_windows at kernel/sched/fair.c:1123
> Read of size 8 at addr 00000000000003d0 by task swapper/0/1

With the change to init and the user mode helper processes to not have
PF_KTHREAD set before they call kernel_execve the PF_KTHREAD test in
task_tick_numa became insufficient to detect all tasks that have
"->mm == NULL". Correct that by testing for "->mm == NULL" directly.

Reported-by: Qian Cai <[email protected]>
Tested-by: Qian Cai <[email protected]>
Fixes: 1b2552cbdbe0 ("fork: Stop allowing kthreads to call execve")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..db6f0df9d43e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2915,7 +2915,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
/*
* We don't care about NUMA placement if we don't have memory.
*/
- if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
+ if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
return;

/*
--
2.35.3


2022-05-14 03:13:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/7] kthread: Don't allocate kthread_struct for init and umh

"Eric W. Biederman" <[email protected]> writes:

> Thomas Gleixner <[email protected]> writes:
>
>> I'm worried that there are more of these issues lurking. Haven't looked
>> yet.
>
> I looked earlier and I missed this one. I am going to look again today,
> along with applying the obvious fix to task_tick_numa.

So I have just looked again and I don't see anything. There are a
couple of subtle issues on x86. Especially with saving floating
point but as I read the code copy_thread initializes things
properly so that code that doesn't touch floating point registers
doesn't need to do anything.

The important thing for lurking issues is that even if I missed
something practically all of the uses of PF_KTHREAD are on x86
or generic code so they should be flushed out quickly.

Eric