2021-11-21 17:50:05

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 00/10] Use copy_process/create_io_thread in vhost layer

The following patches made over Linus's tree, allow the vhost layer to do
a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
io_uring does a copy_process against its userspace app. This allows the
vhost layer's worker threads to inherit cgroups, namespaces, address
space, etc and this worker thread will also be accounted for against that
owner/parent process's RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.

V5:
- Handle kbuild errors by building patchset against current kernel that
has all deps merged. Also add patch to remove create_io_thread code as
it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
had that flag set. Also dropped patches that passed worker flags to
copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
vhost conversion patch.





2021-11-21 17:50:09

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag

This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use the new
kernel_worker helpers that will use the flag and are added later in this
patchset.

The common behavior PF_USER_WORKER covers is the initial frame and fpu
setup and the vm reclaim handling.

Signed-off-by: Mike Christie <[email protected]>
---
arch/alpha/kernel/process.c | 2 +-
arch/arc/kernel/process.c | 2 +-
arch/arm/kernel/process.c | 2 +-
arch/arm64/kernel/process.c | 2 +-
arch/csky/kernel/process.c | 2 +-
arch/h8300/kernel/process.c | 2 +-
arch/hexagon/kernel/process.c | 2 +-
arch/ia64/kernel/process.c | 2 +-
arch/m68k/kernel/process.c | 2 +-
arch/microblaze/kernel/process.c | 2 +-
arch/mips/kernel/process.c | 2 +-
arch/nds32/kernel/process.c | 2 +-
arch/nios2/kernel/process.c | 2 +-
arch/openrisc/kernel/process.c | 2 +-
arch/parisc/kernel/process.c | 2 +-
arch/powerpc/kernel/process.c | 2 +-
arch/riscv/kernel/process.c | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_64.c | 2 +-
arch/um/kernel/process.c | 2 +-
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 2 +-
arch/xtensa/kernel/process.c | 2 +-
include/linux/sched.h | 1 +
include/linux/sched/task.h | 1 +
kernel/fork.c | 5 ++++-
mm/vmscan.c | 4 ++--
29 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5f8527081da9..f4759e4ee4a9 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childti->pcb.ksp = (unsigned long) childstack;
childti->pcb.flags = 1; /* set FEN, clear everything else */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8e90052f6f05..b409ecb1407f 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childksp[0] = 0; /* fp */
childksp[1] = (unsigned long)ret_from_fork; /* blink */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(c_regs, 0, sizeof(struct pt_regs));

c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index d47159f3791c..44603062d661 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -251,7 +251,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
thread->cpu_domain = get_domain();
#endif

- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
*childregs = *current_pt_regs();
childregs->ARM_r0 = 0;
if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index aacf2f5559a8..b4c6c41847be 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -333,7 +333,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,

ptrauth_thread_init_kernel(p);

- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
*childregs = *current_pt_regs();
childregs->regs[0] = 0;

diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..509f2bfe4ace 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
/* setup thread.sp for switch_to !!! */
p->thread.sp = (unsigned long)childstack;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
childstack->r15 = (unsigned long) ret_from_kernel_thread;
childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8833fa4f5d51..050aca44ba6d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,

childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
childregs->retpc = (unsigned long) ret_from_kernel_thread;
childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 232dfd8956aa..40f8294c6c7c 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
sizeof(*ss));
ss->lr = (unsigned long)ret_from_fork;
p->thread.switch_sp = ss;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
/* r24 <- fn, r25 <- arg */
ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 834df24a88f1..29015ebdcf1d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,

ia64_drop_fpu(p); /* don't pick up stale state from a CPU's fph */

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
if (unlikely(!user_stack_base)) {
/* fork_idle() called us */
return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index a6030dbaa089..cbb693012b5e 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
*/
p->thread.fc = USER_DATA;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(frame, 0, sizeof(struct fork_frame));
frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 5e2b91c1e8ce..de1da9900f7e 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct pt_regs *childregs = task_pt_regs(p);
struct thread_info *ti = task_thread_info(p);

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* if we're creating a new kernel thread then just zeroing all
* the registers. That's OK for a brand new thread.*/
memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index cbff1b974f88..6f1ed337cd41 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
/* Put the stack after the struct pt_regs. */
childksp = (unsigned long) childregs;
p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
unsigned long status = p->thread.cp0_status;
memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 49fab9e39cbf..dba91dd1e289 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,

memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
/* kernel thread fn */
p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index f8ea522a1588..eabf3452e5e2 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct switch_stack *childstack =
((struct switch_stack *)childregs) - 1;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));

diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 3c0c91bcdcba..aa110383cfa1 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -172,7 +172,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
sp -= sizeof(struct pt_regs);
kregs = (struct pt_regs *)sp;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(kregs, 0, sizeof(struct pt_regs));
kregs->gpr[20] = usp; /* fn, kernel thread */
kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index ea3d83b6fb62..a76120e30eb4 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -197,7 +197,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
extern void * const ret_from_kernel_thread;
extern void * const child_return;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(cregs, 0, sizeof(struct pt_regs));
if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 406d7ee9e322..0b06b7809816 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1700,7 +1700,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
/* Copy registers */
sp -= sizeof(struct pt_regs);
childregs = (struct pt_regs *) sp;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gpr[1] = sp + sizeof(struct pt_regs);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..8deeb94eb51e 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -125,7 +125,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct pt_regs *childregs = task_pt_regs(p);

/* p->thread holds context to be restored by __switch_to() */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* Kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index e8858b2de24b..5ce90a23b1eb 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -130,7 +130,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
frame->sf.gprs[9] = (unsigned long)frame;

/* Store access registers to kernel stack of new process. */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(&frame->childregs, 0, sizeof(struct pt_regs));
frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 1c28e3cddb60..0506a739b0a8 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,

childregs = task_pt_regs(p);
p->thread.sp = (unsigned long) childregs;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
p->thread.pc = (unsigned long) ret_from_kernel_thread;
childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 2dc0bf9fe62e..5386e56b5e6c 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -296,7 +296,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
ti->ksp = (unsigned long) new_stack;
p->thread.kregs = childregs;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
extern int nwindows;
unsigned long psr;
memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index f5b2cac8669f..6058b3966f71 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -594,7 +594,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
sizeof(struct sparc_stackf));
t->fpsaved[0] = 0;

- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(child_trap_frame, 0, child_stack_sz);
__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] =
(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 82107373ac7e..2cb57aefacbe 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct * p, unsigned long tls)
{
void (*handler)(void);
- int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
+ int kthread = current->flags & (PF_KTHREAD | PF_USER_WORKER);
int ret = 0;

p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..f3fc1b1db999 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -485,10 +485,10 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);

/*
- * No FPU state inheritance for kernel threads and IO
+ * No FPU state inheritance for kernel threads and user
* worker threads.
*/
- if (dst->flags & (PF_KTHREAD | PF_IO_WORKER)) {
+ if (dst->flags & (PF_KTHREAD | PF_USER_WORKER)) {
/* Clear out the minimal state */
memcpy(&dst_fpu->fpstate->regs, &init_fpstate.regs,
init_fpstate_copy_size());
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e9ee8b526319..3120a1aa37f3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -195,7 +195,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
task_user_gs(p) = get_user_gs(current_pt_regs());
#endif

- if (unlikely(p->flags & PF_IO_WORKER)) {
+ if (unlikely(p->flags & PF_USER_WORKER)) {
/*
* An IO thread is a user space thread, but it doesn't
* return to ret_after_fork().
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index bd80df890b1e..00d81668ead4 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -224,7 +224,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
#error Unsupported Xtensa ABI
#endif

- if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (!(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
struct pt_regs *regs = current_pt_regs();
unsigned long usp = usp_thread_fn ?
usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78c351e35fec..5ca45456a77a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1683,6 +1683,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_USER_WORKER 0x01000000 /* Kernel thread cloned from userspace thread */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index c688e1d2e3e3..1ad98e31d5bc 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -19,6 +19,7 @@ struct css_set;
#define CLONE_LEGACY_FLAGS 0xffffffffULL

#define KERN_WORKER_IO BIT(0)
+#define KERN_WORKER_USER BIT(1)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 58dcbf7dd28d..8f6cd9581e2e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2032,6 +2032,9 @@ static __latent_entropy struct task_struct *copy_process(
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}

+ if (args->worker_flags & KERN_WORKER_USER)
+ p->flags |= PF_USER_WORKER;
+
/*
* This _must_ happen before we call free_task(), i.e. before we jump
* to any of the bad_fork_* labels. This is to avoid freeing
@@ -2524,7 +2527,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
.stack_size = (unsigned long)arg,
- .worker_flags = KERN_WORKER_IO,
+ .worker_flags = KERN_WORKER_IO | KERN_WORKER_USER,
};

return copy_process(NULL, 0, node, &args);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..f504ff88a09d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,12 +1028,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);

/*
- * Do not throttle IO workers, kthreads other than kswapd or
+ * Do not throttle user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_IO_WORKER|PF_KTHREAD))
+ current->flags & (PF_USER_WORKER|PF_KTHREAD))
return;

/*
--
2.25.1


2021-11-21 17:50:14

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 05/10] signal: Perfom autoreap for PF_USER_WORKER

Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
to clean them up. For cases like where qemu will do dynamic/hot add/remove
of vhost devices, then we need to auto reap the thread like was done for
the kthread case, because qemu does not know what API the kernel/vhost
layer is using.

This has us do autoreaping for these threads similar to when the parent
ignores SIGCHLD and for kthreads.

Signed-off-by: Mike Christie <[email protected]>
---
kernel/signal.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7c4b7ae714d4..07e7d6f9bf66 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2049,9 +2049,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)

psig = tsk->parent->sighand;
spin_lock_irqsave(&psig->siglock, flags);
- if (!tsk->ptrace && sig == SIGCHLD &&
+ if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
(psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
- (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
+ (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))))) {
/*
* We are exiting and our parent doesn't care. POSIX.1
* defines special semantics for setting SIGCHLD to SIG_IGN
--
2.25.1


2021-11-21 17:50:16

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 01/10] fork: Make IO worker options flag based

This patchset adds a couple new options to kernel_clone_args for IO thread
like/related users. Instead of adding new fields to kernel_clone_args for
each option, this moves us to a flags based approach by first converting
io_thread.

Signed-off-by: Mike Christie <[email protected]>
Suggested-by: Christian Brauner <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/sched/task.h | 4 +++-
kernel/fork.c | 4 ++--
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..c688e1d2e3e3 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
/* All the bits taken by the old clone syscall. */
#define CLONE_LEGACY_FLAGS 0xffffffffULL

+#define KERN_WORKER_IO BIT(0)
+
struct kernel_clone_args {
u64 flags;
+ u32 worker_flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
/* Number of elements in *set_tid */
size_t set_tid_size;
int cgroup;
- int io_thread;
struct cgroup *cgrp;
struct css_set *cset;
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..58dcbf7dd28d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,7 +2023,7 @@ static __latent_entropy struct task_struct *copy_process(
p = dup_task_struct(current, node);
if (!p)
goto fork_out;
- if (args->io_thread) {
+ if (args->worker_flags & KERN_WORKER_IO) {
/*
* Mark us an IO worker, and block any signal that isn't
* fatal or STOP
@@ -2524,7 +2524,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
.stack_size = (unsigned long)arg,
- .io_thread = 1,
+ .worker_flags = KERN_WORKER_IO,
};

return copy_process(NULL, 0, node, &args);
--
2.25.1


2021-11-21 17:50:18

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 09/10] vhost: move worker thread fields to new struct

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
drivers/vhost/vhost.h | 11 +++--
2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* sure it was not in the list.
* test_and_set_bit() implies a memory barrier.
*/
- llist_add(&work->node, &dev->work_list);
- wake_up_process(dev->worker);
+ llist_add(&work->node, &dev->worker->work_list);
+ wake_up_process(dev->worker->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
- return !llist_empty(&dev->work_list);
+ return dev->worker && !llist_empty(&dev->worker->work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);

@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,

static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_worker *worker = data;
+ struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;

@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
break;
}

- node = llist_del_all(&dev->work_list);
+ node = llist_del_all(&worker->work_list);
if (!node)
schedule();

@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
- kcov_remote_start_common(dev->kcov_handle);
+ kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
- init_llist_head(&dev->work_list);
init_waitqueue_head(&dev->wait);
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
}

+static void vhost_worker_free(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker = dev->worker;
+
+ if (!worker)
+ return;
+
+ dev->worker = NULL;
+ WARN_ON(!llist_empty(&worker->work_list));
+ kthread_stop(worker->task);
+ kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+ struct task_struct *task;
+ int ret;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+ if (!worker)
+ return -ENOMEM;
+
+ dev->worker = worker;
+ worker->dev = dev;
+ worker->kcov_handle = kcov_common_handle();
+ init_llist_head(&worker->work_list);
+
+ task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_worker;
+ }
+
+ worker->task = task;
+ wake_up_process(task); /* avoid contributing to loadavg */
+
+ ret = vhost_attach_cgroups(dev);
+ if (ret)
+ goto stop_worker;
+
+ return 0;
+
+stop_worker:
+ kthread_stop(worker->task);
+free_worker:
+ kfree(worker);
+ dev->worker = NULL;
+ return ret;
+}
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
- struct task_struct *worker;
int err;

/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)

vhost_attach_mm(dev);

- dev->kcov_handle = kcov_common_handle();
if (dev->use_worker) {
- worker = kthread_create(vhost_worker, dev,
- "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
- }
-
- dev->worker = worker;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- err = vhost_attach_cgroups(dev);
+ err = vhost_worker_create(dev);
if (err)
- goto err_cgroup;
+ goto err_worker;
}

err = vhost_dev_alloc_iovecs(dev);
if (err)
- goto err_cgroup;
+ goto err_iovecs;

return 0;
-err_cgroup:
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- }
+err_iovecs:
+ vhost_worker_free(dev);
err_worker:
vhost_detach_mm(dev);
- dev->kcov_handle = 0;
err_mm:
return err;
}
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
- WARN_ON(!llist_empty(&dev->work_list));
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- dev->kcov_handle = 0;
- }
+ vhost_worker_free(dev);
vhost_detach_mm(dev);
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
unsigned long flags;
};

+struct vhost_worker {
+ struct task_struct *task;
+ struct llist_head work_list;
+ struct vhost_dev *dev;
+ u64 kcov_handle;
+};
+
/* Poll a file (eventfd or socket) */
/* Note: there's nothing vhost specific about this structure. */
struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
- struct llist_head work_list;
- struct task_struct *worker;
+ struct vhost_worker *worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
int iov_limit;
int weight;
int byte_weight;
- u64 kcov_handle;
bool use_worker;
int (*msg_handler)(struct vhost_dev *dev,
struct vhost_iotlb_msg *msg);
--
2.25.1


2021-11-21 17:50:21

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 07/10] io_uring: switch to kernel_worker

Convert io_uring and io-wq to use kernel_worker.

Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/io-wq.c | 15 ++++++++-------
fs/io_uring.c | 11 +++++------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..b39cf045d6ce 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -70,6 +70,9 @@ struct io_worker {

#define IO_WQ_NR_HASH_BUCKETS (1u << IO_WQ_HASH_ORDER)

+#define IO_WQ_CLONE_FLAGS (CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
+ CLONE_THREAD | CLONE_IO)
+
struct io_wqe_acct {
unsigned nr_workers;
unsigned max_workers;
@@ -600,13 +603,9 @@ static int io_wqe_worker(void *data)
struct io_wqe *wqe = worker->wqe;
struct io_wq *wq = wqe->wq;
bool last_timeout = false;
- char buf[TASK_COMM_LEN];

worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);

- snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
- set_task_comm(current, buf);
-
audit_alloc_kernel(current);

while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
@@ -704,7 +703,7 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
list_add_tail_rcu(&worker->all_list, &wqe->all_list);
worker->flags |= IO_WORKER_F_FREE;
raw_spin_unlock(&wqe->lock);
- wake_up_new_task(tsk);
+ kernel_worker_start(tsk, "iou-wrk-%d", wqe->wq->task->pid);
}

static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
@@ -734,7 +733,8 @@ static void create_worker_cont(struct callback_head *cb)
worker = container_of(cb, struct io_worker, create_work);
clear_bit_unlock(0, &worker->create_state);
wqe = worker->wqe;
- tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+ tsk = kernel_worker(io_wqe_worker, worker, wqe->node,
+ IO_WQ_CLONE_FLAGS, KERN_WORKER_IO);
if (!IS_ERR(tsk)) {
io_init_new_worker(wqe, worker, tsk);
io_worker_release(worker);
@@ -801,7 +801,8 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
if (index == IO_WQ_ACCT_BOUND)
worker->flags |= IO_WORKER_F_BOUND;

- tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+ tsk = kernel_worker(io_wqe_worker, worker, wqe->node, IO_WQ_CLONE_FLAGS,
+ KERN_WORKER_IO);
if (!IS_ERR(tsk)) {
io_init_new_worker(wqe, worker, tsk);
} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b07196b4511c..d01eef85cc03 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7431,12 +7431,8 @@ static int io_sq_thread(void *data)
struct io_sq_data *sqd = data;
struct io_ring_ctx *ctx;
unsigned long timeout = 0;
- char buf[TASK_COMM_LEN];
DEFINE_WAIT(wait);

- snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
- set_task_comm(current, buf);
-
if (sqd->sq_cpu != -1)
set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
else
@@ -8665,6 +8661,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
fdput(f);
}
if (ctx->flags & IORING_SETUP_SQPOLL) {
+ unsigned long flags = CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+ CLONE_THREAD | CLONE_IO;
struct task_struct *tsk;
struct io_sq_data *sqd;
bool attached;
@@ -8710,7 +8708,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,

sqd->task_pid = current->pid;
sqd->task_tgid = current->tgid;
- tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+ tsk = kernel_worker(io_sq_thread, sqd, NUMA_NO_NODE,
+ flags, KERN_WORKER_IO);
if (IS_ERR(tsk)) {
ret = PTR_ERR(tsk);
goto err_sqpoll;
@@ -8718,7 +8717,7 @@ static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,

sqd->thread = tsk;
ret = io_uring_alloc_task_context(tsk, ctx);
- wake_up_new_task(tsk);
+ kernel_worker_start(tsk, "iou-sqp-%d", sqd->task_pid);
if (ret)
goto err;
} else if (p->flags & IORING_SETUP_SQ_AFF) {
--
2.25.1


2021-11-21 17:50:23

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 03/10] fork: add KERNEL_WORKER flag to not dup/clone files

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls the kernel_worker() function added in
the next patch we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 1ad98e31d5bc..47ede41b19c5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;

#define KERN_WORKER_IO BIT(0)
#define KERN_WORKER_USER BIT(1)
+#define KERN_WORKER_NO_FILES BIT(2)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8f6cd9581e2e..a1ba423eec4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1529,7 +1529,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}

-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+ int no_files)
{
struct files_struct *oldf, *newf;
int error = 0;
@@ -1541,6 +1542,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
if (!oldf)
goto out;

+ if (no_files) {
+ tsk->files = NULL;
+ goto out;
+ }
+
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2179,7 +2185,8 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
- retval = copy_files(clone_flags, p);
+ retval = copy_files(clone_flags, p,
+ args->worker_flags & KERN_WORKER_NO_FILES);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
--
2.25.1


2021-11-21 17:50:24

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 08/10] fork: remove create_io_thread

create_io_thread is not used anymore so remove it.

Signed-off-by: Mike Christie <[email protected]>
---
include/linux/sched/task.h | 1 -
kernel/fork.c | 22 ----------------------
2 files changed, 23 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 2188be3a3142..313fb8c825ae 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -89,7 +89,6 @@ extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);

extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
unsigned long clone_flags, u32 worker_flags);
__printf(2, 3)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3729abafbdf9..1b44d048ec16 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2521,28 +2521,6 @@ struct mm_struct *copy_init_mm(void)
return dup_mm(NULL, &init_mm);
}

-/*
- * This is like kernel_clone(), but shaved down and tailored to just
- * creating io_uring workers. It returns a created task, or an error pointer.
- * The returned task is inactive, and the caller must fire it up through
- * wake_up_new_task(p). All signals are blocked in the created task.
- */
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
-{
- unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
- CLONE_IO;
- struct kernel_clone_args args = {
- .flags = ((lower_32_bits(flags) | CLONE_VM |
- CLONE_UNTRACED) & ~CSIGNAL),
- .exit_signal = (lower_32_bits(flags) & CSIGNAL),
- .stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
- .worker_flags = KERN_WORKER_IO | KERN_WORKER_USER,
- };
-
- return copy_process(NULL, 0, node, &args);
-}
-
static bool kernel_worker_flags_valid(struct kernel_clone_args *kargs)
{
/* Verify that no unknown flags are passed along. */
--
2.25.1


2021-11-21 17:50:27

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 04/10] fork: Add KERNEL_WORKER flag to ignore signals

From: Christian Brauner <[email protected]>

Since this is mirroring kthread's sig ignore api introduced in

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

this patch adds an option flag, KERNEL_WORKER_SIG_IGN, handled in
copy_process() after copy_sighand() and copy_signals() to ease the
transition from kthreads to this new api.

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 47ede41b19c5..ef3a8e7f7ed9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
#define KERN_WORKER_IO BIT(0)
#define KERN_WORKER_USER BIT(1)
#define KERN_WORKER_NO_FILES BIT(2)
+#define KERN_WORKER_SIG_IGN BIT(3)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index a1ba423eec4d..6e6050d588ae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2211,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

+ if (args->worker_flags & KERN_WORKER_SIG_IGN)
+ ignore_signals(p);
+
stackleak_task_init(p);

if (pid != &init_struct_pid) {
--
2.25.1


2021-11-21 17:50:30

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 06/10] fork: add helper to clone a process

The vhost layer has similar requirements as io_uring where its worker
threads need to access the userspace thread's memory, want to inherit the
parents's cgroups and namespaces, and be checked against the parent's
RLIMITs. Right now, the vhost layer uses the kthread API which has
kthread_use_mm for mem access, and those threads can use
cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
other items.

This adds a helper to clone a process so we can inherit everything we
want in one call. It's a more generic version of create_io_thread which
will be used by the vhost layer and io_uring in later patches in this set.

[added flag validation code from Christian Brauner's SIG_IGN patch]
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/sched/task.h | 4 +++
kernel/fork.c | 71 ++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef3a8e7f7ed9..2188be3a3142 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -90,6 +90,10 @@ extern void exit_itimers(struct signal_struct *);

extern pid_t kernel_clone(struct kernel_clone_args *kargs);
struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+ unsigned long clone_flags, u32 worker_flags);
+__printf(2, 3)
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6e6050d588ae..3729abafbdf9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2543,6 +2543,77 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
return copy_process(NULL, 0, node, &args);
}

+static bool kernel_worker_flags_valid(struct kernel_clone_args *kargs)
+{
+ /* Verify that no unknown flags are passed along. */
+ if (kargs->worker_flags & ~(KERN_WORKER_IO | KERN_WORKER_USER |
+ KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN))
+ return false;
+
+ /*
+ * If we're ignoring all signals don't allow sharing struct sighand and
+ * don't bother clearing signal handlers.
+ */
+ if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
+ (kargs->worker_flags & KERN_WORKER_SIG_IGN))
+ return false;
+
+ return true;
+}
+
+/**
+ * kernel_worker - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @worker_flags: KERN_WORKER flags
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through kernel_worker_start(). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+ unsigned long clone_flags, u32 worker_flags)
+{
+ struct kernel_clone_args args = {
+ .flags = ((lower_32_bits(clone_flags) | CLONE_VM |
+ CLONE_UNTRACED) & ~CSIGNAL),
+ .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL),
+ .stack = (unsigned long)fn,
+ .stack_size = (unsigned long)arg,
+ .worker_flags = KERN_WORKER_USER | worker_flags,
+ };
+
+ if (!kernel_worker_flags_valid(&args))
+ return ERR_PTR(-EINVAL);
+
+ return copy_process(NULL, 0, node, &args);
+}
+EXPORT_SYMBOL_GPL(kernel_worker);
+
+/**
+ * kernel_worker_start - Start a task created with kernel_worker
+ * @tsk: task to wake up
+ * @namefmt: printf-style format string for the thread name
+ * @arg: arguments for @namefmt
+ */
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
+{
+ char name[TASK_COMM_LEN];
+ va_list args;
+
+ WARN_ON(!(tsk->flags & PF_USER_WORKER));
+
+ va_start(args, namefmt);
+ vsnprintf(name, sizeof(name), namefmt, args);
+ set_task_comm(tsk, name);
+ va_end(args);
+
+ wake_up_new_task(tsk);
+}
+EXPORT_SYMBOL_GPL(kernel_worker_start);
+
/*
* Ok, this is the main fork-routine.
*
--
2.25.1


2021-11-21 17:52:12

by Mike Christie

[permalink] [raw]
Subject: [PATCH V5 10/10] vhost: use kernel_worker to check RLIMITs

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked. This patch has us use the kernel_copy_process function
which will inherit its values/checks from the thread that owns the device.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>

---
drivers/vhost/vhost.c | 65 +++++++++++++++----------------------------
drivers/vhost/vhost.h | 7 ++++-
2 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..9aa04fcdf210 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/kthread.h>
-#include <linux/cgroup.h>
#include <linux/module.h>
#include <linux/sort.h>
#include <linux/sched/mm.h>
@@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
struct vhost_worker *worker = data;
- struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;

- kthread_use_mm(dev->mm);
-
for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);

- if (kthread_should_stop()) {
+ if (test_bit(VHOST_WORKER_FLAG_STOP, &worker->flags)) {
__set_current_state(TASK_RUNNING);
break;
}
@@ -376,8 +372,9 @@ static int vhost_worker(void *data)
schedule();
}
}
- kthread_unuse_mm(dev->mm);
- return 0;
+
+ complete(worker->exit_done);
+ do_exit(0);
}

static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -517,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_check_owner);

-struct vhost_attach_cgroups_struct {
- struct vhost_work work;
- struct task_struct *owner;
- int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
- struct vhost_attach_cgroups_struct *s;
-
- s = container_of(work, struct vhost_attach_cgroups_struct, work);
- s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
- struct vhost_attach_cgroups_struct attach;
-
- attach.owner = current;
- vhost_work_init(&attach.work, vhost_attach_cgroups_work);
- vhost_work_queue(dev, &attach.work);
- vhost_work_dev_flush(dev);
- return attach.ret;
-}
-
/* Caller should have device mutex */
bool vhost_dev_has_owner(struct vhost_dev *dev)
{
@@ -579,6 +551,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
}

+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+ DECLARE_COMPLETION_ONSTACK(exit_done);
+
+ worker->exit_done = &exit_done;
+ set_bit(VHOST_WORKER_FLAG_STOP, &worker->flags);
+ wake_up_process(worker->task);
+ wait_for_completion(worker->exit_done);
+}
+
static void vhost_worker_free(struct vhost_dev *dev)
{
struct vhost_worker *worker = dev->worker;
@@ -588,7 +570,7 @@ static void vhost_worker_free(struct vhost_dev *dev)

dev->worker = NULL;
WARN_ON(!llist_empty(&worker->work_list));
- kthread_stop(worker->task);
+ vhost_worker_stop(worker);
kfree(worker);
}

@@ -603,27 +585,24 @@ static int vhost_worker_create(struct vhost_dev *dev)
return -ENOMEM;

dev->worker = worker;
- worker->dev = dev;
worker->kcov_handle = kcov_common_handle();
init_llist_head(&worker->work_list);

- task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+ /*
+ * vhost used to use the kthread API which ignores all signals by
+ * default and the drivers expect this behavior.
+ */
+ task = kernel_worker(vhost_worker, worker, NUMA_NO_NODE, CLONE_FS,
+ KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN);
if (IS_ERR(task)) {
ret = PTR_ERR(task);
goto free_worker;
}

worker->task = task;
- wake_up_process(task); /* avoid contributing to loadavg */
-
- ret = vhost_attach_cgroups(dev);
- if (ret)
- goto stop_worker;
-
+ kernel_worker_start(task, "vhost-%d", current->pid);
return 0;

-stop_worker:
- kthread_stop(worker->task);
free_worker:
kfree(worker);
dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
unsigned long flags;
};

+enum {
+ VHOST_WORKER_FLAG_STOP,
+};
+
struct vhost_worker {
struct task_struct *task;
+ struct completion *exit_done;
struct llist_head work_list;
- struct vhost_dev *dev;
u64 kcov_handle;
+ unsigned long flags;
};

/* Poll a file (eventfd or socket) */
--
2.25.1


2021-11-21 18:17:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

On 11/21/21 10:49 AM, Mike Christie wrote:
> Convert io_uring and io-wq to use kernel_worker.

I don't like the kernel_worker name, that implies it's always giving you
a kernel thread or kthread. That's not the io_uring use case, it's
really just a thread off the original task that just happens to never
exit to userspace.

Can we do a better name? At least io_thread doesn't imply that.

Also I do think this deserves a bit more commit message, there's really
nothing in here.

--
Jens Axboe


2021-11-22 08:08:23

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH V5 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag

On Sun, Nov 21, 2021 at 6:49 PM Mike Christie
<[email protected]> wrote:
> This adds a new flag, PF_USER_WORKER, that's used for behavior common to
> to both PF_IO_WORKER and users like vhost which will use the new
> kernel_worker helpers that will use the flag and are added later in this
> patchset.
>
> The common behavior PF_USER_WORKER covers is the initial frame and fpu
> setup and the vm reclaim handling.
>
> Signed-off-by: Mike Christie <[email protected]>

> arch/m68k/kernel/process.c | 2 +-

Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-11-22 10:02:40

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
> On 11/21/21 10:49 AM, Mike Christie wrote:
> > Convert io_uring and io-wq to use kernel_worker.
>
> I don't like the kernel_worker name, that implies it's always giving you
> a kernel thread or kthread. That's not the io_uring use case, it's
> really just a thread off the original task that just happens to never
> exit to userspace.
>
> Can we do a better name? At least io_thread doesn't imply that.

Yeah, I had thought about that as well and at first had kernel_uworker()
locally but wasn't convinced. Maybe we should just make it
create_user_worker()?

Christian

2021-11-22 14:20:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

On 11/22/21 3:02 AM, Christian Brauner wrote:
> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
>> On 11/21/21 10:49 AM, Mike Christie wrote:
>>> Convert io_uring and io-wq to use kernel_worker.
>>
>> I don't like the kernel_worker name, that implies it's always giving you
>> a kernel thread or kthread. That's not the io_uring use case, it's
>> really just a thread off the original task that just happens to never
>> exit to userspace.
>>
>> Can we do a better name? At least io_thread doesn't imply that.
>
> Yeah, I had thought about that as well and at first had kernel_uworker()
> locally but wasn't convinced. Maybe we should just make it
> create_user_worker()?

That's better, or maybe even create_user_inkernel_thread() or something?
Pretty long, though... I'd be fine with create_user_worker().

--
Jens Axboe


2021-11-22 16:48:43

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

On 11/22/21 8:20 AM, Jens Axboe wrote:
> On 11/22/21 3:02 AM, Christian Brauner wrote:
>> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
>>> On 11/21/21 10:49 AM, Mike Christie wrote:
>>>> Convert io_uring and io-wq to use kernel_worker.
>>>
>>> I don't like the kernel_worker name, that implies it's always giving you
>>> a kernel thread or kthread. That's not the io_uring use case, it's
>>> really just a thread off the original task that just happens to never
>>> exit to userspace.
>>>
>>> Can we do a better name? At least io_thread doesn't imply that.
>>
>> Yeah, I had thought about that as well and at first had kernel_uworker()
>> locally but wasn't convinced. Maybe we should just make it
>> create_user_worker()?
>
> That's better, or maybe even create_user_inkernel_thread() or something?
> Pretty long, though... I'd be fine with create_user_worker().
>

Ok, I'll do:

create_user_worker()
start_user_worker()

since you guys agree. It will also match the PF flag naming.

I'll also add more details to the commit message you requested.

2021-11-23 14:23:21

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

On Mon, Nov 22, 2021 at 10:47:28AM -0600, [email protected] wrote:
> On 11/22/21 8:20 AM, Jens Axboe wrote:
> > On 11/22/21 3:02 AM, Christian Brauner wrote:
> >> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
> >>> On 11/21/21 10:49 AM, Mike Christie wrote:
> >>>> Convert io_uring and io-wq to use kernel_worker.
> >>>
> >>> I don't like the kernel_worker name, that implies it's always giving you
> >>> a kernel thread or kthread. That's not the io_uring use case, it's
> >>> really just a thread off the original task that just happens to never
> >>> exit to userspace.
> >>>
> >>> Can we do a better name? At least io_thread doesn't imply that.
> >>
> >> Yeah, I had thought about that as well and at first had kernel_uworker()
> >> locally but wasn't convinced. Maybe we should just make it
> >> create_user_worker()?
> >
> > That's better, or maybe even create_user_inkernel_thread() or something?
> > Pretty long, though... I'd be fine with create_user_worker().
> >
>
> Ok, I'll do:
>
> create_user_worker()
> start_user_worker()
>
> since you guys agree. It will also match the PF flag naming.
>
> I'll also add more details to the commit message you requested.

Thanks!