The following patches made over Linus's tree, allow the vhost layer to do
a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
io_uring does a copy_process against its userspace app. This allows the
vhost layer's worker threads to inherit cgroups, namespaces, address
space, etc and this worker thread will also be accounted for against that
owner/parent process's RLIMIT_NPROC limit.
If you are not familiar with qemu and vhost here is more detailed
problem description:
Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.
The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:
https://lkml.org/lkml/2021/6/23/1233
for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
has all deps merged. Also add patch to remove create_io_thread code as
it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
had that flag set. Also dropped patches that passed worker flags to
copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
vhost conversion patch.
This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use a new helper
instead of create_io_thread because they require different behavior.
The common behavior PF_USER_WORKER covers is the initial frame and fpu
setup and the vm reclaim handling.
Signed-off-by: Mike Christie <[email protected]>
[For m68k changes]
Acked-by: Geert Uytterhoeven <[email protected]>
---
arch/alpha/kernel/process.c | 2 +-
arch/arc/kernel/process.c | 2 +-
arch/arm/kernel/process.c | 2 +-
arch/arm64/kernel/process.c | 2 +-
arch/csky/kernel/process.c | 2 +-
arch/h8300/kernel/process.c | 2 +-
arch/hexagon/kernel/process.c | 2 +-
arch/ia64/kernel/process.c | 2 +-
arch/m68k/kernel/process.c | 2 +-
arch/microblaze/kernel/process.c | 2 +-
arch/mips/kernel/process.c | 2 +-
arch/nds32/kernel/process.c | 2 +-
arch/nios2/kernel/process.c | 2 +-
arch/openrisc/kernel/process.c | 2 +-
arch/parisc/kernel/process.c | 2 +-
arch/powerpc/kernel/process.c | 2 +-
arch/riscv/kernel/process.c | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_64.c | 2 +-
arch/um/kernel/process.c | 2 +-
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 2 +-
arch/xtensa/kernel/process.c | 2 +-
include/linux/sched.h | 1 +
include/linux/sched/task.h | 3 ++-
kernel/fork.c | 6 +++++-
mm/vmscan.c | 4 ++--
29 files changed, 36 insertions(+), 30 deletions(-)
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5f8527081da9..f4759e4ee4a9 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childti->pcb.ksp = (unsigned long) childstack;
childti->pcb.flags = 1; /* set FEN, clear everything else */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8e90052f6f05..b409ecb1407f 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childksp[0] = 0; /* fp */
childksp[1] = (unsigned long)ret_from_fork; /* blink */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(c_regs, 0, sizeof(struct pt_regs));
c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index d47159f3791c..44603062d661 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -251,7 +251,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
thread->cpu_domain = get_domain();
#endif
- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
*childregs = *current_pt_regs();
childregs->ARM_r0 = 0;
if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 5369e649fa79..2528fb9650ef 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -334,7 +334,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
ptrauth_thread_init_kernel(p);
- if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER)))) {
+ if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER)))) {
*childregs = *current_pt_regs();
childregs->regs[0] = 0;
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..509f2bfe4ace 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
/* setup thread.sp for switch_to !!! */
p->thread.sp = (unsigned long)childstack;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
childstack->r15 = (unsigned long) ret_from_kernel_thread;
childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8833fa4f5d51..050aca44ba6d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
childregs->retpc = (unsigned long) ret_from_kernel_thread;
childregs->er4 = topstk; /* arg */
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index 232dfd8956aa..40f8294c6c7c 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -73,7 +73,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
sizeof(*ss));
ss->lr = (unsigned long)ret_from_fork;
p->thread.switch_sp = ss;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
/* r24 <- fn, r25 <- arg */
ss->r24 = usp;
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 834df24a88f1..29015ebdcf1d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -338,7 +338,7 @@ copy_thread(unsigned long clone_flags, unsigned long user_stack_base,
ia64_drop_fpu(p); /* don't pick up stale state from a CPU's fph */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
if (unlikely(!user_stack_base)) {
/* fork_idle() called us */
return 0;
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index a6030dbaa089..cbb693012b5e 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -157,7 +157,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
*/
p->thread.fc = USER_DATA;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(frame, 0, sizeof(struct fork_frame));
frame->regs.sr = PS_S;
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index 5e2b91c1e8ce..de1da9900f7e 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -59,7 +59,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct pt_regs *childregs = task_pt_regs(p);
struct thread_info *ti = task_thread_info(p);
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* if we're creating a new kernel thread then just zeroing all
* the registers. That's OK for a brand new thread.*/
memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index cbff1b974f88..6f1ed337cd41 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
/* Put the stack after the struct pt_regs. */
childksp = (unsigned long) childregs;
p->thread.cp0_status = (read_c0_status() & ~(ST0_CU2|ST0_CU1)) | ST0_KERNEL_CUMASK;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
unsigned long status = p->thread.cp0_status;
memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
index 49fab9e39cbf..dba91dd1e289 100644
--- a/arch/nds32/kernel/process.c
+++ b/arch/nds32/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
/* kernel thread fn */
p->thread.cpu_context.r6 = stack_start;
diff --git a/arch/nios2/kernel/process.c b/arch/nios2/kernel/process.c
index f8ea522a1588..eabf3452e5e2 100644
--- a/arch/nios2/kernel/process.c
+++ b/arch/nios2/kernel/process.c
@@ -109,7 +109,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct switch_stack *childstack =
((struct switch_stack *)childregs) - 1;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index 3c0c91bcdcba..aa110383cfa1 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -172,7 +172,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
sp -= sizeof(struct pt_regs);
kregs = (struct pt_regs *)sp;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(kregs, 0, sizeof(struct pt_regs));
kregs->gpr[20] = usp; /* fn, kernel thread */
kregs->gpr[22] = arg;
diff --git a/arch/parisc/kernel/process.c b/arch/parisc/kernel/process.c
index ea3d83b6fb62..a76120e30eb4 100644
--- a/arch/parisc/kernel/process.c
+++ b/arch/parisc/kernel/process.c
@@ -197,7 +197,7 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
extern void * const ret_from_kernel_thread;
extern void * const child_return;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(cregs, 0, sizeof(struct pt_regs));
if (!usp) /* idle thread */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 984813a4d5dc..02f91fcc1417 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1736,7 +1736,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
/* Copy registers */
sp -= sizeof(struct pt_regs);
childregs = (struct pt_regs *) sp;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gpr[1] = sp + sizeof(struct pt_regs);
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..8deeb94eb51e 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -125,7 +125,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
struct pt_regs *childregs = task_pt_regs(p);
/* p->thread holds context to be restored by __switch_to() */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* Kernel thread */
memset(childregs, 0, sizeof(struct pt_regs));
childregs->gp = gp_in_global;
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 71d86f73b02c..51169bb351d4 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -130,7 +130,7 @@ int copy_thread(unsigned long clone_flags, unsigned long new_stackp,
frame->sf.gprs[9] = (unsigned long)frame;
/* Store access registers to kernel stack of new process. */
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(&frame->childregs, 0, sizeof(struct pt_regs));
frame->childregs.psw.mask = PSW_KERNEL_BITS | PSW_MASK_DAT |
diff --git a/arch/sh/kernel/process_32.c b/arch/sh/kernel/process_32.c
index 1c28e3cddb60..0506a739b0a8 100644
--- a/arch/sh/kernel/process_32.c
+++ b/arch/sh/kernel/process_32.c
@@ -114,7 +114,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
childregs = task_pt_regs(p);
p->thread.sp = (unsigned long) childregs;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
p->thread.pc = (unsigned long) ret_from_kernel_thread;
childregs->regs[4] = arg;
diff --git a/arch/sparc/kernel/process_32.c b/arch/sparc/kernel/process_32.c
index 2dc0bf9fe62e..5386e56b5e6c 100644
--- a/arch/sparc/kernel/process_32.c
+++ b/arch/sparc/kernel/process_32.c
@@ -296,7 +296,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
ti->ksp = (unsigned long) new_stack;
p->thread.kregs = childregs;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
extern int nwindows;
unsigned long psr;
memset(new_stack, 0, STACKFRAME_SZ + TRACEREG_SZ);
diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
index f5b2cac8669f..6058b3966f71 100644
--- a/arch/sparc/kernel/process_64.c
+++ b/arch/sparc/kernel/process_64.c
@@ -594,7 +594,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
sizeof(struct sparc_stackf));
t->fpsaved[0] = 0;
- if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(child_trap_frame, 0, child_stack_sz);
__thread_flag_byte_ptr(t)[TI_FLAG_BYTE_CWP] =
(current_pt_regs()->tstate + 1) & TSTATE_CWP;
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 4a420778ed87..25dd496fd820 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -158,7 +158,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct * p, unsigned long tls)
{
void (*handler)(void);
- int kthread = current->flags & (PF_KTHREAD | PF_IO_WORKER);
+ int kthread = current->flags & (PF_KTHREAD | PF_USER_WORKER);
int ret = 0;
p->thread = (struct thread_struct) INIT_THREAD;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8dea01ffc5c1..c5b34fa8bb0c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -579,10 +579,10 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags)
set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
/*
- * No FPU state inheritance for kernel threads and IO
+ * No FPU state inheritance for kernel threads and user
* worker threads.
*/
- if (dst->flags & (PF_KTHREAD | PF_IO_WORKER)) {
+ if (dst->flags & (PF_KTHREAD | PF_USER_WORKER)) {
/* Clear out the minimal state */
memcpy(&dst_fpu->fpstate->regs, &init_fpstate.regs,
init_fpstate_copy_size());
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 81d8ef036637..b0e58fe55b31 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -195,7 +195,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
task_user_gs(p) = get_user_gs(current_pt_regs());
#endif
- if (unlikely(p->flags & PF_IO_WORKER)) {
+ if (unlikely(p->flags & PF_USER_WORKER)) {
/*
* An IO thread is a user space thread, but it doesn't
* return to ret_after_fork().
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index bd80df890b1e..00d81668ead4 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -224,7 +224,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
#error Unsupported Xtensa ABI
#endif
- if (!(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ if (!(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
struct pt_regs *regs = current_pt_regs();
unsigned long usp = usp_thread_fn ?
usp_thread_fn : regs->areg[1];
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..fc45e692f6a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1690,6 +1690,7 @@ extern struct pid *cad_pid;
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_USER_WORKER 0x01000000 /* Kernel thread cloned from userspace thread */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0bf95966ae7d..0467b3ce1c94 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,7 +18,8 @@ struct css_set;
/* All the bits taken by the old clone syscall. */
#define CLONE_LEGACY_FLAGS 0xffffffffULL
-#define USER_WORKER_IO BIT(0)
+#define USER_WORKER BIT(0)
+#define USER_WORKER_IO BIT(1)
struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9273fd81a329..1f6e7184b31f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1998,6 +1998,10 @@ static __latent_entropy struct task_struct *copy_process(
p = dup_task_struct(current, node);
if (!p)
goto fork_out;
+
+ if (args->worker_flags & USER_WORKER)
+ p->flags |= PF_USER_WORKER;
+
if (args->worker_flags & USER_WORKER_IO) {
/*
* Mark us an IO worker, and block any signal that isn't
@@ -2497,7 +2501,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
.stack_size = (unsigned long)arg,
- .worker_flags = USER_WORKER_IO,
+ .worker_flags = USER_WORKER | USER_WORKER_IO,
};
return copy_process(NULL, 0, node, &args);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..bb85b454ff07 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1061,12 +1061,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);
/*
- * Do not throttle IO workers, kthreads other than kswapd or
+ * Do not throttle user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_IO_WORKER|PF_KTHREAD))
+ current->flags & (PF_USER_WORKER|PF_KTHREAD))
return;
/*
--
2.25.1
Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups, and it bypasses the RLIMIT_NPROC limit.
This patch adds a new struct vhost_task which can be used instead of
kthreads. They allow the vhost layer to use copy_process and inherit
the userspace process's mm and cgroups and the task is accounted for
under the userspace's nproc count.
Signed-off-by: Mike Christie <[email protected]>
---
MAINTAINERS | 2 +
drivers/vhost/Kconfig | 5 ++
include/linux/sched/vhost_task.h | 23 ++++++
kernel/Makefile | 1 +
kernel/vhost_task.c | 123 +++++++++++++++++++++++++++++++
5 files changed, 154 insertions(+)
create mode 100644 include/linux/sched/vhost_task.h
create mode 100644 kernel/vhost_task.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 3e461db9cd91..4a3d9541e3cc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20477,7 +20477,9 @@ L: [email protected]
L: [email protected]
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
+F: kernel/vhost_task.c
F: drivers/vhost/
+F: include/linux/sched/vhost_task.h
F: include/linux/vhost_iotlb.h
F: include/uapi/linux/vhost.h
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..b455d9ab6f3d 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -13,9 +13,14 @@ config VHOST_RING
This option is selected by any driver which needs to access
the host side of a virtio ring.
+config VHOST_TASK
+ bool
+ default n
+
config VHOST
tristate
select VHOST_IOTLB
+ select VHOST_TASK
help
This option is selected by any driver which needs to access
the core of vhost.
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
new file mode 100644
index 000000000000..50d02a25d37b
--- /dev/null
+++ b/include/linux/sched/vhost_task.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VHOST_TASK_H
+#define _LINUX_VHOST_TASK_H
+
+#include <linux/completion.h>
+
+struct task_struct;
+
+struct vhost_task {
+ int (*fn)(void *data);
+ void *data;
+ struct completion exited;
+ unsigned long flags;
+ struct task_struct *task;
+};
+
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node);
+__printf(2, 3)
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...);
+void vhost_task_stop(struct vhost_task *vtsk);
+bool vhost_task_should_stop(struct vhost_task *vtsk);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 56f4ee97f328..d82f388082b8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -15,6 +15,7 @@ obj-y = fork.o exec_domain.o panic.o \
obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
obj-$(CONFIG_MODULES) += kmod.o
obj-$(CONFIG_MULTIUSER) += groups.o
+obj-$(CONFIG_VHOST_TASK) += vhost_task.o
ifdef CONFIG_FUNCTION_TRACER
# Do not trace internal ftrace files
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
new file mode 100644
index 000000000000..4ab1c195bc76
--- /dev/null
+++ b/kernel/vhost_task.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Oracle Corporation
+ */
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/sched/task.h>
+#include <linux/sched/vhost_task.h>
+#include <linux/sched/signal.h>
+
+enum vhost_task_flags {
+ VHOST_TASK_FLAGS_STOP,
+};
+
+static void vhost_task_fn(void *data)
+{
+ struct vhost_task *vtsk = data;
+ int ret;
+
+ ret = vtsk->fn(vtsk->data);
+ complete(&vtsk->exited);
+ do_exit(ret);
+}
+
+/**
+ * vhost_task_stop - stop a vhost_task
+ * @vtsk: vhost_task to stop
+ *
+ * Callers must call vhost_task_should_stop and return from their worker
+ * function when it returns true;
+ */
+void vhost_task_stop(struct vhost_task *vtsk)
+{
+ pid_t pid = vtsk->task->pid;
+
+ set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+ wake_up_process(vtsk->task);
+ /*
+ * Make sure vhost_task_fn is no longer accessing the vhost_task before
+ * freeing it below. If userspace crashed or exited without closing,
+ * then the vhost_task->task could already be marked dead so
+ * kernel_wait will return early.
+ */
+ wait_for_completion(&vtsk->exited);
+ /*
+ * If we are just closing/removing a device and the parent process is
+ * not exiting then reap the task.
+ */
+ kernel_wait4(pid, NULL, __WCLONE, NULL);
+ kfree(vtsk);
+}
+EXPORT_SYMBOL_GPL(vhost_task_stop);
+
+/**
+ * vhost_task_should_stop - should the vhost task return from the work function
+ */
+bool vhost_task_should_stop(struct vhost_task *vtsk)
+{
+ return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+}
+EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+
+/**
+ * vhost_task_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ *
+ * This returns a specialized task for use by the vhost layer or NULL on
+ * failure. The returned task is inactive, and the caller must fire it up
+ * through vhost_task_start().
+ */
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node)
+{
+ struct kernel_clone_args args = {
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .exit_signal = 0,
+ .stack = (unsigned long)vhost_task_fn,
+ .worker_flags = USER_WORKER | USER_WORKER_NO_FILES |
+ USER_WORKER_SIG_IGN,
+ };
+ struct vhost_task *vtsk;
+ struct task_struct *tsk;
+
+ vtsk = kzalloc(sizeof(*vtsk), GFP_KERNEL);
+ if (!vtsk)
+ return ERR_PTR(-ENOMEM);
+
+ init_completion(&vtsk->exited);
+ vtsk->data = arg;
+ vtsk->fn = fn;
+ args.stack_size = (unsigned long)vtsk;
+
+ tsk = copy_process(NULL, 0, node, &args);
+ if (IS_ERR(tsk)) {
+ kfree(vtsk);
+ return NULL;
+ }
+
+ vtsk->task = tsk;
+
+ return vtsk;
+}
+EXPORT_SYMBOL_GPL(vhost_task_create);
+
+/**
+ * vhost_task_start - start a vhost_task created with vhost_task_create
+ * @vtsk: vhost_task to wake up
+ * @namefmt: printf-style format string for the thread name
+ */
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...)
+{
+ char name[TASK_COMM_LEN];
+ va_list args;
+
+ va_start(args, namefmt);
+ vsnprintf(name, sizeof(name), namefmt, args);
+ set_task_comm(vtsk->task, name);
+ va_end(args);
+
+ wake_up_new_task(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_start);
--
2.25.1
From: Christian Brauner <[email protected]>
Since:
commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")
kthreads have been ignoring signals by default, and the vhost layer has
never had a need to change that. This patch adds an option flag,
USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand()
and copy_signals() so vhost_tasks added in the next patches which are
based directly off task_structs instead of kthreads can continue to ignore
singals.
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 706ca3dc3423..c001d0fa1426 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
#define USER_WORKER BIT(0)
#define USER_WORKER_IO BIT(1)
#define USER_WORKER_NO_FILES BIT(2)
+#define USER_WORKER_SIG_IGN BIT(3)
struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 81a1096d6ee9..6a04860fa1ae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2185,6 +2185,9 @@ static __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
+ if (args->worker_flags & USER_WORKER_SIG_IGN)
+ ignore_signals(p);
+
stackleak_task_init(p);
if (pid != &init_struct_pid) {
--
2.25.1
This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
drivers/vhost/vhost.h | 11 +++--
2 files changed, 72 insertions(+), 37 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* sure it was not in the list.
* test_and_set_bit() implies a memory barrier.
*/
- llist_add(&work->node, &dev->work_list);
- wake_up_process(dev->worker);
+ llist_add(&work->node, &dev->worker->work_list);
+ wake_up_process(dev->worker->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
- return !llist_empty(&dev->work_list);
+ return dev->worker && !llist_empty(&dev->worker->work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_worker *worker = data;
+ struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
break;
}
- node = llist_del_all(&dev->work_list);
+ node = llist_del_all(&worker->work_list);
if (!node)
schedule();
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
- kcov_remote_start_common(dev->kcov_handle);
+ kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
- init_llist_head(&dev->work_list);
init_waitqueue_head(&dev->wait);
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
}
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker = dev->worker;
+
+ if (!worker)
+ return;
+
+ dev->worker = NULL;
+ WARN_ON(!llist_empty(&worker->work_list));
+ kthread_stop(worker->task);
+ kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+ struct task_struct *task;
+ int ret;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+ if (!worker)
+ return -ENOMEM;
+
+ dev->worker = worker;
+ worker->dev = dev;
+ worker->kcov_handle = kcov_common_handle();
+ init_llist_head(&worker->work_list);
+
+ task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_worker;
+ }
+
+ worker->task = task;
+ wake_up_process(task); /* avoid contributing to loadavg */
+
+ ret = vhost_attach_cgroups(dev);
+ if (ret)
+ goto stop_worker;
+
+ return 0;
+
+stop_worker:
+ kthread_stop(worker->task);
+free_worker:
+ kfree(worker);
+ dev->worker = NULL;
+ return ret;
+}
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
- struct task_struct *worker;
int err;
/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
vhost_attach_mm(dev);
- dev->kcov_handle = kcov_common_handle();
if (dev->use_worker) {
- worker = kthread_create(vhost_worker, dev,
- "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
- }
-
- dev->worker = worker;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- err = vhost_attach_cgroups(dev);
+ err = vhost_worker_create(dev);
if (err)
- goto err_cgroup;
+ goto err_worker;
}
err = vhost_dev_alloc_iovecs(dev);
if (err)
- goto err_cgroup;
+ goto err_iovecs;
return 0;
-err_cgroup:
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- }
+err_iovecs:
+ vhost_worker_free(dev);
err_worker:
vhost_detach_mm(dev);
- dev->kcov_handle = 0;
err_mm:
return err;
}
@@ -712,12 +747,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
- WARN_ON(!llist_empty(&dev->work_list));
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- dev->kcov_handle = 0;
- }
+ vhost_worker_free(dev);
vhost_detach_mm(dev);
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 638bb640d6b4..102ce25e4e13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
unsigned long flags;
};
+struct vhost_worker {
+ struct task_struct *task;
+ struct llist_head work_list;
+ struct vhost_dev *dev;
+ u64 kcov_handle;
+};
+
/* Poll a file (eventfd or socket) */
/* Note: there's nothing vhost specific about this structure. */
struct vhost_poll {
@@ -148,8 +155,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
- struct llist_head work_list;
- struct task_struct *worker;
+ struct vhost_worker *worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;
@@ -159,7 +165,6 @@ struct vhost_dev {
int iov_limit;
int weight;
int byte_weight;
- u64 kcov_handle;
bool use_worker;
int (*msg_handler)(struct vhost_dev *dev,
struct vhost_iotlb_msg *msg);
--
2.25.1
Hi Eric and Christian,
The second patch now has a conflict with Linus's tree. I was going to
rebase, but, I wanted to ping you guys to see if you have comments or
issues?
On 2/9/22 5:29 PM, Mike Christie wrote:
> The following patches made over Linus's tree, allow the vhost layer to do
> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
> io_uring does a copy_process against its userspace app. This allows the
> vhost layer's worker threads to inherit cgroups, namespaces, address
> space, etc and this worker thread will also be accounted for against that
> owner/parent process's RLIMIT_NPROC limit.
>
> If you are not familiar with qemu and vhost here is more detailed
> problem description:
>
> Qemu will create vhost devices in the kernel which perform network, SCSI,
> etc IO and management operations from worker threads created by the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
>
> The problem with this approach is that we then have to add new functions/
> args/functionality for every thing we want to inherit. I started doing
> that here:
>
> https://lkml.org/lkml/2021/6/23/1233
>
> for the RLIMIT_NPROC check, but it seems it might be easier to just
> inherit everything from the beginning, becuase I'd need to do something
> like that patch several times.
>
> V8:
> - Fix kzalloc GFP use.
> - Fix email subject version number.
> V7:
> - Drop generic user_worker_* helpers and replace with vhost_task specific
> ones.
> - Drop autoreap patch. Use kernel_wait4 instead.
> - Fix issue where vhost.ko could be removed while the worker function is
> still running.
> V6:
> - Rename kernel_worker to user_worker and fix prefixes.
> - Add better patch descriptions.
> V5:
> - Handle kbuild errors by building patchset against current kernel that
> has all deps merged. Also add patch to remove create_io_thread code as
> it's not used anymore.
> - Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
> case added in 5.16-rc1.
> - Add PF_USER_WORKER flag so we can check it later after the initial
> thread creation for the wake up, vm and singal cses.
> - Added patch to auto reap the worker thread.
> V4:
> - Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
> - Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
> added the new kernel worker functions.
> - Fixed extra "i" issue.
> - Added PF_USER_WORKER flag and added check that kernel_worker_start users
> had that flag set. Also dropped patches that passed worker flags to
> copy_thread and replaced with PF_USER_WORKER check.
> V3:
> - Add parentheses in p->flag and work_flags check in copy_thread.
> - Fix check in arm/arm64 which was doing the reverse of other archs
> where it did likely(!flags) instead of unlikely(flags).
> V2:
> - Rename kernel_copy_process to kernel_worker.
> - Instead of exporting functions, make kernel_worker() a proper
> function/API that does common work for the caller.
> - Instead of adding new fields to kernel_clone_args for each option
> make it flag based similar to CLONE_*.
> - Drop unused completion struct in vhost.
> - Fix compile warnings by merging vhost cgroup cleanup patch and
> vhost conversion patch.
>
>
>
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization