The following patches were made over Linus's tree and apply over next. They
allow the vhost layer to use copy_process instead of using
workqueue_structs to create worker threads for VM's devices.
Details:
Qemu will create vhost devices in the kernel which perform network or SCSI,
IO and perform management operations from worker threads created with the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.
The patches allow the vhost layer to do a copy_process from the thread that
does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process against
its userspace thread. This allows the vhost layer's worker threads to inherit
cgroups, namespaces, address space, etc. This worker thread will also be
accounted for against that owner/parent process's RLIMIT_NPROC limit which
will prevent malicious users from creating VMs with almost unlimited threads
when these patches are used:
https://lore.kernel.org/all/[email protected]/
which allow us to create a worker thread per N virtqueues.
V12:
- Change how new fields were added to kernel_clone_args so they don't
unnecessarily expand the size of the struct.
- Use named bitfields and make kthread and io_thread work similarly as
the new fields.
- Allow copy_process users to pass in the name of the new task and
convert kthreads and vhost_tasks.
V11:
- Rebase.
V10:
- Eric's cleanup patches my vhost flush cleanup patches are merged
upstream, so rebase against Linus's tree which has everything.
V9:
- Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
no longer needed due to kernel clone arg and pf io worker patches in that
branch.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
has all deps merged. Also add patch to remove create_io_thread code as
it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
had that flag set. Also dropped patches that passed worker flags to
copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
vhost conversion patch.
Remove csky's kernel_thread declaration because it's not needed.
Signed-off-by: Mike Christie <[email protected]>
---
arch/csky/include/asm/processor.h | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/csky/include/asm/processor.h b/arch/csky/include/asm/processor.h
index ea75d72dea86..e487a46d1c37 100644
--- a/arch/csky/include/asm/processor.h
+++ b/arch/csky/include/asm/processor.h
@@ -72,8 +72,6 @@ struct task_struct;
/* Prepare to copy thread state - unlazy all lazy status */
#define prepare_to_copy(tsk) do { } while (0)
-extern int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
-
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)
--
2.25.1
The next patch adds helpers like create_io_thread, but for use by the
vhost layer. There are several functions, so they are in their own file
instead of cluttering up fork.c. This patch allows that new file to
call copy_process.
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 2 ++
kernel/fork.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 00c54bfac0b5..537cbf9a2ade 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -93,6 +93,8 @@ extern void exit_files(struct task_struct *);
extern void exit_itimers(struct task_struct *);
extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *copy_process(struct pid *pid, int trace, int node,
+ struct kernel_clone_args *args);
struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, const char *name,
diff --git a/kernel/fork.c b/kernel/fork.c
index 244aae6c2395..8faf9d0adb3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2013,7 +2013,7 @@ static void rv_task_fork(struct task_struct *p)
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
*/
-static __latent_entropy struct task_struct *copy_process(
+__latent_entropy struct task_struct *copy_process(
struct pid *pid,
int trace,
int node,
--
2.25.1
Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it creates a thread using a helper based on
copy_process we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.
Later, when we do:
Qemu process exits:
do_exit -> exit_files -> put_files_struct -> close_files
we would leak the device's resources because of that extra refcount
on the fd or file_struct.
This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 10 ++++++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 2950e83d5382..4f816048794f 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,6 +28,7 @@ struct kernel_clone_args {
u32 kthread:1;
u32 io_thread:1;
u32 user_worker:1;
+ u32 no_files:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index 0dec38276363..258163ea5cd2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1624,7 +1624,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+ int no_files)
{
struct files_struct *oldf, *newf;
int error = 0;
@@ -1636,6 +1637,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
if (!oldf)
goto out;
+ if (no_files) {
+ tsk->files = NULL;
+ goto out;
+ }
+
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2256,7 +2262,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
- retval = copy_files(clone_flags, p);
+ retval = copy_files(clone_flags, p, args->no_files);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
--
2.25.1
Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result
in VMs creating more threads than the admin expected.
This patch adds a new struct vhost_task which can be used instead of
kthreads. They allow the vhost layer to use copy_process and inherit
the userspace process's mm and cgroups, the task is accounted for
under the userspace's nproc count and can be seen in its process tree,
and other features like namespaces work and are inherited by default.
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
MAINTAINERS | 2 +
drivers/vhost/Kconfig | 5 ++
include/linux/sched/vhost_task.h | 23 ++++++
kernel/Makefile | 1 +
kernel/vhost_task.c | 117 +++++++++++++++++++++++++++++++
5 files changed, 148 insertions(+)
create mode 100644 include/linux/sched/vhost_task.h
create mode 100644 kernel/vhost_task.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 39ff1a717625..f379ec9209ab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22130,7 +22130,9 @@ L: [email protected]
L: [email protected]
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
+F: kernel/vhost_task.c
F: drivers/vhost/
+F: include/linux/sched/vhost_task.h
F: include/linux/vhost_iotlb.h
F: include/uapi/linux/vhost.h
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..b455d9ab6f3d 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -13,9 +13,14 @@ config VHOST_RING
This option is selected by any driver which needs to access
the host side of a virtio ring.
+config VHOST_TASK
+ bool
+ default n
+
config VHOST
tristate
select VHOST_IOTLB
+ select VHOST_TASK
help
This option is selected by any driver which needs to access
the core of vhost.
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
new file mode 100644
index 000000000000..6123c10b99cf
--- /dev/null
+++ b/include/linux/sched/vhost_task.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VHOST_TASK_H
+#define _LINUX_VHOST_TASK_H
+
+#include <linux/completion.h>
+
+struct task_struct;
+
+struct vhost_task {
+ int (*fn)(void *data);
+ void *data;
+ struct completion exited;
+ unsigned long flags;
+ struct task_struct *task;
+};
+
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+ const char *name);
+void vhost_task_start(struct vhost_task *vtsk);
+void vhost_task_stop(struct vhost_task *vtsk);
+bool vhost_task_should_stop(struct vhost_task *vtsk);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 10ef068f598d..6fc72b3afbde 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -15,6 +15,7 @@ obj-y = fork.o exec_domain.o panic.o \
obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
obj-$(CONFIG_MODULES) += kmod.o
obj-$(CONFIG_MULTIUSER) += groups.o
+obj-$(CONFIG_VHOST_TASK) += vhost_task.o
ifdef CONFIG_FUNCTION_TRACER
# Do not trace internal ftrace files
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
new file mode 100644
index 000000000000..4b8aff160640
--- /dev/null
+++ b/kernel/vhost_task.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Oracle Corporation
+ */
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/sched/task.h>
+#include <linux/sched/vhost_task.h>
+#include <linux/sched/signal.h>
+
+enum vhost_task_flags {
+ VHOST_TASK_FLAGS_STOP,
+};
+
+static int vhost_task_fn(void *data)
+{
+ struct vhost_task *vtsk = data;
+ int ret;
+
+ ret = vtsk->fn(vtsk->data);
+ complete(&vtsk->exited);
+ do_exit(ret);
+}
+
+/**
+ * vhost_task_stop - stop a vhost_task
+ * @vtsk: vhost_task to stop
+ *
+ * Callers must call vhost_task_should_stop and return from their worker
+ * function when it returns true;
+ */
+void vhost_task_stop(struct vhost_task *vtsk)
+{
+ pid_t pid = vtsk->task->pid;
+
+ set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+ wake_up_process(vtsk->task);
+ /*
+ * Make sure vhost_task_fn is no longer accessing the vhost_task before
+ * freeing it below. If userspace crashed or exited without closing,
+ * then the vhost_task->task could already be marked dead so
+ * kernel_wait will return early.
+ */
+ wait_for_completion(&vtsk->exited);
+ /*
+ * If we are just closing/removing a device and the parent process is
+ * not exiting then reap the task.
+ */
+ kernel_wait4(pid, NULL, __WCLONE, NULL);
+ kfree(vtsk);
+}
+EXPORT_SYMBOL_GPL(vhost_task_stop);
+
+/**
+ * vhost_task_should_stop - should the vhost task return from the work function
+ * @vtsk: vhost_task to stop
+ */
+bool vhost_task_should_stop(struct vhost_task *vtsk)
+{
+ return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+}
+EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+
+/**
+ * vhost_task_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @name: the thread's name
+ *
+ * This returns a specialized task for use by the vhost layer or NULL on
+ * failure. The returned task is inactive, and the caller must fire it up
+ * through vhost_task_start().
+ */
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+ const char *name)
+{
+ struct kernel_clone_args args = {
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .exit_signal = 0,
+ .fn = vhost_task_fn,
+ .name = name,
+ .user_worker = 1,
+ .no_files = 1,
+ .ignore_signals = 1,
+ };
+ struct vhost_task *vtsk;
+ struct task_struct *tsk;
+
+ vtsk = kzalloc(sizeof(*vtsk), GFP_KERNEL);
+ if (!vtsk)
+ return ERR_PTR(-ENOMEM);
+ init_completion(&vtsk->exited);
+ vtsk->data = arg;
+ vtsk->fn = fn;
+
+ args.fn_arg = vtsk;
+
+ tsk = copy_process(NULL, 0, NUMA_NO_NODE, &args);
+ if (IS_ERR(tsk)) {
+ kfree(vtsk);
+ return NULL;
+ }
+
+ vtsk->task = tsk;
+ return vtsk;
+}
+EXPORT_SYMBOL_GPL(vhost_task_create);
+
+/**
+ * vhost_task_start - start a vhost_task created with vhost_task_create
+ * @vtsk: vhost_task to wake up
+ */
+void vhost_task_start(struct vhost_task *vtsk)
+{
+ wake_up_new_task(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_start);
--
2.25.1
Since:
commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")
kthreads have been ignoring signals by default, and the vhost layer has
never had a need to change that. This patch adds an option flag,
USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand()
and copy_signals() so vhost_tasks added in the next patches can continue
to ignore singals.
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4f816048794f..00c54bfac0b5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,6 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
+ u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index 258163ea5cd2..244aae6c2395 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2287,6 +2287,9 @@ static __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
+ if (args->ignore_signals)
+ ignore_signals(p);
+
stackleak_task_init(p);
if (pid != &init_struct_pid) {
--
2.25.1
This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
drivers/vhost/vhost.h | 11 +++--
2 files changed, 72 insertions(+), 37 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 43c9770b86e5..60282fe5c338 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* sure it was not in the list.
* test_and_set_bit() implies a memory barrier.
*/
- llist_add(&work->node, &dev->work_list);
- wake_up_process(dev->worker);
+ llist_add(&work->node, &dev->worker->work_list);
+ wake_up_process(dev->worker->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
- return !llist_empty(&dev->work_list);
+ return dev->worker && !llist_empty(&dev->worker->work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);
@@ -335,7 +335,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_worker *worker = data;
+ struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
@@ -350,7 +351,7 @@ static int vhost_worker(void *data)
break;
}
- node = llist_del_all(&dev->work_list);
+ node = llist_del_all(&worker->work_list);
if (!node)
schedule();
@@ -360,7 +361,7 @@ static int vhost_worker(void *data)
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
- kcov_remote_start_common(dev->kcov_handle);
+ kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
if (need_resched())
@@ -479,7 +480,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
- init_llist_head(&dev->work_list);
init_waitqueue_head(&dev->wait);
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
@@ -571,10 +571,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
}
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker = dev->worker;
+
+ if (!worker)
+ return;
+
+ dev->worker = NULL;
+ WARN_ON(!llist_empty(&worker->work_list));
+ kthread_stop(worker->task);
+ kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+ struct task_struct *task;
+ int ret;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+ if (!worker)
+ return -ENOMEM;
+
+ dev->worker = worker;
+ worker->dev = dev;
+ worker->kcov_handle = kcov_common_handle();
+ init_llist_head(&worker->work_list);
+
+ task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_worker;
+ }
+
+ worker->task = task;
+ wake_up_process(task); /* avoid contributing to loadavg */
+
+ ret = vhost_attach_cgroups(dev);
+ if (ret)
+ goto stop_worker;
+
+ return 0;
+
+stop_worker:
+ kthread_stop(worker->task);
+free_worker:
+ kfree(worker);
+ dev->worker = NULL;
+ return ret;
+}
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
- struct task_struct *worker;
int err;
/* Is there an owner already? */
@@ -585,36 +635,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
vhost_attach_mm(dev);
- dev->kcov_handle = kcov_common_handle();
if (dev->use_worker) {
- worker = kthread_create(vhost_worker, dev,
- "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
- }
-
- dev->worker = worker;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- err = vhost_attach_cgroups(dev);
+ err = vhost_worker_create(dev);
if (err)
- goto err_cgroup;
+ goto err_worker;
}
err = vhost_dev_alloc_iovecs(dev);
if (err)
- goto err_cgroup;
+ goto err_iovecs;
return 0;
-err_cgroup:
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- }
+err_iovecs:
+ vhost_worker_free(dev);
err_worker:
vhost_detach_mm(dev);
- dev->kcov_handle = 0;
err_mm:
return err;
}
@@ -705,12 +740,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
- WARN_ON(!llist_empty(&dev->work_list));
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- dev->kcov_handle = 0;
- }
+ vhost_worker_free(dev);
vhost_detach_mm(dev);
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 790b296271f1..3a98c4f06b5a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
unsigned long flags;
};
+struct vhost_worker {
+ struct task_struct *task;
+ struct llist_head work_list;
+ struct vhost_dev *dev;
+ u64 kcov_handle;
+};
+
/* Poll a file (eventfd or socket) */
/* Note: there's nothing vhost specific about this structure. */
struct vhost_poll {
@@ -147,8 +154,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
- struct llist_head work_list;
- struct task_struct *worker;
+ struct vhost_worker *worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;
@@ -158,7 +164,6 @@ struct vhost_dev {
int iov_limit;
int weight;
int byte_weight;
- u64 kcov_handle;
bool use_worker;
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg);
--
2.25.1
For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked, so while tools like libvirt try to control the number of
threads based on the nproc rlimit setting we can end up creating more
threads than the user wanted.
This patch has us use the vhost_task helpers which will inherit its
values/checks from the thread that owns the device similar to if we did
a clone in userspace. The vhost threads will now be counted in the nproc
rlimits. And we get features like cgroups and mm sharing automatically,
so we can remove those calls.
Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 60 ++++++++++---------------------------------
drivers/vhost/vhost.h | 4 +--
2 files changed, 15 insertions(+), 49 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 60282fe5c338..7282e0545cb2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,11 +22,11 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/kthread.h>
-#include <linux/cgroup.h>
#include <linux/module.h>
#include <linux/sort.h>
#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
+#include <linux/sched/vhost_task.h>
#include <linux/interval_tree_generic.h>
#include <linux/nospec.h>
#include <linux/kcov.h>
@@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* test_and_set_bit() implies a memory barrier.
*/
llist_add(&work->node, &dev->worker->work_list);
- wake_up_process(dev->worker->task);
+ wake_up_process(dev->worker->vtsk->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
struct vhost_worker *worker = data;
- struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
- kthread_use_mm(dev->mm);
-
for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
- if (kthread_should_stop()) {
+ if (vhost_task_should_stop(worker->vtsk)) {
__set_current_state(TASK_RUNNING);
break;
}
@@ -368,7 +365,7 @@ static int vhost_worker(void *data)
schedule();
}
}
- kthread_unuse_mm(dev->mm);
+
return 0;
}
@@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
-struct vhost_attach_cgroups_struct {
- struct vhost_work work;
- struct task_struct *owner;
- int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
- struct vhost_attach_cgroups_struct *s;
-
- s = container_of(work, struct vhost_attach_cgroups_struct, work);
- s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
- struct vhost_attach_cgroups_struct attach;
-
- attach.owner = current;
- vhost_work_init(&attach.work, vhost_attach_cgroups_work);
- vhost_work_queue(dev, &attach.work);
- vhost_dev_flush(dev);
- return attach.ret;
-}
-
/* Caller should have device mutex */
bool vhost_dev_has_owner(struct vhost_dev *dev)
{
@@ -580,14 +552,15 @@ static void vhost_worker_free(struct vhost_dev *dev)
dev->worker = NULL;
WARN_ON(!llist_empty(&worker->work_list));
- kthread_stop(worker->task);
+ vhost_task_stop(worker->vtsk);
kfree(worker);
}
static int vhost_worker_create(struct vhost_dev *dev)
{
struct vhost_worker *worker;
- struct task_struct *task;
+ struct vhost_task *vtsk;
+ char name[TASK_COMM_LEN];
int ret;
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
@@ -595,27 +568,20 @@ static int vhost_worker_create(struct vhost_dev *dev)
return -ENOMEM;
dev->worker = worker;
- worker->dev = dev;
worker->kcov_handle = kcov_common_handle();
init_llist_head(&worker->work_list);
+ snprintf(name, sizeof(name), "vhost-%d", current->pid);
- task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
- if (IS_ERR(task)) {
- ret = PTR_ERR(task);
+ vtsk = vhost_task_create(vhost_worker, worker, name);
+ if (!vtsk) {
+ ret = -ENOMEM;
goto free_worker;
}
- worker->task = task;
- wake_up_process(task); /* avoid contributing to loadavg */
-
- ret = vhost_attach_cgroups(dev);
- if (ret)
- goto stop_worker;
-
+ worker->vtsk = vtsk;
+ vhost_task_start(vtsk);
return 0;
-stop_worker:
- kthread_stop(worker->task);
free_worker:
kfree(worker);
dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3a98c4f06b5a..ceb590aaf418 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -16,6 +16,7 @@
#include <linux/irqbypass.h>
struct vhost_work;
+struct vhost_task;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
#define VHOST_WORK_QUEUED 1
@@ -26,9 +27,8 @@ struct vhost_work {
};
struct vhost_worker {
- struct task_struct *task;
+ struct vhost_task *vtsk;
struct llist_head work_list;
- struct vhost_dev *dev;
u64 kcov_handle;
};
--
2.25.1
On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
<[email protected]> wrote:
>
> The following patches were made over Linus's tree and apply over next. They
> allow the vhost layer to use copy_process instead of using
> workqueue_structs to create worker threads for VM's devices.
Ok, all these patches looked fine to me from a quick scan - nothing
that I reacted to as objectionable, and several of them looked like
nice cleanups.
The only one I went "Why do you do it that way" for was in 10/11
(entirely internal to vhost, so I don't feel too strongly about this)
how you made "struct vhost_worker" be a pointer in "struct vhost_dev".
It _looks_ to me like it could just have been an embedded structure
rather than a separate allocation.
IOW, why do
vhost_dev->worker
instead of doing
vhost_dev.worker
and just having it all in the same allocation?
Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
code to stay around, and basically use that pointer as a flag too. Or
maybe there is some other reason you want to keep that separate..
Linus
On 3/11/23 11:21 AM, Linus Torvalds wrote:
> On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
> <[email protected]> wrote:
>>
>> The following patches were made over Linus's tree and apply over next. They
>> allow the vhost layer to use copy_process instead of using
>> workqueue_structs to create worker threads for VM's devices.
>
> Ok, all these patches looked fine to me from a quick scan - nothing
> that I reacted to as objectionable, and several of them looked like
> nice cleanups.
>
> The only one I went "Why do you do it that way" for was in 10/11
> (entirely internal to vhost, so I don't feel too strongly about this)
> how you made "struct vhost_worker" be a pointer in "struct vhost_dev".
>
> It _looks_ to me like it could just have been an embedded structure
> rather than a separate allocation.
>
> IOW, why do
>
> vhost_dev->worker
>
> instead of doing
>
> vhost_dev.worker
>
> and just having it all in the same allocation?
>
> Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
> code to stay around, and basically use that pointer as a flag too. Or
> maybe there is some other reason you want to keep that separate..
>
There were 2 reasons:
1. Yeah, we needed a flag to indicate that the worker was not setup
for the cases like where userspace just opens the dev then closes it
without doing the IOCTL that does vhost_dev_set_owner.
2. I could have handled #1 by embedding the worker in the vhost_dev
and then just testing worker.vtsk. However, I have a followup patchset
that allows us to create multiple worker threads per device. For
that patchset I then do:
- if (vhost_dev->worker)
+ if (vhost_dev->workers)
so I think it just saved me some typing.
On Sat, Mar 11, 2023 at 09:21:14AM -0800, Linus Torvalds wrote:
> On Fri, Mar 10, 2023 at 2:04 PM Mike Christie
> <[email protected]> wrote:
> >
> > The following patches were made over Linus's tree and apply over next. They
> > allow the vhost layer to use copy_process instead of using
> > workqueue_structs to create worker threads for VM's devices.
>
> Ok, all these patches looked fine to me from a quick scan - nothing
> that I reacted to as objectionable, and several of them looked like
> nice cleanups.
>
> The only one I went "Why do you do it that way" for was in 10/11
> (entirely internal to vhost, so I don't feel too strongly about this)
> how you made "struct vhost_worker" be a pointer in "struct vhost_dev".
>
> It _looks_ to me like it could just have been an embedded structure
> rather than a separate allocation.
>
> IOW, why do
>
> vhost_dev->worker
>
> instead of doing
>
> vhost_dev.worker
>
> and just having it all in the same allocation?
>
> Not a big deal. Maybe you wanted the 'test if worker pointer is NULL'
> code to stay around, and basically use that pointer as a flag too. Or
> maybe there is some other reason you want to keep that separate..
>
> Linus
I agree with Linus here, slightly better embedded, but no huge deal.
Which tree is this going on?
If not mine here's my ack:
Acked-by: Michael S. Tsirkin <[email protected]>
--
MST
From: Christian Brauner (Microsoft) <[email protected]>
On Fri, 10 Mar 2023 16:03:21 -0600, Mike Christie wrote:
> The following patches were made over Linus's tree and apply over next. They
> allow the vhost layer to use copy_process instead of using
> workqueue_structs to create worker threads for VM's devices.
>
> Details:
> Qemu will create vhost devices in the kernel which perform network or SCSI,
> IO and perform management operations from worker threads created with the
> kthread API. Because the kthread API does a copy_process on the kthreadd
> thread, the vhost layer has to use kthread_use_mm to access the Qemu
> thread's memory and cgroup_attach_task_all to add itself to the Qemu
> thread's cgroups.
>
> [...]
I've picked this up now and it should show up in -next tomorrow.
Thanks for the patience,
tree: [email protected]:pub/scm/linux/kernel/git/brauner/linux kernel.user_worker
[01/11] csky: Remove kernel_thread declaration
commit: e0a98139c162af9601ffa8d6db88dbe745f64b3c
[02/11] kernel: Allow a kernel thread's name to be set in copy_process
commit: cf587db2ee0261c74d04f61f39783db88a0b65e4
[03/11] kthread: Pass in the thread's name during creation
commit: 73e0c116594d99f807754b15e474690635a87249
[04/11] kernel: Make io_thread and kthread bit fields
commit: c81cc5819faf5dd77124f5086aa654482281ac37
[05/11] fork/vm: Move common PF_IO_WORKER behavior to new flag
commit: 54e6842d0775ba76db65cbe21311c3ca466e663d
[06/11] fork: add kernel_clone_args flag to not dup/clone files
commit: 11f3f500ec8a75c96087f3bed87aa2b1c5de7498
[07/11] fork: Add kernel_clone_args flag to ignore signals
commit: 094717586bf71ac20ae3b240d2654d826634b21e
[08/11] fork: allow kernel code to call copy_process
commit: 89c8e98d8cfb0656dbeb648572df5b13e372247d
[09/11] vhost_task: Allow vhost layer to use copy_process
commit: 77feab3c4156229511b378f15c0a16574d9c08ea
[10/11] vhost: move worker thread fields to new struct
commit: d45e2b73ead0daec350680fbf20144a2d3670186
[11/11] vhost: use vhost_tasks for worker threads
commit: 5ab18f4b061ef24a71eea9ffafebd1a82ae2f514