2023-02-02 23:25:51

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 0/8] Use copy_process in vhost layer

The following patches were made over Linus's tree. They allow the vhost
layer to use copy_process instead of using workqueue_structs to create
worker threads for VM's devices.

Eric, the vhost maintainer, Michael Tsirkin has ACK'd the patches, so
we are just waiting on you. I haven't got any more comments after several
postings, and the last reply from you was a year ago on Jan 8th *2022*.
Are you ok with these patches and can we merge them?

Details:
Qemu will create vhost devices in the kernel which perform network or SCSI,
IO and perform management operations from worker threads created with the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The patches allow the vhost layer to do a copy_process from the thread that
does the VHOST_SET_OWNER ioctl like how io_uring does a copy_process against
its userspace thread. This allows the vhost layer's worker threads to inherit
cgroups, namespaces, address space, etc. This worker thread will also be
accounted for against that owner/parent process's RLIMIT_NPROC limit which
will prevent malicious users from creating VMs with almost unlimited threads
when these patches are used:

https://lore.kernel.org/all/[email protected]/

which allow us to create a worker thread per N virtqueues.

V11:
- Rebase.
V10:
- Eric's cleanup patches and my vhost flush cleanup patches are merged
upstream, so rebase against Linus's tree which has everything.
V9:
- Rebase against Eric's kthread-cleanups-for-v5.19 branch. Drop patches
no longer needed due to kernel clone arg and pf io worker patches in that
branch.
V8:
- Fix kzalloc GFP use.
- Fix email subject version number.
V7:
- Drop generic user_worker_* helpers and replace with vhost_task specific
ones.
- Drop autoreap patch. Use kernel_wait4 instead.
- Fix issue where vhost.ko could be removed while the worker function is
still running.
V6:
- Rename kernel_worker to user_worker and fix prefixes.
- Add better patch descriptions.
V5:
- Handle kbuild errors by building patchset against current kernel that
has all deps merged. Also add patch to remove create_io_thread code as
it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
had that flag set. Also dropped patches that passed worker flags to
copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.






2023-02-02 23:25:55

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals

From: Christian Brauner <[email protected]>

Since:

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

kthreads have been ignoring signals by default, and the vhost layer has
never had a need to change that. This patch adds an option flag,
USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand()
and copy_signals() so vhost_tasks added in the next patches can continue
to ignore singals.

Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 18e614591c24..ce6240a006cf 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
#define USER_WORKER BIT(0)
#define USER_WORKER_IO BIT(1)
#define USER_WORKER_NO_FILES BIT(2)
+#define USER_WORKER_SIG_IGN BIT(3)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index bb98b48bc35c..55c77de45271 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2287,6 +2287,9 @@ static __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

+ if (args->worker_flags & USER_WORKER_SIG_IGN)
+ ignore_signals(p);
+
stackleak_task_init(p);

if (pid != &init_struct_pid) {
--
2.25.1


2023-02-02 23:25:59

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 2/8] fork/vm: Move common PF_IO_WORKER behavior to new flag

This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use a new helper
instead of create_io_thread because they require different behavior for
operations like signal handling.

The common behavior PF_USER_WORKER covers is the vm reclaim handling.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/sched/task.h | 3 ++-
kernel/fork.c | 4 ++++
mm/vmscan.c | 4 ++--
4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..2ca9269332c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1723,7 +1723,7 @@ extern struct pid *cad_pid;
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
-#define PF__HOLE__00004000 0x00004000
+#define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
#define PF__HOLE__00010000 0x00010000
#define PF_KSWAPD 0x00020000 /* I am kswapd */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index a759ce5aa603..dfc585e0373c 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,7 +18,8 @@ struct css_set;
/* All the bits taken by the old clone syscall. */
#define CLONE_LEGACY_FLAGS 0xffffffffULL

-#define USER_WORKER_IO BIT(0)
+#define USER_WORKER BIT(0)
+#define USER_WORKER_IO BIT(1)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index b030aefba26c..77d2c527e917 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2100,6 +2100,10 @@ static __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
+
+ if (args->worker_flags & USER_WORKER)
+ p->flags |= PF_USER_WORKER;
+
if (args->worker_flags & USER_WORKER_IO) {
/*
* Mark us an IO worker, and block any signal that isn't
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd6637fcd8f9..54de4adb91cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1141,12 +1141,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);

/*
- * Do not throttle IO workers, kthreads other than kswapd or
+ * Do not throttle user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_IO_WORKER|PF_KTHREAD)) {
+ current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
cond_resched();
return;
}
--
2.25.1


2023-02-02 23:26:03

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files

Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it creates a thread using a helper based on
copy_process we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 11 +++++++++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index dfc585e0373c..18e614591c24 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;

#define USER_WORKER BIT(0)
#define USER_WORKER_IO BIT(1)
+#define USER_WORKER_NO_FILES BIT(2)

struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 77d2c527e917..bb98b48bc35c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1624,7 +1624,8 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}

-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+ int no_files)
{
struct files_struct *oldf, *newf;
int error = 0;
@@ -1636,6 +1637,11 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
if (!oldf)
goto out;

+ if (no_files) {
+ tsk->files = NULL;
+ goto out;
+ }
+
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2255,7 +2261,8 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
- retval = copy_files(clone_flags, p);
+ retval = copy_files(clone_flags, p,
+ args->worker_flags & USER_WORKER_NO_FILES);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
--
2.25.1


2023-02-02 23:26:06

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 1/8] fork: Make IO worker options flag based

This patchset adds a couple new options to kernel_clone_args for the vhost
layer which is going to work like PF_IO_WORKER but will differ enough that
we will need to add several fields to kernel_clone_args. This patch moves
us to a flags based approach for these types of users.

Signed-off-by: Mike Christie <[email protected]>
Suggested-by: Christian Brauner <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 4 +++-
kernel/fork.c | 4 ++--
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..a759ce5aa603 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
/* All the bits taken by the old clone syscall. */
#define CLONE_LEGACY_FLAGS 0xffffffffULL

+#define USER_WORKER_IO BIT(0)
+
struct kernel_clone_args {
u64 flags;
+ u32 worker_flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
/* Number of elements in *set_tid */
size_t set_tid_size;
int cgroup;
- int io_thread;
int kthread;
int idle;
int (*fn)(void *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..b030aefba26c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2100,7 +2100,7 @@ static __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->io_thread) {
+ if (args->worker_flags & USER_WORKER_IO) {
/*
* Mark us an IO worker, and block any signal that isn't
* fatal or STOP
@@ -2623,7 +2623,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.fn = fn,
.fn_arg = arg,
- .io_thread = 1,
+ .worker_flags = USER_WORKER_IO,
};

return copy_process(NULL, 0, node, &args);
--
2.25.1


2023-02-02 23:26:10

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result
in VMs creating more threads than the admin expected.

This patch adds a new struct vhost_task which can be used instead of
kthreads. They allow the vhost layer to use copy_process and inherit
the userspace process's mm and cgroups, the task is accounted for
under the userspace's nproc count and can be seen in its process tree,
and other features like namespaces work and are inherited by default.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
MAINTAINERS | 2 +
drivers/vhost/Kconfig | 5 ++
include/linux/sched/vhost_task.h | 23 ++++++
kernel/Makefile | 1 +
kernel/vhost_task.c | 122 +++++++++++++++++++++++++++++++
5 files changed, 153 insertions(+)
create mode 100644 include/linux/sched/vhost_task.h
create mode 100644 kernel/vhost_task.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8a5c25c20d00..5f7a3b3af7aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22125,7 +22125,9 @@ L: [email protected]
L: [email protected]
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
+F: kernel/vhost_task.c
F: drivers/vhost/
+F: include/linux/sched/vhost_task.h
F: include/linux/vhost_iotlb.h
F: include/uapi/linux/vhost.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 587fbae06182..b455d9ab6f3d 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -13,9 +13,14 @@ config VHOST_RING
This option is selected by any driver which needs to access
the host side of a virtio ring.

+config VHOST_TASK
+ bool
+ default n
+
config VHOST
tristate
select VHOST_IOTLB
+ select VHOST_TASK
help
This option is selected by any driver which needs to access
the core of vhost.
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
new file mode 100644
index 000000000000..50d02a25d37b
--- /dev/null
+++ b/include/linux/sched/vhost_task.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VHOST_TASK_H
+#define _LINUX_VHOST_TASK_H
+
+#include <linux/completion.h>
+
+struct task_struct;
+
+struct vhost_task {
+ int (*fn)(void *data);
+ void *data;
+ struct completion exited;
+ unsigned long flags;
+ struct task_struct *task;
+};
+
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node);
+__printf(2, 3)
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...);
+void vhost_task_stop(struct vhost_task *vtsk);
+bool vhost_task_should_stop(struct vhost_task *vtsk);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 10ef068f598d..6fc72b3afbde 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -15,6 +15,7 @@ obj-y = fork.o exec_domain.o panic.o \
obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o
obj-$(CONFIG_MODULES) += kmod.o
obj-$(CONFIG_MULTIUSER) += groups.o
+obj-$(CONFIG_VHOST_TASK) += vhost_task.o

ifdef CONFIG_FUNCTION_TRACER
# Do not trace internal ftrace files
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
new file mode 100644
index 000000000000..517dd166bb2b
--- /dev/null
+++ b/kernel/vhost_task.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 Oracle Corporation
+ */
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/sched/task.h>
+#include <linux/sched/vhost_task.h>
+#include <linux/sched/signal.h>
+
+enum vhost_task_flags {
+ VHOST_TASK_FLAGS_STOP,
+};
+
+static int vhost_task_fn(void *data)
+{
+ struct vhost_task *vtsk = data;
+ int ret;
+
+ ret = vtsk->fn(vtsk->data);
+ complete(&vtsk->exited);
+ do_exit(ret);
+}
+
+/**
+ * vhost_task_stop - stop a vhost_task
+ * @vtsk: vhost_task to stop
+ *
+ * Callers must call vhost_task_should_stop and return from their worker
+ * function when it returns true;
+ */
+void vhost_task_stop(struct vhost_task *vtsk)
+{
+ pid_t pid = vtsk->task->pid;
+
+ set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+ wake_up_process(vtsk->task);
+ /*
+ * Make sure vhost_task_fn is no longer accessing the vhost_task before
+ * freeing it below. If userspace crashed or exited without closing,
+ * then the vhost_task->task could already be marked dead so
+ * kernel_wait will return early.
+ */
+ wait_for_completion(&vtsk->exited);
+ /*
+ * If we are just closing/removing a device and the parent process is
+ * not exiting then reap the task.
+ */
+ kernel_wait4(pid, NULL, __WCLONE, NULL);
+ kfree(vtsk);
+}
+EXPORT_SYMBOL_GPL(vhost_task_stop);
+
+/**
+ * vhost_task_should_stop - should the vhost task return from the work function
+ */
+bool vhost_task_should_stop(struct vhost_task *vtsk)
+{
+ return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+}
+EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+
+/**
+ * vhost_task_create - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ *
+ * This returns a specialized task for use by the vhost layer or NULL on
+ * failure. The returned task is inactive, and the caller must fire it up
+ * through vhost_task_start().
+ */
+struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, int node)
+{
+ struct kernel_clone_args args = {
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .exit_signal = 0,
+ .worker_flags = USER_WORKER | USER_WORKER_NO_FILES |
+ USER_WORKER_SIG_IGN,
+ .fn = vhost_task_fn,
+ };
+ struct vhost_task *vtsk;
+ struct task_struct *tsk;
+
+ vtsk = kzalloc(sizeof(*vtsk), GFP_KERNEL);
+ if (!vtsk)
+ return ERR_PTR(-ENOMEM);
+ init_completion(&vtsk->exited);
+ vtsk->data = arg;
+ vtsk->fn = fn;
+
+ args.fn_arg = vtsk;
+
+ tsk = copy_process(NULL, 0, node, &args);
+ if (IS_ERR(tsk)) {
+ kfree(vtsk);
+ return NULL;
+ }
+
+ vtsk->task = tsk;
+ return vtsk;
+}
+EXPORT_SYMBOL_GPL(vhost_task_create);
+
+/**
+ * vhost_task_start - start a vhost_task created with vhost_task_create
+ * @vtsk: vhost_task to wake up
+ * @namefmt: printf-style format string for the thread name
+ */
+void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...)
+{
+ char name[TASK_COMM_LEN];
+ va_list args;
+
+ va_start(args, namefmt);
+ vsnprintf(name, sizeof(name), namefmt, args);
+ set_task_comm(vtsk->task, name);
+ va_end(args);
+
+ wake_up_new_task(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_start);
--
2.25.1


2023-02-02 23:26:12

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 5/8] fork: allow kernel code to call copy_process

The next patch adds helpers like create_io_thread, but for use by the
vhost layer. There are several functions, so they are in their own file
instead of cluttering up fork.c. This patch allows that new file to
call copy_process.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
include/linux/sched/task.h | 2 ++
kernel/fork.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ce6240a006cf..b0e43a1fd21d 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -94,6 +94,8 @@ extern void exit_files(struct task_struct *);
extern void exit_itimers(struct task_struct *);

extern pid_t kernel_clone(struct kernel_clone_args *kargs);
+struct task_struct *copy_process(struct pid *pid, int trace, int node,
+ struct kernel_clone_args *args);
struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 55c77de45271..93e545b08205 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2013,7 +2013,7 @@ static void rv_task_fork(struct task_struct *p)
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
*/
-static __latent_entropy struct task_struct *copy_process(
+__latent_entropy struct task_struct *copy_process(
struct pid *pid,
int trace,
int node,
--
2.25.1


2023-02-02 23:26:17

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked, so while tools like libvirt try to control the number of
threads based on the nproc rlimit setting we can end up creating more
threads than the user wanted.

This patch has us use the vhost_task helpers which will inherit its
values/checks from the thread that owns the device similar to if we did
a clone in userspace. The vhost threads will now be counted in the nproc
rlimits. And we get features like cgroups and mm sharing automatically,
so we can remove those calls.

Signed-off-by: Mike Christie <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 58 ++++++++-----------------------------------
drivers/vhost/vhost.h | 4 +--
2 files changed, 13 insertions(+), 49 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 74378d241f8d..d3c7c37b69a7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,11 +22,11 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/kthread.h>
-#include <linux/cgroup.h>
#include <linux/module.h>
#include <linux/sort.h>
#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
+#include <linux/sched/vhost_task.h>
#include <linux/interval_tree_generic.h>
#include <linux/nospec.h>
#include <linux/kcov.h>
@@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* test_and_set_bit() implies a memory barrier.
*/
llist_add(&work->node, &dev->worker->work_list);
- wake_up_process(dev->worker->task);
+ wake_up_process(dev->worker->vtsk->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
static int vhost_worker(void *data)
{
struct vhost_worker *worker = data;
- struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;

- kthread_use_mm(dev->mm);
-
for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);

- if (kthread_should_stop()) {
+ if (vhost_task_should_stop(worker->vtsk)) {
__set_current_state(TASK_RUNNING);
break;
}
@@ -368,7 +365,7 @@ static int vhost_worker(void *data)
schedule();
}
}
- kthread_unuse_mm(dev->mm);
+
return 0;
}

@@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_check_owner);

-struct vhost_attach_cgroups_struct {
- struct vhost_work work;
- struct task_struct *owner;
- int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
- struct vhost_attach_cgroups_struct *s;
-
- s = container_of(work, struct vhost_attach_cgroups_struct, work);
- s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
- struct vhost_attach_cgroups_struct attach;
-
- attach.owner = current;
- vhost_work_init(&attach.work, vhost_attach_cgroups_work);
- vhost_work_queue(dev, &attach.work);
- vhost_dev_flush(dev);
- return attach.ret;
-}
-
/* Caller should have device mutex */
bool vhost_dev_has_owner(struct vhost_dev *dev)
{
@@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev)

dev->worker = NULL;
WARN_ON(!llist_empty(&worker->work_list));
- kthread_stop(worker->task);
+ vhost_task_stop(worker->vtsk);
kfree(worker);
}

static int vhost_worker_create(struct vhost_dev *dev)
{
struct vhost_worker *worker;
- struct task_struct *task;
+ struct vhost_task *vtsk;
int ret;

worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
@@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev)
return -ENOMEM;

dev->worker = worker;
- worker->dev = dev;
worker->kcov_handle = kcov_common_handle();
init_llist_head(&worker->work_list);

- task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
- if (IS_ERR(task)) {
- ret = PTR_ERR(task);
+ vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE);
+ if (!vtsk) {
+ ret = -ENOMEM;
goto free_worker;
}

- worker->task = task;
- wake_up_process(task); /* avoid contributing to loadavg */
-
- ret = vhost_attach_cgroups(dev);
- if (ret)
- goto stop_worker;
-
+ worker->vtsk = vtsk;
+ vhost_task_start(vtsk, "vhost-%d", current->pid);
return 0;

-stop_worker:
- kthread_stop(worker->task);
free_worker:
kfree(worker);
dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2f6beab93784..3af59c65025e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -16,6 +16,7 @@
#include <linux/irqbypass.h>

struct vhost_work;
+struct vhost_task;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);

#define VHOST_WORK_QUEUED 1
@@ -26,9 +27,8 @@ struct vhost_work {
};

struct vhost_worker {
- struct task_struct *task;
+ struct vhost_task *vtsk;
struct llist_head work_list;
- struct vhost_dev *dev;
u64 kcov_handle;
};

--
2.25.1


2023-02-02 23:28:52

by Mike Christie

[permalink] [raw]
Subject: [PATCH v11 7/8] vhost: move worker thread fields to new struct

This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/vhost/vhost.c | 98 ++++++++++++++++++++++++++++---------------
drivers/vhost/vhost.h | 11 +++--
2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bfd2f1f..74378d241f8d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* sure it was not in the list.
* test_and_set_bit() implies a memory barrier.
*/
- llist_add(&work->node, &dev->work_list);
- wake_up_process(dev->worker);
+ llist_add(&work->node, &dev->worker->work_list);
+ wake_up_process(dev->worker->task);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
/* A lockless hint for busy polling code to exit the loop */
bool vhost_has_work(struct vhost_dev *dev)
{
- return !llist_empty(&dev->work_list);
+ return dev->worker && !llist_empty(&dev->worker->work_list);
}
EXPORT_SYMBOL_GPL(vhost_has_work);

@@ -335,7 +335,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,

static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_worker *worker = data;
+ struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;

@@ -350,7 +351,7 @@ static int vhost_worker(void *data)
break;
}

- node = llist_del_all(&dev->work_list);
+ node = llist_del_all(&worker->work_list);
if (!node)
schedule();

@@ -360,7 +361,7 @@ static int vhost_worker(void *data)
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
__set_current_state(TASK_RUNNING);
- kcov_remote_start_common(dev->kcov_handle);
+ kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
if (need_resched())
@@ -479,7 +480,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
- init_llist_head(&dev->work_list);
init_waitqueue_head(&dev->wait);
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
@@ -571,10 +571,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
}

+static void vhost_worker_free(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker = dev->worker;
+
+ if (!worker)
+ return;
+
+ dev->worker = NULL;
+ WARN_ON(!llist_empty(&worker->work_list));
+ kthread_stop(worker->task);
+ kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+ struct vhost_worker *worker;
+ struct task_struct *task;
+ int ret;
+
+ worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+ if (!worker)
+ return -ENOMEM;
+
+ dev->worker = worker;
+ worker->dev = dev;
+ worker->kcov_handle = kcov_common_handle();
+ init_llist_head(&worker->work_list);
+
+ task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+ if (IS_ERR(task)) {
+ ret = PTR_ERR(task);
+ goto free_worker;
+ }
+
+ worker->task = task;
+ wake_up_process(task); /* avoid contributing to loadavg */
+
+ ret = vhost_attach_cgroups(dev);
+ if (ret)
+ goto stop_worker;
+
+ return 0;
+
+stop_worker:
+ kthread_stop(worker->task);
+free_worker:
+ kfree(worker);
+ dev->worker = NULL;
+ return ret;
+}
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
- struct task_struct *worker;
int err;

/* Is there an owner already? */
@@ -585,36 +635,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)

vhost_attach_mm(dev);

- dev->kcov_handle = kcov_common_handle();
if (dev->use_worker) {
- worker = kthread_create(vhost_worker, dev,
- "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
- }
-
- dev->worker = worker;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- err = vhost_attach_cgroups(dev);
+ err = vhost_worker_create(dev);
if (err)
- goto err_cgroup;
+ goto err_worker;
}

err = vhost_dev_alloc_iovecs(dev);
if (err)
- goto err_cgroup;
+ goto err_iovecs;

return 0;
-err_cgroup:
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- }
+err_iovecs:
+ vhost_worker_free(dev);
err_worker:
vhost_detach_mm(dev);
- dev->kcov_handle = 0;
err_mm:
return err;
}
@@ -704,12 +739,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
- WARN_ON(!llist_empty(&dev->work_list));
- if (dev->worker) {
- kthread_stop(dev->worker);
- dev->worker = NULL;
- dev->kcov_handle = 0;
- }
+ vhost_worker_free(dev);
vhost_detach_mm(dev);
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d9109107af08..2f6beab93784 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,6 +25,13 @@ struct vhost_work {
unsigned long flags;
};

+struct vhost_worker {
+ struct task_struct *task;
+ struct llist_head work_list;
+ struct vhost_dev *dev;
+ u64 kcov_handle;
+};
+
/* Poll a file (eventfd or socket) */
/* Note: there's nothing vhost specific about this structure. */
struct vhost_poll {
@@ -147,8 +154,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
- struct llist_head work_list;
- struct task_struct *worker;
+ struct vhost_worker *worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;
@@ -158,7 +164,6 @@ struct vhost_dev {
int iov_limit;
int weight;
int byte_weight;
- u64 kcov_handle;
bool use_worker;
int (*msg_handler)(struct vhost_dev *dev, u32 asid,
struct vhost_iotlb_msg *msg);
--
2.25.1


2023-02-03 00:14:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 1/8] fork: Make IO worker options flag based

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<[email protected]> wrote:
>
> struct kernel_clone_args {
> u64 flags;
> + u32 worker_flags;
> int __user *pidfd;
> int __user *child_tid;
> int __user *parent_tid;

Minor nit: please put this next to "exit_signal".

As it is, you've put a new 32-bit field in between two 64-bit fields
and are generating extra pointless padding.

We have that padding by "exit_signal" already, so let's just use it.

Also, I like moving those flags to a "flags" field, but can we please
make it consistent? We have that "args->kthread" field too, which is
100% analogous to args->io_thread.

So don't make a bit field for io_thread, and then not do the same for kthread.

Finally, why isn't this all just a bitfield - every single case would
seem to prefer something like

if (args->user_worker) ..

instead of

if (args->worker_flags & USER_WORKER)

which would seem to make everything simpler still?

Linus

2023-02-03 00:16:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 3/8] fork: add USER_WORKER flag to not dup/clone files

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<[email protected]> wrote:
>
> - retval = copy_files(clone_flags, p);
> + retval = copy_files(clone_flags, p,
> + args->worker_flags & USER_WORKER_NO_FILES);

Just to hit the previous email comment home, adding just another
bitfield case would have made this patch simpler, and this would just
be

retval = copy_files(clone_flags, p, args->no_files);

which seems more legible too.

Linus

2023-02-03 00:20:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<[email protected]> wrote:
>
> + if (args->worker_flags & USER_WORKER_SIG_IGN)
> + ignore_signals(p);

Same comment as for the other case.

There are real reasons to avoid bitfields:

- you can't pass addresses to them around

- it's easier to read or assign multiple fields in one go

- they are horrible for ABI issues due to the exact bit ordering and
padding being very subtle

but none of those issues are relevant here, where it's a kernel-internal ABI.

All these use-cases seem to actually be testing one bit at a time, and
the "assignments" are structure initializers for which named bitfields
are actually perfect and just make the initializer more legible.

Linus

2023-02-03 00:43:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 6/8] vhost_task: Allow vhost layer to use copy_process

On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
<[email protected]> wrote:
>
> +/**
> + * vhost_task_start - start a vhost_task created with vhost_task_create
> + * @vtsk: vhost_task to wake up
> + * @namefmt: printf-style format string for the thread name
> + */
> +void vhost_task_start(struct vhost_task *vtsk, const char namefmt[], ...)
> +{
> + char name[TASK_COMM_LEN];
> + va_list args;
> +
> + va_start(args, namefmt);
> + vsnprintf(name, sizeof(name), namefmt, args);
> + set_task_comm(vtsk->task, name);
> + va_end(args);
> +
> + wake_up_new_task(vtsk->task);
> +}

Ok, I like this more than what we do for the IO workers - they set
their own names themselves once they start running, rather than have
the creator do it like this.

At the same time, my reaction to this was "why do we need to go
through that temporary 'name[]' buffer at all?"

And I think this patch is very much correct to do so, because
"copy_thread()" has already exposed the new thread to the rest of the
world, even though it hasn't actually started running yet.

So I think this is all doing the right thing, and I like how it does
it better than what io_uring does, BUT...

It does make me think that maybe we should make that task name
handling part of copy_process(), and simply create the task name
before we need this careful set_task_comm() with a temporary buffer.

Because if we just did it in copy_process() before the new task has
been exposed anywhere,. we could just do it as

if (args->name)
vsnprintf(tsk->comm, TASK_COMM_LEN, "%s-%d", args->name, tsk->pid);

or something like that.

Not a big deal, it was just me reacting to this patch with "do we
really need set_task_comm() when we're creating the task?"

Linus

2023-02-05 16:06:33

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 4/8] fork: Add USER_WORKER flag to ignore signals

On 2/2/23 6:19 PM, Linus Torvalds wrote:
> On Thu, Feb 2, 2023 at 3:25 PM Mike Christie
> <[email protected]> wrote:
>>
>> + if (args->worker_flags & USER_WORKER_SIG_IGN)
>> + ignore_signals(p);
>
> Same comment as for the other case.
>
> There are real reasons to avoid bitfields:
>
> - you can't pass addresses to them around
>
> - it's easier to read or assign multiple fields in one go
>
> - they are horrible for ABI issues due to the exact bit ordering and
> padding being very subtle
>
> but none of those issues are relevant here, where it's a kernel-internal ABI.
>
> All these use-cases seem to actually be testing one bit at a time, and
> the "assignments" are structure initializers for which named bitfields
> are actually perfect and just make the initializer more legible.
>

Thanks for the comments. I see what you mean and have fixed those instances and
updated kthread as well.


2023-02-07 08:19:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v11 0/8] Use copy_process in vhost layer

On Thu, Feb 02, 2023 at 05:25:09PM -0600, Mike Christie wrote:
> The following patches were made over Linus's tree. They allow the vhost
> layer to use copy_process instead of using workqueue_structs to create
> worker threads for VM's devices.

Thanks for keeping at this, Mike.
I can pick this up once you resend with the requested changes.

2023-05-05 13:58:00

by Nicolas Dichtel

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

Le 03/02/2023 à 00:25, Mike Christie a écrit :
> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked, so while tools like libvirt try to control the number of
> threads based on the nproc rlimit setting we can end up creating more
> threads than the user wanted.
>
> This patch has us use the vhost_task helpers which will inherit its
> values/checks from the thread that owns the device similar to if we did
> a clone in userspace. The vhost threads will now be counted in the nproc
> rlimits. And we get features like cgroups and mm sharing automatically,
> so we can remove those calls.
>
> Signed-off-by: Mike Christie <[email protected]>
> Acked-by: Michael S. Tsirkin <[email protected]>

I have a question about (a side effect of?) this patch. The output of the 'ps'
command has changed. Here is an example:

Before:
$ ps
PID TTY TIME CMD
598 ttyS0 00:00:00 login
640 ttyS0 00:00:00 bash
8880 ttyS0 00:00:06 example:2
9389 ttyS0 00:00:00 ps
$ ps a
PID TTY STAT TIME COMMAND
598 ttyS0 Ss 0:00 /bin/login -p --
602 tty1 Ss+ 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
640 ttyS0 S 0:00 /bin/bash -li
8880 ttyS0 SLl 0:10 /usr/bin/example
9396 ttyS0 R+ 0:00 ps a
$ pgrep -f example
8880

After:
$ ps
PID TTY TIME CMD
538 ttyS0 00:00:00 login
574 ttyS0 00:00:00 bash
8275 ttyS0 00:03:28 example:2
8285 ttyS0 00:00:00 vhost-8275
8295 ttyS0 00:00:00 vhost-8275
8299 ttyS0 00:00:00 vhost-8275
9054 ttyS0 00:00:00 ps
$ ps a
PID TTY STAT TIME COMMAND
538 ttyS0 Ss 0:00 /bin/login -p --
540 tty1 Ss+ 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
574 ttyS0 S 0:00 /bin/bash -li
8275 ttyS0 SLl 3:28 /usr/bin/example
8285 ttyS0 SL 0:00 /usr/bin/example
8295 ttyS0 SL 0:00 /usr/bin/example
8299 ttyS0 SL 0:00 /usr/bin/example
9055 ttyS0 R+ 0:00 ps a
$ pgrep -f example
8275
8285
8295
8299

Is this an intended behavior?
This breaks some of our scripts.


Regards,
Nicolas

2023-05-05 18:35:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
<[email protected]> wrote:
>
> Is this an intended behavior?
> This breaks some of our scripts.

It doesn't just break your scripts (which counts as a regression), I
think it's really wrong.

The worker threads should show up as threads of the thing that started
them, not as processes.

So they should show up in 'ps' only when one of the "show threads" flag is set.

But I suspect the fix is trivial: the virtio code should likely use
CLONE_THREAD for the copy_process() it does.

It should look more like "create_io_thread()" than "copy_process()", I think.

For example, do virtio worker threads really want their own signals
and files? That sounds wrong. create_io_thread() uses all of

CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO

to share much more of the context with the process it is actually run within.

Christian? Mike?

Linus

2023-05-05 22:59:43

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/5/23 1:22 PM, Linus Torvalds wrote:
> On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
> <[email protected]> wrote:
>>
>> Is this an intended behavior?
>> This breaks some of our scripts.
>
> It doesn't just break your scripts (which counts as a regression), I
> think it's really wrong.
>
> The worker threads should show up as threads of the thing that started
> them, not as processes.
>
> So they should show up in 'ps' only when one of the "show threads" flag is set.
>
> But I suspect the fix is trivial: the virtio code should likely use
> CLONE_THREAD for the copy_process() it does.
>
> It should look more like "create_io_thread()" than "copy_process()", I think.
>
> For example, do virtio worker threads really want their own signals
> and files? That sounds wrong. create_io_thread() uses all of
>
> CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO
>
> to share much more of the context with the process it is actually run within.
>

For the vhost tasks and the CLONE flags:

1. I didn't use CLONE_FILES in the vhost task patches because you are right
and we didn't need our own. We needed it to work like kthreads where there
are no files, so I set the kernel_clone_args.no_files bit to have copy_files
not do a dup or clone (task->files is NULL).

2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use
signals for management operations. But, the vhost thread's worker functions
assume signals are ignored like they were with kthreads. So if they were doing
IO and got a signal like a SIGHUP they might return early and fail from whatever
network/block function they were calling. And currently the parent like qemu
handles something like a SIGSTOP by shutting everything down by calling into
the vhost interface to remove the device.

So similar to files I used the kernel_clone_args.ignore_signals bit so
copy_process has the vhost thread have it's own signal handle that just ignores
signals.

3. I didn't use CLONE_THREAD because before my patches you could do
"ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we
can only see it when we do something like "ps -T -p $parent" like you mentioned
above. I guess I messed up and did the reverse and thought it would be a
regression if "ps -u root" no longer showed the vhost threads.

If it's ok to change the behavior of "ps -u root", then we can do this patch:
(Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
case. If you could test the ps only case or give me info on what /usr/bin/example
was doing I can replicate and test here):


diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..eb9ffc58e211 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
+ *
+ * A userworker's parent thread will normally have a signal handler
+ * that performs management operations, but the worker will not
+ * because the parent will handle the signal then user a worker
+ * specific interface to manage the thread and related resources.
*/
- if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
+ if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
+ !args->user_worker && !args->ignore_signals)
return ERR_PTR(-EINVAL);

/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..3700c21ea39d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_THREAD | CLONE_VM |
+ CLONE_UNTRACED,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,










2023-05-06 02:20:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Fri, May 5, 2023 at 3:38 PM Mike Christie
<[email protected]> wrote:
>
> If it's ok to change the behavior of "ps -u root", then we can do this patch:

I think this is the right thing to do.

Making the user worker threads show up as threads with the vhost
process as the parent really seems like a much better model, and more
accurate.

Yes, they used to show up as random kernel threads, and you'd see them
as such (not just for "ps -u root", but simply also with just a normal
"ps ax" kind of thing). But that isn't all that helpful, and it's
really just annoying to see our kernel threads in "ps ax" output, and
I've often wished we didn't do that (it think of all the random
"kworker/xyz-kcryptd" etc things that show up).

So I think showing them as the threaded children of the vhost process
is much nicer, and probably the best option.

Because I don't thin kanything is going to get the *old* behavior of
showing them as the '[vhost-xyz]' system threads (or whatever the old
output ended up being in 'ps ax'), but hopefully nothing wants that
horror anyway.

At a minimum, the parenting is fundamentally going to look different
in the new model.

Linus

2023-05-08 17:21:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Fri, May 05, 2023 at 05:37:40PM -0500, Mike Christie wrote:
> On 5/5/23 1:22 PM, Linus Torvalds wrote:
> > On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
> > <[email protected]> wrote:
> >>
> >> Is this an intended behavior?
> >> This breaks some of our scripts.
> >
> > It doesn't just break your scripts (which counts as a regression), I
> > think it's really wrong.
> >
> > The worker threads should show up as threads of the thing that started
> > them, not as processes.
> >
> > So they should show up in 'ps' only when one of the "show threads" flag is set.
> >
> > But I suspect the fix is trivial: the virtio code should likely use
> > CLONE_THREAD for the copy_process() it does.
> >
> > It should look more like "create_io_thread()" than "copy_process()", I think.
> >
> > For example, do virtio worker threads really want their own signals
> > and files? That sounds wrong. create_io_thread() uses all of
> >
> > CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO
> >
> > to share much more of the context with the process it is actually run within.
> >
>
> For the vhost tasks and the CLONE flags:
>
> 1. I didn't use CLONE_FILES in the vhost task patches because you are right
> and we didn't need our own. We needed it to work like kthreads where there
> are no files, so I set the kernel_clone_args.no_files bit to have copy_files
> not do a dup or clone (task->files is NULL).
>
> 2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use
> signals for management operations. But, the vhost thread's worker functions
> assume signals are ignored like they were with kthreads. So if they were doing
> IO and got a signal like a SIGHUP they might return early and fail from whatever
> network/block function they were calling. And currently the parent like qemu
> handles something like a SIGSTOP by shutting everything down by calling into
> the vhost interface to remove the device.
>
> So similar to files I used the kernel_clone_args.ignore_signals bit so
> copy_process has the vhost thread have it's own signal handle that just ignores
> signals.
>
> 3. I didn't use CLONE_THREAD because before my patches you could do
> "ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we
> can only see it when we do something like "ps -T -p $parent" like you mentioned
> above. I guess I messed up and did the reverse and thought it would be a
> regression if "ps -u root" no longer showed the vhost threads.
>
> If it's ok to change the behavior of "ps -u root", then we can do this patch:
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
>
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..eb9ffc58e211 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
> /*
> * Thread groups must share signals as well, and detached threads
> * can only be started up within the thread group.
> + *
> + * A userworker's parent thread will normally have a signal handler
> + * that performs management operations, but the worker will not
> + * because the parent will handle the signal then user a worker
> + * specific interface to manage the thread and related resources.
> */
> - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
> + !args->user_worker && !args->ignore_signals)
> return ERR_PTR(-EINVAL);

I'm currently traveling due to LSFMM so that's why my responses will be
delayed this week. I'm not yet clear if CLONE_THREAD without
CLONE_SIGHAND is safe. If there's code that assumes that
$task_in_threadgroup->sighand->siglock always covers all threads in the
threadgroup then this change would break this assumption?

>
> /*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..3700c21ea39d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> const char *name)
> {
> struct kernel_clone_args args = {
> - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM |
> + CLONE_UNTRACED,
> .exit_signal = 0,
> .fn = vhost_task_fn,
> .name = name,
>
>
>
>
>
>
>
>
>
>

2023-05-09 08:25:02

by Nicolas Dichtel

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

Le 06/05/2023 à 00:37, Mike Christie a écrit :
[snip]
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
With you patch:
$ ps a
PID TTY STAT TIME COMMAND
191 ttyS0 Ss 0:00 /bin/sh -li
1255 ttyS0 SLl 0:53 /usr/bin/example
1742 ttyS0 R+ 0:00 ps a
$ ps
PID TTY TIME CMD
191 ttyS0 00:00:00 sh
1743 ttyS0 00:00:00 ps

This fixes the regression on our side, but now, 'example' is not displayed
anymore with 'ps'.

Thank you,
Nicolas

2023-05-09 08:40:30

by Nicolas Dichtel

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

Le 09/05/2023 à 10:09, Nicolas Dichtel a écrit :
> Le 06/05/2023 à 00:37, Mike Christie a écrit :
> [snip]
>> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
>> case. If you could test the ps only case or give me info on what /usr/bin/example
>> was doing I can replicate and test here):
> With you patch:
> $ ps a
> PID TTY STAT TIME COMMAND
> 191 ttyS0 Ss 0:00 /bin/sh -li
> 1255 ttyS0 SLl 0:53 /usr/bin/example
> 1742 ttyS0 R+ 0:00 ps a
> $ ps
> PID TTY TIME CMD
> 191 ttyS0 00:00:00 sh
> 1743 ttyS0 00:00:00 ps
Sorry, this is wrong, here is the right screenshot:
$ ps
PID TTY TIME CMD
538 ttyS0 00:00:00 login
573 ttyS0 00:00:00 bash
8282 ttyS0 00:00:04 example:2
8825 ttyS0 00:00:00 ps
$ ps a
PID TTY STAT TIME COMMAND
538 ttyS0 Ss 0:00 /bin/login -p --
540 tty1 Ss+ 0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
573 ttyS0 S 0:00 /bin/bash -li
8282 ttyS0 RLl 0:05 /usr/bin/example
8829 ttyS0 R+ 0:00 ps a

It fixes the issue.


Thank you,
Nicolas

2023-05-13 13:16:24

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

[CCing the regression list]

On 06.05.23 00:37, Mike Christie wrote:
> On 5/5/23 1:22 PM, Linus Torvalds wrote:
>> On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel
>> <[email protected]> wrote:
>>>
>>> Is this an intended behavior?
>>> This breaks some of our scripts.

Jumping in here, as I found another problem with that patch: it broke
s2idle on my laptop when a qemu-kvm VM is running, as freezing user
space processes now fails for me:

```
[ 195.442949] PM: suspend entry (s2idle)
[ 195.641271] Filesystems sync: 0.198 seconds
[ 195.833828] Freezing user space processes
[ 215.841084] Freezing user space processes failed after 20.007
seconds (1 tasks refusing to freeze, wq_busy=0):
[ 215.841255] task:vhost-3221 state:R stack:0 pid:3250
ppid:3221 flags:0x00004006
[ 215.841264] Call Trace:
[ 215.841266] <TASK>
[ 215.841270] ? update_rq_clock+0x39/0x270
[ 215.841283] ? _raw_spin_unlock+0x19/0x40
[ 215.841290] ? __schedule+0x3f/0x1510
[ 215.841296] ? sysvec_apic_timer_interrupt+0xaf/0xd0
[ 215.841306] ? schedule+0x61/0xe0
[ 215.841313] ? vhost_worker+0x87/0xb0 [vhost]
[ 215.841329] ? vhost_task_fn+0x1a/0x30
[ 215.841336] ? __pfx_vhost_task_fn+0x10/0x10
[ 215.841341] ? ret_from_fork+0x2c/0x50
[ 215.841352] </TASK>
[ 215.841936] OOM killer enabled.
[ 215.841938] Restarting tasks ... done.
[ 215.844204] random: crng reseeded on system resumption
[ 215.957095] PM: suspend exit
[ 215.957185] PM: suspend entry (s2idle)
[ 215.967646] Filesystems sync: 0.010 seconds
[ 215.971326] Freezing user space processes
[ 235.974400] Freezing user space processes failed after 20.003
seconds (1 tasks refusing to freeze, wq_busy=0):
[ 235.974574] task:vhost-3221 state:R stack:0 pid:3250
ppid:3221 flags:0x00004806
[ 235.974583] Call Trace:
[ 235.974586] <TASK>
[ 235.974593] ? __schedule+0x184/0x1510
[ 235.974605] ? sysvec_apic_timer_interrupt+0xaf/0xd0
[ 235.974616] ? schedule+0x61/0xe0
[ 235.974624] ? vhost_worker+0x87/0xb0 [vhost]
[ 235.974648] ? vhost_task_fn+0x1a/0x30
[ 235.974656] ? __pfx_vhost_task_fn+0x10/0x10
[ 235.974662] ? ret_from_fork+0x2c/0x50
[ 235.974673] </TASK>
[ 235.975190] OOM killer enabled.
[ 235.975192] Restarting tasks ... done.
[ 235.978131] random: crng reseeded on system resumption
[ 236.091219] PM: suspend exit
```

After running into the problem I booted 6.3.1-rc1 again and there s2idle
still worked. Didn't do a bisection, just looked at the vhost commits
during the latest merge window; 6e890c5d502 ("vhost: use vhost_tasks for
worker threads") looked suspicious, so I reverted it on top of latest
mainline and then things work again. Through a search on lore I arrived
in this thread and found below patch from Mike. Gave it a try on top of
latest mainline, but it didn't help.

Ciao, Thorsten

> [...]
> If it's ok to change the behavior of "ps -u root", then we can do this patch:
> (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps'
> case. If you could test the ps only case or give me info on what /usr/bin/example
> was doing I can replicate and test here):
>
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..eb9ffc58e211 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process(
> /*
> * Thread groups must share signals as well, and detached threads
> * can only be started up within the thread group.
> + *
> + * A userworker's parent thread will normally have a signal handler
> + * that performs management operations, but the worker will not
> + * because the parent will handle the signal then user a worker
> + * specific interface to manage the thread and related resources.
> */
> - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) &&
> + !args->user_worker && !args->ignore_signals)
> return ERR_PTR(-EINVAL);
>
> /*
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..3700c21ea39d 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> const char *name)
> {
> struct kernel_clone_args args = {
> - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM |
> + CLONE_UNTRACED,
> .exit_signal = 0,
> .fn = vhost_task_fn,
> .name = name

2023-05-13 15:30:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Sat, May 13, 2023 at 7:39 AM Thorsten Leemhuis <[email protected]> wrote:
>
> Jumping in here, as I found another problem with that patch: it broke
> s2idle on my laptop when a qemu-kvm VM is running, as freezing user
> space processes now fails for me:

Hmm. kthreads have PF_NOFREEZE by default, which is probably the reason.

Adding

current->flags |= PF_NOFREEZE;

to the vhost_task setup might just fix it, but it feels a bit off.

The way io_uring does this is to do

if (signal_pending(current)) {
struct ksignal ksig;

if (!get_signal(&ksig))
continue;
break;
}

in the main loop, which ends up handling the freezer situation too.
But it should handle things like SIGSTOP etc as well, and also exit on
actual signals.

I get the feeling that the whole "vhost_task_should_stop()" logic
should have the exact logic above, and basically make those threads
killable as well.

Hmm?

Linus

2023-05-15 14:32:26

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Sat, May 13, 2023 at 10:08:04AM -0500, Linus Torvalds wrote:
> On Sat, May 13, 2023 at 7:39 AM Thorsten Leemhuis <[email protected]> wrote:
> >
> > Jumping in here, as I found another problem with that patch: it broke
> > s2idle on my laptop when a qemu-kvm VM is running, as freezing user
> > space processes now fails for me:
>
> Hmm. kthreads have PF_NOFREEZE by default, which is probably the reason.
>
> Adding
>
> current->flags |= PF_NOFREEZE;
>
> to the vhost_task setup might just fix it, but it feels a bit off.
>
> The way io_uring does this is to do
>
> if (signal_pending(current)) {
> struct ksignal ksig;
>
> if (!get_signal(&ksig))
> continue;
> break;
> }
>
> in the main loop, which ends up handling the freezer situation too.
> But it should handle things like SIGSTOP etc as well, and also exit on
> actual signals.
>
> I get the feeling that the whole "vhost_task_should_stop()" logic
> should have the exact logic above, and basically make those threads
> killable as well.
>
> Hmm?

I'm still trying to catch up after LSFMM with everything that's happened
on the fs side so coming back to this thread with a fresh set of eyes is
difficult. Sorry about the delay here.

So we seem to two immediate issues:
(1) The current logic breaks ps output because vhost creates helper
processes instead of threads. The suggested patch by Mike was to
make them proper threads again but somehow special threads in the
sense that they don't unshare signal handlers. The latter part is
possibly broken and seems hacky. (That's earlier in the thread.)
(2) Freezing of vhost tasks fails. (This mail.)

So I think we will be able to address (1) and (2) by making vhost tasks
proper threads and blocking every signal except for SIGKILL and SIGSTOP
and then having vhost handle get_signal() - as you mentioned - the same
way io uring already does. We should also remove the ingore_signals
thing completely imho. I don't think we ever want to do this with user
workers.

@Mike, can you get a patch ready ideally this week so we can get this
fixed soon?

2023-05-15 15:57:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
>
> So I think we will be able to address (1) and (2) by making vhost tasks
> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> and then having vhost handle get_signal() - as you mentioned - the same
> way io uring already does. We should also remove the ingore_signals
> thing completely imho. I don't think we ever want to do this with user
> workers.

Right. That's what IO_URING does:

if (args->io_thread) {
/*
* Mark us an IO worker, and block any signal that isn't
* fatal or STOP
*/
p->flags |= PF_IO_WORKER;
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}

and I really think that vhost should basically do exactly what io_uring does.

Not because io_uring fundamentally got this right - but simply because
io_uring had almost all the same bugs (and then some), and what the
io_uring worker threads ended up doing was to basically zoom in on
"this works".

And it zoomed in on it largely by just going for "make it look as much
as possible as a real user thread", because every time the kernel
thread did something different, it just caused problems.

So I think the patch should just look something like the attached.
Mike, can you test this on whatever vhost test-suite?

I did consider getting rid of ".ignore_signals" entirely, and instead
just keying the "block signals" behavior off the ".user_worker" flag.
But this approach doesn't seem wrong either, and I don't think it's
wrong to make the create_io_thread() function say that
".ignore_signals = 1" thing explicitly, rather than key it off the
".io_thread" flag.

Jens/Christian - comments?

Slightly related to this all: I think vhost should also do
CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if
vhost doesn't use any files, it shouldn't matter, and looking
different just to be different is wrong. But if vhost doesn't use any
files, the current situation shouldn't be a bug either.

Linus

2023-05-15 15:59:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/15/23 9:44?AM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 7:23?AM Christian Brauner <[email protected]> wrote:
>>
>> So I think we will be able to address (1) and (2) by making vhost tasks
>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>> and then having vhost handle get_signal() - as you mentioned - the same
>> way io uring already does. We should also remove the ingore_signals
>> thing completely imho. I don't think we ever want to do this with user
>> workers.
>
> Right. That's what IO_URING does:
>
> if (args->io_thread) {
> /*
> * Mark us an IO worker, and block any signal that isn't
> * fatal or STOP
> */
> p->flags |= PF_IO_WORKER;
> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> }
>
> and I really think that vhost should basically do exactly what io_uring does.
>
> Not because io_uring fundamentally got this right - but simply because
> io_uring had almost all the same bugs (and then some), and what the
> io_uring worker threads ended up doing was to basically zoom in on
> "this works".
>
> And it zoomed in on it largely by just going for "make it look as much
> as possible as a real user thread", because every time the kernel
> thread did something different, it just caused problems.

This is exactly what I told Christian in a private chat too - we went
through all of that, and this is what works. KISS.

> So I think the patch should just look something like the attached.
> Mike, can you test this on whatever vhost test-suite?

Seems like that didn't get attached...

> I did consider getting rid of ".ignore_signals" entirely, and instead
> just keying the "block signals" behavior off the ".user_worker" flag.
> But this approach doesn't seem wrong either, and I don't think it's
> wrong to make the create_io_thread() function say that
> ".ignore_signals = 1" thing explicitly, rather than key it off the
> ".io_thread" flag.
>
> Jens/Christian - comments?
>
> Slightly related to this all: I think vhost should also do
> CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if
> vhost doesn't use any files, it shouldn't matter, and looking
> different just to be different is wrong. But if vhost doesn't use any
> files, the current situation shouldn't be a bug either.

Only potential downside is that it does make file references more
expensive for other syscalls, since you now have a shared file table.
But probably not something to worry about here?

--
Jens Axboe


2023-05-15 16:10:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 8:52 AM Jens Axboe <[email protected]> wrote:
>
> Only potential downside is that it does make file references more
> expensive for other syscalls, since you now have a shared file table.
> But probably not something to worry about here?

Would the vhost user worker user processes ever be otherwise single-threaded?

I'd *assume* that a vhost user is already doing its own threads. But
maybe that's a completely bogus assumption. I don't actually use any
of this, so...

Because you are obviously 100% right that if you're otherwise
single-threaded, then a CLONE_FILES kernel helper thread will cause
the extra cost for file descriptor lookup/free due to all the race
prevention.

Linus

2023-05-15 16:14:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 8:52 AM Jens Axboe <[email protected]> wrote:
>
> On 5/15/23 9:44?AM, Linus Torvalds wrote:
> >
> > So I think the patch should just look something like the attached.
> > Mike, can you test this on whatever vhost test-suite?
>
> Seems like that didn't get attached...

Blush. I decided to built-test it, and then forgot to attach it. Here.

Linus


Attachments:
patch.diff (1.67 kB)

2023-05-15 17:35:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 8:54 AM Linus Torvalds
<[email protected]> wrote:
>
> Blush. I decided to build-test it, and then forgot to attach it. Here.

Btw, if this tests out good and we end up doing this, I think we
should also just rename that '.ignore_signals' bitfield to
'.block_signals' to actually match what it does.

But that's an entirely cosmetic thing just to clarify things. The
patch would look almost identical, apart from the new name (and the
small additional parts to rename the two existing users that weren't
touched by that patch - the header file and the vhost use-case).

Linus

2023-05-15 22:50:53

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/15/23 10:44 AM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
>>
>> So I think we will be able to address (1) and (2) by making vhost tasks
>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>> and then having vhost handle get_signal() - as you mentioned - the same
>> way io uring already does. We should also remove the ingore_signals
>> thing completely imho. I don't think we ever want to do this with user
>> workers.
>
> Right. That's what IO_URING does:
>
> if (args->io_thread) {
> /*
> * Mark us an IO worker, and block any signal that isn't
> * fatal or STOP
> */
> p->flags |= PF_IO_WORKER;
> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> }
>
> and I really think that vhost should basically do exactly what io_uring does.
>
> Not because io_uring fundamentally got this right - but simply because
> io_uring had almost all the same bugs (and then some), and what the
> io_uring worker threads ended up doing was to basically zoom in on
> "this works".
>
> And it zoomed in on it largely by just going for "make it look as much
> as possible as a real user thread", because every time the kernel
> thread did something different, it just caused problems.
>
> So I think the patch should just look something like the attached.
> Mike, can you test this on whatever vhost test-suite?

I tried that approach already and it doesn't work because io_uring and vhost
differ in that vhost drivers implement a device where each device has a vhost_task
and the drivers have a file_operations for the device. When the vhost_task's
parent gets signal like SIGKILL, then it will exit and call into the vhost
driver's file_operations->release function. At this time, we need to do cleanup
like flush the device which uses the vhost_task. There is also the case where if
the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.

io_uring has a similar cleanup issue where the core kernel code can't do
exit for the io thread, but it only has that one point that it has to worry
about so when it gets SIGKILL it can clean itself up then exit.

So the patch in the other mail hits an issue where vhost_worker() can get into
a tight loop hammering the CPU due to the pending SIGKILL signal.

The vhost layer really doesn't want any signals and wants to work like kthreads
for that case. To make it really simple can we do something like this where it
separates user and io worker behavior where the major diff is how they handle
signals and exit. I also included a fix for the freeze case:



diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->user_worker)
+ if (args->user_worker) {
+ /*
+ * User worker are similar to io_threads but they do not
+ * support signals and cleanup is driven via another kernel
+ * interface so even SIGKILL is blocked.
+ */
p->flags |= PF_USER_WORKER;
+ siginitsetinv(&p->blocked, 0);
+ }
if (args->io_thread) {
/*
* Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
+ if (args->user_worker)
+ p->flags |= PF_NOFREEZE;

stackleak_task_init(p);

@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn = fn,
.fn_arg = arg,
.io_thread = 1,
- .user_worker = 1,
};

return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..f2f1e5ef44b2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p)
return task_curr(p) || !task_sigpending(p);
}

+static void try_set_pending_sigkill(struct task_struct *t)
+{
+ /*
+ * User workers don't support signals and their exit is driven through
+ * their kernel layer, so do not send them SIGKILL.
+ */
+ if (t->flags & PF_USER_WORKER)
+ return;
+
+ sigaddset(&t->pending.signal, SIGKILL);
+ signal_wake_up(t, 1);
+}
+
static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
t = p;
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
} while_each_thread(p, t);
return;
}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't bother with already dead threads */
if (t->exit_state)
continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
}

return count;
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2d8d3ebaec4d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..255a2147e5c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);

/*
- * Do not throttle user workers, kthreads other than kswapd or
+ * Do not throttle IO/user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
+ current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) {
cond_resched();
return;
}











2023-05-15 23:14:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 3:23 PM Mike Christie
<[email protected]> wrote:
>
> The vhost layer really doesn't want any signals and wants to work like kthreads
> for that case. To make it really simple can we do something like this where it
> separates user and io worker behavior where the major diff is how they handle
> signals and exit. I also included a fix for the freeze case:

I don't love the SIGKILL special case, but I also don't find this
deeply offensive. So if this is what it takes, I'm ok with it.

I wonder if we could make that special case simply check for "is
SIGKILL blocked" instead? No normal case will cause that, and it means
that a PF_USER_WORKER thread could decide per-thread what it wants to
do wrt SIGKILL.

Christian? And I guess we should Cc: Oleg too, since the signal parts
are an area he's familiar with and has worked on.. Eric Biederman has
already been on the list and has also been involved

Oleg: see

https://lore.kernel.org/lkml/[email protected]/

for the context here.

Linus

2023-05-16 04:19:26

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/15/23 5:54 PM, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 3:23 PM Mike Christie
> <[email protected]> wrote:
>>
>> The vhost layer really doesn't want any signals and wants to work like kthreads
>> for that case. To make it really simple can we do something like this where it
>> separates user and io worker behavior where the major diff is how they handle
>> signals and exit. I also included a fix for the freeze case:
>
> I don't love the SIGKILL special case, but I also don't find this
> deeply offensive. So if this is what it takes, I'm ok with it.
>
> I wonder if we could make that special case simply check for "is
> SIGKILL blocked" instead? No normal case will cause that, and it means

Yeah, it's doable. Updated below.

> that a PF_USER_WORKER thread could decide per-thread what it wants to
> do wrt SIGKILL.
>
> Christian? And I guess we should Cc: Oleg too, since the signal parts
> are an area he's familiar with and has worked on.. Eric Biederman has
> already been on the list and has also been involved
>
> Oleg: see
>
> https://lore.kernel.org/lkml/[email protected]/
>
> for the context here.

Oleg and Christian,


Below is an updated patch that doesn't check for PF_USER_WORKER in the
signal.c code and instead will check for if we have blocked the signal.




diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->user_worker)
+ if (args->user_worker) {
+ /*
+ * User worker are similar to io_threads but they do not
+ * support signals and cleanup is driven via another kernel
+ * interface so even SIGKILL is blocked.
+ */
p->flags |= PF_USER_WORKER;
+ siginitsetinv(&p->blocked, 0);
+ }
if (args->io_thread) {
/*
* Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
+ if (args->user_worker)
+ p->flags |= PF_NOFREEZE;

stackleak_task_init(p);

@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn = fn,
.fn_arg = arg,
.io_thread = 1,
- .user_worker = 1,
};

return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..bc7e26072437 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p)
return task_curr(p) || !task_sigpending(p);
}

+static void try_set_pending_sigkill(struct task_struct *t)
+{
+ /*
+ * User workers don't support signals and their exit is driven through
+ * their kernel layer, so by default block even SIGKILL.
+ */
+ if (sigismember(&t->blocked, SIGKILL))
+ return;
+
+ sigaddset(&t->pending.signal, SIGKILL);
+ signal_wake_up(t, 1);
+}
+
static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
t = p;
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
} while_each_thread(p, t);
return;
}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't bother with already dead threads */
if (t->exit_state)
continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
}

return count;
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2d8d3ebaec4d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..255a2147e5c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);

/*
- * Do not throttle user workers, kthreads other than kswapd or
+ * Do not throttle IO/user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
+ current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) {
cond_resched();
return;
}



2023-05-16 08:53:17

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> On 5/15/23 10:44 AM, Linus Torvalds wrote:
> > On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
> >>
> >> So I think we will be able to address (1) and (2) by making vhost tasks
> >> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> >> and then having vhost handle get_signal() - as you mentioned - the same
> >> way io uring already does. We should also remove the ingore_signals
> >> thing completely imho. I don't think we ever want to do this with user
> >> workers.
> >
> > Right. That's what IO_URING does:
> >
> > if (args->io_thread) {
> > /*
> > * Mark us an IO worker, and block any signal that isn't
> > * fatal or STOP
> > */
> > p->flags |= PF_IO_WORKER;
> > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> > }
> >
> > and I really think that vhost should basically do exactly what io_uring does.
> >
> > Not because io_uring fundamentally got this right - but simply because
> > io_uring had almost all the same bugs (and then some), and what the
> > io_uring worker threads ended up doing was to basically zoom in on
> > "this works".
> >
> > And it zoomed in on it largely by just going for "make it look as much
> > as possible as a real user thread", because every time the kernel
> > thread did something different, it just caused problems.
> >
> > So I think the patch should just look something like the attached.
> > Mike, can you test this on whatever vhost test-suite?
>
> I tried that approach already and it doesn't work because io_uring and vhost
> differ in that vhost drivers implement a device where each device has a vhost_task
> and the drivers have a file_operations for the device. When the vhost_task's
> parent gets signal like SIGKILL, then it will exit and call into the vhost
> driver's file_operations->release function. At this time, we need to do cleanup

But that's no reason why the vhost worker couldn't just be allowed to
exit on SIGKILL cleanly similar to io_uring. That's just describing the
current architecture which isn't a necessity afaict. And the helper
thread could e.g., crash.

> like flush the device which uses the vhost_task. There is also the case where if
> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.

In a way I really don't like the patch below. Because this should be
solvable by adapting vhost workers. Right now, vhost is coming from a
kthread model and we ported it to a user worker model and the whole
point of this excercise has been that the workers behave more like
regular userspace processes. So my tendency is to not massage kernel
signal handling to now also include a special case for user workers in
addition to kthreads. That's just the wrong way around and then vhost
could've just stuck with kthreads in the first place.

So I'm fine with skipping over the freezing case for now but SIGKILL
should be handled imho. Only init and kthreads should get the luxury of
ignoring SIGKILL.

So, I'm afraid I'm asking some work here of you but how feasible would a
model be where vhost_worker() similar to io_wq_worker() gracefully
handles SIGKILL. Yes, I see there's

net.c: .release = vhost_net_release
scsi.c: .release = vhost_scsi_release
test.c: .release = vhost_test_release
vdpa.c: .release = vhost_vdpa_release
vsock.c: .release = virtio_transport_release
vsock.c: .release = vhost_vsock_dev_release

but that means you have all the basic logic in place and all of those
drivers also support the VHOST_RESET_OWNER ioctl which also stops the
vhost worker. I'm confident that a lof this can be leveraged to just
cleanup on SIGKILL.

So it feels like this should be achievable by adding a callback to
struct vhost_worker that get's called when vhost_worker() gets SIGKILL
and that all the users of vhost workers are forced to implement.

Yes, it is more work but I think that's the right thing to do and not to
complicate our signal handling.

Worst case if this can't be done fast enough we'll have to revert the
vhost parts. I think the user worker parts are mostly sane and are
useful.
Thoughts?

2023-05-16 13:34:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 05/15, Mike Christie wrote:
>
> Oleg and Christian,
>
>
> Below is an updated patch that doesn't check for PF_USER_WORKER in the
> signal.c code and instead will check for if we have blocked the signal.

Looks like I need to read the whole series... will try tomorrow.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
> p->flags &= ~PF_KTHREAD;
> if (args->kthread)
> p->flags |= PF_KTHREAD;
> - if (args->user_worker)
> + if (args->user_worker) {
> + /*
> + * User worker are similar to io_threads but they do not
> + * support signals and cleanup is driven via another kernel
> + * interface so even SIGKILL is blocked.
> + */
> p->flags |= PF_USER_WORKER;
> + siginitsetinv(&p->blocked, 0);

I never liked the fact that io-threads block the signals, this adds
another precedent... OK, this needs another discussion.

> +static void try_set_pending_sigkill(struct task_struct *t)
> +{
> + /*
> + * User workers don't support signals and their exit is driven through
> + * their kernel layer, so by default block even SIGKILL.
> + */
> + if (sigismember(&t->blocked, SIGKILL))
> + return;
> +
> + sigaddset(&t->pending.signal, SIGKILL);
> + signal_wake_up(t, 1);
> +}

so why do you need this? to avoid fatal_signal_pending() or signal_pending() ?

In the latter case this change is not enough.

Oleg.


2023-05-16 13:52:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 05/15, Mike Christie wrote:
>
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> const char *name)
> {
> struct kernel_clone_args args = {
> - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> + CLONE_THREAD | CLONE_SIGHAND,

I am looking at 6/8 on https://lore.kernel.org/lkml/ ...

with this change kernel_wait4() in vhost_task_stop() won't work?

Oleg.


2023-05-16 14:19:59

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 05.05.23 15:40, Nicolas Dichtel wrote:
> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
>
> I have a question about (a side effect of?) this patch. The output of the 'ps'
> command has changed. Here is an example:
> [...]

Thanks for the report. This is already dealt with, but to be sure the
issue doesn't fall through the cracks unnoticed, I'm adding it to
regzbot, the Linux kernel regression tracking bot:

#regzbot ^introduced 6e890c5d502
#regzbot title vhost: ps output changed and suspend fails when VMs are
running
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-05-16 16:07:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

Linus Torvalds <[email protected]> writes:

> On Mon, May 15, 2023 at 3:23 PM Mike Christie
> <[email protected]> wrote:
>>
>> The vhost layer really doesn't want any signals and wants to work like kthreads
>> for that case. To make it really simple can we do something like this where it
>> separates user and io worker behavior where the major diff is how they handle
>> signals and exit. I also included a fix for the freeze case:
>
> I don't love the SIGKILL special case, but I also don't find this
> deeply offensive. So if this is what it takes, I'm ok with it.
>
> I wonder if we could make that special case simply check for "is
> SIGKILL blocked" instead? No normal case will cause that, and it means
> that a PF_USER_WORKER thread could decide per-thread what it wants to
> do wrt SIGKILL.

A kernel thread can block SIGKILL and that is supported.

For a thread that is part of a process you can't block SIGKILL when the
task is part of a user mode process.

There is this bit in complete_signal when SIGKILL is delivered to any
thread in the process.


/*
* Start a group exit and wake everybody up.
* This way we don't have other threads
* running and doing things after a slower
* thread has the fatal signal pending.
*/
signal->flags = SIGNAL_GROUP_EXIT;
signal->group_exit_code = sig;
signal->group_stop_count = 0;
t = p;
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
} while_each_thread(p, t);

For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't
setting SIGKILL pending, it is part of the short circuit delivery logic,
and that sigaddset(SIGKILL) is just setting a flag to tell the process
it needs to die.


The important part of that code is that SIGNAL_GROUP_EXIT gets set.
That indicates the entire process is being torn down.

Where this becomes important is exit_notify and release_task work
together to ensure that the first thread in the process (a user space
thread that can not block SIGKILL) will not send SIGCHLD to it's parent
process until every thread in the process has exited.

The delay_group_leader logic in wait_consider_task part of wait(2) has
the same logic.

Having been through this with io_uring the threads really need to call
get_signal to handle that case.


This is pretty much why I said at the outset you they needed to decided
if they were going to implement a thread or if they were going to be a
process. Changing the decision to be a thread from a process is fine
but in that case the vhost logic needs to act like a process, just
like io_uring does.


> Christian? And I guess we should Cc: Oleg too, since the signal parts
> are an area he's familiar with and has worked on.. Eric Biederman has
> already been on the list and has also been involved

>
> Oleg: see
>
> https://lore.kernel.org/lkml/[email protected]/
>
> for the context here.

Eric


2023-05-16 16:32:14

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/16/23 3:39 AM, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
>> On 5/15/23 10:44 AM, Linus Torvalds wrote:
>>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
>>>>
>>>> So I think we will be able to address (1) and (2) by making vhost tasks
>>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
>>>> and then having vhost handle get_signal() - as you mentioned - the same
>>>> way io uring already does. We should also remove the ingore_signals
>>>> thing completely imho. I don't think we ever want to do this with user
>>>> workers.
>>>
>>> Right. That's what IO_URING does:
>>>
>>> if (args->io_thread) {
>>> /*
>>> * Mark us an IO worker, and block any signal that isn't
>>> * fatal or STOP
>>> */
>>> p->flags |= PF_IO_WORKER;
>>> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>> }
>>>
>>> and I really think that vhost should basically do exactly what io_uring does.
>>>
>>> Not because io_uring fundamentally got this right - but simply because
>>> io_uring had almost all the same bugs (and then some), and what the
>>> io_uring worker threads ended up doing was to basically zoom in on
>>> "this works".
>>>
>>> And it zoomed in on it largely by just going for "make it look as much
>>> as possible as a real user thread", because every time the kernel
>>> thread did something different, it just caused problems.
>>>
>>> So I think the patch should just look something like the attached.
>>> Mike, can you test this on whatever vhost test-suite?
>>
>> I tried that approach already and it doesn't work because io_uring and vhost
>> differ in that vhost drivers implement a device where each device has a vhost_task
>> and the drivers have a file_operations for the device. When the vhost_task's
>> parent gets signal like SIGKILL, then it will exit and call into the vhost
>> driver's file_operations->release function. At this time, we need to do cleanup
>
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
>
>> like flush the device which uses the vhost_task. There is also the case where if
>> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
>
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.

I would have preferred that :) Maybe let's take a step back and revisit
that decision to make sure it was right. The vhost layer wants:

1. inherit cgroups.
2. share mm.
3. no signals
4. to not show up was an extra process like in Nicolas's bug.
5. have it's worker threads counted under its parent nproc limit.

We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
attempt which passed around the creds to use for kthreads or exported a
helper for the nproc accounting was not liked and we eventually ended up
here.

Is this hybird user/kernel thread/task still the right way to go or is
better to use kthreads and add some way to handle #5?


>
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
>
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
>
> net.c: .release = vhost_net_release
> scsi.c: .release = vhost_scsi_release
> test.c: .release = vhost_test_release
> vdpa.c: .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
>
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.

We can do this, but the issue I'm worried about is that right now if there
is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
will still complete those IOs. If we now allow SIGKILL on the vhost thread,
then those IOs might fail.

If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
ignores the signal and allows IO/flushes/whatever-operations to complete
at that level. However, we could hit issues where when vhost_worker()
calls into the drivers listed above, and those drivers call into whatever
kernel layer they use, that might do

if (signal_pending(current))
return failure;

and we now fail.

If we say that since we got a SIGKILL, then failing is acceptable behavior
now, I can code what you are requesting.




2023-05-16 16:46:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Tue, May 16, 2023 at 11:24:48AM -0500, Mike Christie wrote:
> On 5/16/23 3:39 AM, Christian Brauner wrote:
> > On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> >> On 5/15/23 10:44 AM, Linus Torvalds wrote:
> >>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
> >>>>
> >>>> So I think we will be able to address (1) and (2) by making vhost tasks
> >>>> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> >>>> and then having vhost handle get_signal() - as you mentioned - the same
> >>>> way io uring already does. We should also remove the ingore_signals
> >>>> thing completely imho. I don't think we ever want to do this with user
> >>>> workers.
> >>>
> >>> Right. That's what IO_URING does:
> >>>
> >>> if (args->io_thread) {
> >>> /*
> >>> * Mark us an IO worker, and block any signal that isn't
> >>> * fatal or STOP
> >>> */
> >>> p->flags |= PF_IO_WORKER;
> >>> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> >>> }
> >>>
> >>> and I really think that vhost should basically do exactly what io_uring does.
> >>>
> >>> Not because io_uring fundamentally got this right - but simply because
> >>> io_uring had almost all the same bugs (and then some), and what the
> >>> io_uring worker threads ended up doing was to basically zoom in on
> >>> "this works".
> >>>
> >>> And it zoomed in on it largely by just going for "make it look as much
> >>> as possible as a real user thread", because every time the kernel
> >>> thread did something different, it just caused problems.
> >>>
> >>> So I think the patch should just look something like the attached.
> >>> Mike, can you test this on whatever vhost test-suite?
> >>
> >> I tried that approach already and it doesn't work because io_uring and vhost
> >> differ in that vhost drivers implement a device where each device has a vhost_task
> >> and the drivers have a file_operations for the device. When the vhost_task's
> >> parent gets signal like SIGKILL, then it will exit and call into the vhost
> >> driver's file_operations->release function. At this time, we need to do cleanup
> >
> > But that's no reason why the vhost worker couldn't just be allowed to
> > exit on SIGKILL cleanly similar to io_uring. That's just describing the
> > current architecture which isn't a necessity afaict. And the helper
> > thread could e.g., crash.
> >
> >> like flush the device which uses the vhost_task. There is also the case where if
> >> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
> >
> > In a way I really don't like the patch below. Because this should be
> > solvable by adapting vhost workers. Right now, vhost is coming from a
> > kthread model and we ported it to a user worker model and the whole
> > point of this excercise has been that the workers behave more like
> > regular userspace processes. So my tendency is to not massage kernel
> > signal handling to now also include a special case for user workers in
> > addition to kthreads. That's just the wrong way around and then vhost
> > could've just stuck with kthreads in the first place.
>
> I would have preferred that :) Maybe let's take a step back and revisit
> that decision to make sure it was right. The vhost layer wants:
>
> 1. inherit cgroups.
> 2. share mm.
> 3. no signals
> 4. to not show up was an extra process like in Nicolas's bug.
> 5. have it's worker threads counted under its parent nproc limit.
>
> We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first
> attempt which passed around the creds to use for kthreads or exported a
> helper for the nproc accounting was not liked and we eventually ended up
> here.
>
> Is this hybird user/kernel thread/task still the right way to go or is
> better to use kthreads and add some way to handle #5?

I think the io_uring model makes a lot more sense for vhost than the
current approach.

>
>
> >
> > So I'm fine with skipping over the freezing case for now but SIGKILL
> > should be handled imho. Only init and kthreads should get the luxury of
> > ignoring SIGKILL.
> >
> > So, I'm afraid I'm asking some work here of you but how feasible would a
> > model be where vhost_worker() similar to io_wq_worker() gracefully
> > handles SIGKILL. Yes, I see there's
> >
> > net.c: .release = vhost_net_release
> > scsi.c: .release = vhost_scsi_release
> > test.c: .release = vhost_test_release
> > vdpa.c: .release = vhost_vdpa_release
> > vsock.c: .release = virtio_transport_release
> > vsock.c: .release = vhost_vsock_dev_release
> >
> > but that means you have all the basic logic in place and all of those
> > drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> > vhost worker. I'm confident that a lof this can be leveraged to just
> > cleanup on SIGKILL.
>
> We can do this, but the issue I'm worried about is that right now if there
> is queued/running IO and userspace escalates to SIGKILL, then the vhost layer
> will still complete those IOs. If we now allow SIGKILL on the vhost thread,
> then those IOs might fail.
>
> If we get a SIGKILL, I can modify vhost_worker() so that it temporarily
> ignores the signal and allows IO/flushes/whatever-operations to complete
> at that level. However, we could hit issues where when vhost_worker()

It's really not that different from io_uring though which also flushes
out remaining io, no? This seems to basically line up with what
io_wq_worker() does.

> calls into the drivers listed above, and those drivers call into whatever
> kernel layer they use, that might do
>
> if (signal_pending(current))
> return failure;
>
> and we now fail.
>
> If we say that since we got a SIGKILL, then failing is acceptable behavior
> now, I can code what you are requesting.

I think this is fine but I don't maintain vhost and we'd need their
opinion.

2023-05-16 18:43:33

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 05/16, Eric W. Biederman wrote:
>
> A kernel thread can block SIGKILL and that is supported.
>
> For a thread that is part of a process you can't block SIGKILL when the
> task is part of a user mode process.

Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc.

> There is this bit in complete_signal when SIGKILL is delivered to any
> thread in the process.
>
> t = p;
> do {
> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> sigaddset(&t->pending.signal, SIGKILL);
> signal_wake_up(t, 1);
> } while_each_thread(p, t);

That is why the latest version adds try_set_pending_sigkill(). No, no,
it is not that I think this is a good idea.

> For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't
> setting SIGKILL pending,

Hmm. it does? Nevermind.

> The important part of that code is that SIGNAL_GROUP_EXIT gets set.
> That indicates the entire process is being torn down.

Yes. and the same is true for io-thread even if it calls get_signal()
and dequeues SIGKILL and clears TIF_SIGPENDING.

> but in that case the vhost logic needs to act like a process, just
> like io_uring does.

confused... create_io_thread() creates a sub-thread too?

Although I never understood this logic. I can't even understand the usage
of lower_32_bits() in create_io_thread().

Oleg.


2023-05-16 20:16:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

Oleg Nesterov <[email protected]> writes:

> On 05/16, Eric W. Biederman wrote:
>>
>> A kernel thread can block SIGKILL and that is supported.
>>
>> For a thread that is part of a process you can't block SIGKILL when the
>> task is part of a user mode process.
>
> Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc.

Yes, ignoring SIGSTOP leads to the same kind of rendezvous issues as
SIGKILL.

>> There is this bit in complete_signal when SIGKILL is delivered to any
>> thread in the process.
>>
>> t = p;
>> do {
>> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
>> sigaddset(&t->pending.signal, SIGKILL);
>> signal_wake_up(t, 1);
>> } while_each_thread(p, t);
>
> That is why the latest version adds try_set_pending_sigkill(). No, no,
> it is not that I think this is a good idea.

I see that try_set_pending_sigkill in the patch now.

That try_set_pending_sigkill just keeps the process from reporting
that it has exited, and extend the process exit indefinitely.

SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
already delivered and the process is exiting.

>> For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't
>> setting SIGKILL pending,
>
> Hmm. it does? Nevermind.

The point is that what try_set_pending_sigkill in the patch is doing is
keeping the "you are dead exit now" flag, from being set.

That flag is what fatal_signal_pending always tests, because we can only
know if a fatal signal is pending if we have performed short circuit
delivery on the signal.

The result is the effects of the change are mostly what people expect.
The difference the semantics being changed aren't what people think they
are.

AKA process exit is being ignored for the thread, not that SIGKILL is
being blocked.

>> The important part of that code is that SIGNAL_GROUP_EXIT gets set.
>> That indicates the entire process is being torn down.
>
> Yes. and the same is true for io-thread even if it calls get_signal()
> and dequeues SIGKILL and clears TIF_SIGPENDING.
>
>> but in that case the vhost logic needs to act like a process, just
>> like io_uring does.
>
> confused... create_io_thread() creates a sub-thread too?

Yes, create_io_uring creates an ordinary user space thread that never
runs any code in user space.

> Although I never understood this logic. I can't even understand the usage
> of lower_32_bits() in create_io_thread().

As far as I can tell lower_32_bits(flags) is just defensive programming
that just copies the code in clone. The code just as easily have said
u32 flags, or have just populated .flags directly. Then .exit_signal
could have been set to 0. Later copy_process will set .exit_signal = -1
because CLONE_THREAD is set.

The reason for adding create_io_thread calling copy_process as I recall
so that the new task does not start automatically. This allows
functions like io_init_new_worker to initialize the new task without
races and then call wake_up_new_task.

Eric


2023-05-17 17:25:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 05/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> >> There is this bit in complete_signal when SIGKILL is delivered to any
> >> thread in the process.
> >>
> >> t = p;
> >> do {
> >> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> >> sigaddset(&t->pending.signal, SIGKILL);
> >> signal_wake_up(t, 1);
> >> } while_each_thread(p, t);
> >
> > That is why the latest version adds try_set_pending_sigkill(). No, no,
> > it is not that I think this is a good idea.
>
> I see that try_set_pending_sigkill in the patch now.
>
> That try_set_pending_sigkill just keeps the process from reporting
> that it has exited, and extend the process exit indefinitely.
>
> SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
> already delivered and the process is exiting.

Agreed, that is why I said I don't think try_set_pending_sigkill() is
a good idea.

And again, the same is true for the threads created by
create_io_thread(). get_signal() from io_uring/ can dequeue a pending
SIGKILL and return, but that is all.

> >> For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't
> >> setting SIGKILL pending,
> >
> > Hmm. it does? Nevermind.
>
> The point is that what try_set_pending_sigkill in the patch is doing is
> keeping the "you are dead exit now" flag, from being set.
>
> That flag is what fatal_signal_pending always tests, because we can only
> know if a fatal signal is pending if we have performed short circuit
> delivery on the signal.
>
> The result is the effects of the change are mostly what people expect.
> The difference the semantics being changed aren't what people think they
> are.
>
> AKA process exit is being ignored for the thread, not that SIGKILL is
> being blocked.

Sorry, I don't understand. I just tried to say that
sigaddset(&t->pending.signal, SIGKILL) really sets SIGKILL pending.
Nevermind.

> > Although I never understood this logic.

I meant I never really liked how io-threads play with signals,

> I can't even understand the usage
> > of lower_32_bits() in create_io_thread().
>
> As far as I can tell lower_32_bits(flags) is just defensive programming

Cough. but this is ugly. Or I missed something.

> or have just populated .flags directly.

Exactly,

> Then .exit_signal
> could have been set to 0.

Exactly.

-------------------------------------------------------------------------------
OK. It doesn't matter. I tried to read the whole thread and got lost.

IIUC, Mike is going to send the next version? So I think we can delay
the further discussions until then.

Oleg.


2023-05-17 18:32:03

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 5/17/23 12:09 PM, Oleg Nesterov wrote:
> IIUC, Mike is going to send the next version? So I think we can delay
> the further discussions until then.

Yeah, I'm working on a version that supports signals so it will be easier
to discuss with the vhost devs and you, Christian and Eric.

2023-05-26 09:20:05

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 16.05.23 16:06, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 05.05.23 15:40, Nicolas Dichtel wrote:
>> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>>> For vhost workers we use the kthread API which inherit's its values from
>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>> being checked, so while tools like libvirt try to control the number of
>>> threads based on the nproc rlimit setting we can end up creating more
>>> threads than the user wanted.
>>
>> I have a question about (a side effect of?) this patch. The output of the 'ps'
>> command has changed. Here is an example:
>> [...]
>
> Thanks for the report. This is already dealt with, but to be sure the
> issue doesn't fall through the cracks unnoticed, I'm adding it to
> regzbot, the Linux kernel regression tracking bot:
>
> #regzbot ^introduced 6e890c5d502
> #regzbot title vhost: ps output changed and suspend fails when VMs are
> running
> #regzbot ignore-activity

#regzbot monitor:
https://lore.kernel.org/all/[email protected]/
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


2023-06-02 11:46:31

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Link: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 16.05.23 16:06, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 05.05.23 15:40, Nicolas Dichtel wrote:
>> Le 03/02/2023 à 00:25, Mike Christie a écrit :
>>> For vhost workers we use the kthread API which inherit's its values from
>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>> being checked, so while tools like libvirt try to control the number of
>>> threads based on the nproc rlimit setting we can end up creating more
>>> threads than the user wanted.
>>
>> I have a question about (a side effect of?) this patch. The output of the 'ps'
>> command has changed. Here is an example:
>> [...]
>
> Thanks for the report. This is already dealt with, but to be sure the
> issue doesn't fall through the cracks unnoticed, I'm adding it to
> regzbot, the Linux kernel regression tracking bot:
>
> #regzbot ^introduced 6e890c5d502

#regzbot fix: f9010dbdce911ee1f1af1398a24b1f9f992e0080
#regzbot ignore-activity

BTW, if anyone cares: just checked, it fixes my suspend problem. Thx
everyone!

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-07-20 13:29:32

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> For vhost workers we use the kthread API which inherit's its values from
> and checks against the kthreadd thread. This results in the wrong RLIMITs
> being checked, so while tools like libvirt try to control the number of
> threads based on the nproc rlimit setting we can end up creating more
> threads than the user wanted.
>
> This patch has us use the vhost_task helpers which will inherit its
> values/checks from the thread that owns the device similar to if we did
> a clone in userspace. The vhost threads will now be counted in the nproc
> rlimits. And we get features like cgroups and mm sharing automatically,
> so we can remove those calls.
>
> Signed-off-by: Mike Christie <[email protected]>
> Acked-by: Michael S. Tsirkin <[email protected]>



Hi Mike,
So this seems to have caused a measureable regression in networking
performance (about 30%). Take a look here, and there's a zip file
with detailed measuraments attached:

https://bugzilla.redhat.com/show_bug.cgi?id=2222603


Could you take a look please?
You can also ask reporter questions there assuming you
have or can create a (free) account.



> ---
> drivers/vhost/vhost.c | 58 ++++++++-----------------------------------
> drivers/vhost/vhost.h | 4 +--
> 2 files changed, 13 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 74378d241f8d..d3c7c37b69a7 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -22,11 +22,11 @@
> #include <linux/slab.h>
> #include <linux/vmalloc.h>
> #include <linux/kthread.h>
> -#include <linux/cgroup.h>
> #include <linux/module.h>
> #include <linux/sort.h>
> #include <linux/sched/mm.h>
> #include <linux/sched/signal.h>
> +#include <linux/sched/vhost_task.h>
> #include <linux/interval_tree_generic.h>
> #include <linux/nospec.h>
> #include <linux/kcov.h>
> @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> * test_and_set_bit() implies a memory barrier.
> */
> llist_add(&work->node, &dev->worker->work_list);
> - wake_up_process(dev->worker->task);
> + wake_up_process(dev->worker->vtsk->task);
> }
> }
> EXPORT_SYMBOL_GPL(vhost_work_queue);
> @@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> static int vhost_worker(void *data)
> {
> struct vhost_worker *worker = data;
> - struct vhost_dev *dev = worker->dev;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
>
> - kthread_use_mm(dev->mm);
> -
> for (;;) {
> /* mb paired w/ kthread_stop */
> set_current_state(TASK_INTERRUPTIBLE);
>
> - if (kthread_should_stop()) {
> + if (vhost_task_should_stop(worker->vtsk)) {
> __set_current_state(TASK_RUNNING);
> break;
> }
> @@ -368,7 +365,7 @@ static int vhost_worker(void *data)
> schedule();
> }
> }
> - kthread_unuse_mm(dev->mm);
> +
> return 0;
> }
>
> @@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
> }
> EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
>
> -struct vhost_attach_cgroups_struct {
> - struct vhost_work work;
> - struct task_struct *owner;
> - int ret;
> -};
> -
> -static void vhost_attach_cgroups_work(struct vhost_work *work)
> -{
> - struct vhost_attach_cgroups_struct *s;
> -
> - s = container_of(work, struct vhost_attach_cgroups_struct, work);
> - s->ret = cgroup_attach_task_all(s->owner, current);
> -}
> -
> -static int vhost_attach_cgroups(struct vhost_dev *dev)
> -{
> - struct vhost_attach_cgroups_struct attach;
> -
> - attach.owner = current;
> - vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> - vhost_work_queue(dev, &attach.work);
> - vhost_dev_flush(dev);
> - return attach.ret;
> -}
> -
> /* Caller should have device mutex */
> bool vhost_dev_has_owner(struct vhost_dev *dev)
> {
> @@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev)
>
> dev->worker = NULL;
> WARN_ON(!llist_empty(&worker->work_list));
> - kthread_stop(worker->task);
> + vhost_task_stop(worker->vtsk);
> kfree(worker);
> }
>
> static int vhost_worker_create(struct vhost_dev *dev)
> {
> struct vhost_worker *worker;
> - struct task_struct *task;
> + struct vhost_task *vtsk;
> int ret;
>
> worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
> @@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev)
> return -ENOMEM;
>
> dev->worker = worker;
> - worker->dev = dev;
> worker->kcov_handle = kcov_common_handle();
> init_llist_head(&worker->work_list);
>
> - task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
> - if (IS_ERR(task)) {
> - ret = PTR_ERR(task);
> + vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE);
> + if (!vtsk) {
> + ret = -ENOMEM;
> goto free_worker;
> }
>
> - worker->task = task;
> - wake_up_process(task); /* avoid contributing to loadavg */
> -
> - ret = vhost_attach_cgroups(dev);
> - if (ret)
> - goto stop_worker;
> -
> + worker->vtsk = vtsk;
> + vhost_task_start(vtsk, "vhost-%d", current->pid);
> return 0;
>
> -stop_worker:
> - kthread_stop(worker->task);
> free_worker:
> kfree(worker);
> dev->worker = NULL;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 2f6beab93784..3af59c65025e 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -16,6 +16,7 @@
> #include <linux/irqbypass.h>
>
> struct vhost_work;
> +struct vhost_task;
> typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>
> #define VHOST_WORK_QUEUED 1
> @@ -26,9 +27,8 @@ struct vhost_work {
> };
>
> struct vhost_worker {
> - struct task_struct *task;
> + struct vhost_task *vtsk;
> struct llist_head work_list;
> - struct vhost_dev *dev;
> u64 kcov_handle;
> };
>
> --
> 2.25.1


2023-07-23 04:33:11

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
>>
>> This patch has us use the vhost_task helpers which will inherit its
>> values/checks from the thread that owns the device similar to if we did
>> a clone in userspace. The vhost threads will now be counted in the nproc
>> rlimits. And we get features like cgroups and mm sharing automatically,
>> so we can remove those calls.
>>
>> Signed-off-by: Mike Christie <[email protected]>
>> Acked-by: Michael S. Tsirkin <[email protected]>
>
>
> Hi Mike,
> So this seems to have caused a measureable regression in networking
> performance (about 30%). Take a look here, and there's a zip file
> with detailed measuraments attached:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>
>
> Could you take a look please?
> You can also ask reporter questions there assuming you
> have or can create a (free) account.
>

Sorry for the late reply. I just got home from vacation.

The account creation link seems to be down. I keep getting a
"unable to establish SMTP connection to bz-exim-prod port 25 " error.

Can you give me Quan's email?

I think I can replicate the problem. I just need some extra info from Quan:

1. Just double check that they are using RHEL 9 on the host running the VMs.
2. The kernel config
3. Any tuning that was done. Is tuned running in guest and/or host running the
VMs and what profile is being used in each.
4. Number of vCPUs and virtqueues being used.
5. Can they dump the contents of:

/sys/kernel/debug/sched

and

sysctl -a

on the host running the VMs.

6. With the 6.4 kernel, can they also run a quick test and tell me if they set
the scheduler to batch:

ps -T -o comm,pid,tid $QEMU_THREAD

then for each vhost thread do:

chrt -b -p 0 $VHOST_THREAD

Does that end up increasing perf? When I do this I see throughput go up by
around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
It's just a difference I noticed when running some other tests.

2023-07-23 10:24:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Sat, Jul 22, 2023 at 11:03:29PM -0500, [email protected] wrote:
> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >> For vhost workers we use the kthread API which inherit's its values from
> >> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >> being checked, so while tools like libvirt try to control the number of
> >> threads based on the nproc rlimit setting we can end up creating more
> >> threads than the user wanted.
> >>
> >> This patch has us use the vhost_task helpers which will inherit its
> >> values/checks from the thread that owns the device similar to if we did
> >> a clone in userspace. The vhost threads will now be counted in the nproc
> >> rlimits. And we get features like cgroups and mm sharing automatically,
> >> so we can remove those calls.
> >>
> >> Signed-off-by: Mike Christie <[email protected]>
> >> Acked-by: Michael S. Tsirkin <[email protected]>
> >
> >
> > Hi Mike,
> > So this seems to have caused a measureable regression in networking
> > performance (about 30%). Take a look here, and there's a zip file
> > with detailed measuraments attached:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> >
> >
> > Could you take a look please?
> > You can also ask reporter questions there assuming you
> > have or can create a (free) account.
> >
>
> Sorry for the late reply. I just got home from vacation.
>
> The account creation link seems to be down. I keep getting a
> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>
> Can you give me Quan's email?

Thanks for getting back! I asked whether it's ok to share the email.
For now pasted your request in the bugzilla.

> I think I can replicate the problem. I just need some extra info from Quan:
>
> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> 2. The kernel config
> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> VMs and what profile is being used in each.
> 4. Number of vCPUs and virtqueues being used.
> 5. Can they dump the contents of:
>
> /sys/kernel/debug/sched
>
> and
>
> sysctl -a
>
> on the host running the VMs.
>
> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> the scheduler to batch:
>
> ps -T -o comm,pid,tid $QEMU_THREAD
>
> then for each vhost thread do:
>
> chrt -b -p 0 $VHOST_THREAD
>
> Does that end up increasing perf? When I do this I see throughput go up by
> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> It's just a difference I noticed when running some other tests.


2023-08-10 19:53:02

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Sat, Jul 22, 2023 at 11:03:29PM -0500, [email protected] wrote:
> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >> For vhost workers we use the kthread API which inherit's its values from
> >> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >> being checked, so while tools like libvirt try to control the number of
> >> threads based on the nproc rlimit setting we can end up creating more
> >> threads than the user wanted.
> >>
> >> This patch has us use the vhost_task helpers which will inherit its
> >> values/checks from the thread that owns the device similar to if we did
> >> a clone in userspace. The vhost threads will now be counted in the nproc
> >> rlimits. And we get features like cgroups and mm sharing automatically,
> >> so we can remove those calls.
> >>
> >> Signed-off-by: Mike Christie <[email protected]>
> >> Acked-by: Michael S. Tsirkin <[email protected]>
> >
> >
> > Hi Mike,
> > So this seems to have caused a measureable regression in networking
> > performance (about 30%). Take a look here, and there's a zip file
> > with detailed measuraments attached:
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> >
> >
> > Could you take a look please?
> > You can also ask reporter questions there assuming you
> > have or can create a (free) account.
> >
>
> Sorry for the late reply. I just got home from vacation.
>
> The account creation link seems to be down. I keep getting a
> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>
> Can you give me Quan's email?
>
> I think I can replicate the problem. I just need some extra info from Quan:
>
> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> 2. The kernel config
> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> VMs and what profile is being used in each.
> 4. Number of vCPUs and virtqueues being used.
> 5. Can they dump the contents of:
>
> /sys/kernel/debug/sched
>
> and
>
> sysctl -a
>
> on the host running the VMs.
>
> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> the scheduler to batch:
>
> ps -T -o comm,pid,tid $QEMU_THREAD
>
> then for each vhost thread do:
>
> chrt -b -p 0 $VHOST_THREAD
>
> Does that end up increasing perf? When I do this I see throughput go up by
> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> It's just a difference I noticed when running some other tests.


Mike I'm unsure what to do at this point. Regressions are not nice
but if the kernel is released with the new userspace api we won't
be able to revert. So what's the plan?

--
MST


2023-08-11 20:15:06

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 22, 2023 at 11:03:29PM -0500, [email protected] wrote:
>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>> For vhost workers we use the kthread API which inherit's its values from
>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>> being checked, so while tools like libvirt try to control the number of
>>>> threads based on the nproc rlimit setting we can end up creating more
>>>> threads than the user wanted.
>>>>
>>>> This patch has us use the vhost_task helpers which will inherit its
>>>> values/checks from the thread that owns the device similar to if we did
>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>> so we can remove those calls.
>>>>
>>>> Signed-off-by: Mike Christie <[email protected]>
>>>> Acked-by: Michael S. Tsirkin <[email protected]>
>>>
>>>
>>> Hi Mike,
>>> So this seems to have caused a measureable regression in networking
>>> performance (about 30%). Take a look here, and there's a zip file
>>> with detailed measuraments attached:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>>>
>>>
>>> Could you take a look please?
>>> You can also ask reporter questions there assuming you
>>> have or can create a (free) account.
>>>
>>
>> Sorry for the late reply. I just got home from vacation.
>>
>> The account creation link seems to be down. I keep getting a
>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>
>> Can you give me Quan's email?
>>
>> I think I can replicate the problem. I just need some extra info from Quan:
>>
>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>> 2. The kernel config
>> 3. Any tuning that was done. Is tuned running in guest and/or host running the
>> VMs and what profile is being used in each.
>> 4. Number of vCPUs and virtqueues being used.
>> 5. Can they dump the contents of:
>>
>> /sys/kernel/debug/sched
>>
>> and
>>
>> sysctl -a
>>
>> on the host running the VMs.
>>
>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
>> the scheduler to batch:
>>
>> ps -T -o comm,pid,tid $QEMU_THREAD
>>
>> then for each vhost thread do:
>>
>> chrt -b -p 0 $VHOST_THREAD
>>
>> Does that end up increasing perf? When I do this I see throughput go up by
>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
>> It's just a difference I noticed when running some other tests.
>
>
> Mike I'm unsure what to do at this point. Regressions are not nice
> but if the kernel is released with the new userspace api we won't
> be able to revert. So what's the plan?
>

I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
6.4 perform the same for me. I've tried your setup and settings and with different
combos of using things like tuned and irqbalance.

I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
would then reset the sched settings. So in 6.4 if I just tune the parent differently
I can cause different performance. If we want the 6.3 behavior we can do the patch
below.

However, I don't think you guys are hitting this because you are just running
qemu from the normal shell and were not doing anything fancy with the sched
settings.


diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index da35e5b7f047..f2c2638d1106 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -2,6 +2,7 @@
/*
* Copyright (C) 2021 Oracle Corporation
*/
+#include <uapi/linux/sched/types.h>
#include <linux/slab.h>
#include <linux/completion.h>
#include <linux/sched/task.h>
@@ -22,9 +23,16 @@ struct vhost_task {

static int vhost_task_fn(void *data)
{
+ static const struct sched_param param = { .sched_priority = 0 };
struct vhost_task *vtsk = data;
bool dead = false;

+ /*
+ * Don't inherit the parent's sched info, so we maintain compat from
+ * when we used kthreads and it reset this info.
+ */
+ sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
+
for (;;) {
bool did_work;








2023-08-13 20:47:32

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
> > On Sat, Jul 22, 2023 at 11:03:29PM -0500, [email protected] wrote:
> >> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
> >>>> For vhost workers we use the kthread API which inherit's its values from
> >>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
> >>>> being checked, so while tools like libvirt try to control the number of
> >>>> threads based on the nproc rlimit setting we can end up creating more
> >>>> threads than the user wanted.
> >>>>
> >>>> This patch has us use the vhost_task helpers which will inherit its
> >>>> values/checks from the thread that owns the device similar to if we did
> >>>> a clone in userspace. The vhost threads will now be counted in the nproc
> >>>> rlimits. And we get features like cgroups and mm sharing automatically,
> >>>> so we can remove those calls.
> >>>>
> >>>> Signed-off-by: Mike Christie <[email protected]>
> >>>> Acked-by: Michael S. Tsirkin <[email protected]>
> >>>
> >>>
> >>> Hi Mike,
> >>> So this seems to have caused a measureable regression in networking
> >>> performance (about 30%). Take a look here, and there's a zip file
> >>> with detailed measuraments attached:
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
> >>>
> >>>
> >>> Could you take a look please?
> >>> You can also ask reporter questions there assuming you
> >>> have or can create a (free) account.
> >>>
> >>
> >> Sorry for the late reply. I just got home from vacation.
> >>
> >> The account creation link seems to be down. I keep getting a
> >> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
> >>
> >> Can you give me Quan's email?
> >>
> >> I think I can replicate the problem. I just need some extra info from Quan:
> >>
> >> 1. Just double check that they are using RHEL 9 on the host running the VMs.
> >> 2. The kernel config
> >> 3. Any tuning that was done. Is tuned running in guest and/or host running the
> >> VMs and what profile is being used in each.
> >> 4. Number of vCPUs and virtqueues being used.
> >> 5. Can they dump the contents of:
> >>
> >> /sys/kernel/debug/sched
> >>
> >> and
> >>
> >> sysctl -a
> >>
> >> on the host running the VMs.
> >>
> >> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
> >> the scheduler to batch:
> >>
> >> ps -T -o comm,pid,tid $QEMU_THREAD
> >>
> >> then for each vhost thread do:
> >>
> >> chrt -b -p 0 $VHOST_THREAD
> >>
> >> Does that end up increasing perf? When I do this I see throughput go up by
> >> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> >> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
> >> It's just a difference I noticed when running some other tests.
> >
> >
> > Mike I'm unsure what to do at this point. Regressions are not nice
> > but if the kernel is released with the new userspace api we won't
> > be able to revert. So what's the plan?
> >
>
> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
> 6.4 perform the same for me. I've tried your setup and settings and with different
> combos of using things like tuned and irqbalance.
>
> I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
> would then reset the sched settings. So in 6.4 if I just tune the parent differently
> I can cause different performance. If we want the 6.3 behavior we can do the patch
> below.
>
> However, I don't think you guys are hitting this because you are just running
> qemu from the normal shell and were not doing anything fancy with the sched
> settings.
>
>
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index da35e5b7f047..f2c2638d1106 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -2,6 +2,7 @@
> /*
> * Copyright (C) 2021 Oracle Corporation
> */
> +#include <uapi/linux/sched/types.h>
> #include <linux/slab.h>
> #include <linux/completion.h>
> #include <linux/sched/task.h>
> @@ -22,9 +23,16 @@ struct vhost_task {
>
> static int vhost_task_fn(void *data)
> {
> + static const struct sched_param param = { .sched_priority = 0 };
> struct vhost_task *vtsk = data;
> bool dead = false;
>
> + /*
> + * Don't inherit the parent's sched info, so we maintain compat from
> + * when we used kthreads and it reset this info.
> + */
> + sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> +
> for (;;) {
> bool did_work;
>
>
>

yes seems unlikely, still, attach this to bugzilla so it can be
tested?

and, what will help you debug? any traces to enable?

Also wasn't there another issue with a non standard config?
Maybe if we fix that it will by chance fix this one too?

>
>


2023-08-14 06:12:05

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

On 8/13/23 2:01 PM, Michael S. Tsirkin wrote:
> On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
>> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
>>> On Sat, Jul 22, 2023 at 11:03:29PM -0500, [email protected] wrote:
>>>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>>>> For vhost workers we use the kthread API which inherit's its values from
>>>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>>>> being checked, so while tools like libvirt try to control the number of
>>>>>> threads based on the nproc rlimit setting we can end up creating more
>>>>>> threads than the user wanted.
>>>>>>
>>>>>> This patch has us use the vhost_task helpers which will inherit its
>>>>>> values/checks from the thread that owns the device similar to if we did
>>>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>>>> so we can remove those calls.
>>>>>>
>>>>>> Signed-off-by: Mike Christie <[email protected]>
>>>>>> Acked-by: Michael S. Tsirkin <[email protected]>
>>>>>
>>>>>
>>>>> Hi Mike,
>>>>> So this seems to have caused a measureable regression in networking
>>>>> performance (about 30%). Take a look here, and there's a zip file
>>>>> with detailed measuraments attached:
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2222603
>>>>>
>>>>>
>>>>> Could you take a look please?
>>>>> You can also ask reporter questions there assuming you
>>>>> have or can create a (free) account.
>>>>>
>>>>
>>>> Sorry for the late reply. I just got home from vacation.
>>>>
>>>> The account creation link seems to be down. I keep getting a
>>>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>>>
>>>> Can you give me Quan's email?
>>>>
>>>> I think I can replicate the problem. I just need some extra info from Quan:
>>>>
>>>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>>>> 2. The kernel config
>>>> 3. Any tuning that was done. Is tuned running in guest and/or host running the
>>>> VMs and what profile is being used in each.
>>>> 4. Number of vCPUs and virtqueues being used.
>>>> 5. Can they dump the contents of:
>>>>
>>>> /sys/kernel/debug/sched
>>>>
>>>> and
>>>>
>>>> sysctl -a
>>>>
>>>> on the host running the VMs.
>>>>
>>>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they set
>>>> the scheduler to batch:
>>>>
>>>> ps -T -o comm,pid,tid $QEMU_THREAD
>>>>
>>>> then for each vhost thread do:
>>>>
>>>> chrt -b -p 0 $VHOST_THREAD
>>>>
>>>> Does that end up increasing perf? When I do this I see throughput go up by
>>>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>>>> and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
>>>> It's just a difference I noticed when running some other tests.
>>>
>>>
>>> Mike I'm unsure what to do at this point. Regressions are not nice
>>> but if the kernel is released with the new userspace api we won't
>>> be able to revert. So what's the plan?
>>>
>>
>> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
>> 6.4 perform the same for me. I've tried your setup and settings and with different
>> combos of using things like tuned and irqbalance.
>>
>> I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
>> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
>> would then reset the sched settings. So in 6.4 if I just tune the parent differently
>> I can cause different performance. If we want the 6.3 behavior we can do the patch
>> below.
>>
>> However, I don't think you guys are hitting this because you are just running
>> qemu from the normal shell and were not doing anything fancy with the sched
>> settings.
>>
>>
>> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
>> index da35e5b7f047..f2c2638d1106 100644
>> --- a/kernel/vhost_task.c
>> +++ b/kernel/vhost_task.c
>> @@ -2,6 +2,7 @@
>> /*
>> * Copyright (C) 2021 Oracle Corporation
>> */
>> +#include <uapi/linux/sched/types.h>
>> #include <linux/slab.h>
>> #include <linux/completion.h>
>> #include <linux/sched/task.h>
>> @@ -22,9 +23,16 @@ struct vhost_task {
>>
>> static int vhost_task_fn(void *data)
>> {
>> + static const struct sched_param param = { .sched_priority = 0 };
>> struct vhost_task *vtsk = data;
>> bool dead = false;
>>
>> + /*
>> + * Don't inherit the parent's sched info, so we maintain compat from
>> + * when we used kthreads and it reset this info.
>> + */
>> + sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
>> +
>> for (;;) {
>> bool did_work;
>>
>>
>>
>
> yes seems unlikely, still, attach this to bugzilla so it can be
> tested?
>
> and, what will help you debug? any traces to enable?

I added the patch and asked for a perf trace.

>
> Also wasn't there another issue with a non standard config?
> Maybe if we fix that it will by chance fix this one too?
>

It was when CONFIG_RT_GROUP_SCHED was enabled in the kernel config then
I would see a large drop in IOPs/throughput.

In the current 6.5-rc6 I don't see the problem anymore. I haven't had a
chance to narrow down what fixed it.