LinuxLists.cc - [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Mike Christie <[email protected]> writes:

> When switching from kthreads to vhost_tasks two bugs were added:
> 1. The vhost worker tasks's now show up as processes so scripts doing
> ps or ps a would not incorrectly detect the vhost task as another
> process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
> vhost tasks's didn't disable or add support for them.
>
> To fix both bugs, this switches the vhost task to be thread in the
> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
> SIGKILL/STOP support is required because CLONE_THREAD requires
> CLONE_SIGHAND which requires those 2 signals to be supported.
>
> This is a modified version of the patch written by Mike Christie
> <[email protected]> which was a modified version of patch
> originally written by Linus.
>
> Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
> Including ignoring signals, setting up the register state, and having
> get_signal return instead of calling do_group_exit.
>
> Tidied up the vhost_task abstraction so that the definition of
> vhost_task only needs to be visible inside of vhost_task.c. Making
> it easier to review the code and tell what needs to be done where.
> As part of this the main loop has been moved from vhost_worker into
> vhost_task_fn. vhost_worker now returns true if work was done.

Please see below for a race in that tidying up.

> The main loop has been updated to call get_signal which handles
> SIGSTOP, freezing, and collects the message that tells the thread to
> exit as part of process exit. This collection clears
> __fatal_signal_pending. This collection is not guaranteed to
> clear signal_pending() so clear that explicitly so the schedule()
> sleeps.
>
> For now the vhost thread continues to exist and run work until the
> last file descriptor is closed and the release function is called as
> part of freeing struct file. To avoid hangs in the coredump
> rendezvous and when killing threads in a multi-threaded exec. The
> coredump code and de_thread have been modified to ignore vhost threads.
>
> Remvoing the special case for exec appears to require teaching
> vhost_dev_flush how to directly complete transactions in case
> the vhost thread is no longer running.
>
> Removing the special case for coredump rendezvous requires either the
> above fix needed for exec or moving the coredump rendezvous into
> get_signal.

In just fixing the hang after exec I am afraid I may have introduced
something worse.

Two different sighand_struct's (and their associated locks) pointing
at the same signal_struct. (Should be fixable?)

I am worried about what happens with that vhost task after an exec.
It retains it's existing cred (and technically the old mm) but shares
signal_struct so it might be possible to use permission checks against
the old vhost task cred to bypass permission checks on the new tasks
cred. In particular for exec's that gain privilege.

It doesn't look like that is an issue for signals and suid exec as
kill_ok_by_cred seems to deliberately allow the same thing.

We may be ok but the way that vhost task remains after exec it smells
like the setup for a local privilege escalation.

Oleg do you have any insights?

Does anyone see why using the vhost task to modify the process should
not result in privilege escalation?

> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..074273020849 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
> * test_and_set_bit() implies a memory barrier.
> */
> llist_add(&work->node, &dev->worker->work_list);
> - wake_up_process(dev->worker->vtsk->task);
> + vhost_task_wake(dev->worker->vtsk);
> }
> }
> EXPORT_SYMBOL_GPL(vhost_work_queue);
> @@ -333,31 +333,19 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> __vhost_vq_meta_reset(vq);
> }
>
> -static int vhost_worker(void *data)
> +static bool vhost_worker(void *data)
> {
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
>
> - for (;;) {
> - /* mb paired w/ kthread_stop */
> - set_current_state(TASK_INTERRUPTIBLE);
> -
> - if (vhost_task_should_stop(worker->vtsk)) {
> - __set_current_state(TASK_RUNNING);
> - break;
> - }
> -
> - node = llist_del_all(&worker->work_list);
> - if (!node)
> - schedule();
> -
> + node = llist_del_all(&worker->work_list);
> + if (node) {
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
> - __set_current_state(TASK_RUNNING);
> kcov_remote_start_common(worker->kcov_handle);
> work->fn(work);
> kcov_remote_stop();
> @@ -365,7 +353,7 @@ static int vhost_worker(void *data)
> }
> }
>
> - return 0;
> + return !!node;
> }
>
> static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..f80d5c51ae67 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -12,58 +12,88 @@ enum vhost_task_flags {
> VHOST_TASK_FLAGS_STOP,
> };
>
> +struct vhost_task {
> + bool (*fn)(void *data);
> + void *data;
> + struct completion exited;
> + unsigned long flags;
> + struct task_struct *task;
> +};
> +
> static int vhost_task_fn(void *data)
> {
> struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + /* mb paired w/ vhost_task_stop */
> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
> + break;
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> + * Calling get_signal will block in SIGSTOP,
> + * or clear fatal_signal_pending, but remember
> + * what was set.
> + *
> + * This thread won't actually exit until all
> + * of the file descriptors are closed, and
> + * the release function is called.
> + */
> + dead = get_signal(&ksig);
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);
> + }
> +
> + did_work = vtsk->fn(vtsk->data);
> + if (!did_work) {
> + set_current_state(TASK_INTERRUPTIBLE);

I am about to head off on vacation for a week or so, but I want to
add some comments before I go.

First moving set_current_state(TASK_INTERRUPTIBLE) here in the loop
won't work. It introduces a race between vhost_work_queue and this wake
up.

Better would be to move the "if (!dead && signal_pending(current)) ..."
check up above "test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)"

Eric

2023-06-02 14:42:21

by Nicolas Dichtel

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Le 01/06/2023 à 20:32, Mike Christie a écrit :
> When switching from kthreads to vhost_tasks two bugs were added:
> 1. The vhost worker tasks's now show up as processes so scripts doing
> ps or ps a would not incorrectly detect the vhost task as another
> process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
> vhost tasks's didn't disable or add support for them.
>
> To fix both bugs, this switches the vhost task to be thread in the
> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
> SIGKILL/STOP support is required because CLONE_THREAD requires
> CLONE_SIGHAND which requires those 2 signals to be supported.
>
> This is a modified version of the patch written by Mike Christie
> <[email protected]> which was a modified version of patch
> originally written by Linus.
>
> Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
> Including ignoring signals, setting up the register state, and having
> get_signal return instead of calling do_group_exit.
>
> Tidied up the vhost_task abstraction so that the definition of
> vhost_task only needs to be visible inside of vhost_task.c. Making
> it easier to review the code and tell what needs to be done where.
> As part of this the main loop has been moved from vhost_worker into
> vhost_task_fn. vhost_worker now returns true if work was done.
>
> The main loop has been updated to call get_signal which handles
> SIGSTOP, freezing, and collects the message that tells the thread to
> exit as part of process exit. This collection clears
> __fatal_signal_pending. This collection is not guaranteed to
> clear signal_pending() so clear that explicitly so the schedule()
> sleeps.
>
> For now the vhost thread continues to exist and run work until the
> last file descriptor is closed and the release function is called as
> part of freeing struct file. To avoid hangs in the coredump
> rendezvous and when killing threads in a multi-threaded exec. The
> coredump code and de_thread have been modified to ignore vhost threads.
>
> Remvoing the special case for exec appears to require teaching
> vhost_dev_flush how to directly complete transactions in case
> the vhost thread is no longer running.
>
> Removing the special case for coredump rendezvous requires either the
> above fix needed for exec or moving the coredump rendezvous into
> get_signal.
>
> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
> Signed-off-by: Eric W. Biederman <[email protected]>
> Co-developed-by: Mike Christie <[email protected]>
> Signed-off-by: Mike Christie <[email protected]>
Tested-by: Nicolas Dichtel <[email protected]>

2023-06-02 19:52:12

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Hi Mike,

sorry, but somehow I can't understand this patch...

I'll try to read it with a fresh head on Weekend, but for example,

On 06/01, Mike Christie wrote:
>
> static int vhost_task_fn(void *data)
> {
> struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + /* mb paired w/ vhost_task_stop */
> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
> + break;
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> + * Calling get_signal will block in SIGSTOP,
> + * or clear fatal_signal_pending, but remember
> + * what was set.
> + *
> + * This thread won't actually exit until all
> + * of the file descriptors are closed, and
> + * the release function is called.
> + */
> + dead = get_signal(&ksig);
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);

this can't be right or I am totally confused.

Another signal_wake_up() can come right after clear(SIGPENDING).

Again, I'll try to re-read this patch, but let me ask anyway...

Do we have a plan B? I mean... iirc you have mentioned that you can
change these code paths to do something like

if (killed)
tell_the_drivers_that_all_callbacks_will_fail();

so that vhost_worker() can exit after get_signal() returns SIGKILL.

Probably I misunderstood you, but it would be nice to avoid the changes
in coredump/etc code just to add a temporary (iiuc!) fix.

Oleg.

2023-06-03 03:58:49

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <[email protected]> writes:

> Hi Mike,
>
> sorry, but somehow I can't understand this patch...
>
> I'll try to read it with a fresh head on Weekend, but for example,
>
> On 06/01, Mike Christie wrote:
>>
>> static int vhost_task_fn(void *data)
>> {
>> struct vhost_task *vtsk = data;
>> - int ret;
>> + bool dead = false;
>> +
>> + for (;;) {
>> + bool did_work;
>> +
>> + /* mb paired w/ vhost_task_stop */
>> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
>> + break;
>> +
>> + if (!dead && signal_pending(current)) {
>> + struct ksignal ksig;
>> + /*
>> + * Calling get_signal will block in SIGSTOP,
>> + * or clear fatal_signal_pending, but remember
>> + * what was set.
>> + *
>> + * This thread won't actually exit until all
>> + * of the file descriptors are closed, and
>> + * the release function is called.
>> + */
>> + dead = get_signal(&ksig);
>> + if (dead)
>> + clear_thread_flag(TIF_SIGPENDING);
>
> this can't be right or I am totally confused.
>
> Another signal_wake_up() can come right after clear(SIGPENDING).

Technically yes.

However please not that prepare_signal does:
if (signal->flags & SIGNAL_GROUP_EXIT)
return false;

In general it is wrong to receive or attempt to process a signal
after task death has been decided.

Strictly speaking that doesn't cover de_thread, and coredumping
but still receiving any kind of signal at that point is rare
and generally wrong behavior.

Beyond that clearing TIF_SIGPENDING is just an optimization so
the thread can sleep in schedule and not spin.

> Again, I'll try to re-read this patch, but let me ask anyway...
>
> Do we have a plan B? I mean... iirc you have mentioned that you can
> change these code paths to do something like
>
> if (killed)
> tell_the_drivers_that_all_callbacks_will_fail();
>
>
> so that vhost_worker() can exit after get_signal() returns SIGKILL.
>
> Probably I misunderstood you, but it would be nice to avoid the changes
> in coredump/etc code just to add a temporary (iiuc!) fix.

One saving grace with the the vhost code is that you need to open
device nodes that normally have root-only permissions.

If we are willing to allow races in process shutdown to cause leaks I
think we can do something better, and put the burden of work on vhost
layer.

I will follow up with a patch doing that.

Eric

2023-06-03 04:28:02

[permalink] [raw]

Subject: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be suppported.

This is a modified version of the patch written by Mike Christie
<[email protected]> which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c. Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn. vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit. This collection clears
__fatal_signal_pending.

The vhost tasks when it has been asked to exit runs until it has
no more work pending and then exits instead of sleeping.

Causing the other threads to stop feeding the vhost worker work and
having the vhost worker stop when it runs out of work should be enough
to avoid hangs in coredump rendezvous and when killing threads in a
multi-threaded exec.

The vhost thread is no longer guaranteed to be the last thread to
exit. Which means it is possible a work item to be submitted after
the vhost work thread has exited. If that happens the work item will
leak and vhost_worker_free will warn about the situtation.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Co-developed-by: Mike Christie <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---

This fixes the ordering issue in vhost_task_fn so that get_signal
should not work.

This patch is a gamble that during process exit or de_thread in exec
work will not be commonly queued from other threads.

If this gamble turns out to be false the existing WARN_ON in
vhost_worker_free will fire.

Can folks test this and let us know if the WARN_ON fires?

Thank you.

arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/kernel/fpu/context.h | 2 +-
arch/x86/kernel/fpu/core.c | 2 +-
drivers/vhost/vhost.c | 24 +++-----
include/linux/sched/task.h | 1 -
include/linux/sched/vhost_task.h | 16 ++----
kernel/fork.c | 13 ++---
kernel/signal.c | 4 +-
kernel/vhost_task.c | 99 ++++++++++++++++++++++----------
9 files changed, 91 insertions(+), 72 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..78fcde7b1f07 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -39,7 +39,7 @@ extern void fpu_flush_thread(void);
static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
{
if (cpu_feature_enabled(X86_FEATURE_FPU) &&
- !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
save_fpregs_to_fpstate(old_fpu);
/*
* The save operation preserved register state, so the
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index 9fcfa5c4dad7..af5cbdd9bd29 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void)
struct fpu *fpu = &current->thread.fpu;
int cpu = smp_processor_id();

- if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
+ if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER)))
return;

if (!fpregs_state_valid(fpu, cpu)) {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..1015af1ae562 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)

this_cpu_write(in_kernel_fpu, true);

- if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
+ if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
!test_thread_flag(TIF_NEED_FPU_LOAD)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
save_fpregs_to_fpstate(&current->thread.fpu);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..85948f40ddfe 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -236,6 +236,9 @@ void vhost_dev_flush(struct vhost_dev *dev)
struct vhost_flush_struct flush;

if (dev->worker) {
+ if (vhost_task_flush(dev->worker->vtsk))
+ return;
+
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);

@@ -256,7 +259,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* test_and_set_bit() implies a memory barrier.
*/
llist_add(&work->node, &dev->worker->work_list);
- wake_up_process(dev->worker->vtsk->task);
+ vhost_task_wake(dev->worker->vtsk);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -333,25 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
__vhost_vq_meta_reset(vq);
}

-static int vhost_worker(void *data)
+static bool vhost_worker(void *data)
{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

- for (;;) {
- /* mb paired w/ kthread_stop */
- set_current_state(TASK_INTERRUPTIBLE);
-
- if (vhost_task_should_stop(worker->vtsk)) {
- __set_current_state(TASK_RUNNING);
- break;
- }
-
- node = llist_del_all(&worker->work_list);
- if (!node)
- schedule();
-
+ node = llist_del_all(&worker->work_list);
+ if (node) {
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
@@ -365,7 +357,7 @@ static int vhost_worker(void *data)
}
}

- return 0;
+ return !!node;
}

static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..f8cd7ba3ad04 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -2,22 +2,14 @@
#ifndef _LINUX_VHOST_TASK_H
#define _LINUX_VHOST_TASK_H

-#include <linux/completion.h>

-struct task_struct;
+struct vhost_task;

-struct vhost_task {
- int (*fn)(void *data);
- void *data;
- struct completion exited;
- unsigned long flags;
- struct task_struct *task;
-};
-
-struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
const char *name);
void vhost_task_start(struct vhost_task *vtsk);
void vhost_task_stop(struct vhost_task *vtsk);
-bool vhost_task_should_stop(struct vhost_task *vtsk);
+void vhost_task_wake(struct vhost_task *vtsk);
+bool vhost_task_flush(struct vhost_task *vtsk);

#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..81cba91f30bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,16 +2336,16 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->user_worker)
- p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
+ if (args->user_worker) {
/*
- * Mark us an IO worker, and block any signal that isn't
+ * Mark us a user worker, and block any signal that isn't
* fatal or STOP
*/
- p->flags |= PF_IO_WORKER;
+ p->flags |= PF_USER_WORKER;
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}
+ if (args->io_thread)
+ p->flags |= PF_IO_WORKER;

if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2517,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);

if (pid != &init_struct_pid) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..0ac48c96ab04 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2861,11 +2861,11 @@ bool get_signal(struct ksignal *ksig)
}

/*
- * PF_IO_WORKER threads will catch and exit on fatal signals
+ * PF_USER_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/
- if (current->flags & PF_IO_WORKER)
+ if (current->flags & PF_USER_WORKER)
goto out;

/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a82c45cb014d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -12,58 +12,97 @@ enum vhost_task_flags {
VHOST_TASK_FLAGS_STOP,
};

+struct vhost_task {
+ bool (*fn)(void *data);
+ void *data;
+ struct completion exited;
+ unsigned long flags;
+ struct task_struct *task;
+};
+
static int vhost_task_fn(void *data)
{
struct vhost_task *vtsk = data;
- int ret;
+ bool dead = false;
+
+ for (;;) {
+ bool did_work;
+
+ if (!dead && signal_pending(current)) {
+ struct ksignal ksig;
+ /*
+ * Calling get_signal can block in SIGSTOP,
+ * and the freezer. Or it can clear
+ * fatal_signal_pending and return non-zero.
+ */
+ dead = get_signal(&ksig);
+ if (dead)
+ set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+ }
+
+ /* mb paired w/ kthread_stop */
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ did_work = vtsk->fn(vtsk->data);
+ if (!did_work) {
+ if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
+ schedule();
+ }
+ }

- ret = vtsk->fn(vtsk->data);
- complete(&vtsk->exited);
- do_exit(ret);
+ complete_all(&vtsk->exited);
+ do_exit(0);
}

+/**
+ * vhost_task_wake - wakeup the vhost_task
+ * @vtsk: vhost_task to wake
+ *
+ * wake up the vhost_task worker thread
+ */
+void vhost_task_wake(struct vhost_task *vtsk)
+{
+ wake_up_process(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_wake);
+
/**
* vhost_task_stop - stop a vhost_task
* @vtsk: vhost_task to stop
*
- * Callers must call vhost_task_should_stop and return from their worker
- * function when it returns true;
+ * vhost_task_fn ensures the worker thread exits after
+ * VHOST_TASK_FLAGS_SOP becomes true.
*/
void vhost_task_stop(struct vhost_task *vtsk)
{
- pid_t pid = vtsk->task->pid;
-
set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
- wake_up_process(vtsk->task);
+ vhost_task_wake(vtsk);
/*
* Make sure vhost_task_fn is no longer accessing the vhost_task before
- * freeing it below. If userspace crashed or exited without closing,
- * then the vhost_task->task could already be marked dead so
- * kernel_wait will return early.
+ * freeing it below.
*/
wait_for_completion(&vtsk->exited);
- /*
- * If we are just closing/removing a device and the parent process is
- * not exiting then reap the task.
- */
- kernel_wait4(pid, NULL, __WCLONE, NULL);
+ put_task_struct(vtsk->task);
kfree(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_stop);

-/**
- * vhost_task_should_stop - should the vhost task return from the work function
- * @vtsk: vhost_task to stop
- */
-bool vhost_task_should_stop(struct vhost_task *vtsk)
+bool vhost_task_flush(struct vhost_task *vtsk)
{
- return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
+ if (!test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
+ return false;
+
+ wait_for_completion(&vtsk->exited);
+ return true;
}
-EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+EXPORT_SYMBOL_GPL(vhost_task_flush);

/**
- * vhost_task_create - create a copy of a process to be used by the kernel
- * @fn: thread stack
+ * vhost_task_create - create a copy of a task to be used by the kernel
+ * @fn: vhost worker function
* @arg: data to be passed to fn
* @name: the thread's name
*
@@ -71,17 +110,17 @@ EXPORT_SYMBOL_GPL(vhost_task_should_stop);
* failure. The returned task is inactive, and the caller must fire it up
* through vhost_task_start().
*/
-struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
@@ -101,7 +140,7 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
return NULL;
}

- vtsk->task = tsk;
+ vtsk->task = get_task_struct(tsk);
return vtsk;
}
EXPORT_SYMBOL_GPL(vhost_task_create);
--
2.35.3

2023-06-04 03:48:18

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 6/2/23 11:15 PM, Eric W. Biederman wrote:
>
> This fixes the ordering issue in vhost_task_fn so that get_signal
> should not work.
>
> This patch is a gamble that during process exit or de_thread in exec
> work will not be commonly queued from other threads.
>
> If this gamble turns out to be false the existing WARN_ON in
> vhost_worker_free will fire.
>
> Can folks test this and let us know if the WARN_ON fires?

I don't hit the WARN_ONs but probably not for the reason you are thinking
of. We are hung like:

Jun 03 22:25:23 ol4 kernel: Call Trace:
Jun 03 22:25:23 ol4 kernel: <TASK>
Jun 03 22:25:23 ol4 kernel: __schedule+0x334/0xac0
Jun 03 22:25:23 ol4 kernel: ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel: schedule+0x5a/0xd0
Jun 03 22:25:23 ol4 kernel: schedule_timeout+0x240/0x2a0
Jun 03 22:25:23 ol4 kernel: ? __wake_up_klogd.part.0+0x3c/0x60
Jun 03 22:25:23 ol4 kernel: ? vprintk_emit+0x104/0x270
Jun 03 22:25:23 ol4 kernel: ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel: wait_for_completion+0xb0/0x150
Jun 03 22:25:23 ol4 kernel: vhost_scsi_flush+0xc2/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel: vhost_scsi_clear_endpoint+0x16f/0x240 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel: vhost_scsi_release+0x7d/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel: __fput+0xa2/0x270
Jun 03 22:25:23 ol4 kernel: task_work_run+0x56/0xa0
Jun 03 22:25:23 ol4 kernel: do_exit+0x337/0xb40
Jun 03 22:25:23 ol4 kernel: ? __remove_hrtimer+0x39/0x70
Jun 03 22:25:23 ol4 kernel: do_group_exit+0x30/0x90
Jun 03 22:25:23 ol4 kernel: get_signal+0x9cd/0x9f0
Jun 03 22:25:23 ol4 kernel: ? kvm_arch_vcpu_put+0x12b/0x170 [kvm]
Jun 03 22:25:23 ol4 kernel: ? vcpu_put+0x1e/0x50 [kvm]
Jun 03 22:25:23 ol4 kernel: ? kvm_arch_vcpu_ioctl_run+0x193/0x4e0 [kvm]
Jun 03 22:25:23 ol4 kernel: arch_do_signal_or_restart+0x2a/0x260
Jun 03 22:25:23 ol4 kernel: exit_to_user_mode_prepare+0xdd/0x120
Jun 03 22:25:23 ol4 kernel: syscall_exit_to_user_mode+0x1d/0x40
Jun 03 22:25:23 ol4 kernel: do_syscall_64+0x48/0x90
Jun 03 22:25:23 ol4 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 03 22:25:23 ol4 kernel: RIP: 0033:0x7f2d004df50b

The problem is that as part of the flush the drivers/vhost/scsi.c code
will wait for outstanding commands, because we can't free the device and
it's resources before the commands complete or we will hit the accessing
freed memory bug.

We got hung because the patch had us now do:

vhost_dev_flush() -> vhost_task_flush()

and that saw VHOST_TASK_FLAGS_STOP was set and the exited completion has
completed. However, the scsi code is still waiting on commands in vhost_scsi_flush.
The cmds wanted to use the vhost_task task to complete and couldn't since the task
has exited.

To handle those types of issues above, it's a lot more code. We would add
some rcu in vhost_work_queue to handle the worker being freed from under us.
Then add a callback similar to what I did on one of the past patchsets that
stops the drivers. Then modify scsi, so in the callback it also sets some
bits so the completion paths just do a fast failing that doesn't try to
queue the completion to the vhost_task.

If we want to go that route, I can get it done in more like a 6.6 time frame.

2023-06-05 13:08:06

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/02, Eric W. Biederman wrote:
>
> static int vhost_task_fn(void *data)
> {
> struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> + * Calling get_signal can block in SIGSTOP,
> + * and the freezer. Or it can clear
> + * fatal_signal_pending and return non-zero.
> + */
> + dead = get_signal(&ksig);
> + if (dead)
> + set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
> + }
> +
> + /* mb paired w/ kthread_stop */
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + did_work = vtsk->fn(vtsk->data);

I don't understand why do you set TASK_INTERRUPTIBLE before vtsk->fn(),
it seems that you could do this before the test_bit(FLAGS_STOP) below.
But probably I missed something and this is minor anyway...

> + if (!did_work) {
> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
> + __set_current_state(TASK_RUNNING);
> + break;

What if VHOST_TASK_FLAGS_STOP was set by us after get_signal() above ?
We need to ensure that in this case vhost_work_queue() can't add a new work,
nobody will flush it.

In fact, unless I missed something this can even race with vhost_dev_flush().

vhost_dev_flush: vhost_task_fn:

checks FLAGS_STOP, not set,
vhost_task_flush() returns false
gets SIGKILL, sets FLAGS_STOP

vtsk->fn() returns false

vhost_task_fn() exits.

vhost_work_queue();
wait_for_completion(&flush.wait_event);

and the last wait_for_completion() will hang forever.

Oleg.

2023-06-05 13:45:57

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/02, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > Hi Mike,
> >
> > sorry, but somehow I can't understand this patch...
> >
> > I'll try to read it with a fresh head on Weekend, but for example,
> >
> > On 06/01, Mike Christie wrote:
> >>
> >> static int vhost_task_fn(void *data)
> >> {
> >> struct vhost_task *vtsk = data;
> >> - int ret;
> >> + bool dead = false;
> >> +
> >> + for (;;) {
> >> + bool did_work;
> >> +
> >> + /* mb paired w/ vhost_task_stop */
> >> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
> >> + break;
> >> +
> >> + if (!dead && signal_pending(current)) {
> >> + struct ksignal ksig;
> >> + /*
> >> + * Calling get_signal will block in SIGSTOP,
> >> + * or clear fatal_signal_pending, but remember
> >> + * what was set.
> >> + *
> >> + * This thread won't actually exit until all
> >> + * of the file descriptors are closed, and
> >> + * the release function is called.
> >> + */
> >> + dead = get_signal(&ksig);
> >> + if (dead)
> >> + clear_thread_flag(TIF_SIGPENDING);
> >
> > this can't be right or I am totally confused.
> >
> > Another signal_wake_up() can come right after clear(SIGPENDING).
>
> Technically yes.

...

> Beyond that clearing TIF_SIGPENDING is just an optimization so
> the thread can sleep in schedule and not spin.

Yes. So if another signal_wake_up() comes after clear(SIGPENDING)
this code will spin in busy-wait loop waiting VHOST_TASK_FLAGS_STOP.
Obviously not good and even deadlockable on UP && !PREEMPT.

Oleg.

2023-06-05 14:14:14

[permalink] [raw]

Subject: Re: [PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/01, Mike Christie wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1368,7 +1368,9 @@ int zap_other_threads(struct task_struct *p)
>
> while_each_thread(p, t) {
> task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> - count++;
> + /* Don't require de_thread to wait for the vhost_worker */
> + if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
> + count++;

Well if you do this, then you should also change __exit_signal() to
not decrement sig->notify_count. Otherwise de_thread() can succeed
before the "normal" sub-threads exit.

But this can't be right anyway... If nothing else, suppose we have
a process with 3 threads:

M - the main thread, group leader
T - sub-thread
V - vhost worker

T does exec and calls de_thread().

M exits. T takes the leadership and does release_task()

V is still running but V->group_leader points to already freed M.

Or unshare_sighand() after that... If nothing else this means that
lock_task_sighand(T) and lock_task_sighand(V) will take different
locks.

Oleg.

2023-06-05 15:20:59

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/03, [email protected] wrote:
>
> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> The problem is that as part of the flush the drivers/vhost/scsi.c code
> will wait for outstanding commands, because we can't free the device and
> it's resources before the commands complete or we will hit the accessing
> freed memory bug.

ignoring send-fd/clone issues, can we assume that the final fput/release
should always come from vhost_worker's sub-thread (which shares mm/etc) ?

Oleg.

2023-06-05 16:13:39

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 6/5/23 10:10 AM, Oleg Nesterov wrote:
> On 06/03, [email protected] wrote:
>>
>> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
>> The problem is that as part of the flush the drivers/vhost/scsi.c code
>> will wait for outstanding commands, because we can't free the device and
>> it's resources before the commands complete or we will hit the accessing
>> freed memory bug.
>
> ignoring send-fd/clone issues, can we assume that the final fput/release
> should always come from vhost_worker's sub-thread (which shares mm/etc) ?

I think I'm misunderstanding the sub-thread term.

- Is it the task_struct's context that we did the
kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
thread we did VHOST_SET_OWNER from.

If so, then yes.

- Is it the task_struct that gets created by
kernel/vhost_taskc.c:vhost_task_create()?

If so, then the answer is no. vhost_task_create has set the no_files
arg on kernel_clone_args, so copy_files() sets task_struct->files to NULL
and we don't clone or dup the files.

So it works like if we were using a kthread still:

1. Userapce thread0 opens /dev/vhost-$something.
2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
create the task_struct which runs the vhost_worker() function which handles
the work->fns.
3. If userspace now does a SIGKILL or just exits without doing a close() on
/dev/vhost-$something, then when thread0 does exit_files() that will do the
fput that does vhost-$something's file_operations->release.

2023-06-06 12:44:57

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/05, Mike Christie wrote:
>
> On 6/5/23 10:10 AM, Oleg Nesterov wrote:
> > On 06/03, [email protected] wrote:
> >>
> >> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> >> The problem is that as part of the flush the drivers/vhost/scsi.c code
> >> will wait for outstanding commands, because we can't free the device and
> >> it's resources before the commands complete or we will hit the accessing
> >> freed memory bug.
> >
> > ignoring send-fd/clone issues, can we assume that the final fput/release
> > should always come from vhost_worker's sub-thread (which shares mm/etc) ?
>
> I think I'm misunderstanding the sub-thread term.
>
> - Is it the task_struct's context that we did the
> kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
> thread we did VHOST_SET_OWNER from.

Yes,

> So it works like if we were using a kthread still:
>
> 1. Userapce thread0 opens /dev/vhost-$something.
> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
> create the task_struct which runs the vhost_worker() function which handles
> the work->fns.
> 3. If userspace now does a SIGKILL or just exits without doing a close() on
> /dev/vhost-$something, then when thread0 does exit_files() that will do the
> fput that does vhost-$something's file_operations->release.

So, at least in this simple case vhost_worker() can just exit after SIGKILL,
and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
rather than wait for vhost_worker().

Right?

not that I think this can help in the general case ...

Oleg.

2023-06-06 16:31:29

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 6/6/23 7:16 AM, Oleg Nesterov wrote:
> On 06/05, Mike Christie wrote:
>>
>> On 6/5/23 10:10 AM, Oleg Nesterov wrote:
>>> On 06/03, [email protected] wrote:
>>>>
>>>> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
>>>> The problem is that as part of the flush the drivers/vhost/scsi.c code
>>>> will wait for outstanding commands, because we can't free the device and
>>>> it's resources before the commands complete or we will hit the accessing
>>>> freed memory bug.
>>>
>>> ignoring send-fd/clone issues, can we assume that the final fput/release
>>> should always come from vhost_worker's sub-thread (which shares mm/etc) ?
>>
>> I think I'm misunderstanding the sub-thread term.
>>
>> - Is it the task_struct's context that we did the
>> kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
>> thread we did VHOST_SET_OWNER from.
>
> Yes,
>
>> So it works like if we were using a kthread still:
>>
>> 1. Userapce thread0 opens /dev/vhost-$something.
>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
>> create the task_struct which runs the vhost_worker() function which handles
>> the work->fns.
>> 3. If userspace now does a SIGKILL or just exits without doing a close() on
>> /dev/vhost-$something, then when thread0 does exit_files() that will do the
>> fput that does vhost-$something's file_operations->release.
>
> So, at least in this simple case vhost_worker() can just exit after SIGKILL,
> and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
> rather than wait for vhost_worker().
>
> Right?

With the current code, the answer is no. We would hang like I mentioned here:

https://lore.kernel.org/lkml/[email protected]/

We need to add code like I mentioned in that reply because we don't have a
way to call into the layers below us to flush those commands. We need more
like an abort and don't call back into us type of operation. Or, I'm just trying
to add a check where we detect what happened then instead of trying to use
the vhost_task we try to complete in the context the lower level completes us
in.

2023-06-06 19:43:58

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/06, Mike Christie wrote:
>
> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
> > On 06/05, Mike Christie wrote:
> >
> >> So it works like if we were using a kthread still:
> >>
> >> 1. Userapce thread0 opens /dev/vhost-$something.
> >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
> >> create the task_struct which runs the vhost_worker() function which handles
> >> the work->fns.
> >> 3. If userspace now does a SIGKILL or just exits without doing a close() on
> >> /dev/vhost-$something, then when thread0 does exit_files() that will do the
> >> fput that does vhost-$something's file_operations->release.
> >
> > So, at least in this simple case vhost_worker() can just exit after SIGKILL,
> > and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
> > rather than wait for vhost_worker().
> >
> > Right?
>
> With the current code, the answer is no. We would hang like I mentioned here:
>
> https://lore.kernel.org/lkml/[email protected]/

If only I could fully understand this email ;)

Could you spell to explain why this can't work (again, in this simple case) ?

My current (and I know, very poor) understanding is that .release() should
roughly do the following:

1. Ensure that vhost_work_queue() can't add the new callbacks

2. Call vhost_dev_flush() to ensure that worker->work_list is empty

3. Call vhost_task_stop()

so why this sequence can't work if we turn vhost_dev_flush() into something like

void vhost_dev_flush(struct vhost_dev *dev)
{
struct vhost_flush_struct flush;

if (dev->worker) {
// this assumes that vhost_task_create() uses CLONE_THREAD
if (same_thread_group(current, dev->worker->vtsk->task)) {
... run the pending callbacks ...
return;
}

// this is what we currently have

init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);

vhost_work_queue(dev, &flush.work);
wait_for_completion(&flush.wait_event);
}
}

?

Mike, I am just trying to understand what exactly vhost_worker() should do.

> We need to add code like I mentioned in that reply because we don't have a
> way to call into the layers below us to flush those commands.

This tells me nothing, but this is my fault, not yours. Again, again, I know
nothing about drivers/vhost.

Oleg.

2023-06-06 20:55:20

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 6/6/23 2:39 PM, Oleg Nesterov wrote:
> On 06/06, Mike Christie wrote:
>>
>> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
>>> On 06/05, Mike Christie wrote:
>>>
>>>> So it works like if we were using a kthread still:
>>>>
>>>> 1. Userapce thread0 opens /dev/vhost-$something.
>>>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
>>>> create the task_struct which runs the vhost_worker() function which handles
>>>> the work->fns.
>>>> 3. If userspace now does a SIGKILL or just exits without doing a close() on
>>>> /dev/vhost-$something, then when thread0 does exit_files() that will do the
>>>> fput that does vhost-$something's file_operations->release.
>>>
>>> So, at least in this simple case vhost_worker() can just exit after SIGKILL,
>>> and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
>>> rather than wait for vhost_worker().
>>>
>>> Right?
>>
>> With the current code, the answer is no. We would hang like I mentioned here:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> If only I could fully understand this email ;)
>
> Could you spell to explain why this can't work (again, in this simple case) ?
>
> My current (and I know, very poor) understanding is that .release() should
> roughly do the following:
>
> 1. Ensure that vhost_work_queue() can't add the new callbacks
>
> 2. Call vhost_dev_flush() to ensure that worker->work_list is empty
>

The problem is what do we do in the work->fn.

What you wrote is correct for cleaning up the work_list. However, the lower level
vhost drivers, like vhost-scsi, will do something like:

async_submit_request_to_storage/net_layer()

from their work->fn. The submission is async so when the request completes it
calls some callbacks that call into the vhost driver and vhost layer. For
vhost-scsi the call back will run vhost_queue_work so we can complete the request
from the vhost_task.

So if we've already run the work->fn then we need to add code to handle the
completion of the request we submitted. We need:

1. vhost_queue_work needs some code to detect when the vhost_task has exited
so we don't do vhost_task_wake on a freed task.

I was saying for this, we can sprinkle some RCU in there and in the code paths
we cleanup the vhost_task.

2. The next problem is that if the vhost_task is going to just loop over the
work_list and kill those works before it exits (or if we do it from the vhost_dev_flush
function), then we still have handle those async requests that got kicked off to
some other layer that are going to eventually complete and try to call
vhost_work_queue.

With #1, we can detect when the vhost_task is no longer usable, so we then need
to modify the drivers to detect that and instead of trying to execute like normal
where they queue the work, they just take their failure paths and free resources.

So the release cabllback was doing 2 things:
1. Flushing the work_list
2. Waiting on the those request completions

And so I was saying before I'm trying to finish up handling #2. I hit some
hiccups though because it turns out there is at least one case where we
don't have a vhost_task but we don't want to fail. It's just a matter of
coding it though.

2023-06-11 21:01:26

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <[email protected]> writes:

> On 06/06, Mike Christie wrote:
>>
>> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
>> > On 06/05, Mike Christie wrote:
>> >
>> >> So it works like if we were using a kthread still:
>> >>
>> >> 1. Userapce thread0 opens /dev/vhost-$something.
>> >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
>> >> create the task_struct which runs the vhost_worker() function which handles
>> >> the work->fns.
>> >> 3. If userspace now does a SIGKILL or just exits without doing a close() on
>> >> /dev/vhost-$something, then when thread0 does exit_files() that will do the
>> >> fput that does vhost-$something's file_operations->release.
>> >
>> > So, at least in this simple case vhost_worker() can just exit after SIGKILL,
>> > and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
>> > rather than wait for vhost_worker().
>> >
>> > Right?
>>
>> With the current code, the answer is no. We would hang like I mentioned here:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> If only I could fully understand this email ;)
>
> Could you spell to explain why this can't work (again, in this simple case) ?
>
> My current (and I know, very poor) understanding is that .release() should
> roughly do the following:
>
> 1. Ensure that vhost_work_queue() can't add the new callbacks
>
> 2. Call vhost_dev_flush() to ensure that worker->work_list is empty
>
> 3. Call vhost_task_stop()

At least in the case of exec by the time the final fput happens
from close_on_exec the task has already changed it's mm. So the
conditions are wrong to run the work queue items.

For close(2) and SIGKILL perhaps, but definitely not in the case of
exec.

Eric

2023-06-14 06:31:21

[permalink] [raw]

Subject: Can vhost translate to io_uring?

I am sad my idea for simplifying things did not work out.

Let's try an even bigger idea to reduce maintenance and simplify things.

Could vhost depend on io_uring?

Could vhost just be a translation layer of existing vhost requests to
io_uring requests?

At a quick glance it looks like io_uring already supports the
functionality that vhost supports (which I think is networking and
scsi).

If vhost could become a translation layer that would allow removing
the vhost worker and PF_USER_WORKER could be removed completely,
leaving only PF_IO_WORKER.

I suggest this because a significant vhost change is needed because in
the long term the hacks in exec and coredump are not a good idea. Which
means something like my failed "[PATCH v3] fork, vhost: Use CLONE_THREAD
to fix freezer/ps regression".

If we have to go to all of the trouble of reworking things it why can't
we just make io_uring do all of the work?

Eric

2023-06-14 07:01:50

[permalink] [raw]

Subject: Re: Can vhost translate to io_uring?

On 6/14/23 1:02 AM, Eric W. Biederman wrote:
>
> I am sad my idea for simplifying things did not work out.
>
>
> Let's try an even bigger idea to reduce maintenance and simplify things.
>
> Could vhost depend on io_uring?
>
> Could vhost just be a translation layer of existing vhost requests to
> io_uring requests?
>
> At a quick glance it looks like io_uring already supports the
> functionality that vhost supports (which I think is networking and
> scsi).
>
> If vhost could become a translation layer that would allow removing
> the vhost worker and PF_USER_WORKER could be removed completely,
> leaving only PF_IO_WORKER.
>
>
> I suggest this because a significant vhost change is needed because in

It would be nice if the vhost layer could use the io-wq code as sort of
generic worker. I can look into what that would take if Jens is ok
with that type of thing.

For vhost, I just submitted a patch to the vhost layer that allow us to
switch out the vhost worker thread when IO is running:

https://lists.linuxfoundation.org/pipermail/virtualization/2023-June/067246.html

After that patch, I just need to modify vhost_worker/vhost_task_fn so
when get_signal returns true we set the worker to NULL and do a synchronize_rcu.
Then I just need to modify vhost-scsi so it detects when the worker is NULL
and modify the flush code so it handles work that didn't get to run.

> the long term the hacks in exec and coredump are not a good idea. Which
> means something like my failed "[PATCH v3] fork, vhost: Use CLONE_THREAD
> to fix freezer/ps regression".
>
> If we have to go to all of the trouble of reworking things it why can't
> we just make io_uring do all of the work?
>
> Eric

2023-06-14 14:38:33

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: Can vhost translate to io_uring?

On Wed, Jun 14, 2023 at 01:02:58AM -0500, Eric W. Biederman wrote:
>
> I am sad my idea for simplifying things did not work out.
>
>
> Let's try an even bigger idea to reduce maintenance and simplify things.
>
> Could vhost depend on io_uring?
>
> Could vhost just be a translation layer of existing vhost requests to
> io_uring requests?

I expect that's going to have a measureable performance impact.

> At a quick glance it looks like io_uring already supports the
> functionality that vhost supports (which I think is networking and
> scsi).
>
> If vhost could become a translation layer that would allow removing
> the vhost worker and PF_USER_WORKER could be removed completely,
> leaving only PF_IO_WORKER.
>
>
> I suggest this because a significant vhost change is needed because in
> the long term the hacks in exec and coredump are not a good idea. Which
> means something like my failed "[PATCH v3] fork, vhost: Use CLONE_THREAD
> to fix freezer/ps regression".
>
> If we have to go to all of the trouble of reworking things it why can't
> we just make io_uring do all of the work?
>
> Eric

2023-06-14 15:42:17

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: Can vhost translate to io_uring?

On Wed, Jun 14, 2023 at 01:02:58AM -0500, Eric W. Biederman wrote:
> At a quick glance it looks like io_uring already supports the
> functionality that vhost supports (which I think is networking and
> scsi).

There's vsock too.

--
MST

2023-06-14 15:44:52

by Jens Axboe

[permalink] [raw]

Subject: Re: Can vhost translate to io_uring?

On 6/14/23 12:25?AM, [email protected] wrote:
> On 6/14/23 1:02 AM, Eric W. Biederman wrote:
>>
>> I am sad my idea for simplifying things did not work out.
>>
>>
>> Let's try an even bigger idea to reduce maintenance and simplify things.
>>
>> Could vhost depend on io_uring?
>>
>> Could vhost just be a translation layer of existing vhost requests to
>> io_uring requests?
>>
>> At a quick glance it looks like io_uring already supports the
>> functionality that vhost supports (which I think is networking and
>> scsi).
>>
>> If vhost could become a translation layer that would allow removing
>> the vhost worker and PF_USER_WORKER could be removed completely,
>> leaving only PF_IO_WORKER.
>>
>>
>> I suggest this because a significant vhost change is needed because in
>
> It would be nice if the vhost layer could use the io-wq code as sort of
> generic worker. I can look into what that would take if Jens is ok
> with that type of thing.

Certainly. io-wq is mostly generic, eg it has no understanding of
io_uring internals or commands and structs, and it should be possible to
just setup a struct io_wq and use that.

Obviously might need a bit of refactoring work and exporting of symbols,
io_uring is y/n so we don't export anything. But I think it should all
be minor work, really.

--
Jens Axboe

2023-06-14 17:54:08

[permalink] [raw]

Subject: Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/11, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > Could you spell to explain why this can't work (again, in this simple case) ?
> >
> > My current (and I know, very poor) understanding is that .release() should
> > roughly do the following:
> >
> > 1. Ensure that vhost_work_queue() can't add the new callbacks
> >
> > 2. Call vhost_dev_flush() to ensure that worker->work_list is empty
> >
> > 3. Call vhost_task_stop()
>
> At least in the case of exec by the time the final fput happens
> from close_on_exec the task has already changed it's mm.

Of course you are right.

But can't resist, please note that I only meant "this simple case" which
doesn't include exec/etc.

Nevermind. As Mike explains there are more problems even in this particular
"simple" case, and I am not surprised.

Sorry for noise,

Oleg.

2023-06-14 18:43:22