2023-05-22 02:58:18

by Mike Christie

[permalink] [raw]
Subject: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing ps
or ps a would not incorrectly detect the vhost task as another process.
2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's
didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be suppported.

This a modified version of patch originally written by Linus which
handles his review comment to himself to rename ignore_signals to
block_signals to better represent what it now does. And it includes a
change to vhost_worker() to support SIGSTOP/KILL and freeze, and it
drops the wait use per Oleg's review comment that it's no longer needed
when using CLONE_THREAD.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Mike Christie <[email protected]>
---
drivers/vhost/vhost.c | 17 ++++++++++++++++-
include/linux/sched/task.h | 2 +-
kernel/fork.c | 12 +++---------
kernel/signal.c | 1 +
kernel/vhost_task.c | 16 ++++------------
5 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..bf83e9340e40 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -338,6 +338,7 @@ static int vhost_worker(void *data)
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;
+ bool dead = false;

for (;;) {
/* mb paired w/ kthread_stop */
@@ -349,8 +350,22 @@ static int vhost_worker(void *data)
}

node = llist_del_all(&worker->work_list);
- if (!node)
+ if (!node) {
schedule();
+ /*
+ * When we get a SIGKILL our release function will
+ * be called. That will stop new IOs from being queued
+ * and check for outstanding cmd responses. It will then
+ * call vhost_task_stop to tell us to return and exit.
+ */
+ if (!dead && signal_pending(current)) {
+ struct ksignal ksig;
+
+ dead = get_signal(&ksig);
+ if (dead)
+ clear_thread_flag(TIF_SIGPENDING);
+ }
+ }

node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
+ u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
- /*
- * Mark us an IO worker, and block any signal that isn't
- * fatal or STOP
- */
+ if (args->io_thread)
p->flags |= PF_IO_WORKER;
+ if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
- }

if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);

if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn_arg = arg,
.io_thread = 1,
.user_worker = 1,
+ .block_signals = 1,
};

return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8050fe23c732..a0f00a078cbb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)

return ksig->sig > 0;
}
+EXPORT_SYMBOL_GPL(get_signal);

/**
* signal_delivered - called after signal delivery to update blocked signals
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..7a2d7d9fe772 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
*/
void vhost_task_stop(struct vhost_task *vtsk)
{
- pid_t pid = vtsk->task->pid;
-
set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
wake_up_process(vtsk->task);
/*
* Make sure vhost_task_fn is no longer accessing the vhost_task before
- * freeing it below. If userspace crashed or exited without closing,
- * then the vhost_task->task could already be marked dead so
- * kernel_wait will return early.
+ * freeing it below.
*/
wait_for_completion(&vtsk->exited);
- /*
- * If we are just closing/removing a device and the parent process is
- * not exiting then reap the task.
- */
- kernel_wait4(pid, NULL, __WCLONE, NULL);
kfree(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -75,13 +66,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
+ .block_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
--
2.25.1



2023-05-22 13:21:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Confused, please help...

On 05/21, Mike Christie wrote:
>
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
> + bool dead = false;
>
> for (;;) {
> /* mb paired w/ kthread_stop */
> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
> }
>
> node = llist_del_all(&worker->work_list);
> - if (!node)
> + if (!node) {
> schedule();
> + /*
> + * When we get a SIGKILL our release function will
> + * be called. That will stop new IOs from being queued
> + * and check for outstanding cmd responses. It will then
> + * call vhost_task_stop to tell us to return and exit.
> + */

But who will call the release function / vhost_task_stop() and when this
will happen after this thread gets SIGKILL ?

> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> +
> + dead = get_signal(&ksig);
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);

If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ?


Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING.

SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
In this case the main for (;;) loop will spin without sleeping until
vhost_task_should_stop() becomes true?

Oleg.


2023-05-22 17:23:30

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/22/23 7:30 AM, Oleg Nesterov wrote:
> Confused, please help...
>
> On 05/21, Mike Christie wrote:
>>
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
>> struct vhost_worker *worker = data;
>> struct vhost_work *work, *work_next;
>> struct llist_node *node;
>> + bool dead = false;
>>
>> for (;;) {
>> /* mb paired w/ kthread_stop */
>> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
>> }
>>
>> node = llist_del_all(&worker->work_list);
>> - if (!node)
>> + if (!node) {
>> schedule();
>> + /*
>> + * When we get a SIGKILL our release function will
>> + * be called. That will stop new IOs from being queued
>> + * and check for outstanding cmd responses. It will then
>> + * call vhost_task_stop to tell us to return and exit.
>> + */
>
> But who will call the release function / vhost_task_stop() and when this
> will happen after this thread gets SIGKILL ?

When we get a SIGKILL, the thread that owns the device/vhost_task will
also exit since it's the same thread group and it does:

do_exit -> exit_files -> put_files_struct -> close_files -> fput

which eventually (the actual put is done via the delayed work path
in there) runs the file_operation->release.

So the release function is being called in parallel more or less as the
code above.

>
>> + if (!dead && signal_pending(current)) {
>> + struct ksignal ksig;
>> +
>> + dead = get_signal(&ksig);
>> + if (dead)
>> + clear_thread_flag(TIF_SIGPENDING);
>
> If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ?

You're right. I don't need it. I thought __fatal_signal_pending checked
the shared pending as well but it only checks pending.

> Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING.
>
> SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
> In this case the main for (;;) loop will spin without sleeping until
> vhost_task_should_stop() becomes true?

I see. So I either have to be able to call get_signal after SIGKILL or
at this time work like a kthread and ignore signals like a

if (dead && signal_pending())
flush_signals()
?


2023-05-22 17:52:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/22, Mike Christie wrote:
>
> On 5/22/23 7:30 AM, Oleg Nesterov wrote:
> >> + /*
> >> + * When we get a SIGKILL our release function will
> >> + * be called. That will stop new IOs from being queued
> >> + * and check for outstanding cmd responses. It will then
> >> + * call vhost_task_stop to tell us to return and exit.
> >> + */
> >
> > But who will call the release function / vhost_task_stop() and when this
> > will happen after this thread gets SIGKILL ?
>
> When we get a SIGKILL, the thread that owns the device/vhost_task will
> also exit since it's the same thread group and it does:
>
> do_exit -> exit_files -> put_files_struct -> close_files -> fput

Ah. thanks. I confused CLONE_FS in vhost_task_create() with CLONE_FILES.

> > Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING.
> >
> > SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
> > In this case the main for (;;) loop will spin without sleeping until
> > vhost_task_should_stop() becomes true?
>
> I see. So I either have to be able to call get_signal after SIGKILL or
> at this time work like a kthread and ignore signals like a
>
> if (dead && signal_pending())
> flush_signals()
> ?

Right now I think that "int dead" should die, and you should simply do
get_signal() + clear(SIGPENDING) if signal_pending() == T , but let me
think tomorrow.

Oleg.


2023-05-22 20:18:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Sun, May 21, 2023 at 09:51:24PM -0500, Mike Christie wrote:
> When switching from kthreads to vhost_tasks two bugs were added:
> 1. The vhost worker tasks's now show up as processes so scripts doing ps
> or ps a would not incorrectly detect the vhost task as another process.
> 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's
> didn't disable or add support for them.
>
> To fix both bugs, this switches the vhost task to be thread in the
> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
> SIGKILL/STOP support is required because CLONE_THREAD requires
> CLONE_SIGHAND which requires those 2 signals to be suppported.
>
> This a modified version of patch originally written by Linus which
> handles his review comment to himself to rename ignore_signals to
> block_signals to better represent what it now does. And it includes a
> change to vhost_worker() to support SIGSTOP/KILL and freeze, and it
> drops the wait use per Oleg's review comment that it's no longer needed
> when using CLONE_THREAD.
>
> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
> Signed-off-by: Mike Christie <[email protected]>
> ---
> drivers/vhost/vhost.c | 17 ++++++++++++++++-
> include/linux/sched/task.h | 2 +-
> kernel/fork.c | 12 +++---------
> kernel/signal.c | 1 +
> kernel/vhost_task.c | 16 ++++------------
> 5 files changed, 25 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..bf83e9340e40 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
> + bool dead = false;
>
> for (;;) {
> /* mb paired w/ kthread_stop */
> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
> }
>
> node = llist_del_all(&worker->work_list);
> - if (!node)
> + if (!node) {
> schedule();
> + /*
> + * When we get a SIGKILL our release function will
> + * be called. That will stop new IOs from being queued
> + * and check for outstanding cmd responses. It will then
> + * call vhost_task_stop to tell us to return and exit.
> + */
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> +
> + dead = get_signal(&ksig);
> + if (dead)
> + clear_thread_flag(TIF_SIGPENDING);


Does get_signal actually return true only on SIGKILL then?

> + }
> + }
>
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 537cbf9a2ade..249a5ece9def 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -29,7 +29,7 @@ struct kernel_clone_args {
> u32 io_thread:1;
> u32 user_worker:1;
> u32 no_files:1;
> - u32 ignore_signals:1;
> + u32 block_signals:1;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..9e04ab5c3946 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
> p->flags |= PF_KTHREAD;
> if (args->user_worker)
> p->flags |= PF_USER_WORKER;
> - if (args->io_thread) {
> - /*
> - * Mark us an IO worker, and block any signal that isn't
> - * fatal or STOP
> - */
> + if (args->io_thread)
> p->flags |= PF_IO_WORKER;
> + if (args->block_signals)
> siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> - }
>
> if (args->name)
> strscpy_pad(p->comm, args->name, sizeof(p->comm));
> @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
> if (retval)
> goto bad_fork_cleanup_io;
>
> - if (args->ignore_signals)
> - ignore_signals(p);
> -
> stackleak_task_init(p);
>
> if (pid != &init_struct_pid) {
> @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
> .fn_arg = arg,
> .io_thread = 1,
> .user_worker = 1,
> + .block_signals = 1,
> };
>
> return copy_process(NULL, 0, node, &args);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8050fe23c732..a0f00a078cbb 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>
> return ksig->sig > 0;
> }
> +EXPORT_SYMBOL_GPL(get_signal);

If you are exporting this, could you add documentation please?


> /**
> * signal_delivered - called after signal delivery to update blocked signals
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..7a2d7d9fe772 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -31,22 +31,13 @@ static int vhost_task_fn(void *data)
> */
> void vhost_task_stop(struct vhost_task *vtsk)
> {
> - pid_t pid = vtsk->task->pid;
> -
> set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
> wake_up_process(vtsk->task);
> /*
> * Make sure vhost_task_fn is no longer accessing the vhost_task before
> - * freeing it below. If userspace crashed or exited without closing,
> - * then the vhost_task->task could already be marked dead so
> - * kernel_wait will return early.
> + * freeing it below.
> */
> wait_for_completion(&vtsk->exited);
> - /*
> - * If we are just closing/removing a device and the parent process is
> - * not exiting then reap the task.
> - */
> - kernel_wait4(pid, NULL, __WCLONE, NULL);
> kfree(vtsk);
> }
> EXPORT_SYMBOL_GPL(vhost_task_stop);
> @@ -75,13 +66,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
> const char *name)
> {
> struct kernel_clone_args args = {
> - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
> + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
> + CLONE_THREAD | CLONE_SIGHAND,
> .exit_signal = 0,
> .fn = vhost_task_fn,
> .name = name,
> .user_worker = 1,
> .no_files = 1,
> - .ignore_signals = 1,
> + .block_signals = 1,
> };
> struct vhost_task *vtsk;
> struct task_struct *tsk;
> --
> 2.25.1


2023-05-23 12:38:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/22, Oleg Nesterov wrote:
>
> Right now I think that "int dead" should die,

No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL.

> but let me think tomorrow.

May be something like this... I don't like it but I can't suggest anything better
right now.

bool killed = false;

for (;;) {
...

node = llist_del_all(&worker->work_list);
if (!node) {
schedule();
/*
* When we get a SIGKILL our release function will
* be called. That will stop new IOs from being queued
* and check for outstanding cmd responses. It will then
* call vhost_task_stop to tell us to return and exit.
*/
if (signal_pending(current)) {
struct ksignal ksig;

if (!killed)
killed = get_signal(&ksig);

clear_thread_flag(TIF_SIGPENDING);
}

continue;
}

-------------------------------------------------------------------------------
But let me ask a couple of questions. Let's forget this patch, let's look at the
current code:

node = llist_del_all(&worker->work_list);
if (!node)
schedule();

node = llist_reverse_order(node);
... process works ...

To me this looks a bit confusing. Shouldn't we do

if (!node) {
schedule();
continue;
}

just to make the code a bit more clear? If node == NULL then
llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
But this is minor.



/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);

I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
vhost_work_queue() can add this work again and change work->node->next.

That is why we use _safe, but we need to ensure that llist_for_each_safe()
completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.

So it seems that smp_wmb() can't help and should be removed, instead we need

llist_for_each_entry_safe(...) {
smp_mb__before_atomic();
clear_bit(VHOST_WORK_QUEUED, &work->flags);

Also, if the work->fn pointer is not stable, we should read it before
smp_mb__before_atomic() as well.

No?


__set_current_state(TASK_RUNNING);

Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
can return with current->state != RUNNING ?


work->fn(work);

Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
before we call work->fn(). Is it "safe" to run this callback with
signal_pending() or fatal_signal_pending() ?


Finally. I never looked into drivers/vhost/ before so I don't understand
this code at all, but let me ask anyway... Can we change vhost_dev_flush()
to run the pending callbacks rather than wait for vhost_worker() ?
I guess we can't, ->mm won't be correct, but can you confirm?

Oleg.


2023-05-23 16:03:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

"Michael S. Tsirkin" <[email protected]> writes:

> On Sun, May 21, 2023 at 09:51:24PM -0500, Mike Christie wrote:
>> When switching from kthreads to vhost_tasks two bugs were added:
>> 1. The vhost worker tasks's now show up as processes so scripts doing ps
>> or ps a would not incorrectly detect the vhost task as another process.
>> 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's
>> didn't disable or add support for them.
>>
>> To fix both bugs, this switches the vhost task to be thread in the
>> process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
>> get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
>> SIGKILL/STOP support is required because CLONE_THREAD requires
>> CLONE_SIGHAND which requires those 2 signals to be suppported.
>>
>> This a modified version of patch originally written by Linus which
>> handles his review comment to himself to rename ignore_signals to
>> block_signals to better represent what it now does. And it includes a
>> change to vhost_worker() to support SIGSTOP/KILL and freeze, and it
>> drops the wait use per Oleg's review comment that it's no longer needed
>> when using CLONE_THREAD.
>>
>> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
>> Signed-off-by: Mike Christie <[email protected]>
>> ---
>> drivers/vhost/vhost.c | 17 ++++++++++++++++-
>> include/linux/sched/task.h | 2 +-
>> kernel/fork.c | 12 +++---------
>> kernel/signal.c | 1 +
>> kernel/vhost_task.c | 16 ++++------------
>> 5 files changed, 25 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index a92af08e7864..bf83e9340e40 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
>> struct vhost_worker *worker = data;
>> struct vhost_work *work, *work_next;
>> struct llist_node *node;
>> + bool dead = false;
>>
>> for (;;) {
>> /* mb paired w/ kthread_stop */
>> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
>> }
>>
>> node = llist_del_all(&worker->work_list);
>> - if (!node)
>> + if (!node) {
>> schedule();
>> + /*
>> + * When we get a SIGKILL our release function will
>> + * be called. That will stop new IOs from being queued
>> + * and check for outstanding cmd responses. It will then
>> + * call vhost_task_stop to tell us to return and exit.
>> + */
>> + if (!dead && signal_pending(current)) {
>> + struct ksignal ksig;
>> +
>> + dead = get_signal(&ksig);
>> + if (dead)
>> + clear_thread_flag(TIF_SIGPENDING);
>
>
> Does get_signal actually return true only on SIGKILL then?

get_signal returns the signal that was dequeued, or 0 if no signal was
dequeued.

For these threads that block all but SIGSTOP and SIGKILL get_signal
should properly never return any signal. As SIGSTOP and SIGKILL are
handled internally to get_signal.

However get_signal was modified to have a special case for io_uring
and now the vhost code so that extra cleanup can be performed, before
do_exit is called on the thread. That special case causes get_signal
to return when with the value of SIGKILL when the process exits.

The process can exit with do_group_exit aka exit(2) or any fatal signal
not just SIGKILL, and get_signal on these threads will return.

Eric

2023-05-23 16:20:22

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/22/23 2:40 PM, Michael S. Tsirkin wrote:
>> return copy_process(NULL, 0, node, &args);
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8050fe23c732..a0f00a078cbb 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>>
>> return ksig->sig > 0;
>> }
>> +EXPORT_SYMBOL_GPL(get_signal);
>
> If you are exporting this, could you add documentation please?
>

Ok.



2023-05-23 16:22:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <[email protected]> writes:

> On 05/22, Oleg Nesterov wrote:
>>
>> Right now I think that "int dead" should die,
>
> No, probably we shouldn't call get_signal() if we have already
> dequeued SIGKILL.

Very much agreed. It is one thing to add a patch to move do_exit
out of get_signal. It is another to keep calling get_signal after
that. Nothing tests that case, and so we get some weird behaviors.


>> but let me think tomorrow.
>
> May be something like this... I don't like it but I can't suggest anything better
> right now.
>
> bool killed = false;
>
> for (;;) {
> ...
>
> node = llist_del_all(&worker->work_list);
> if (!node) {
> schedule();
> /*
> * When we get a SIGKILL our release function will
> * be called. That will stop new IOs from being queued
> * and check for outstanding cmd responses. It will then
> * call vhost_task_stop to tell us to return and exit.
> */
> if (signal_pending(current)) {
> struct ksignal ksig;
>
> if (!killed)
> killed = get_signal(&ksig);
>
> clear_thread_flag(TIF_SIGPENDING);
> }
>
> continue;
> }

I want to point out that we need to consider not just SIGKILL, but
SIGABRT that causes a coredump, as well as the process peforming
an ordinary exit(2). All of which will cause get_signal to return
SIGKILL in this context.

>
> -------------------------------------------------------------------------------
> But let me ask a couple of questions.

I share most of these questions.

> Let's forget this patch, let's look at the
> current code:
>
> node = llist_del_all(&worker->work_list);
> if (!node)
> schedule();
>
> node = llist_reverse_order(node);
> ... process works ...
>
> To me this looks a bit confusing. Shouldn't we do
>
> if (!node) {
> schedule();
> continue;
> }
>
> just to make the code a bit more clear? If node == NULL then
> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
> But this is minor.
>
>
>
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> vhost_work_queue() can add this work again and change work->node->next.
>
> That is why we use _safe, but we need to ensure that llist_for_each_safe()
> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> So it seems that smp_wmb() can't help and should be removed, instead we need
>
> llist_for_each_entry_safe(...) {
> smp_mb__before_atomic();
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> Also, if the work->fn pointer is not stable, we should read it before
> smp_mb__before_atomic() as well.
>
> No?
>
>
> __set_current_state(TASK_RUNNING);
>
> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> can return with current->state != RUNNING ?
>
>
> work->fn(work);
>
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?
>
>
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?

In a conversation long ago I remember hearing that vhost does not
support file descriptor passing. Which means all of the file
descriptors should be in the same process.

Looking at the vhost code what I am seeing happening is that the
vhost_worker persists until vhost_dev_cleanup is called from
one of the vhost_???_release() functions. The release functions
are only called after the last flush function completes. See __fput
if you want to trace the details.


On one hand this all seems reasonable. On the other hand I am not
seeing the code that prevents file descriptor passing.


It is probably not the worst thing in the world, but what this means
is now if you pass a copy of the vhost file descriptor to another
process the vhost_worker will persist, and thus the process will persist
until that copy of the file descriptor is closed.

Eric


2023-05-24 00:18:48

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Hey Oleg,

For all these questions below let me get back to you by tomorrow.
I need to confirm if something would be considered a regression by
the core vhost developers.

On 5/23/23 7:15 AM, Oleg Nesterov wrote:
> On 05/22, Oleg Nesterov wrote:
>>
>> Right now I think that "int dead" should die,
>
> No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL.
>
>> but let me think tomorrow.
>
> May be something like this... I don't like it but I can't suggest anything better
> right now.
>
> bool killed = false;
>
> for (;;) {
> ...
>
> node = llist_del_all(&worker->work_list);
> if (!node) {
> schedule();
> /*
> * When we get a SIGKILL our release function will
> * be called. That will stop new IOs from being queued
> * and check for outstanding cmd responses. It will then
> * call vhost_task_stop to tell us to return and exit.
> */
> if (signal_pending(current)) {
> struct ksignal ksig;
>
> if (!killed)
> killed = get_signal(&ksig);
>
> clear_thread_flag(TIF_SIGPENDING);
> }
>
> continue;
> }
>
> -------------------------------------------------------------------------------
> But let me ask a couple of questions. Let's forget this patch, let's look at the
> current code:
>
> node = llist_del_all(&worker->work_list);
> if (!node)
> schedule();
>
> node = llist_reverse_order(node);
> ... process works ...
>
> To me this looks a bit confusing. Shouldn't we do
>
> if (!node) {
> schedule();
> continue;
> }
>
> just to make the code a bit more clear? If node == NULL then
> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
> But this is minor.
>
>
>
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> vhost_work_queue() can add this work again and change work->node->next.
>
> That is why we use _safe, but we need to ensure that llist_for_each_safe()
> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> So it seems that smp_wmb() can't help and should be removed, instead we need
>
> llist_for_each_entry_safe(...) {
> smp_mb__before_atomic();
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> Also, if the work->fn pointer is not stable, we should read it before
> smp_mb__before_atomic() as well.
>
> No?
>
>
> __set_current_state(TASK_RUNNING);
>
> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> can return with current->state != RUNNING ?
>
>
> work->fn(work);
>
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?
>
>
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?
>
> Oleg.
>


2023-05-24 14:27:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/23, Eric W. Biederman wrote:
>
> I want to point out that we need to consider not just SIGKILL, but
> SIGABRT that causes a coredump, as well as the process peforming
> an ordinary exit(2). All of which will cause get_signal to return
> SIGKILL in this context.

Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
vhost_worker().

> It is probably not the worst thing in the world, but what this means
> is now if you pass a copy of the vhost file descriptor to another
> process the vhost_worker will persist, and thus the process will persist
> until that copy of the file descriptor is closed.

Hadn't thought about it.

I am fighting with internal bugzillas today, will try to write another
email tomorrow.

But before that, I would like to have an answer to my "main" question in
my previois email. Otherwise I am still not sure I understand what exactly
we need to fix.

Oleg.


2023-05-24 15:04:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <[email protected]> writes:

> On 05/23, Eric W. Biederman wrote:
>>
>> I want to point out that we need to consider not just SIGKILL, but
>> SIGABRT that causes a coredump, as well as the process peforming
>> an ordinary exit(2). All of which will cause get_signal to return
>> SIGKILL in this context.
>
> Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
> vhost_worker().

Actually I think it reveals that exiting with SIGABRT will cause
a deadlock.

coredump_wait will wait for all of the threads to reach
coredump_task_exit. Meanwhile vhost_worker is waiting for
all of the other threads to reach exit_files to close their
file descriptors.


So it looks like the final pieces of work will actually need to be moved
into to vhost_xxx_flush or vhost_xxx_release to avoid the exiting
threads from waiting on each other, instead of depending upon the
vhost_worker to do the work.

Which gets back to most of your other questions.

>> It is probably not the worst thing in the world, but what this means
>> is now if you pass a copy of the vhost file descriptor to another
>> process the vhost_worker will persist, and thus the process will persist
>> until that copy of the file descriptor is closed.
>
> Hadn't thought about it.
>
> I am fighting with internal bugzillas today, will try to write another
> email tomorrow.
>
> But before that, I would like to have an answer to my "main" question in
> my previois email. Otherwise I am still not sure I understand what exactly
> we need to fix.

Let me repeat your "main" question just for clarity here.

If a signal comes in after the signal_pending check but before the
"work->fn(work)" call is "work->fn(work)" expected to run correctly
with signal_pending() or fatal_signal_pending returning true?


Eric






2023-05-25 12:10:53

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/24, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
> > vhost_worker().
>
> Actually I think it reveals that exiting with SIGABRT will cause
> a deadlock.
>
> coredump_wait will wait for all of the threads to reach
> coredump_task_exit. Meanwhile vhost_worker is waiting for
> all of the other threads to reach exit_files to close their
> file descriptors.

Indeed, I didn't think about this.


So why do we actually need CLONE_THREAD ? Can't vhost_worker() be a kernel thread?

kthread_create() won't be convenient, but how about kernel_thread() ? it inherits
mm/cgroups/rlimits/etc, kthread_stop() should work just fine.

Oleg.


2023-05-25 15:36:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Oleg Nesterov <[email protected]> writes:

> On 05/24, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > Yes, but probably SIGABRT/exit doesn't really differ from SIGKILL wrt
>> > vhost_worker().
>>
>> Actually I think it reveals that exiting with SIGABRT will cause
>> a deadlock.
>>
>> coredump_wait will wait for all of the threads to reach
>> coredump_task_exit. Meanwhile vhost_worker is waiting for
>> all of the other threads to reach exit_files to close their
>> file descriptors.
>
> Indeed, I didn't think about this.
>
>
> So why do we actually need CLONE_THREAD ? Can't vhost_worker() be a kernel thread?
>
> kthread_create() won't be convenient, but how about kernel_thread() ? it inherits
> mm/cgroups/rlimits/etc, kthread_stop() should work just fine.

Basically with no patches that is where Linus's kernel is.

User complained about the new thread showing up in ps. So the
exploration of how we could use CLONE_THREAD instead of what is
currently merged began.

Eric

2023-05-25 16:27:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Thu, May 25, 2023 at 8:30 AM Eric W. Biederman <[email protected]> wrote:
>
> Basically with no patches that is where Linus's kernel is.
>
> User complained about the new thread showing up in ps.

Well, not only that, but it actively broke existing workflows for
managing things. Showing up in 'ps' wasn't just some purely cosmetic
issue, but had semantic meaning.

And honestly, I think the core issue is that we should just make this
work. Kernel threads randomly switching to user memory threads was
wrong, so CLONE_VM is absolutely the right thing to do.

But while "CLONE_VM without real threading" is a very traditional
thing in Linux - it was the original model for clone(), after all - I
don't believe it is the *correct* model. There was a very real reason
clone() has grown CLONE_THREAD and friends.

So honestly, I really think we want to complete the vhost move to
CLONE_THREAD (and thus CLONE_SIGNAL).

Not because the old kthread model didn't _work_, but because it's
really really wrong to try to randomly take on user-space attributes
at run-time.

And once you do the "user threads in kernel space" model, at that
point you really do want to act like a proper thread. Both because of
that 'ps' issue (which is really just "show the world what your
relationship is), but simply because that is the modern threading
model that we use for everything else, and special cases are bad.

So I'd really like to finish this. Even if we end up with a hack or
two in signal handling that we can hopefully fix up later by having
vhost fix up some of its current assumptions.

It has worked wonderfully well for io_uring - but we *did* have quite
a bit of conversion patches over some time as people found issues.
Which is why I don't expect the vhost conevrsion to be entirely
pain-free either, and I don't think we necessarily have to get to a
"perfect and clean" state immediately, just a "working and
conceptually in the right direction" state.

Linus

2023-05-25 16:27:53

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/23/23 7:15 AM, Oleg Nesterov wrote:
>
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?

The questions before this one I'll leave for the core vhost devs since
they know best.

For this question and the one below, when we get SIGKILL we think it's ok
to fail the operation as in it's ok to not execute it like normal (send
it down to the net/target/scsi/block layers for execution). However, we
need to do some processing of the work to release mem, refcounts, etc.

For example we use the vhost_task to handle completions from the layers
we interact with. So when we get a SIGKILL, we could have N commands in
the target/block/scsi/nvme layers below the vhost layer. When those
complete, the current code assumes we have the vhost_task to handle the
responses. So for pending works on that work_list, we can pretty easily
kill them. However, we don't have a way to kill those outstanding
requests to some other layer, so we have to wait and handle them
somewhere.

If you are saying that once we get SIGKILL then we just can't use the
task anymore and we have to drop down to workqueue/kthread or change up
the completion paths so they can execute in the context they are completed
from by lower levels, then I can code this. On the vhost side, it's just
really ugly vs the code we have now that used to use kthreads (or in
6.4-rc looks like a process) so we couldn't get signals and just had the
one code path that removes us.

From the vhost point of view signals are not useful to us and it's just
adding complexity for something that I don't think is useful to users.
However after discussing this with guys for a week and going over the
signal code, I can see your point of views where you guys are thinking its
ugly for the signal code to try and support what we want :)



>
>
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?
>
> Oleg.
>


2023-05-27 10:06:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Linus Torvalds <[email protected]> writes:

> So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions.


The real sticky widget for me is how to handle one of these processes
coredumping. It really looks like it will result in a reliable hang.

Limiting ourselves to changes that will only affect vhost, all I can
see would be allowing the vhost_worker thread to exit as soon as
get_signal reports the process is exiting. Then vhost_dev_flush
would need to process the pending work.

Something like this:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..fb5ebc50c553 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -234,14 +234,31 @@ EXPORT_SYMBOL_GPL(vhost_poll_stop);
void vhost_dev_flush(struct vhost_dev *dev)
{
struct vhost_flush_struct flush;
+ struct vhost_worker *worker = dev->worker;
+ struct llist_node *node, *head;
+
+ if (!worker)
+ return;
+
+ init_completion(&flush.wait_event);
+ vhost_work_init(&flush.work, vhost_flush_work);

- if (dev->worker) {
- init_completion(&flush.wait_event);
- vhost_work_init(&flush.work, vhost_flush_work);
+ vhost_work_queue(dev, &flush.work);

- vhost_work_queue(dev, &flush.work);
- wait_for_completion(&flush.wait_event);
+ /* Either vhost_worker runs the pending work or we do */
+ node = llist_del_all(&worker->work_list);
+ if (node) {
+ node = llist_reverse_order(node);
+ /* make sure flag is seen after deletion */
+ smp_wmb();
+ llist_for_each_entry_safe(work, work_next, node, node) {
+ clear_bit(VHOST_WORK_QUEUED, &work->flags);
+ work->fn(work);
+ cond_resched();
+ }
}
+
+ wait_for_completion(&flush.wait_event);
}
EXPORT_SYMBOL_GPL(vhost_dev_flush);

@@ -338,6 +355,7 @@ static int vhost_worker(void *data)
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;
+ struct ksignal ksig;

for (;;) {
/* mb paired w/ kthread_stop */
@@ -348,6 +366,9 @@ static int vhost_worker(void *data)
break;
}

+ if (get_signal(&ksig))
+ break;
+
node = llist_del_all(&worker->work_list);
if (!node)
schedule();
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..613d52f01c07 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -47,6 +47,7 @@ void vhost_task_stop(struct vhost_task *vtsk)
* not exiting then reap the task.
*/
kernel_wait4(pid, NULL, __WCLONE, NULL);
+ put_task_struct(vtsk->task);
kfree(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -101,7 +102,7 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
return NULL;
}

- vtsk->task = tsk;
+ vtsk->task = get_task_struct(tsk);
return vtsk;
}
EXPORT_SYMBOL_GPL(vhost_task_create);

Eric


2023-05-27 16:56:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman <[email protected]> wrote:
>
> The real sticky widget for me is how to handle one of these processes
> coredumping. It really looks like it will result in a reliable hang.

Well, if *that* is the main worry, I think that's trivial enough to deal with.

In particular, we could make the rule just be that user worker threads
simply do not participate in core-dumps.

THAT isn't hard.

All we need to do is

(a) not count those threads in zap_threads()

(b) make sure that they don't add themselves to the "dumper" list by
not calling "coredujmp_task_exit()"

(c) not initiate core-dumping themselves.

and I think that's pretty much it.

In fact, that really seems like a good model *regardless*, because
honestly, a PF_IO_WORKER doesn't have valid register state for the
core dump anyway, and anything that would have caused a IO thread to
get a SIGSEGV *should* have caused a kernel oops already.

So the only worry is that the core dump will now happen while an IO
worker is still busy and so it's not "atomic" wrt possible VM changes,
but while that used to be a big problem back in the dark ages when we
didn't get the VM locks for core dumping, that got fixed a few years
ago because it already caused lots of potential issues.

End result: I think the attached patch is probably missing something,
but the approach "FeelsRight(tm)" to me.

Comments?

Linus


Attachments:
patch.diff (2.08 kB)

2023-05-28 01:26:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Linus Torvalds <[email protected]> writes:

> On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman <[email protected]> wrote:
>>
>> The real sticky widget for me is how to handle one of these processes
>> coredumping. It really looks like it will result in a reliable hang.
>
> Well, if *that* is the main worry, I think that's trivial enough to deal with.
>
> In particular, we could make the rule just be that user worker threads
> simply do not participate in core-dumps.
>
> THAT isn't hard.
>
> All we need to do is
>
> (a) not count those threads in zap_threads()
>
> (b) make sure that they don't add themselves to the "dumper" list by
> not calling "coredujmp_task_exit()"
>
> (c) not initiate core-dumping themselves.
>
> and I think that's pretty much it.
>
> In fact, that really seems like a good model *regardless*, because
> honestly, a PF_IO_WORKER doesn't have valid register state for the
> core dump anyway, and anything that would have caused a IO thread to
> get a SIGSEGV *should* have caused a kernel oops already.
>
> So the only worry is that the core dump will now happen while an IO
> worker is still busy and so it's not "atomic" wrt possible VM changes,
> but while that used to be a big problem back in the dark ages when we
> didn't get the VM locks for core dumping, that got fixed a few years
> ago because it already caused lots of potential issues.


>
> End result: I think the attached patch is probably missing something,
> but the approach "FeelsRight(tm)" to me.
>
> Comments?

It seems like a good approach for including in the -rc series.
I think the change should look more like my change below.

nits:
- The threads all need to participate in the group exit even if they
aren't going to be in the coredump.

- For vhost_worker we need s/PF_IO_WORKER/PF_USER_WORKER/.

- Moving PF_IO_WORKER above the sig_kernel_coredump(signr) test is
unnecessary. The sig_kernel_coredump(signr) test can only become
true if a process exit has not been initiated yet. More importantly
moving the test obscures the fact that only do_group_exit is
moved out of get_signal for the PF_IO_WORKER special case.


Long term I think we want an approach that stops the worker threads
during the coredumps. It will just yield a better quality of
implementation if we minimize the amount of concurrency during the
coredump.

I have a pending patchset that moves the coredump rendezvous into
get_signal. At which point stopping all of the threads is just like
SIGSTOP something the worker threads can use and it won't introduce any
issues. Today the problem is some of the worker thread code must run
before the coredump stop.

Looking forward I don't see not asking the worker threads to stop
for the coredump right now causing any problems in the future.
So I think we can use this to resolve the coredump issue I spotted.


diff --git a/fs/coredump.c b/fs/coredump.c
index ece7badf701b..620f7f9dc894 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -371,7 +371,8 @@ static int zap_process(struct task_struct *start, int exit_code)
if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
- nr++;
+ if (!(t->flags & PF_IO_WORKER))
+ nr++;
}
}

diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..6082dba9131a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -411,7 +411,9 @@ static void coredump_task_exit(struct task_struct *tsk)
tsk->flags |= PF_POSTCOREDUMP;
core_state = tsk->signal->core_state;
spin_unlock_irq(&tsk->sighand->siglock);
- if (core_state) {
+
+ /* I/O Workers don't participate in coredumps */
+ if (core_state && !(tsk->flags & PF_IO_WORKER) {
struct core_thread self;

self.task = current;


> current->flags |= PF_SIGNALED;
>
> + /*
> + * PF_IO_WORKER threads will catch and exit on fatal signals
> + * themselves and do not participate in core dumping.
> + *
> + * They have cleanup that must be performed, so we cannot
> + * call do_exit() on their behalf.
> + */
> + if (current->flags & PF_IO_WORKER)
> + goto out;
> +
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> @@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig)
> do_coredump(&ksig->info);
> }
>
> - /*
> - * PF_IO_WORKER threads will catch and exit on fatal signals
> - * themselves. They have cleanup that must be performed, so
> - * we cannot call do_exit() on their behalf.
> - */
> - if (current->flags & PF_IO_WORKER)
> - goto out;
> -
> /*
> * Death signals, no core dump.
> */

Eric

2023-05-28 01:40:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Sat, May 27, 2023 at 6:17 PM Eric W. Biederman <[email protected]> wrote:
>
> It seems like a good approach for including in the -rc series.
> I think the change should look more like my change below.

I have no objections. My patch was a fairly "hack and slash" thing to
just disassociate the IO workers entirely from the core dumping. Yours
seems to be slightly more surgical.

Linus

2023-05-28 01:45:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Mike Christie <[email protected]> writes:

> On 5/23/23 7:15 AM, Oleg Nesterov wrote:
>>
>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
>> before we call work->fn(). Is it "safe" to run this callback with
>> signal_pending() or fatal_signal_pending() ?
>
> The questions before this one I'll leave for the core vhost devs since
> they know best.

Let me ask a clarifying question:

Is it only the call to schedule() in vhost_worker that you are worried
about not sleeping if signal_pending() or fatal_signal_pending()?

Is there concern that the worker functions aka "work->fn()" will also
have killable or interruptible sleeps that also will misbehave.

We can handle schedule() in vhost_worker without problem.

If a worker function has interruptible or killable sleeps that will turn
into busy waits or worse not sleeping long enough that seems like a
problem. There is no way to guarantee that the outer loop of
vhost_worker will protect the worker functions from signal_pending()
or fatal_signal_pending() becoming true.


Eric

2023-05-28 20:53:45

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/27/23 8:41 PM, Eric W. Biederman wrote:
> Mike Christie <[email protected]> writes:
>
>> On 5/23/23 7:15 AM, Oleg Nesterov wrote:
>>>
>>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
>>> before we call work->fn(). Is it "safe" to run this callback with
>>> signal_pending() or fatal_signal_pending() ?
>>
>> The questions before this one I'll leave for the core vhost devs since
>> they know best.
>
> Let me ask a clarifying question:
>
> Is it only the call to schedule() in vhost_worker that you are worried
> about not sleeping if signal_pending() or fatal_signal_pending()?

It will only be the vhost_worker call to schedule().

When we do the file_operation's release call, we normally set things up
so the work->fn just fails and cleans up. I can pretty easily move that
code into a helper and do:

if (get_signal(ksig))
new_function_to_tell_drivers_that_all_work_fns_should_fail()



2023-05-29 11:35:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/27, Eric W. Biederman wrote:
>
> Looking forward I don't see not asking the worker threads to stop
> for the coredump right now causing any problems in the future.
> So I think we can use this to resolve the coredump issue I spotted.

But we have almost the same problem with exec.

Execing thread will wait for vhost_worker() while vhost_worker will wait for
.release -> vhost_task_stop().

And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().

Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
flush the pending works on ->work_list before exit, I dunno. But imo it should
not wait for the final fput().

Oleg.


2023-05-29 16:16:37

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> On 05/27, Eric W. Biederman wrote:
>>
>> Looking forward I don't see not asking the worker threads to stop
>> for the coredump right now causing any problems in the future.
>> So I think we can use this to resolve the coredump issue I spotted.
>
> But we have almost the same problem with exec.
>
> Execing thread will wait for vhost_worker() while vhost_worker will wait for
> .release -> vhost_task_stop().

For this type of case, what is the goal or correct behavior in the end?

When get_signal returns true we can code things like you mention below and
clean up the task_struct. However, we now have a non-functioning vhost device
open and just sitting around taking up memory and it can't do any IO.

For this type of case, do we expect just not to crash/hang, or was this new
exec'd thread suppose to be able to use the vhost device?

I would normally say it probably wants to use the vhost device still. However,
I don't think this comes up so just not hanging might be ok. Before 6.4-rc1,
we ignored signals so it would have worked if we are concerned about a possible
regression if this was a common thing.



>
> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().
>
> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

You mean the vhost_task's task/thread doing a function that does a copy_process
right? That type of thing is not needed. I can add a check in vhost_task_create
for this so new code doesn't try to do it. I don't think it will come up that some
code vhost is using will call kernel_thread/copy_process directly since those
calls are so rare and the functions are not exported to modules.


>
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().
>
> Oleg.
>


2023-05-29 16:24:13

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().

Just a FYI, I coded this. It's pre-RFC quality. It's invasive. If we want to go
this route then we have to do a temp hack for 6.4 or revert then do this route
for 6.5 or 6.6 (vhost devs will need time to review and we need time to full test).

2023-05-29 17:56:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

Mike, sorry, I don't understand your email.

Just in case, let me remind I know nothing about drivers/vhost/

On 05/29, [email protected] wrote:
>
> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> > On 05/27, Eric W. Biederman wrote:
> >>
> >> Looking forward I don't see not asking the worker threads to stop
> >> for the coredump right now causing any problems in the future.
> >> So I think we can use this to resolve the coredump issue I spotted.
> >
> > But we have almost the same problem with exec.
> >
> > Execing thread will wait for vhost_worker() while vhost_worker will wait for
> > .release -> vhost_task_stop().
>
> For this type of case, what is the goal or correct behavior in the end?
>
> When get_signal returns true we can code things like you mention below

and you have mentioned in the next email that you have already coded something
like this, so perhaps we can delay the further discussions until you send the
new code?

> and
> clean up the task_struct.

Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,

> However, we now have a non-functioning vhost device
> open and just sitting around taking up memory and it can't do any IO.

can't comment, see above.

> For this type of case, do we expect just not to crash/hang, or was this new
> exec'd thread suppose to be able to use the vhost device?

I just tried to point out that (unless I missed something) there are more corner
cases, not just coredump.

> > Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> You mean the vhost_task's task/thread doing a function that does a copy_process
> right?

I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
userspace. Yes, this implies copy_process() but I still can't understand you.

> That type of thing is not needed.

Do you mean that userspace should never do this? But this doesn't matter, the
kernel should handle this case anyway.

Or what?

In short let me repeat that I don't understand you and - of course! - quite
possibly I missed something.

Oleg.


2023-05-29 18:15:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/29, Oleg Nesterov wrote:
>
> Mike, sorry, I don't understand your email.
>
> Just in case, let me remind I know nothing about drivers/vhost/

And... this is slightly off-topic but you didn't answer my previous
question and I am just curious. Do you agree that, even if we forget
about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
it can race with vhost_work_queue() ?

Oleg.


2023-05-29 19:18:06

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 12:54 PM, Oleg Nesterov wrote:
> On 05/29, Oleg Nesterov wrote:
>>
>> Mike, sorry, I don't understand your email.
>>
>> Just in case, let me remind I know nothing about drivers/vhost/
>
> And... this is slightly off-topic but you didn't answer my previous
> question and I am just curious. Do you agree that, even if we forget
> about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
> it can race with vhost_work_queue() ?

You saw the reply where I wrote I was waiting for the vhost devs to
reply because it's their code, right? Just wanted to make sure you know
I'm not ignoring your mails. The info is very valuable to me since I don't
know the signal code.

- I think you are right about using a continue after schedule.
- The work fn is stable. It's setup once and never changes.
- I have no idea why we do the __set_current_state(TASK_RUNNING). As
far as I can tell the work functions do not change the task state and
that might have been a left over from something else. However, the
vhost devs might know of some case.
- For the barrier use, I think you are right, but I couldn't trigger
an issue even if I hack up different timings. So I was hoping the
vhost devs know of something else in there.



2023-05-29 19:39:42

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 12:46 PM, Oleg Nesterov wrote:
> Mike, sorry, I don't understand your email.
>
> Just in case, let me remind I know nothing about drivers/vhost/
>

No problem. I get it. I don't know the signal/thread code so it's
one of those things where I'm also learning as I go.

> On 05/29, [email protected] wrote:
>>
>> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
>>> On 05/27, Eric W. Biederman wrote:
>>>>
>>>> Looking forward I don't see not asking the worker threads to stop
>>>> for the coredump right now causing any problems in the future.
>>>> So I think we can use this to resolve the coredump issue I spotted.
>>>
>>> But we have almost the same problem with exec.
>>>
>>> Execing thread will wait for vhost_worker() while vhost_worker will wait for
>>> .release -> vhost_task_stop().
>>
>> For this type of case, what is the goal or correct behavior in the end?
>>
>> When get_signal returns true we can code things like you mention below
>
> and you have mentioned in the next email that you have already coded something
> like this, so perhaps we can delay the further discussions until you send the
> new code?
>

Ok. Let me post that. You guys and the vhost devs can argue about if it's
too ugly to merge :)


>> and
>> clean up the task_struct.
>
> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,

Oh wait, are you saying that when we get auto-reaped then we would do the last
fput and call the file_operations->release function right? We actually set
task_struct->files = NULL for the vhost_task task_struct, so I think we call
release a little sooner than you think.

vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc
that gets created works like kthreads where it doesn't do a CLONE_FILES and it
doesn't do a dup_fd.

So when we do de_thread() -> zap_other_threads(), that will kill all the threads
in the group right? So when they exit, it will call our release function since
we don't have refcount on ourself.


>
>> However, we now have a non-functioning vhost device
>> open and just sitting around taking up memory and it can't do any IO.
>
> can't comment, see above.
>
>> For this type of case, do we expect just not to crash/hang, or was this new
>> exec'd thread suppose to be able to use the vhost device?
>
> I just tried to point out that (unless I missed something) there are more corner
> cases, not just coredump.

Ok. I see.

>
>>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>>
>> You mean the vhost_task's task/thread doing a function that does a copy_process
>> right?
>
> I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
> userspace.

I think we were talking about different things. I was saying that when the vhost
layer does vhost_task_create() -> copy_process(), the kernel_clone_args.fn is
vhost_task_fn() -> vhost_worker(). I thought you were concerned about vhost_worker()
or some function it calls, calling copy_process() with CLONE_FILES.


2023-05-29 19:55:22

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 2:35 PM, Mike Christie wrote:
>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,
> Oh wait, are you saying that when we get auto-reaped then we would do the last
> fput and call the file_operations->release function right? We actually set
> task_struct->files = NULL for the vhost_task task_struct, so I think we call
> release a little sooner than you think.
>
> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc
> that gets created works like kthreads where it doesn't do a CLONE_FILES and it
> doesn't do a dup_fd.
>
> So when we do de_thread() -> zap_other_threads(), that will kill all the threads
> in the group right? So when they exit, it will call our release function since
> we don't have refcount on ourself.
>

Just to make sure I'm on the same page now.

In the past thread when were discussing the patch below and you guys were saying
that it doesn't really ignore SIGKILL because we will hit the
SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was
on purpose.

Instead of a signal to tell me when do exit, I was using the parent exiting and doing
the last fput on the vhost device's file which calls our release function. That then
allowed the vhost code to use the vhost_task to handle the outstanding IOs and then
do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO
had completed.

On the vhost side of things it's really nice, because all the shutdown paths work
the same and we don't need rcu/locking in the main IO path to handle the vhost_task
exiting while we are using it.


diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->user_worker)
+ if (args->user_worker) {
+ /*
+ * User worker are similar to io_threads but they do not
+ * support signals and cleanup is driven via another kernel
+ * interface so even SIGKILL is blocked.
+ */
p->flags |= PF_USER_WORKER;
+ siginitsetinv(&p->blocked, 0);
+ }
if (args->io_thread) {
/*
* Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
+ if (args->user_worker)
+ p->flags |= PF_NOFREEZE;

stackleak_task_init(p);

@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn = fn,
.fn_arg = arg,
.io_thread = 1,
- .user_worker = 1,
};

return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..bc7e26072437 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p)
return task_curr(p) || !task_sigpending(p);
}

+static void try_set_pending_sigkill(struct task_struct *t)
+{
+ /*
+ * User workers don't support signals and their exit is driven through
+ * their kernel layer, so by default block even SIGKILL.
+ */
+ if (sigismember(&t->blocked, SIGKILL))
+ return;
+
+ sigaddset(&t->pending.signal, SIGKILL);
+ signal_wake_up(t, 1);
+}
+
static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
t = p;
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
} while_each_thread(p, t);
return;
}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't bother with already dead threads */
if (t->exit_state)
continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ try_set_pending_sigkill(t);
}

return count;
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..2d8d3ebaec4d 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..255a2147e5c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1207,12 +1207,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
DEFINE_WAIT(wait);

/*
- * Do not throttle user workers, kthreads other than kswapd or
+ * Do not throttle IO/user workers, kthreads other than kswapd or
* workqueues. They may be required for reclaim to make
* forward progress (e.g. journalling workqueues or kthreads).
*/
if (!current_is_kswapd() &&
- current->flags & (PF_USER_WORKER|PF_KTHREAD)) {
+ current->flags & (PF_USER_WORKER|PF_IO_WORKER|PF_KTHREAD)) {
cond_resched();
return;
}

2023-05-30 02:53:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[email protected] writes:

> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
>> On 05/27, Eric W. Biederman wrote:
>>>
>>> Looking forward I don't see not asking the worker threads to stop
>>> for the coredump right now causing any problems in the future.
>>> So I think we can use this to resolve the coredump issue I spotted.
>>
>> But we have almost the same problem with exec.
>>
>> Execing thread will wait for vhost_worker() while vhost_worker will wait for
>> .release -> vhost_task_stop().
>
> For this type of case, what is the goal or correct behavior in the end?
>
> When get_signal returns true we can code things like you mention below and
> clean up the task_struct. However, we now have a non-functioning vhost device
> open and just sitting around taking up memory and it can't do any IO.
>
> For this type of case, do we expect just not to crash/hang, or was this new
> exec'd thread suppose to be able to use the vhost device?

Looking at the vhost code, from before commit 6e890c5d5021 ("vhost: use
vhost_tasks for worker threads") exec effectively brakes the connection
with the vhost device. Aka the vhost device permanently keeps a
connection to the mm that opened it.

Frankly I think the vhost code will misbehave in fascinating ways if you
actually try and use it after exec.

> I would normally say it probably wants to use the vhost device still. However,
> I don't think this comes up so just not hanging might be ok.

Yes we just need to not hang, and fail as gracefully after exec.

>> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().
>>
>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> You mean the vhost_task's task/thread doing a function that does a copy_process
> right? That type of thing is not needed. I can add a check in vhost_task_create
> for this so new code doesn't try to do it. I don't think it will come up that some
> code vhost is using will call kernel_thread/copy_process directly since those
> calls are so rare and the functions are not exported to modules.

Oleg was referring to the fact that during exec de_thread comes before
do_close_on_exec.

>> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
>> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
>> flush the pending works on ->work_list before exit, I dunno. But imo it should
>> not wait for the final fput().

In the long term I agree. Exiting more or less immediately and leaving
the work to vhost_dev_flush looks like where the code needs to go.

For right now just to get us something before the final -rc I believe
the changes below with some testing are good enough. We can handle
exec just like the coredump. I have made those tests specific to the
vhost case for now so that there is no danger of trigger any other
funnies.

If someone does not set O_CLOEXEC or performs file descriptor passing we
might get a weird thread with an old mm.

I have also added in the needed changes to change a few additional
PF_IO_WORKER test to PF_USER_WORK.

Mike is there any chance you can test the change below?

From: "Eric W. Biederman" <[email protected]>
Date: Mon, 29 May 2023 19:13:59 -0500
Subject: [PATCH] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be suppported.

This is a modified version of the patch written by Mike Christie
<[email protected]> which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c. Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn. vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit. This collection clears
__fatal_signal_pending. This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file. To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec. The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/kernel/fpu/context.h | 2 +-
arch/x86/kernel/fpu/core.c | 2 +-
drivers/vhost/vhost.c | 21 ++-----
fs/coredump.c | 4 +-
include/linux/sched/task.h | 1 -
include/linux/sched/vhost_task.h | 15 +----
kernel/exit.c | 5 +-
kernel/fork.c | 13 ++---
kernel/signal.c | 8 ++-
kernel/vhost_task.c | 94 +++++++++++++++++++++-----------
11 files changed, 91 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..78fcde7b1f07 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -39,7 +39,7 @@ extern void fpu_flush_thread(void);
static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
{
if (cpu_feature_enabled(X86_FEATURE_FPU) &&
- !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+ !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
save_fpregs_to_fpstate(old_fpu);
/*
* The save operation preserved register state, so the
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index 9fcfa5c4dad7..af5cbdd9bd29 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void)
struct fpu *fpu = &current->thread.fpu;
int cpu = smp_processor_id();

- if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
+ if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER)))
return;

if (!fpregs_state_valid(fpu, cpu)) {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..1015af1ae562 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)

this_cpu_write(in_kernel_fpu, true);

- if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
+ if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
!test_thread_flag(TIF_NEED_FPU_LOAD)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
save_fpregs_to_fpstate(&current->thread.fpu);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..221d1b6c1be5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
* test_and_set_bit() implies a memory barrier.
*/
llist_add(&work->node, &dev->worker->work_list);
- wake_up_process(dev->worker->vtsk->task);
+ vhost_task_wake(dev->worker->vtsk);
}
}
EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -333,25 +333,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
__vhost_vq_meta_reset(vq);
}

-static int vhost_worker(void *data)
+static bool vhost_worker(void *data)
{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

- for (;;) {
- /* mb paired w/ kthread_stop */
- set_current_state(TASK_INTERRUPTIBLE);
-
- if (vhost_task_should_stop(worker->vtsk)) {
- __set_current_state(TASK_RUNNING);
- break;
- }
-
- node = llist_del_all(&worker->work_list);
- if (!node)
- schedule();
-
+ node = llist_del_all(&worker->work_list);
+ if (node) {
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
@@ -365,7 +354,7 @@ static int vhost_worker(void *data)
}
}

- return 0;
+ return !!node;
}

static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
diff --git a/fs/coredump.c b/fs/coredump.c
index ece7badf701b..88740c51b942 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -371,7 +371,9 @@ static int zap_process(struct task_struct *start, int exit_code)
if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
- nr++;
+ /* The vhost_worker does not particpate in coredumps */
+ if ((t->flags & (PF_USER_WORKER | PF_IO_WORKER)) != PF_USER_WORKER)
+ nr++;
}
}

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..837a23624a66 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -2,22 +2,13 @@
#ifndef _LINUX_VHOST_TASK_H
#define _LINUX_VHOST_TASK_H

-#include <linux/completion.h>

-struct task_struct;
+struct vhost_task;

-struct vhost_task {
- int (*fn)(void *data);
- void *data;
- struct completion exited;
- unsigned long flags;
- struct task_struct *task;
-};
-
-struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
const char *name);
void vhost_task_start(struct vhost_task *vtsk);
void vhost_task_stop(struct vhost_task *vtsk);
-bool vhost_task_should_stop(struct vhost_task *vtsk);
+void vhost_task_wake(struct vhost_task *vtsk);

#endif
diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..edb50b4c9972 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -411,7 +411,10 @@ static void coredump_task_exit(struct task_struct *tsk)
tsk->flags |= PF_POSTCOREDUMP;
core_state = tsk->signal->core_state;
spin_unlock_irq(&tsk->sighand->siglock);
- if (core_state) {
+
+ /* The vhost_worker does not particpate in coredumps */
+ if (core_state &&
+ ((tsk->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)) {
struct core_thread self;

self.task = current;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..81cba91f30bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,16 +2336,16 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
- if (args->user_worker)
- p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
+ if (args->user_worker) {
/*
- * Mark us an IO worker, and block any signal that isn't
+ * Mark us a user worker, and block any signal that isn't
* fatal or STOP
*/
- p->flags |= PF_IO_WORKER;
+ p->flags |= PF_USER_WORKER;
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}
+ if (args->io_thread)
+ p->flags |= PF_IO_WORKER;

if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2517,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;

- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);

if (pid != &init_struct_pid) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..2547fa73bde5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1368,7 +1368,9 @@ int zap_other_threads(struct task_struct *p)

while_each_thread(p, t) {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- count++;
+ /* Don't require de_thread to wait for the vhost_worker */
+ if ((t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER)
+ count++;

/* Don't bother with already dead threads */
if (t->exit_state)
@@ -2861,11 +2863,11 @@ bool get_signal(struct ksignal *ksig)
}

/*
- * PF_IO_WORKER threads will catch and exit on fatal signals
+ * PF_USER_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
* we cannot call do_exit() on their behalf.
*/
- if (current->flags & PF_IO_WORKER)
+ if (current->flags & PF_USER_WORKER)
goto out;

/*
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..f3ce0fa6edd7 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -12,58 +12,90 @@ enum vhost_task_flags {
VHOST_TASK_FLAGS_STOP,
};

+struct vhost_task {
+ bool (*fn)(void *data);
+ void *data;
+ struct completion exited;
+ unsigned long flags;
+ struct task_struct *task;
+};
+
static int vhost_task_fn(void *data)
{
struct vhost_task *vtsk = data;
- int ret;
+ bool dead = false;
+
+ for (;;) {
+ bool did_work;
+
+ /* mb paired w/ kthread_stop */
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
+
+ if (!dead && signal_pending(current)) {
+ struct ksignal ksig;
+ /*
+ * Calling get_signal will block in SIGSTOP,
+ * or clear fatal_signal_pending, but remember
+ * what was set.
+ *
+ * This thread won't actually exit until all
+ * of the file descriptors are closed, and
+ * the release function is called.
+ */
+ dead = get_signal(&ksig);
+ if (dead)
+ clear_thread_flag(TIF_SIGPENDING);
+ }
+
+ did_work = vtsk->fn(vtsk->data);
+ if (!did_work)
+ schedule();
+ }

- ret = vtsk->fn(vtsk->data);
complete(&vtsk->exited);
- do_exit(ret);
+ do_exit(0);
}

+/**
+ * vhost_task_wake - wakeup the vhost_task
+ * @vtsk: vhost_task to wake
+ *
+ * wake up the vhost_task worker thread
+ */
+void vhost_task_wake(struct vhost_task *vtsk)
+{
+ wake_up_process(vtsk->task);
+}
+EXPORT_SYMBOL_GPL(vhost_task_wake);
+
/**
* vhost_task_stop - stop a vhost_task
* @vtsk: vhost_task to stop
*
- * Callers must call vhost_task_should_stop and return from their worker
- * function when it returns true;
+ * vhost_task_fn ensures the worker thread exits after
+ * VHOST_TASK_FLAGS_SOP becomes true.
*/
void vhost_task_stop(struct vhost_task *vtsk)
{
- pid_t pid = vtsk->task->pid;
-
set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
- wake_up_process(vtsk->task);
+ vhost_task_wake(vtsk);
/*
* Make sure vhost_task_fn is no longer accessing the vhost_task before
- * freeing it below. If userspace crashed or exited without closing,
- * then the vhost_task->task could already be marked dead so
- * kernel_wait will return early.
+ * freeing it below.
*/
wait_for_completion(&vtsk->exited);
- /*
- * If we are just closing/removing a device and the parent process is
- * not exiting then reap the task.
- */
- kernel_wait4(pid, NULL, __WCLONE, NULL);
kfree(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_stop);

/**
- * vhost_task_should_stop - should the vhost task return from the work function
- * @vtsk: vhost_task to stop
- */
-bool vhost_task_should_stop(struct vhost_task *vtsk)
-{
- return test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
-}
-EXPORT_SYMBOL_GPL(vhost_task_should_stop);
-
-/**
- * vhost_task_create - create a copy of a process to be used by the kernel
- * @fn: thread stack
+ * vhost_task_create - create a copy of a task to be used by the kernel
+ * @fn: vhost worker function
* @arg: data to be passed to fn
* @name: the thread's name
*
@@ -71,17 +103,17 @@ EXPORT_SYMBOL_GPL(vhost_task_should_stop);
* failure. The returned task is inactive, and the caller must fire it up
* through vhost_task_start().
*/
-struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
+struct vhost_task *vhost_task_create(bool (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
--
2.35.3

2023-05-30 03:26:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

[email protected] writes:

> On 5/29/23 2:35 PM, Mike Christie wrote:
>>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,
>> Oh wait, are you saying that when we get auto-reaped then we would do the last
>> fput and call the file_operations->release function right? We actually set
>> task_struct->files = NULL for the vhost_task task_struct, so I think we call
>> release a little sooner than you think.
>>
>> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc
>> that gets created works like kthreads where it doesn't do a CLONE_FILES and it
>> doesn't do a dup_fd.
>>
>> So when we do de_thread() -> zap_other_threads(), that will kill all the threads
>> in the group right? So when they exit, it will call our release function since
>> we don't have refcount on ourself.
>>
>
> Just to make sure I'm on the same page now.
>
> In the past thread when were discussing the patch below and you guys were saying
> that it doesn't really ignore SIGKILL because we will hit the
> SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was
> on purpose.
>
> Instead of a signal to tell me when do exit, I was using the parent exiting and doing
> the last fput on the vhost device's file which calls our release function. That then
> allowed the vhost code to use the vhost_task to handle the outstanding IOs and then
> do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO
> had completed.
>
> On the vhost side of things it's really nice, because all the shutdown paths work
> the same and we don't need rcu/locking in the main IO path to handle the vhost_task
> exiting while we are using it.

The code below does nothing for exec.

You really need to call get_signal to handle SIGSTOP/freeze/process exit.

A variant on my coredump proposal looks like it can handle exec as well.
It isn't pretty but it looks good enough for right now.

If you could test the patch I posted up thread I think that is something
imperfect that is good enough for now.

Eric

2023-05-30 14:42:44

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Mon, May 29, 2023 at 01:19:39PM +0200, Oleg Nesterov wrote:
> On 05/27, Eric W. Biederman wrote:
> >
> > Looking forward I don't see not asking the worker threads to stop
> > for the coredump right now causing any problems in the future.
> > So I think we can use this to resolve the coredump issue I spotted.
>
> But we have almost the same problem with exec.
>
> Execing thread will wait for vhost_worker() while vhost_worker will wait for
> .release -> vhost_task_stop().
>
> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().
>
> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and

Yes, and that's what I proposed at the beginning of this tread. We want
to have similar behavior as io_uring and cancel any oustanding work
instead of trying to finish it. But Mike was concerned because this
might be a change in behavior. Which I think is fine though. And it
complicates the code if we want to finish any outstanding work.

2023-05-30 15:08:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

"Eric W. Biederman" <[email protected]> writes:

> Linus Torvalds <[email protected]> writes:
>
>> So I'd really like to finish this. Even if we end up with a hack or
>> two in signal handling that we can hopefully fix up later by having
>> vhost fix up some of its current assumptions.
>
>
> The real sticky widget for me is how to handle one of these processes
> coredumping. It really looks like it will result in a reliable hang.
>
> Limiting ourselves to changes that will only affect vhost, all I can
> see would be allowing the vhost_worker thread to exit as soon as
> get_signal reports the process is exiting. Then vhost_dev_flush
> would need to process the pending work.
>

Oleg recently pointed out that the trickiest case currently appears
to be what happens if someone calls exec, in a process using vhost.

do_close_on_exec is called after de_thread, and after the mm has
changed. Which means that my idea of moving the work from vhost_worker
into vhost_dev_flush can't work. At the point that flush is called
it has the wrong mm. Which means the flush or cancel of the pending
work needs to happen in the vhost thread, we can't assume there
is any other thread available to do the work.

What makes this all nice is that the vhost code has
vhost_dev_check_owner which ensures only one mm can initiate
I/O. Which means file descriptor passing is essentially an
academic concern.

In the case of both process exit, and exec except for a racing
on which piece of code shuts down first there should be no
more I/O going into the work queues.

But it is going to take someone who understands and cares about
vhost to figure out how to stop new I/O from going into the
work queues and to ensure that on-going work is dealt with.

Eric

2023-05-30 15:50:24

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 9:38 PM, Eric W. Biederman wrote:
> Mike is there any chance you can test the change below?

Yeah, I'm firing up tests now.

2023-05-30 18:12:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/30, Christian Brauner wrote:
>
> On Mon, May 29, 2023 at 01:19:39PM +0200, Oleg Nesterov wrote:
> >
> > If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> > returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
>
> Yes, and that's what I proposed at the beginning of this tread.

Yes. And you know, I misunderstood you even if I had the same feeling from the
very beginning too (except I wasn't and still not sure CLONE_THREAD is a good
idea). Because... OK, I think it doesn't matter now ;)

Mike, Eric, et al.

I'll try to (at least) read your emails tomorrow. Another day spent on redhat
bugzillas.

Oleg.


2023-05-31 04:05:22

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 5/29/23 9:38 PM, Eric W. Biederman wrote:
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..f3ce0fa6edd7 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c

...

> static int vhost_task_fn(void *data)
> {
> struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + /* mb paired w/ kthread_stop */
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
> + __set_current_state(TASK_RUNNING);
> + break;
> + }
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> + * Calling get_signal will block in SIGSTOP,
> + * or clear fatal_signal_pending, but remember
> + * what was set.
> + *
> + * This thread won't actually exit until all
> + * of the file descriptors are closed, and
> + * the release function is called.
> + */
> + dead = get_signal(&ksig);

Hey Eric, the patch works well. Thanks! There was just one small issue.

get_signal() does try_to_freeze() -> ... __might_sleep() which wants the
state to be TASK_RUNNING.

We just need the patch below on top of yours which I think also cleans up
some of the state setting weirdness with the code like where vhost.c calls
__set_current_state(TASK_RUNNING) for each work. It looks like that was
not needed for any reason like a work->fn changing the state and the old
code could have done:

node = llist_del_all(&worker->work_list);
if (!node) {
schedule();
continue;
} else {
__set_current_state(TASK_RUNNING);
}

So I think we can do the following on top of your patch:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 221d1b6c1be5..074273020849 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -346,7 +346,6 @@ static bool vhost_worker(void *data)
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
- __set_current_state(TASK_RUNNING);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index f3ce0fa6edd7..fead2ed29561 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -29,12 +29,8 @@ static int vhost_task_fn(void *data)
bool did_work;

/* mb paired w/ kthread_stop */
- set_current_state(TASK_INTERRUPTIBLE);
-
- if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
- __set_current_state(TASK_RUNNING);
+ if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
break;
- }

if (!dead && signal_pending(current)) {
struct ksignal ksig;
@@ -53,8 +49,10 @@ static int vhost_task_fn(void *data)
}

did_work = vtsk->fn(vtsk->data);
- if (!did_work)
+ if (!did_work) {
+ set_current_state(TASK_INTERRUPTIBLE);
schedule();
+ }
}

complete(&vtsk->exited);




2023-05-31 05:27:49

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression


在 2023/5/23 20:15, Oleg Nesterov 写道:
> On 05/22, Oleg Nesterov wrote:
>> Right now I think that "int dead" should die,
> No, probably we shouldn't call get_signal() if we have already dequeued SIGKILL.
>
>> but let me think tomorrow.
> May be something like this... I don't like it but I can't suggest anything better
> right now.
>
> bool killed = false;
>
> for (;;) {
> ...
>
> node = llist_del_all(&worker->work_list);
> if (!node) {
> schedule();
> /*
> * When we get a SIGKILL our release function will
> * be called. That will stop new IOs from being queued
> * and check for outstanding cmd responses. It will then
> * call vhost_task_stop to tell us to return and exit.
> */
> if (signal_pending(current)) {
> struct ksignal ksig;
>
> if (!killed)
> killed = get_signal(&ksig);
>
> clear_thread_flag(TIF_SIGPENDING);
> }
>
> continue;
> }
>
> -------------------------------------------------------------------------------
> But let me ask a couple of questions. Let's forget this patch, let's look at the
> current code:
>
> node = llist_del_all(&worker->work_list);
> if (!node)
> schedule();
>
> node = llist_reverse_order(node);
> ... process works ...
>
> To me this looks a bit confusing. Shouldn't we do
>
> if (!node) {
> schedule();
> continue;
> }
>
> just to make the code a bit more clear? If node == NULL then
> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
> But this is minor.


Yes.


>
>
>
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> vhost_work_queue() can add this work again and change work->node->next.
>
> That is why we use _safe, but we need to ensure that llist_for_each_safe()
> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.


This should be fine since store is not speculated, so work->node->next
needs to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop
condition.


> So it seems that smp_wmb() can't help and should be removed, instead we need
>
> llist_for_each_entry_safe(...) {
> smp_mb__before_atomic();
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>
> Also, if the work->fn pointer is not stable, we should read it before
> smp_mb__before_atomic() as well.


The fn won't be changed after it is initialized.


>
> No?
>
>
> __set_current_state(TASK_RUNNING);
>
> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> can return with current->state != RUNNING ?


It is because the state were set to TASK_INTERRUPTIBLE in the beginning
of the loop otherwise it might be side effect while executing work->fn().


>
>
> work->fn(work);
>
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?


It looks safe since:

1) vhost hold refcnt of the mm
2) release will sync with the worker


>
>
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?


Yes.

Thanks


>
> Oleg.
>


2023-05-31 05:31:17

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression


在 2023/5/23 23:57, Eric W. Biederman 写道:
> Oleg Nesterov <[email protected]> writes:
>
>> On 05/22, Oleg Nesterov wrote:
>>> Right now I think that "int dead" should die,
>> No, probably we shouldn't call get_signal() if we have already
>> dequeued SIGKILL.
> Very much agreed. It is one thing to add a patch to move do_exit
> out of get_signal. It is another to keep calling get_signal after
> that. Nothing tests that case, and so we get some weird behaviors.
>
>
>>> but let me think tomorrow.
>> May be something like this... I don't like it but I can't suggest anything better
>> right now.
>>
>> bool killed = false;
>>
>> for (;;) {
>> ...
>>
>> node = llist_del_all(&worker->work_list);
>> if (!node) {
>> schedule();
>> /*
>> * When we get a SIGKILL our release function will
>> * be called. That will stop new IOs from being queued
>> * and check for outstanding cmd responses. It will then
>> * call vhost_task_stop to tell us to return and exit.
>> */
>> if (signal_pending(current)) {
>> struct ksignal ksig;
>>
>> if (!killed)
>> killed = get_signal(&ksig);
>>
>> clear_thread_flag(TIF_SIGPENDING);
>> }
>>
>> continue;
>> }
> I want to point out that we need to consider not just SIGKILL, but
> SIGABRT that causes a coredump, as well as the process peforming
> an ordinary exit(2). All of which will cause get_signal to return
> SIGKILL in this context.
>
>> -------------------------------------------------------------------------------
>> But let me ask a couple of questions.
> I share most of these questions.
>
>> Let's forget this patch, let's look at the
>> current code:
>>
>> node = llist_del_all(&worker->work_list);
>> if (!node)
>> schedule();
>>
>> node = llist_reverse_order(node);
>> ... process works ...
>>
>> To me this looks a bit confusing. Shouldn't we do
>>
>> if (!node) {
>> schedule();
>> continue;
>> }
>>
>> just to make the code a bit more clear? If node == NULL then
>> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
>> But this is minor.
>>
>>
>>
>> /* make sure flag is seen after deletion */
>> smp_wmb();
>> llist_for_each_entry_safe(work, work_next, node, node) {
>> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>>
>> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
>> vhost_work_queue() can add this work again and change work->node->next.
>>
>> That is why we use _safe, but we need to ensure that llist_for_each_safe()
>> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>>
>> So it seems that smp_wmb() can't help and should be removed, instead we need
>>
>> llist_for_each_entry_safe(...) {
>> smp_mb__before_atomic();
>> clear_bit(VHOST_WORK_QUEUED, &work->flags);
>>
>> Also, if the work->fn pointer is not stable, we should read it before
>> smp_mb__before_atomic() as well.
>>
>> No?
>>
>>
>> __set_current_state(TASK_RUNNING);
>>
>> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
>> can return with current->state != RUNNING ?
>>
>>
>> work->fn(work);
>>
>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
>> before we call work->fn(). Is it "safe" to run this callback with
>> signal_pending() or fatal_signal_pending() ?
>>
>>
>> Finally. I never looked into drivers/vhost/ before so I don't understand
>> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
>> to run the pending callbacks rather than wait for vhost_worker() ?
>> I guess we can't, ->mm won't be correct, but can you confirm?
> In a conversation long ago I remember hearing that vhost does not
> support file descriptor passing. Which means all of the file
> descriptors should be in the same process.


It's not. Actually passing vhost fd is pretty common since Qemu is
usually running without privilege. So it's the charge of the management
layer to open vhost fd and pass it to Qemu.


>
> Looking at the vhost code what I am seeing happening is that the
> vhost_worker persists until vhost_dev_cleanup is called from
> one of the vhost_???_release() functions. The release functions
> are only called after the last flush function completes. See __fput
> if you want to trace the details.
>
>
> On one hand this all seems reasonable. On the other hand I am not
> seeing the code that prevents file descriptor passing.


Yes.


>
>
> It is probably not the worst thing in the world, but what this means
> is now if you pass a copy of the vhost file descriptor to another
> process the vhost_worker will persist, and thus the process will persist
> until that copy of the file descriptor is closed.


Right.

Thanks


>
> Eric
>


2023-05-31 07:33:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/31, Jason Wang wrote:
>
> 在 2023/5/23 20:15, Oleg Nesterov 写道:
> >
> > /* make sure flag is seen after deletion */
> > smp_wmb();
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> >
> >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> >vhost_work_queue() can add this work again and change work->node->next.
> >
> >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> This should be fine since store is not speculated, so work->node->next needs
> to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.

I don't understand you. OK, to simplify, suppose we have 2 global vars

void *PTR = something_non_null;
unsigned long FLAGS = -1ul;

Now I think this code

CPU_0 CPU_1

void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
clear_bit(0, FLAGS); PTR = NULL;
BUG_ON(!ptr);

is racy and can hit the BUG_ON(!ptr).

I guess it is fine on x86, but in general you need smp_mb__before_atomic()
before clear_bit(), or clear_bit_unlock().

> > __set_current_state(TASK_RUNNING);
> >
> >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> >can return with current->state != RUNNING ?
>
> It is because the state were set to TASK_INTERRUPTIBLE in the beginning of
> the loop otherwise it might be side effect while executing work->fn().

Again, I don't understand you. So let me repeat: can work->fn() return with
current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can
do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe().

> >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> >before we call work->fn(). Is it "safe" to run this callback with
> >signal_pending() or fatal_signal_pending() ?
>
> It looks safe since:
>
> 1) vhost hold refcnt of the mm
> 2) release will sync with the worker

Well, that's not what I asked... nevermind, please forget.

Thanks.

Oleg.


2023-05-31 08:27:29

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <[email protected]> wrote:
>
> On 05/31, Jason Wang wrote:
> >
> > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > >
> > > /* make sure flag is seen after deletion */
> > > smp_wmb();
> > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > >
> > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > >vhost_work_queue() can add this work again and change work->node->next.
> > >
> > >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> >
> > This should be fine since store is not speculated, so work->node->next needs
> > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.
>
> I don't understand you. OK, to simplify, suppose we have 2 global vars
>
> void *PTR = something_non_null;
> unsigned long FLAGS = -1ul;
>
> Now I think this code
>
> CPU_0 CPU_1
>
> void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
> clear_bit(0, FLAGS); PTR = NULL;
> BUG_ON(!ptr);
>
> is racy and can hit the BUG_ON(!ptr).

This seems different to the above case? And you can hit BUG_ON with
the following execution sequence:

[cpu 0] clear_bit(0, FLAGS);
[cpu 1] if (!test_and_set_bit(0, FLAGS))
[cpu 1] PTR = NULL;
[cpu 0] BUG_ON(!ptr)

In vhost code, there's a condition before the clear_bit() which sits
inside llist_for_each_entry_safe():

#define llist_for_each_entry_safe(pos, n, node, member) \
for (pos = llist_entry((node), typeof(*pos), member); \
member_address_is_nonnull(pos, member) && \
(n = llist_entry(pos->member.next, typeof(*n), member), true); \
pos = n)

The clear_bit() is a store which is not speculated, so there's a
control dependency, the store can't be executed until the condition
expression is evaluated which requires pos->member.next
(work->node.next) to be loaded.

>
> I guess it is fine on x86, but in general you need smp_mb__before_atomic()
> before clear_bit(), or clear_bit_unlock().
>
> > > __set_current_state(TASK_RUNNING);
> > >
> > >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> > >can return with current->state != RUNNING ?
> >
> > It is because the state were set to TASK_INTERRUPTIBLE in the beginning of
> > the loop otherwise it might be side effect while executing work->fn().
>
> Again, I don't understand you. So let me repeat: can work->fn() return with
> current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can
> do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe().
>

Ok, that should be fine.

Thanks


> > >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> > >before we call work->fn(). Is it "safe" to run this callback with
> > >signal_pending() or fatal_signal_pending() ?
> >
> > It looks safe since:
> >
> > 1) vhost hold refcnt of the mm
> > 2) release will sync with the worker
>
> Well, that's not what I asked... nevermind, please forget.
>
> Thanks.
>
> Oleg.
>


2023-05-31 09:40:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 05/31, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <[email protected]> wrote:
> >
> > On 05/31, Jason Wang wrote:
> > >
> > > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > > >
> > > > /* make sure flag is seen after deletion */
> > > > smp_wmb();
> > > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > > >
> > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > > >vhost_work_queue() can add this work again and change work->node->next.
> > > >
> > > >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> > >
> > > This should be fine since store is not speculated, so work->node->next needs
> > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.
> >
> > I don't understand you. OK, to simplify, suppose we have 2 global vars
> >
> > void *PTR = something_non_null;
> > unsigned long FLAGS = -1ul;
> >
> > Now I think this code
> >
> > CPU_0 CPU_1
> >
> > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
> > clear_bit(0, FLAGS); PTR = NULL;
> > BUG_ON(!ptr);
> >
> > is racy and can hit the BUG_ON(!ptr).
>
> This seems different to the above case?

not sure,

> And you can hit BUG_ON with
> the following execution sequence:
>
> [cpu 0] clear_bit(0, FLAGS);
> [cpu 1] if (!test_and_set_bit(0, FLAGS))
> [cpu 1] PTR = NULL;
> [cpu 0] BUG_ON(!ptr)

I don't understand this part... yes, we can hit this BUG_ON() without mb in
between, this is what I tried to say.

> In vhost code, there's a condition before the clear_bit() which sits
> inside llist_for_each_entry_safe():
>
> #define llist_for_each_entry_safe(pos, n, node, member) \
> for (pos = llist_entry((node), typeof(*pos), member); \
> member_address_is_nonnull(pos, member) && \
> (n = llist_entry(pos->member.next, typeof(*n), member), true); \
> pos = n)
>
> The clear_bit() is a store which is not speculated, so there's a
> control dependency, the store can't be executed until the condition
> expression is evaluated which requires pos->member.next
> (work->node.next) to be loaded.

But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have
something like

n = llist_entry(...);
if (n)
clear_bit(...);

so I do not see how can we rely on the load-store control dependency.

Oleg.


2023-06-01 03:04:28

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <[email protected]> wrote:
>
> On 05/31, Jason Wang wrote:
> >
> > On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > On 05/31, Jason Wang wrote:
> > > >
> > > > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > > > >
> > > > > /* make sure flag is seen after deletion */
> > > > > smp_wmb();
> > > > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > > > >
> > > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > > > >vhost_work_queue() can add this work again and change work->node->next.
> > > > >
> > > > >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> > > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> > > >
> > > > This should be fine since store is not speculated, so work->node->next needs
> > > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.
> > >
> > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > >
> > > void *PTR = something_non_null;
> > > unsigned long FLAGS = -1ul;
> > >
> > > Now I think this code
> > >
> > > CPU_0 CPU_1
> > >
> > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
> > > clear_bit(0, FLAGS); PTR = NULL;
> > > BUG_ON(!ptr);
> > >
> > > is racy and can hit the BUG_ON(!ptr).
> >
> > This seems different to the above case?
>
> not sure,
>
> > And you can hit BUG_ON with
> > the following execution sequence:
> >
> > [cpu 0] clear_bit(0, FLAGS);
> > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > [cpu 1] PTR = NULL;
> > [cpu 0] BUG_ON(!ptr)
>
> I don't understand this part... yes, we can hit this BUG_ON() without mb in
> between, this is what I tried to say.

I may miss something, but the above is the sequence that is executed
by the processor (for each CPU, it's just the program order). So where
do you expect to place an mb can help?

>
> > In vhost code, there's a condition before the clear_bit() which sits
> > inside llist_for_each_entry_safe():
> >
> > #define llist_for_each_entry_safe(pos, n, node, member) \
> > for (pos = llist_entry((node), typeof(*pos), member); \
> > member_address_is_nonnull(pos, member) && \
> > (n = llist_entry(pos->member.next, typeof(*n), member), true); \
> > pos = n)
> >
> > The clear_bit() is a store which is not speculated, so there's a
> > control dependency, the store can't be executed until the condition
> > expression is evaluated which requires pos->member.next
> > (work->node.next) to be loaded.
>
> But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have
> something like
>
> n = llist_entry(...);
> if (n)
> clear_bit(...);
>
> so I do not see how can we rely on the load-store control dependency.

Just to make sure we are on the same page, the condition expression is

member_address_is_nonnull(pos, member) && (n =
llist_entry(pos->member.next, typeof(*n), member), true)

So it's something like:

if (work->node && (work_next = work->node->next, true))
clear_bit(&work->flags);

So two loads from both work->node and work->node->next, and there's a
store which is clear_bit, then it's a load-store control dependencies?

Thanks

>
> Oleg.
>


2023-06-01 08:12:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/01, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <[email protected]> wrote:
> >
> > > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > > >
> > > > void *PTR = something_non_null;
> > > > unsigned long FLAGS = -1ul;
> > > >
> > > > Now I think this code
> > > >
> > > > CPU_0 CPU_1
> > > >
> > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
> > > > clear_bit(0, FLAGS); PTR = NULL;
> > > > BUG_ON(!ptr);
> > > >
> > > > is racy and can hit the BUG_ON(!ptr).
> > >
> > > This seems different to the above case?
> >
> > not sure,
> >
> > > And you can hit BUG_ON with
> > > the following execution sequence:
> > >
> > > [cpu 0] clear_bit(0, FLAGS);
> > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > [cpu 1] PTR = NULL;
> > > [cpu 0] BUG_ON(!ptr)
> >
> > I don't understand this part... yes, we can hit this BUG_ON() without mb in
> > between, this is what I tried to say.
>
> I may miss something,

Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before clear_bit.
Since you have mentioned the program order: yes this lacks READ_ONCE() or barrier(),
but the same is true for the code in vhost_worker(). So I still don't understand.

> but the above is the sequence that is executed
> by the processor (for each CPU, it's just the program order). So where
> do you expect to place an mb can help?

before clear_bit:

CPU_0

void *ptr = PTR;
mb(); // implies compiler barrier as well
clear_bit(0, FLAGS);
BUG_ON(!ptr);

just in case... mb() in the code above is only for illustration, we can use
smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
one-way barrier is fine in this case.


> > > In vhost code, there's a condition before the clear_bit() which sits
> > > inside llist_for_each_entry_safe():
> > >
> > > #define llist_for_each_entry_safe(pos, n, node, member) \
> > > for (pos = llist_entry((node), typeof(*pos), member); \
> > > member_address_is_nonnull(pos, member) && \
> > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \
> > > pos = n)
> > >
> > > The clear_bit() is a store which is not speculated, so there's a
> > > control dependency, the store can't be executed until the condition
> > > expression is evaluated which requires pos->member.next
> > > (work->node.next) to be loaded.
> >
> > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have
> > something like
> >
> > n = llist_entry(...);
> > if (n)
> > clear_bit(...);
> >
> > so I do not see how can we rely on the load-store control dependency.
>
> Just to make sure we are on the same page, the condition expression is
>
> member_address_is_nonnull(pos, member) && (n =
> llist_entry(pos->member.next, typeof(*n), member), true)
>
> So it's something like:
>
> if (work->node && (work_next = work->node->next, true))
> clear_bit(&work->flags);
>
> So two loads from both work->node and work->node->next, and there's a
> store which is clear_bit, then it's a load-store control dependencies?

I guess you missed the comma expression... Let me rewrite your pseudo-code
above, it is equivalent to

if (work->node) {
if ((work_next = work->node->next, true))
clear_bit(&work->flags);
}

another rewrite:

if (work->node) {
work_next = work->node->next;
if ((work, true))
clear_bit(&work->flags);
}

and the final rewrite:

if (work->node) {
work_next = work->node->next;
if (true)
clear_bit(&work->flags);
}

so again, I do not see the load-store control dependency. Not to mention this
code lacks READ_ONCE().


If we had something like

if (work->node) {
work_next = READ_ONCE(work->node->next);
if (work_next)
clear_bit(&work->flags);
}

instead, then yes, we could rely on the LOAD-STORE dependency.

Oleg.


2023-06-02 05:28:54

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov <[email protected]> wrote:
>
> On 06/01, Jason Wang wrote:
> >
> > On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > > > >
> > > > > void *PTR = something_non_null;
> > > > > unsigned long FLAGS = -1ul;
> > > > >
> > > > > Now I think this code
> > > > >
> > > > > CPU_0 CPU_1
> > > > >
> > > > > void *ptr = PTR; if (!test_and_set_bit(0, FLAGS))
> > > > > clear_bit(0, FLAGS); PTR = NULL;
> > > > > BUG_ON(!ptr);
> > > > >
> > > > > is racy and can hit the BUG_ON(!ptr).
> > > >
> > > > This seems different to the above case?
> > >
> > > not sure,
> > >
> > > > And you can hit BUG_ON with
> > > > the following execution sequence:
> > > >
> > > > [cpu 0] clear_bit(0, FLAGS);
> > > > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > > > [cpu 1] PTR = NULL;
> > > > [cpu 0] BUG_ON(!ptr)
> > >
> > > I don't understand this part... yes, we can hit this BUG_ON() without mb in
> > > between, this is what I tried to say.
> >
> > I may miss something,
>
> Or me... note that CPU_0 loads the global "PTR" into the local "ptr" before clear_bit.
> Since you have mentioned the program order: yes this lacks READ_ONCE() or barrier(),
> but the same is true for the code in vhost_worker(). So I still don't understand.
>
> > but the above is the sequence that is executed
> > by the processor (for each CPU, it's just the program order). So where
> > do you expect to place an mb can help?
>
> before clear_bit:
>
> CPU_0
>
> void *ptr = PTR;
> mb(); // implies compiler barrier as well
> clear_bit(0, FLAGS);
> BUG_ON(!ptr);
>
> just in case... mb() in the code above is only for illustration, we can use
> smp_mb__before_atomic() + clear_bit(). Or just clear_bit_unlock(), iiuc the
> one-way barrier is fine in this case.

Ok, but it seems different, in the case of vhost we had a condition
above the clear_bit().

>
>
> > > > In vhost code, there's a condition before the clear_bit() which sits
> > > > inside llist_for_each_entry_safe():
> > > >
> > > > #define llist_for_each_entry_safe(pos, n, node, member) \
> > > > for (pos = llist_entry((node), typeof(*pos), member); \
> > > > member_address_is_nonnull(pos, member) && \
> > > > (n = llist_entry(pos->member.next, typeof(*n), member), true); \
> > > > pos = n)
> > > >
> > > > The clear_bit() is a store which is not speculated, so there's a
> > > > control dependency, the store can't be executed until the condition
> > > > expression is evaluated which requires pos->member.next
> > > > (work->node.next) to be loaded.
> > >
> > > But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we have
> > > something like
> > >
> > > n = llist_entry(...);
> > > if (n)
> > > clear_bit(...);
> > >
> > > so I do not see how can we rely on the load-store control dependency.
> >
> > Just to make sure we are on the same page, the condition expression is
> >
> > member_address_is_nonnull(pos, member) && (n =
> > llist_entry(pos->member.next, typeof(*n), member), true)
> >
> > So it's something like:
> >
> > if (work->node && (work_next = work->node->next, true))
> > clear_bit(&work->flags);
> >
> > So two loads from both work->node and work->node->next, and there's a
> > store which is clear_bit, then it's a load-store control dependencies?
>
> I guess you missed the comma expression...

Probably not, see below:

> Let me rewrite your pseudo-code
> above, it is equivalent to
>
> if (work->node) {
> if ((work_next = work->node->next, true))
> clear_bit(&work->flags);
> }
>
> another rewrite:
>
> if (work->node) {
> work_next = work->node->next;
> if ((work, true))
> clear_bit(&work->flags);
> }
>
> and the final rewrite:
>
> if (work->node) {
> work_next = work->node->next;
> if (true)
> clear_bit(&work->flags);
> }
>
> so again, I do not see the load-store control dependency.

This kind of optimization is suspicious. Especially considering it's
the control expression of the loop but not a condition.

Looking at the assembly (x86):

0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order>
0xffffffff81d46c60 <+80>: mov %rax,%r15
0xffffffff81d46c63 <+83>: test %rax,%rax
0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42>
0xffffffff81d46c68 <+88>: mov %r15,%rdi
0xffffffff81d46c6b <+91>: mov (%r15),%r15
0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi)
0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx)
0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax
0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0
<__x86_indirect_thunk_array>
0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched>
...

I can see:

1) The code read node->next (+91) before clear_bit (+94)
2) And the it uses a lock prefix to guarantee the execution order

> Not to mention this
> code lacks READ_ONCE().

Work_next is loaded only once for temporary storage, so I don't see
why we need that.

Thanks



>
>
> If we had something like
>
> if (work->node) {
> work_next = READ_ONCE(work->node->next);
> if (work_next)
> clear_bit(&work->flags);
> }
>
> instead, then yes, we could rely on the LOAD-STORE dependency.
>
> Oleg.
>


2023-06-02 18:27:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/02, Jason Wang wrote:
>
> On Thu, Jun 1, 2023 at 3:43 PM Oleg Nesterov <[email protected]> wrote:
> >
> > and the final rewrite:
> >
> > if (work->node) {
> > work_next = work->node->next;
> > if (true)
> > clear_bit(&work->flags);
> > }
> >
> > so again, I do not see the load-store control dependency.
>
> This kind of optimization is suspicious. Especially considering it's
> the control expression of the loop but not a condition.

It is not about optimization,

> Looking at the assembly (x86):
>
> 0xffffffff81d46c5b <+75>: callq 0xffffffff81689ac0 <llist_reverse_order>
> 0xffffffff81d46c60 <+80>: mov %rax,%r15
> 0xffffffff81d46c63 <+83>: test %rax,%rax
> 0xffffffff81d46c66 <+86>: je 0xffffffff81d46c3a <vhost_worker+42>
> 0xffffffff81d46c68 <+88>: mov %r15,%rdi
> 0xffffffff81d46c6b <+91>: mov (%r15),%r15
> 0xffffffff81d46c6e <+94>: lock andb $0xfd,0x10(%rdi)
> 0xffffffff81d46c73 <+99>: movl $0x0,0x18(%rbx)
> 0xffffffff81d46c7a <+106>: mov 0x8(%rdi),%rax
> 0xffffffff81d46c7e <+110>: callq 0xffffffff821b39a0
> <__x86_indirect_thunk_array>
> 0xffffffff81d46c83 <+115>: callq 0xffffffff821b4d10 <__SCT__cond_resched>
> ...
>
> I can see:
>
> 1) The code read node->next (+91) before clear_bit (+94)

The code does. but what about CPU ?

> 2) And the it uses a lock prefix to guarantee the execution order

As I said from the very beginning, this code is fine on x86 because
atomic ops are fully serialised on x86.

OK. we can't convince each other. I'll try to write another email when
I have time,

If this code is correct, then my understanding of memory barriers is even
worse than I think. I wouldn't be surprised, but I'd like to understand
what I have missed.

Oleg.


2023-06-02 20:31:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov <[email protected]> wrote:
>
> As I said from the very beginning, this code is fine on x86 because
> atomic ops are fully serialised on x86.

Yes. Other architectures require __smp_mb__{before,after}_atomic for
the bit setting ops to actually be memory barriers.

We *should* probably have acquire/release versions of the bit test/set
helpers, but we don't, so they end up being full memory barriers with
those things. Which isn't optimal, but I doubt it matters on most
architectures.

So maybe we'll some day have a "test_bit_acquire()" and a
"set_bit_release()" etc.

Linus

2023-06-05 14:49:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

On 06/02, Linus Torvalds wrote:
>
> On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov <[email protected]> wrote:
> >
> > As I said from the very beginning, this code is fine on x86 because
> > atomic ops are fully serialised on x86.
>
> Yes. Other architectures require __smp_mb__{before,after}_atomic for
> the bit setting ops to actually be memory barriers.
>
> We *should* probably have acquire/release versions of the bit test/set
> helpers, but we don't, so they end up being full memory barriers with
> those things. Which isn't optimal, but I doubt it matters on most
> architectures.
>
> So maybe we'll some day have a "test_bit_acquire()" and a
> "set_bit_release()" etc.

In this particular case we need clear_bit_release() and iiuc it is
already here, just it is named clear_bit_unlock().

So do you agree that vhost_worker() needs smp_mb__before_atomic()
before clear_bit() or just clear_bit_unlock() to avoid the race with
vhost_work_queue() ?

Let me provide a simplified example:

struct item {
struct llist_node llist;
unsigned long flags;
};

struct llist_head HEAD = {}; // global

void queue(struct item *item)
{
// ensure this item was already flushed
if (!test_and_set_bit(0, &item->flags))
llist_add(item->llist, &HEAD);

}

void flush(void)
{
struct llist_node *head = llist_del_all(&HEAD);
struct item *item, *next;

llist_for_each_entry_safe(item, next, head, llist)
clear_bit(0, &item->flags);
}

I think this code is buggy in that flush() can race with queue(), the same
way as vhost_worker() and vhost_work_queue().

Once flush() clears bit 0, queue() can come on another CPU and re-queue
this item and change item->llist.next. We need a barrier before clear_bit()
to ensure that next = llist_entry(item->next) in llist_for_each_entry_safe()
completes before the result of clear_bit() is visible to queue().

And, I do not think we can rely on control dependency because... because
I fail to see the load-store control dependency in this code,
llist_for_each_entry_safe() loads item->llist.next but doesn't check the
result until the next iteration.

No?

Oleg.