This patch allows the vhost and vhost_task code to use CLONE_THREAD,
CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
normal testing, haven't coverted vsock and vdpa, and I know you guys
will not like the first patch. However, I think it better shows what
we need from the signal code and how we can support signals in the
vhost_task layer.
Note that I took the super simple route and kicked off some work to
the system workqueue. We can do more invassive approaches:
1. Modify the vhost drivers so they can check for IO completions using
a non-blocking interface. We then don't need to run from the system
workqueue and can run from the vhost_task.
2. We could drop patch 1 and just say we are doing a polling type
of approach. We then modify the vhost layer similar to #1 where we
can check for completions using a non-blocking interface and use
the vhost_task task.
The vhost_task can now support the worker being freed from under the
device when we get a SIGKILL or the process exits without closing
devices. We no longer need no_files so this removes it.
Signed-off-by: Mike Christie <[email protected]>
---
include/linux/sched/task.h | 1 -
kernel/fork.c | 10 ++--------
kernel/vhost_task.c | 3 +--
3 files changed, 3 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 249a5ece9def..342fe297ffd4 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,7 +28,6 @@ struct kernel_clone_args {
u32 kthread:1;
u32 io_thread:1;
u32 user_worker:1;
- u32 no_files:1;
u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e04ab5c3946..f2c081c15efb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
- int no_files)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
{
struct files_struct *oldf, *newf;
int error = 0;
@@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
if (!oldf)
goto out;
- if (no_files) {
- tsk->files = NULL;
- goto out;
- }
-
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
- retval = copy_files(clone_flags, p, args->no_files);
+ retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a11f036290cc..642047765190 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
{
struct kernel_clone_args args = {
.flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
- CLONE_THREAD | CLONE_SIGHAND,
+ CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
- .no_files = 1,
.block_signals = 1,
};
struct vhost_task *vtsk;
--
2.25.1
This is a modified version of Linus's patch which has vhost_task
use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.
I renamed the ignore_signals to block_signals based on Linus's comment
where it aligns with what we are doing with the siginitsetinv
p->blocked use and no longer calling ignore_signals.
Signed-off-by: Mike Christie <[email protected]>
---
include/linux/sched/task.h | 2 +-
kernel/fork.c | 12 +++---------
kernel/vhost_task.c | 5 +++--
3 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
- u32 ignore_signals:1;
+ u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
- if (args->io_thread) {
- /*
- * Mark us an IO worker, and block any signal that isn't
- * fatal or STOP
- */
+ if (args->io_thread)
p->flags |= PF_IO_WORKER;
+ if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
- }
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
- if (args->ignore_signals)
- ignore_signals(p);
-
stackleak_task_init(p);
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
.fn_arg = arg,
.io_thread = 1,
.user_worker = 1,
+ .block_signals = 1,
};
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a661cfa32ba3..a11f036290cc 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
const char *name)
{
struct kernel_clone_args args = {
- .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+ .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal = 0,
.fn = vhost_task_fn,
.name = name,
.user_worker = 1,
.no_files = 1,
- .ignore_signals = 1,
+ .block_signals = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
--
2.25.1
This patch has vhost use get_signal to handle freezing and sort of
handle signals. By the latter I mean that when we get SIGKILL, our
parent will exit and call our file_operatons release function. That will
then stop new work from breing queued and wait for the vhost_task to
handle completions for running IO. We then exit when those are done.
The next patches will then have us work more like io_uring where
we handle the get_signal return value and key off that to cleanup.
Signed-off-by: Mike Christie <[email protected]>
---
drivers/vhost/vhost.c | 10 +++++++++-
include/linux/sched/vhost_task.h | 1 +
kernel/vhost_task.c | 20 ++++++++++++++++++++
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
}
node = llist_del_all(&worker->work_list);
- if (!node)
+ if (!node) {
schedule();
+ /*
+ * When we get a SIGKILL our release function will
+ * be called. That will stop new IOs from being queued
+ * and check for outstanding cmd responses. It will then
+ * call vhost_task_stop to exit us.
+ */
+ vhost_task_get_signal();
+ }
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..54b68115eb3b 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg,
void vhost_task_start(struct vhost_task *vtsk);
void vhost_task_stop(struct vhost_task *vtsk);
bool vhost_task_should_stop(struct vhost_task *vtsk);
+bool vhost_task_get_signal(void);
#endif
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a661cfa32ba3 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
}
EXPORT_SYMBOL_GPL(vhost_task_should_stop);
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * Return true if we got SIGKILL.
+ */
+bool vhost_task_get_signal(void)
+{
+ struct ksignal ksig;
+ bool rc;
+
+ if (!signal_pending(current))
+ return false;
+
+ __set_current_state(TASK_RUNNING);
+ rc = get_signal(&ksig);
+ set_current_state(TASK_INTERRUPTIBLE);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
/**
* vhost_task_create - create a copy of a process to be used by the kernel
* @fn: thread stack
--
2.25.1
On Wed, May 17, 2023 at 5:09 PM Mike Christie
<[email protected]> wrote:
>
> + __set_current_state(TASK_RUNNING);
> + rc = get_signal(&ksig);
> + set_current_state(TASK_INTERRUPTIBLE);
> + return rc;
The games with current_state seem nonsensical.
What are they all about? get_signal() shouldn't care, and no other
caller does this thing. This just seems completely random.
Linus
This moves the scsi code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.
Signed-off-by: Mike Christie <[email protected]>
---
drivers/vhost/scsi.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 40f9135e1a62..a0f2588270f2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
return 0;
}
+static void vhost_scsi_stop_dev_work(struct vhost_dev *dev)
+{
+ struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev);
+ struct vhost_scsi_target t;
+
+ mutex_lock(&vs->dev.mutex);
+ memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
+ mutex_unlock(&vs->dev.mutex);
+ vhost_scsi_clear_endpoint(vs, &t);
+ vhost_dev_stop(&vs->dev);
+ vhost_dev_cleanup(&vs->dev);
+}
+
static int vhost_scsi_open(struct inode *inode, struct file *f)
{
struct vhost_scsi *vs;
@@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
- true, NULL, NULL);
+ true, NULL, vhost_scsi_stop_dev_work);
vhost_scsi_init_inflight(vs, NULL);
@@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
static int vhost_scsi_release(struct inode *inode, struct file *f)
{
struct vhost_scsi *vs = f->private_data;
- struct vhost_scsi_target t;
- mutex_lock(&vs->dev.mutex);
- memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
- mutex_unlock(&vs->dev.mutex);
- vhost_scsi_clear_endpoint(vs, &t);
- vhost_dev_stop(&vs->dev);
- vhost_dev_cleanup(&vs->dev);
+ vhost_dev_stop_work(&vs->dev);
kfree(vs->dev.vqs);
kfree(vs->vqs);
kfree(vs->old_inflight);
--
2.25.1
On 5/17/23 7:09 PM, Mike Christie wrote:
> + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
Sorry. I tried to throw this one in last second so we could see that
we can also see that we can now use CLONE_FILES like io_uring.
It will of course not compile.
On 5/17/23 7:16 PM, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 5:09 PM Mike Christie
> <[email protected]> wrote:
>>
>> + __set_current_state(TASK_RUNNING);
>> + rc = get_signal(&ksig);
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + return rc;
>
> The games with current_state seem nonsensical.
>
> What are they all about? get_signal() shouldn't care, and no other
> caller does this thing. This just seems completely random.
Sorry. It's a leftover.
I was originally calling this from vhost_task_should_stop where before
calling that function we do a:
set_current_state(TASK_INTERRUPTIBLE);
So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
and was getting the "do not call blocking ops when !TASK_RUNNING"
warnings.
On Wed, May 17, 2023 at 07:09:15PM -0500, Mike Christie wrote:
> This is a modified version of Linus's patch which has vhost_task
> use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.
>
> I renamed the ignore_signals to block_signals based on Linus's comment
> where it aligns with what we are doing with the siginitsetinv
> p->blocked use and no longer calling ignore_signals.
>
> Signed-off-by: Mike Christie <[email protected]>
> ---
Yes, much nicer than what this was before,
Acked-by: Christian Brauner <[email protected]>
On Wed, May 17, 2023 at 08:01:45PM -0500, Mike Christie wrote:
> On 5/17/23 7:16 PM, Linus Torvalds wrote:
> > On Wed, May 17, 2023 at 5:09 PM Mike Christie
> > <[email protected]> wrote:
> >>
> >> + __set_current_state(TASK_RUNNING);
> >> + rc = get_signal(&ksig);
> >> + set_current_state(TASK_INTERRUPTIBLE);
> >> + return rc;
> >
> > The games with current_state seem nonsensical.
> >
> > What are they all about? get_signal() shouldn't care, and no other
> > caller does this thing. This just seems completely random.
>
> Sorry. It's a leftover.
>
> I was originally calling this from vhost_task_should_stop where before
> calling that function we do a:
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
> and was getting the "do not call blocking ops when !TASK_RUNNING"
> warnings.
Also seems you might want to check the return value of you new helper...
On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> normal testing, haven't coverted vsock and vdpa, and I know you guys
> will not like the first patch. However, I think it better shows what
Just to summarize the core idea behind my proposal is that no signal
handling changes are needed unless there's a bug in the current way
io_uring workers already work. All that should be needed is
s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
If you follow my proposal than vhost and io_uring workers should almost
collapse into the same concept. Specifically, io_uring workers and vhost
workers should behave the same when it comes ot handling signals.
See
https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
> we need from the signal code and how we can support signals in the
> vhost_task layer.
>
> Note that I took the super simple route and kicked off some work to
> the system workqueue. We can do more invassive approaches:
> 1. Modify the vhost drivers so they can check for IO completions using
> a non-blocking interface. We then don't need to run from the system
> workqueue and can run from the vhost_task.
>
> 2. We could drop patch 1 and just say we are doing a polling type
> of approach. We then modify the vhost layer similar to #1 where we
> can check for completions using a non-blocking interface and use
> the vhost_task task.
My preference would be to do whatever is the minimal thing now and has
the least bug potential and is the easiest to review for us non-vhost
experts. Then you can take all the time to rework and improve the vhost
infra based on the possibilities that using user workers offers. Plus,
that can easily happen in the next kernel cycle.
Remember, that we're trying to fix a regression here. A regression on an
unreleased kernel but still.
On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
>
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
>
> See
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
>
>
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> >
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> >
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
>
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
>
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.
It's a public holiday here today so I'll try to find time to review this
tomorrow.
On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
>
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
>
> See
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
>
>
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> >
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> >
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
>
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
>
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.
Just two more thoughts:
The following places currently check for PF_IO_WORKER:
arch/x86/include/asm/fpu/sched.h: !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
arch/x86/kernel/fpu/context.h: if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
arch/x86/kernel/fpu/core.c: if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
Both PF_KTHREAD and PF_IO_WORKER don't need TIF_NEED_FPU_LOAD because
they never return to userspace. But that's not specific to
PF_IO_WORKERs. Please generalize this to just check for PF_USER_WORKER
via a simple s/PF_IO_WORKER/PF_USER_WORKER/g in these places.
Another thing, in the sched code we have hooks into sched_submit_work()
and sched_update_worker() specific to PF_IO_WORKERs. But again, I don't
think this needs to be special to PF_IO_WORKERS. This might be
generally useful for PF_USER_WORKER. So we should probably generalize
this and have a generic user_worker_sleeping() and user_worker_running()
helper that figures out internally what specific helper to call. That's
not something that needs to be done right now though since I don't think
vhost needs this functionality.
But we should generalize this for the next development cycle so we have
this all nice and clean when someone actually needs this. Overall this
will mean that there would only be a single place left where
PF_IO_WORKER would need to be checked and that's in io_uring code
itself. And if we do things just right we might not even need that
PF_IO_WORKER flag anymore at all. But again, that's just notes for next
cycle.
Thoughts? Rotten apples?
On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> > This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> > CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> > normal testing, haven't coverted vsock and vdpa, and I know you guys
> > will not like the first patch. However, I think it better shows what
>
> Just to summarize the core idea behind my proposal is that no signal
> handling changes are needed unless there's a bug in the current way
> io_uring workers already work. All that should be needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>
> If you follow my proposal than vhost and io_uring workers should almost
> collapse into the same concept. Specifically, io_uring workers and vhost
> workers should behave the same when it comes ot handling signals.
>
> See
> https://lore.kernel.org/lkml/20230518-kontakt-geduckt-25bab595f503@brauner
>
>
> > we need from the signal code and how we can support signals in the
> > vhost_task layer.
> >
> > Note that I took the super simple route and kicked off some work to
> > the system workqueue. We can do more invassive approaches:
> > 1. Modify the vhost drivers so they can check for IO completions using
> > a non-blocking interface. We then don't need to run from the system
> > workqueue and can run from the vhost_task.
> >
> > 2. We could drop patch 1 and just say we are doing a polling type
> > of approach. We then modify the vhost layer similar to #1 where we
> > can check for completions using a non-blocking interface and use
> > the vhost_task task.
>
> My preference would be to do whatever is the minimal thing now and has
> the least bug potential and is the easiest to review for us non-vhost
> experts. Then you can take all the time to rework and improve the vhost
> infra based on the possibilities that using user workers offers. Plus,
> that can easily happen in the next kernel cycle.
>
> Remember, that we're trying to fix a regression here. A regression on an
> unreleased kernel but still.
On Tue, May 16, 2023 at 10:40:01AM +0200, Christian Brauner wrote:
> On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote:
> > On 5/15/23 10:44 AM, Linus Torvalds wrote:
> > > On Mon, May 15, 2023 at 7:23 AM Christian Brauner <[email protected]> wrote:
> > >>
> > >> So I think we will be able to address (1) and (2) by making vhost tasks
> > >> proper threads and blocking every signal except for SIGKILL and SIGSTOP
> > >> and then having vhost handle get_signal() - as you mentioned - the same
> > >> way io uring already does. We should also remove the ingore_signals
> > >> thing completely imho. I don't think we ever want to do this with user
> > >> workers.
> > >
> > > Right. That's what IO_URING does:
> > >
> > > if (args->io_thread) {
> > > /*
> > > * Mark us an IO worker, and block any signal that isn't
> > > * fatal or STOP
> > > */
> > > p->flags |= PF_IO_WORKER;
> > > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> > > }
> > >
> > > and I really think that vhost should basically do exactly what io_uring does.
> > >
> > > Not because io_uring fundamentally got this right - but simply because
> > > io_uring had almost all the same bugs (and then some), and what the
> > > io_uring worker threads ended up doing was to basically zoom in on
> > > "this works".
> > >
> > > And it zoomed in on it largely by just going for "make it look as much
> > > as possible as a real user thread", because every time the kernel
> > > thread did something different, it just caused problems.
> > >
> > > So I think the patch should just look something like the attached.
> > > Mike, can you test this on whatever vhost test-suite?
> >
> > I tried that approach already and it doesn't work because io_uring and vhost
> > differ in that vhost drivers implement a device where each device has a vhost_task
> > and the drivers have a file_operations for the device. When the vhost_task's
> > parent gets signal like SIGKILL, then it will exit and call into the vhost
> > driver's file_operations->release function. At this time, we need to do cleanup
>
> But that's no reason why the vhost worker couldn't just be allowed to
> exit on SIGKILL cleanly similar to io_uring. That's just describing the
> current architecture which isn't a necessity afaict. And the helper
> thread could e.g., crash.
>
> > like flush the device which uses the vhost_task. There is also the case where if
> > the vhost_task gets a SIGKILL, we can just exit from under the vhost layer.
>
> In a way I really don't like the patch below. Because this should be
> solvable by adapting vhost workers. Right now, vhost is coming from a
> kthread model and we ported it to a user worker model and the whole
> point of this excercise has been that the workers behave more like
> regular userspace processes. So my tendency is to not massage kernel
> signal handling to now also include a special case for user workers in
> addition to kthreads. That's just the wrong way around and then vhost
> could've just stuck with kthreads in the first place.
>
> So I'm fine with skipping over the freezing case for now but SIGKILL
> should be handled imho. Only init and kthreads should get the luxury of
> ignoring SIGKILL.
>
> So, I'm afraid I'm asking some work here of you but how feasible would a
> model be where vhost_worker() similar to io_wq_worker() gracefully
> handles SIGKILL. Yes, I see there's
>
> net.c: .release = vhost_net_release
> scsi.c: .release = vhost_scsi_release
> test.c: .release = vhost_test_release
> vdpa.c: .release = vhost_vdpa_release
> vsock.c: .release = virtio_transport_release
> vsock.c: .release = vhost_vsock_dev_release
>
> but that means you have all the basic logic in place and all of those
> drivers also support the VHOST_RESET_OWNER ioctl which also stops the
> vhost worker. I'm confident that a lof this can be leveraged to just
> cleanup on SIGKILL.
>
> So it feels like this should be achievable by adding a callback to
> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> and that all the users of vhost workers are forced to implement.
>
> Yes, it is more work but I think that's the right thing to do and not to
> complicate our signal handling.
>
> Worst case if this can't be done fast enough we'll have to revert the
> vhost parts. I think the user worker parts are mostly sane and are
As mentioned, if we can't settle this cleanly before -rc4 we should
revert the vhost parts unless Linus wants to have it earlier.
On 19.05.23 14:15, Christian Brauner wrote:
> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>> will not like the first patch. However, I think it better shows what
>>
>> Just to summarize the core idea behind my proposal is that no signal
>> handling changes are needed unless there's a bug in the current way
>> io_uring workers already work. All that should be needed is
>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
[...]
>> So it feels like this should be achievable by adding a callback to
>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>> and that all the users of vhost workers are forced to implement.
>>
>> Yes, it is more work but I think that's the right thing to do and not to
>> complicate our signal handling.
>>
>> Worst case if this can't be done fast enough we'll have to revert the
>> vhost parts. I think the user worker parts are mostly sane and are
>
> As mentioned, if we can't settle this cleanly before -rc4 we should
> revert the vhost parts unless Linus wants to have it earlier.
Meanwhile -rc5 is just a few days away and there are still a lot of
discussions in the patch-set proposed to address the issues[1]. Which is
kinda great (albeit also why I haven't given it a spin yet), but on the
other hand makes we wonder:
Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?
[1]
https://lore.kernel.org/all/[email protected]/
Ciao, Thorsten "not sure if I'm asking because I'm affected, or because
it's my duty as regression tracker" Leemhuis
On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
> On 19.05.23 14:15, Christian Brauner wrote:
> > On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
> >> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
> >>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
> >>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
> >>> normal testing, haven't coverted vsock and vdpa, and I know you guys
> >>> will not like the first patch. However, I think it better shows what
> >>
> >> Just to summarize the core idea behind my proposal is that no signal
> >> handling changes are needed unless there's a bug in the current way
> >> io_uring workers already work. All that should be needed is
> >> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
> [...]
> >> So it feels like this should be achievable by adding a callback to
> >> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
> >> and that all the users of vhost workers are forced to implement.
> >>
> >> Yes, it is more work but I think that's the right thing to do and not to
> >> complicate our signal handling.
> >>
> >> Worst case if this can't be done fast enough we'll have to revert the
> >> vhost parts. I think the user worker parts are mostly sane and are
> >
> > As mentioned, if we can't settle this cleanly before -rc4 we should
> > revert the vhost parts unless Linus wants to have it earlier.
>
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:
You might've missed it in the thread but it seems everyone is currently
operating under the assumption that the preferred way is to fix this is
rather than revert. See the mail in [1]:
"So I'd really like to finish this. Even if we end up with a hack or
two in signal handling that we can hopefully fix up later by having
vhost fix up some of its current assumptions."
which is why no revert was send for -rc4. And there's a temporary fix we
seem to have converged on.
@Mike, do you want to prepare an updated version of the temporary fix.
If @Linus prefers to just apply it directly he can just grab it from the
list rather than delaying it. Make sure to grab a Co-developed-by line
on this, @Mike.
Just in case we misunderstood the intention, I also prepared a revert
at the end of this mail that Linus can use.
@Thorsten, you can test it if you want. The revert only reverts the
vhost bits as the general agreement seems to be that user workers are
otherwise the path forward.
[1]: https://lore.kernel.org/lkml/CAHk-=wj4DS=2F5mW+K2P7cVqrsuGd3rKE_2k2BqnnPeeYhUCvg@mail.gmail.com
---
/* Summary */
Switching vhost workers to user workers broke existing workflows because
vhost workers started showing up in ps output breaking various scripts.
The reason is that vhost user workers are currently spawned as separate
processes and not as threads. Revert the patches converting vhost from
kthreads to vhost workers until vhost is ready to support user workers
created as actual threads.
The following changes since commit 7877cb91f1081754a1487c144d85dc0d2e2e7fc4:
Linux 6.4-rc4 (2023-05-28 07:49:00 -0400)
are available in the Git repository at:
[email protected]:pub/scm/linux/kernel/git/brauner/linux tags/kernel/v6.4-rc4/vhost
for you to fetch changes up to b20084b6bc90012a8ccce72ef1c0050d5fd42aa8:
Revert "vhost_task: Allow vhost layer to use copy_process" (2023-06-01 12:33:19 +0200)
----------------------------------------------------------------
kernel/v6.4-rc4/vhost
----------------------------------------------------------------
Christian Brauner (3):
Revert "vhost: use vhost_tasks for worker threads"
Revert "vhost: move worker thread fields to new struct"
Revert "vhost_task: Allow vhost layer to use copy_process"
MAINTAINERS | 1 -
drivers/vhost/Kconfig | 5 --
drivers/vhost/vhost.c | 124 ++++++++++++++++++++-------------------
drivers/vhost/vhost.h | 11 +---
include/linux/sched/vhost_task.h | 23 --------
kernel/Makefile | 1 -
kernel/vhost_task.c | 117 ------------------------------------
7 files changed, 67 insertions(+), 215 deletions(-)
delete mode 100644 include/linux/sched/vhost_task.h
delete mode 100644 kernel/vhost_task.c
Le 01/06/2023 à 09:58, Thorsten Leemhuis a écrit :
[snip]
>
> Meanwhile -rc5 is just a few days away and there are still a lot of
> discussions in the patch-set proposed to address the issues[1]. Which is
> kinda great (albeit also why I haven't given it a spin yet), but on the
> other hand makes we wonder:
>
> Is it maybe time to revert the vhost parts for 6.4 and try again next cycle?
At least it's time to find a way to fix this issue :)
Thank you,
Nicolas
On 01.06.23 12:47, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>>
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>>
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>>
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
>
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert.
I saw that, but that was also a week ago already, so I slowly started to
wonder if plans might have/should be changed. Anyway: if that's still
the plan forward it's totally fine for me if it's fine for Linus. :-D
BTW: I for now didn't sit down to test Mike's patches, as due to all the
discussions I assumed new ones would be coming sooner or later anyway.
If it's worth giving them a shot, please let me know.
> [...]
Thx for the update!
Ciao, Thorsten
On Thu, Jun 1, 2023 at 6:47 AM Christian Brauner <[email protected]> wrote:
>
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.
Yeah, let's apply the known "fix the immediate regression" patch wrt
vhost ps output and the freezer. That gets rid of the regression.
I think that we can - and should - then treat the questions about core
dumping and execve as separate issues.
vhost wouldn't have done execve since it's nonsensical and has never
worked anyway since it always left the old mm ref behind, and
similarly core dumping has never been an issue.
So on those things we don't have any "semantic" issues, we just need
to make sure we don't do crazy things like hang uninterruptibly.
Linus
On 6/1/23 5:47 AM, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. See the mail in [1]:
>
> "So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions."
>
> which is why no revert was send for -rc4. And there's a temporary fix we
> seem to have converged on.
>
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.
Yes, I'll send it within a couple hours.