2010-06-24 08:17:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers

On Sun, May 30, 2010 at 10:25:01PM +0200, Tejun Heo wrote:
> Apply the cpumask and cgroup of the initializing task to the created
> vhost poller.
>
> Based on Sridhar Samudrala's patch.
>
> Cc: Michael S. Tsirkin <[email protected]>
> Cc: Sridhar Samudrala <[email protected]>


I wanted to apply this, but modpost fails:
ERROR: "sched_setaffinity" [drivers/vhost/vhost_net.ko] undefined!
ERROR: "sched_getaffinity" [drivers/vhost/vhost_net.ko] undefined!

Did you try building as a module?

> ---
> drivers/vhost/vhost.c | 36 +++++++++++++++++++++++++++++++-----
> 1 file changed, 31 insertions(+), 5 deletions(-)
>
> Index: work/drivers/vhost/vhost.c
> ===================================================================
> --- work.orig/drivers/vhost/vhost.c
> +++ work/drivers/vhost/vhost.c
> @@ -23,6 +23,7 @@
> #include <linux/highmem.h>
> #include <linux/slab.h>
> #include <linux/kthread.h>
> +#include <linux/cgroup.h>
>
> #include <linux/net.h>
> #include <linux/if_packet.h>
> @@ -176,12 +177,30 @@ repeat:
> long vhost_dev_init(struct vhost_dev *dev,
> struct vhost_virtqueue *vqs, int nvqs)
> {
> - struct task_struct *poller;
> - int i;
> + struct task_struct *poller = NULL;
> + cpumask_var_t mask;
> + int i, ret = -ENOMEM;
> +
> + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> + goto out;
>
> poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
> - if (IS_ERR(poller))
> - return PTR_ERR(poller);
> + if (IS_ERR(poller)) {
> + ret = PTR_ERR(poller);
> + goto out;
> + }
> +
> + ret = sched_getaffinity(current->pid, mask);
> + if (ret)
> + goto out;
> +
> + ret = sched_setaffinity(poller->pid, mask);
> + if (ret)
> + goto out;
> +
> + ret = cgroup_attach_task_current_cg(poller);
> + if (ret)
> + goto out;
>
> dev->vqs = vqs;
> dev->nvqs = nvqs;
> @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
> vhost_poll_init(&dev->vqs[i].poll,
> dev->vqs[i].handle_kick, POLLIN, dev);
> }
> - return 0;
> +
> + wake_up_process(poller); /* avoid contributing to loadavg */
> + ret = 0;
> +out:
> + if (ret)
> + kthread_stop(poller);
> + free_cpumask_var(mask);
> + return ret;
> }
>
> /* Caller should have device mutex */


2010-06-24 22:46:08

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers

On Thu, 2010-06-24 at 11:11 +0300, Michael S. Tsirkin wrote:
> On Sun, May 30, 2010 at 10:25:01PM +0200, Tejun Heo wrote:
> > Apply the cpumask and cgroup of the initializing task to the created
> > vhost poller.
> >
> > Based on Sridhar Samudrala's patch.
> >
> > Cc: Michael S. Tsirkin <[email protected]>
> > Cc: Sridhar Samudrala <[email protected]>
>
>
> I wanted to apply this, but modpost fails:
> ERROR: "sched_setaffinity" [drivers/vhost/vhost_net.ko] undefined!
> ERROR: "sched_getaffinity" [drivers/vhost/vhost_net.ko] undefined!
>
> Did you try building as a module?

In my original implementation, i had these calls in workqueue.c.
Now that these are moved to vhost.c which can be built as a module,
these symbols need to be exported.
The following patch fixes the build issue with vhost as a module.

Signed-off-by: Sridhar Samudrala <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..15a0c6f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4837,6 +4837,7 @@ out_put_task:
put_online_cpus();
return retval;
}
+EXPORT_SYMBOL_GPL(sched_setaffinity);

static int get_user_cpu_mask(unsigned long __user *user_mask_ptr,
unsigned len,
struct cpumask *new_mask)
@@ -4900,6 +4901,7 @@ out_unlock:

return retval;
}
+EXPORT_SYMBOL_GPL(sched_getaffinity);

/**
* sys_sched_getaffinity - get the cpu affinity of a process


> > ---
> > drivers/vhost/vhost.c | 36 +++++++++++++++++++++++++++++++-----
> > 1 file changed, 31 insertions(+), 5 deletions(-)
> >
> > Index: work/drivers/vhost/vhost.c
> > ===================================================================
> > --- work.orig/drivers/vhost/vhost.c
> > +++ work/drivers/vhost/vhost.c
> > @@ -23,6 +23,7 @@
> > #include <linux/highmem.h>
> > #include <linux/slab.h>
> > #include <linux/kthread.h>
> > +#include <linux/cgroup.h>
> >
> > #include <linux/net.h>
> > #include <linux/if_packet.h>
> > @@ -176,12 +177,30 @@ repeat:
> > long vhost_dev_init(struct vhost_dev *dev,
> > struct vhost_virtqueue *vqs, int nvqs)
> > {
> > - struct task_struct *poller;
> > - int i;
> > + struct task_struct *poller = NULL;
> > + cpumask_var_t mask;
> > + int i, ret = -ENOMEM;
> > +
> > + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> > + goto out;
> >
> > poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
> > - if (IS_ERR(poller))
> > - return PTR_ERR(poller);
> > + if (IS_ERR(poller)) {
> > + ret = PTR_ERR(poller);
> > + goto out;
> > + }
> > +
> > + ret = sched_getaffinity(current->pid, mask);
> > + if (ret)
> > + goto out;
> > +
> > + ret = sched_setaffinity(poller->pid, mask);
> > + if (ret)
> > + goto out;
> > +
> > + ret = cgroup_attach_task_current_cg(poller);
> > + if (ret)
> > + goto out;
> >
> > dev->vqs = vqs;
> > dev->nvqs = nvqs;
> > @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
> > vhost_poll_init(&dev->vqs[i].poll,
> > dev->vqs[i].handle_kick, POLLIN, dev);
> > }
> > - return 0;
> > +
> > + wake_up_process(poller); /* avoid contributing to loadavg */
> > + ret = 0;
> > +out:
> > + if (ret)
> > + kthread_stop(poller);
> > + free_cpumask_var(mask);
> > + return ret;
> > }
> >
> > /* Caller should have device mutex */
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-06-25 10:15:54

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH] sched: export sched_set/getaffinity (was Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers)

On Thu, Jun 24, 2010 at 03:45:51PM -0700, Sridhar Samudrala wrote:
> On Thu, 2010-06-24 at 11:11 +0300, Michael S. Tsirkin wrote:
> > On Sun, May 30, 2010 at 10:25:01PM +0200, Tejun Heo wrote:
> > > Apply the cpumask and cgroup of the initializing task to the created
> > > vhost poller.
> > >
> > > Based on Sridhar Samudrala's patch.
> > >
> > > Cc: Michael S. Tsirkin <[email protected]>
> > > Cc: Sridhar Samudrala <[email protected]>
> >
> >
> > I wanted to apply this, but modpost fails:
> > ERROR: "sched_setaffinity" [drivers/vhost/vhost_net.ko] undefined!
> > ERROR: "sched_getaffinity" [drivers/vhost/vhost_net.ko] undefined!
> >
> > Did you try building as a module?
>
> In my original implementation, i had these calls in workqueue.c.
> Now that these are moved to vhost.c which can be built as a module,
> these symbols need to be exported.
> The following patch fixes the build issue with vhost as a module.
>
> Signed-off-by: Sridhar Samudrala <[email protected]>

Signed-off-by: Michael S. Tsirkin <[email protected]>

Works for me. To simplify dependencies, I'd like to queue this
together with the chost patches through net-next.
Ack to this?

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3c2a54f..15a0c6f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4837,6 +4837,7 @@ out_put_task:
> put_online_cpus();
> return retval;
> }
> +EXPORT_SYMBOL_GPL(sched_setaffinity);
>
> static int get_user_cpu_mask(unsigned long __user *user_mask_ptr,
> unsigned len,
> struct cpumask *new_mask)
> @@ -4900,6 +4901,7 @@ out_unlock:
>
> return retval;
> }
> +EXPORT_SYMBOL_GPL(sched_getaffinity);
>
> /**
> * sys_sched_getaffinity - get the cpu affinity of a process
>
>
> > > ---
> > > drivers/vhost/vhost.c | 36 +++++++++++++++++++++++++++++++-----
> > > 1 file changed, 31 insertions(+), 5 deletions(-)
> > >
> > > Index: work/drivers/vhost/vhost.c
> > > ===================================================================
> > > --- work.orig/drivers/vhost/vhost.c
> > > +++ work/drivers/vhost/vhost.c
> > > @@ -23,6 +23,7 @@
> > > #include <linux/highmem.h>
> > > #include <linux/slab.h>
> > > #include <linux/kthread.h>
> > > +#include <linux/cgroup.h>
> > >
> > > #include <linux/net.h>
> > > #include <linux/if_packet.h>
> > > @@ -176,12 +177,30 @@ repeat:
> > > long vhost_dev_init(struct vhost_dev *dev,
> > > struct vhost_virtqueue *vqs, int nvqs)
> > > {
> > > - struct task_struct *poller;
> > > - int i;
> > > + struct task_struct *poller = NULL;
> > > + cpumask_var_t mask;
> > > + int i, ret = -ENOMEM;
> > > +
> > > + if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> > > + goto out;
> > >
> > > poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
> > > - if (IS_ERR(poller))
> > > - return PTR_ERR(poller);
> > > + if (IS_ERR(poller)) {
> > > + ret = PTR_ERR(poller);
> > > + goto out;
> > > + }
> > > +
> > > + ret = sched_getaffinity(current->pid, mask);
> > > + if (ret)
> > > + goto out;
> > > +
> > > + ret = sched_setaffinity(poller->pid, mask);
> > > + if (ret)
> > > + goto out;
> > > +
> > > + ret = cgroup_attach_task_current_cg(poller);
> > > + if (ret)
> > > + goto out;
> > >
> > > dev->vqs = vqs;
> > > dev->nvqs = nvqs;
> > > @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
> > > vhost_poll_init(&dev->vqs[i].poll,
> > > dev->vqs[i].handle_kick, POLLIN, dev);
> > > }
> > > - return 0;
> > > +
> > > + wake_up_process(poller); /* avoid contributing to loadavg */
> > > + ret = 0;
> > > +out:
> > > + if (ret)
> > > + kthread_stop(poller);
> > > + free_cpumask_var(mask);
> > > + return ret;
> > > }
> > >
> > > /* Caller should have device mutex */
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-01 11:12:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH repost] sched: export sched_set/getaffinity to modules

Author: Sridhar Samudrala <[email protected]>

sched: export sched_set/getaffinity to modules

vhost-net driver wants to copy the affinity from the
owner thread to thread it creates. Export
sched_set/get affinity to modules to make this possible
when vhost is built as a module.

Signed-off-by: Sridhar Samudrala <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

---

I'm not sure the previous time made it clear what exactly is the
proposed change, so reposting. Info, Peter, could you ack merging the
following through the net-next tree please?

diff --git a/kernel/sched.c b/kernel/sched.c
index d484081..3759391 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4744,6 +4744,7 @@ out_put_task:
put_online_cpus();
return retval;
}
+EXPORT_SYMBOL_GPL(sched_setaffinity);

static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
struct cpumask *new_mask)
@@ -4807,6 +4808,7 @@ out_unlock:

return retval;
}
+EXPORT_SYMBOL_GPL(sched_getaffinity);

/**
* sys_sched_getaffinity - get the cpu affinity of a process

2010-07-01 11:20:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 14:07 +0300, Michael S. Tsirkin wrote:
> Author: Sridhar Samudrala <[email protected]>
>
> sched: export sched_set/getaffinity to modules
>
> vhost-net driver wants to copy the affinity from the
> owner thread to thread it creates. Export
> sched_set/get affinity to modules to make this possible
> when vhost is built as a module.
>
> Signed-off-by: Sridhar Samudrala <[email protected]>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> ---
>
> I'm not sure the previous time made it clear what exactly is the
> proposed change, so reposting. Info, Peter, could you ack merging the
> following through the net-next tree please?
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d484081..3759391 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4744,6 +4744,7 @@ out_put_task:
> put_online_cpus();
> return retval;
> }
> +EXPORT_SYMBOL_GPL(sched_setaffinity);
>
> static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
> struct cpumask *new_mask)
> @@ -4807,6 +4808,7 @@ out_unlock:
>
> return retval;
> }
> +EXPORT_SYMBOL_GPL(sched_getaffinity);
>
> /**
> * sys_sched_getaffinity - get the cpu affinity of a process

Urgh,.. so why again is that a good idea?

2010-07-01 11:43:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 13:19 +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 14:07 +0300, Michael S. Tsirkin wrote:
> > Author: Sridhar Samudrala <[email protected]>
> >
> > sched: export sched_set/getaffinity to modules
> >
> > vhost-net driver wants to copy the affinity from the
> > owner thread to thread it creates. Export
> > sched_set/get affinity to modules to make this possible
> > when vhost is built as a module.

> Urgh,.. so why again is that a good idea?

In particular:
- who sets the affinity of the task?
- why can't it set the kernel thread's affinity too?
- what happens if someone changes the tasks' affinity?

So no, I don't think this is a sensible thing to do at all.

2010-07-01 12:01:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 01:43:23PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 13:19 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-07-01 at 14:07 +0300, Michael S. Tsirkin wrote:
> > > Author: Sridhar Samudrala <[email protected]>
> > >
> > > sched: export sched_set/getaffinity to modules
> > >
> > > vhost-net driver wants to copy the affinity from the
> > > owner thread to thread it creates. Export
> > > sched_set/get affinity to modules to make this possible
> > > when vhost is built as a module.
>
> > Urgh,.. so why again is that a good idea?
>
> In particular:
> - who sets the affinity of the task?

management tools do this when they start qemu.

> - why can't it set the kernel thread's affinity too?

It can. However: the threads are started internally by the driver
when qemu does an ioctl. What we want to do is give it a sensible
default affinity. management tool can later tweak it if it wants to.

> - what happens if someone changes the tasks' affinity?

We would normally create a cgroup including all internal
tasks, making it easy to find and change affinity for
them all if necessary.

> So no, I don't think this is a sensible thing to do at all.

2010-07-01 12:30:12

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 02:55:07PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 01, 2010 at 01:43:23PM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-07-01 at 13:19 +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-07-01 at 14:07 +0300, Michael S. Tsirkin wrote:
> > > > Author: Sridhar Samudrala <[email protected]>
> > > >
> > > > sched: export sched_set/getaffinity to modules
> > > >
> > > > vhost-net driver wants to copy the affinity from the
> > > > owner thread to thread it creates. Export
> > > > sched_set/get affinity to modules to make this possible
> > > > when vhost is built as a module.
> >
> > > Urgh,.. so why again is that a good idea?
> >
> > In particular:
> > - who sets the affinity of the task?
>
> management tools do this when they start qemu.
>
> > - why can't it set the kernel thread's affinity too?
>
> It can. However: the threads are started internally by the driver
> when qemu does an ioctl. What we want to do is give it a sensible
> default affinity. management tool can later tweak it if it wants to.
>
> > - what happens if someone changes the tasks' affinity?
>
> We would normally create a cgroup including all internal
> tasks, making it easy to find and change affinity for
> them all if necessary.
>
> > So no, I don't think this is a sensible thing to do at all.

The patch using this is here:
http://www.mail-archive.com/[email protected]/msg35411.html

It simply copies the affinity from the parent when thread is created.

--
MST

2010-07-01 12:33:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 14:55 +0300, Michael S. Tsirkin wrote:

> > - why can't it set the kernel thread's affinity too?
>
> It can. However: the threads are started internally by the driver
> when qemu does an ioctl. What we want to do is give it a sensible
> default affinity. management tool can later tweak it if it wants to.

So have that ioctl return the tid of that new fancy thread and then set
its affinity, stuff it in cgroup, whatever you fancy.

> > - what happens if someone changes the tasks' affinity?
>
> We would normally create a cgroup including all internal
> tasks, making it easy to find and change affinity for
> them all if necessary.

And to stuff them in a cgroup you also need the tid, at which point it
might as well set the affinity from userspace, right?

2010-07-01 12:34:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 15:23 +0300, Michael S. Tsirkin wrote:
>
> The patch using this is here:
> http://www.mail-archive.com/[email protected]/msg35411.html
>
> It simply copies the affinity from the parent when thread is created.

Sounds like policy, not something the kernel should do..

2010-07-01 12:47:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 14:34 +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 15:23 +0300, Michael S. Tsirkin wrote:
> >
> > The patch using this is here:
> > http://www.mail-archive.com/[email protected]/msg35411.html
> >
> > It simply copies the affinity from the parent when thread is created.
>
> Sounds like policy, not something the kernel should do..

The alternative would be using clone() instead of thread_create() and
inherit everything from the creating task. Inheriting from kthreadd and
then undoing some aspects just sounds like daft policy that really ought
to be in userspace.

2010-07-01 12:56:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 02:32:43PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 14:55 +0300, Michael S. Tsirkin wrote:
>
> > > - why can't it set the kernel thread's affinity too?
> >
> > It can. However: the threads are started internally by the driver
> > when qemu does an ioctl. What we want to do is give it a sensible
> > default affinity. management tool can later tweak it if it wants to.
>
> So have that ioctl return the tid of that new fancy thread and then set
> its affinity, stuff it in cgroup, whatever you fancy.
>
> > > - what happens if someone changes the tasks' affinity?
> >
> > We would normally create a cgroup including all internal
> > tasks, making it easy to find and change affinity for
> > them all if necessary.
>
> And to stuff them in a cgroup you also need the tid, at which point it
> might as well set the affinity from userspace, right?

We also put it in a cgroup transparently. I think that it's actually
important to do it on thread creation: if we don't, malicious userspace
can create large amount of work exceeding the cgroup limits.

And the same applies so the affinity, right? If the qemu process
is limited to a set of CPUs, isn't it important to make
the kernel thread that does work our behalf limited to the same
set of CPUs?

--
MST

2010-07-01 13:08:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 15:50 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 01, 2010 at 02:32:43PM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-07-01 at 14:55 +0300, Michael S. Tsirkin wrote:
> >
> > > > - why can't it set the kernel thread's affinity too?
> > >
> > > It can. However: the threads are started internally by the driver
> > > when qemu does an ioctl. What we want to do is give it a sensible
> > > default affinity. management tool can later tweak it if it wants to.
> >
> > So have that ioctl return the tid of that new fancy thread and then set
> > its affinity, stuff it in cgroup, whatever you fancy.
> >
> > > > - what happens if someone changes the tasks' affinity?
> > >
> > > We would normally create a cgroup including all internal
> > > tasks, making it easy to find and change affinity for
> > > them all if necessary.
> >
> > And to stuff them in a cgroup you also need the tid, at which point it
> > might as well set the affinity from userspace, right?
>
> We also put it in a cgroup transparently. I think that it's actually
> important to do it on thread creation: if we don't, malicious userspace
> can create large amount of work exceeding the cgroup limits.
>
> And the same applies so the affinity, right? If the qemu process
> is limited to a set of CPUs, isn't it important to make
> the kernel thread that does work our behalf limited to the same
> set of CPUs?

I'm not sure we have anything like this, but I wouldn't think so, if a
driver creates a kthread and manages to inject tons of work its not
typically limited to whatever limits apply to the task that supplied the
work.

Take the encryption threads for example, those don't run in the context
of whoever provides the data to be encrypted (file,net whatever) and
thus the task responsible could consume heaps more resources than when
it would have to do the encryption itself.

That's how kthreads work.

2010-07-01 13:13:59

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 02:46:35PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 14:34 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-07-01 at 15:23 +0300, Michael S. Tsirkin wrote:
> > >
> > > The patch using this is here:
> > > http://www.mail-archive.com/[email protected]/msg35411.html
> > >
> > > It simply copies the affinity from the parent when thread is created.
> >
> > Sounds like policy, not something the kernel should do..
>
> The alternative would be using clone() instead of thread_create() and
> inherit everything from the creating task.
> Inheriting from kthreadd and then undoing some aspects just sounds
> like daft policy that really ought to be in userspace.

Yes, that's basically what this patchset is trying to do:
create a workqueue inheriting everything from the creating task.
Sridhar started with an API to do exactly this:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-05/msg07478.html

Then we switched to raw kthread to avoid stepping on cwq toes.
Maybe it makes sense to add kthread_clone (in addition to
kthread_create) that would do what you suggest?
If yes, any hints on an implementation?


--
MST

2010-07-01 13:28:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 03:07:26PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 15:50 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 01, 2010 at 02:32:43PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-07-01 at 14:55 +0300, Michael S. Tsirkin wrote:
> > >
> > > > > - why can't it set the kernel thread's affinity too?
> > > >
> > > > It can. However: the threads are started internally by the driver
> > > > when qemu does an ioctl. What we want to do is give it a sensible
> > > > default affinity. management tool can later tweak it if it wants to.
> > >
> > > So have that ioctl return the tid of that new fancy thread and then set
> > > its affinity, stuff it in cgroup, whatever you fancy.
> > >
> > > > > - what happens if someone changes the tasks' affinity?
> > > >
> > > > We would normally create a cgroup including all internal
> > > > tasks, making it easy to find and change affinity for
> > > > them all if necessary.
> > >
> > > And to stuff them in a cgroup you also need the tid, at which point it
> > > might as well set the affinity from userspace, right?
> >
> > We also put it in a cgroup transparently. I think that it's actually
> > important to do it on thread creation: if we don't, malicious userspace
> > can create large amount of work exceeding the cgroup limits.
> >
> > And the same applies so the affinity, right? If the qemu process
> > is limited to a set of CPUs, isn't it important to make
> > the kernel thread that does work our behalf limited to the same
> > set of CPUs?
>
> I'm not sure we have anything like this, but I wouldn't think so, if a
> driver creates a kthread and manages to inject tons of work its not
> typically limited to whatever limits apply to the task that supplied the
> work.
>
> Take the encryption threads for example, those don't run in the context
> of whoever provides the data to be encrypted (file,net whatever) and
> thus the task responsible could consume heaps more resources than when
> it would have to do the encryption itself.
>
> That's how kthreads work.

Right. And IMHO ideally all such work would run on the appropriate
CPU and be accounted to. It's just that with virt people seem to
run untrusted applications and expect the damage to be contained.
So we came up with a simple approach that seems to do the
just just for us.

--
MST

2010-07-01 13:30:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 16:08 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 01, 2010 at 02:46:35PM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-07-01 at 14:34 +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-07-01 at 15:23 +0300, Michael S. Tsirkin wrote:
> > > >
> > > > The patch using this is here:
> > > > http://www.mail-archive.com/[email protected]/msg35411.html
> > > >
> > > > It simply copies the affinity from the parent when thread is created.
> > >
> > > Sounds like policy, not something the kernel should do..
> >
> > The alternative would be using clone() instead of thread_create() and
> > inherit everything from the creating task.
> > Inheriting from kthreadd and then undoing some aspects just sounds
> > like daft policy that really ought to be in userspace.
>
> Yes, that's basically what this patchset is trying to do:
> create a workqueue inheriting everything from the creating task.
> Sridhar started with an API to do exactly this:
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-05/msg07478.html
>
> Then we switched to raw kthread to avoid stepping on cwq toes.
> Maybe it makes sense to add kthread_clone (in addition to
> kthread_create) that would do what you suggest?
> If yes, any hints on an implementation?

I think that's called kernel_thread() see
kernel/kthread.c:create_kthread().

Doing the whole kthreadd dance and then copying bits and pieces back
sounds very fragile, so yeah, something like that should work.

The other issue to consider is the thread group status of these things,
I think it would be best if these threads were still considered part of
the process that spawned them so that they would die nicely when the
process gets whacked.

At which point one could wonder if the kthread interface makes any
sense, why not let userspace fork tasks and let them call into the
kernel to perform work...

2010-07-01 13:45:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, Jul 01, 2010 at 03:30:24PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 16:08 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 01, 2010 at 02:46:35PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-07-01 at 14:34 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2010-07-01 at 15:23 +0300, Michael S. Tsirkin wrote:
> > > > >
> > > > > The patch using this is here:
> > > > > http://www.mail-archive.com/[email protected]/msg35411.html
> > > > >
> > > > > It simply copies the affinity from the parent when thread is created.
> > > >
> > > > Sounds like policy, not something the kernel should do..
> > >
> > > The alternative would be using clone() instead of thread_create() and
> > > inherit everything from the creating task.
> > > Inheriting from kthreadd and then undoing some aspects just sounds
> > > like daft policy that really ought to be in userspace.
> >
> > Yes, that's basically what this patchset is trying to do:
> > create a workqueue inheriting everything from the creating task.
> > Sridhar started with an API to do exactly this:
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-05/msg07478.html
> >
> > Then we switched to raw kthread to avoid stepping on cwq toes.
> > Maybe it makes sense to add kthread_clone (in addition to
> > kthread_create) that would do what you suggest?
> > If yes, any hints on an implementation?
>
> I think that's called kernel_thread() see
> kernel/kthread.c:create_kthread().
>
> Doing the whole kthreadd dance and then copying bits and pieces back
> sounds very fragile, so yeah, something like that should work.


Thanks!
Sridhar, Tejun, have the time to look into this approach?

> The other issue to consider is the thread group status of these things,
> I think it would be best if these threads were still considered part of
> the process that spawned them so that they would die nicely when the
> process gets whacked.

The proposed patch kills the thread when the fd is closed,
so I think this already works without making it part of the process.

> At which point one could wonder if the kthread interface makes any
> sense, why not let userspace fork tasks and let them call into the
> kernel to perform work...

One thing I wanted to avoid is letting userspace know
just how many threads are there. We are using a single one
now, but we used to have threads per-cpu, and we might
switch to a thread per virtqueue in the future.
IMO all this should ideally be transparent to userspace.

--
MST

2010-07-01 13:59:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 16:39 +0300, Michael S. Tsirkin wrote:
>
> The proposed patch kills the thread when the fd is closed,
> so I think this already works without making it part of the process.
>
OK, fd bound resources are fine.

2010-07-01 14:29:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

Hello,

On 07/01/2010 03:39 PM, Michael S. Tsirkin wrote:
>> I think that's called kernel_thread() see
>> kernel/kthread.c:create_kthread().
>>
>> Doing the whole kthreadd dance and then copying bits and pieces back
>> sounds very fragile, so yeah, something like that should work.
>
> Thanks!
> Sridhar, Tejun, have the time to look into this approach?

All that's necessary is shortcutting indirection through kthreadd.
ie. An exported function which looks like the following,

struct kthread_clone_or_whatever(int (*threadfn).....)
{
struct kthread_create_info create;
int pid;

INIT create;

pid = kernel_thread(kthread, &create, CLONE_FS...);
if (pid < 0)
return ERROR;
wait_for_completion(&create.done);

if (!IS_ERR(create.result))
SET NAME;
return create.result;
}

It might be a good idea to make the function take extra clone flags
but anyways once created cloned task can be treated the same way as
other kthreads, so nothing else needs to be changed.

Thanks.

--
tejun

2010-07-01 14:36:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 07/01, Peter Zijlstra wrote:
>
> On Thu, 2010-07-01 at 16:08 +0300, Michael S. Tsirkin wrote:
> > Maybe it makes sense to add kthread_clone (in addition to
> > kthread_create) that would do what you suggest?
> > If yes, any hints on an implementation?
>
> I think that's called kernel_thread() see
> kernel/kthread.c:create_kthread().

Well, strictly speaking kernel_thread() doesn't create the kernel thread.
Unless the caller is the kernel thread. And daemonize() is deprecated.
kernel_thread() just forks the CLONE_VM + flags child.

Oleg.

2010-07-01 14:51:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 07/01, Tejun Heo wrote:
>
> All that's necessary is shortcutting indirection through kthreadd.
> ie. An exported function which looks like the following,
>
> struct kthread_clone_or_whatever(int (*threadfn).....)
> {
> struct kthread_create_info create;
> int pid;
>
> INIT create;
>
> pid = kernel_thread(kthread, &create, CLONE_FS...);
> if (pid < 0)
> return ERROR;
> wait_for_completion(&create.done);
>
> if (!IS_ERR(create.result))
> SET NAME;
> return create.result;
> }
>
> It might be a good idea to make the function take extra clone flags
> but anyways once created cloned task can be treated the same way as
> other kthreads, so nothing else needs to be changed.

This makes kthread_stop() work. Otherwise the new thread is just
the CLONE_VM child of the caller, and the caller is the user-mode
task doing ioctl() ?

Oleg.

2010-07-01 14:55:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

Hello,

On 07/01/2010 04:46 PM, Oleg Nesterov wrote:
>> It might be a good idea to make the function take extra clone flags
>> but anyways once created cloned task can be treated the same way as
>> other kthreads, so nothing else needs to be changed.
>
> This makes kthread_stop() work. Otherwise the new thread is just
> the CLONE_VM child of the caller, and the caller is the user-mode
> task doing ioctl() ?

Hmmm, indeed. It makes the attribute inheritance work but circumvents
the whole reason there is kthreadd.

Thanks.

--
tejun

2010-07-01 14:55:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Thu, 2010-07-01 at 16:53 +0200, Tejun Heo wrote:
> Hello,
>
> On 07/01/2010 04:46 PM, Oleg Nesterov wrote:
> >> It might be a good idea to make the function take extra clone flags
> >> but anyways once created cloned task can be treated the same way as
> >> other kthreads, so nothing else needs to be changed.
> >
> > This makes kthread_stop() work. Otherwise the new thread is just
> > the CLONE_VM child of the caller, and the caller is the user-mode
> > task doing ioctl() ?
>
> Hmmm, indeed. It makes the attribute inheritance work but circumvents
> the whole reason there is kthreadd.

I thought the whole reason there was threadd was to avoid the
inheritance? So avoiding the avoiding of inheritance seems like the goal
here, no?

2010-07-02 18:02:20

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 7/1/2010 7:55 AM, Peter Zijlstra wrote:
> On Thu, 2010-07-01 at 16:53 +0200, Tejun Heo wrote:
>
>> Hello,
>>
>> On 07/01/2010 04:46 PM, Oleg Nesterov wrote:
>>
>>>> It might be a good idea to make the function take extra clone flags
>>>> but anyways once created cloned task can be treated the same way as
>>>> other kthreads, so nothing else needs to be changed.
>>>>
>>> This makes kthread_stop() work. Otherwise the new thread is just
>>> the CLONE_VM child of the caller, and the caller is the user-mode
>>> task doing ioctl() ?
>>>
>> Hmmm, indeed. It makes the attribute inheritance work but circumvents
>> the whole reason there is kthreadd.
>>
> I thought the whole reason there was threadd was to avoid the
> inheritance? So avoiding the avoiding of inheritance seems like the goal
> here, no?
>
I think so. Does it (Tejun's kthread_clone() patch) also inherit the
cgroup of the caller? or do we still need the explicit
call to attach the thread to the current task's cgroup?

I am on vacation next week and cannot look into this until Jul 12. Hope
this will be resoved by then.
If not, i will look into after i am back.

Thanks
Sridhar

2010-07-02 18:11:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
>
> Does it (Tejun's kthread_clone() patch) also inherit the
> cgroup of the caller?

Of course, its a simple do_fork() which inherits everything just as you
would expect from a similar sys_clone()/sys_fork() call.

2010-07-02 21:09:07

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 07/02, Peter Zijlstra wrote:
>
> On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> >
> > Does it (Tejun's kthread_clone() patch) also inherit the
> > cgroup of the caller?
>
> Of course, its a simple do_fork() which inherits everything just as you
> would expect from a similar sys_clone()/sys_fork() call.

Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
from ioctl(), right?

Then the new thread becomes the natural child of the caller, and it shares
->mm with the parent. And files, dup_fd() without CLONE_FS.

Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
just because the parent gets SIGQUIT or abother coredumpable signal.
Or the new thread can recieve SIGSTOP via ^Z.

Perhaps this is OK, I do not know. Just to remind that kernel_thread()
is merely clone(CLONE_VM).

Oleg.

2010-07-04 09:08:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> On 07/02, Peter Zijlstra wrote:
> >
> > On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > >
> > > Does it (Tejun's kthread_clone() patch) also inherit the
> > > cgroup of the caller?
> >
> > Of course, its a simple do_fork() which inherits everything just as you
> > would expect from a similar sys_clone()/sys_fork() call.
>
> Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> from ioctl(), right?
>
> Then the new thread becomes the natural child of the caller, and it shares
> ->mm with the parent. And files, dup_fd() without CLONE_FS.
>
> Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> just because the parent gets SIGQUIT or abother coredumpable signal.
> Or the new thread can recieve SIGSTOP via ^Z.
>
> Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> is merely clone(CLONE_VM).
>
> Oleg.


Right. Doing this might break things like flush. The signal and exit
behaviour needs to be examined carefully. I am also unsure whether
using such threads might be more expensive than inheriting kthreadd.

--
MST

2010-07-13 06:59:23

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
>
>> On 07/02, Peter Zijlstra wrote:
>>
>>> On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
>>>
>>>> Does it (Tejun's kthread_clone() patch) also inherit the
>>>> cgroup of the caller?
>>>>
>>> Of course, its a simple do_fork() which inherits everything just as you
>>> would expect from a similar sys_clone()/sys_fork() call.
>>>
>> Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
>> from ioctl(), right?
>>
>> Then the new thread becomes the natural child of the caller, and it shares
>> ->mm with the parent. And files, dup_fd() without CLONE_FS.
>>
>> Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
>> TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
>> just because the parent gets SIGQUIT or abother coredumpable signal.
>> Or the new thread can recieve SIGSTOP via ^Z.
>>
>> Perhaps this is OK, I do not know. Just to remind that kernel_thread()
>> is merely clone(CLONE_VM).
>>
>> Oleg.
>>
>
> Right. Doing this might break things like flush. The signal and exit
> behaviour needs to be examined carefully. I am also unsure whether
> using such threads might be more expensive than inheriting kthreadd.
>
>
Should we just leave it to the userspace to set the cgroup/cpumask after
qemu starts the guest and
the vhost threads?

Thanks
Sridhar

2010-07-13 11:15:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, Jul 12, 2010 at 11:59:08PM -0700, Sridhar Samudrala wrote:
> On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
> >On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> >>On 07/02, Peter Zijlstra wrote:
> >>>On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> >>>> Does it (Tejun's kthread_clone() patch) also inherit the
> >>>>cgroup of the caller?
> >>>Of course, its a simple do_fork() which inherits everything just as you
> >>>would expect from a similar sys_clone()/sys_fork() call.
> >>Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> >>from ioctl(), right?
> >>
> >>Then the new thread becomes the natural child of the caller, and it shares
> >>->mm with the parent. And files, dup_fd() without CLONE_FS.
> >>
> >>Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> >>TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> >>just because the parent gets SIGQUIT or abother coredumpable signal.
> >>Or the new thread can recieve SIGSTOP via ^Z.
> >>
> >>Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> >>is merely clone(CLONE_VM).
> >>
> >>Oleg.
> >
> >Right. Doing this might break things like flush. The signal and exit
> >behaviour needs to be examined carefully. I am also unsure whether
> >using such threads might be more expensive than inheriting kthreadd.
> >
> Should we just leave it to the userspace to set the cgroup/cpumask
> after qemu starts the guest and
> the vhost threads?
>
> Thanks
> Sridhar

Yes but we can't trust userspace to do this. It's important
to do it on thread creation: if we don't, malicious userspace
can create large amount of work exceeding the cgroup limits.

And the same applies so the affinity: if the qemu process
is limited to a set of CPUs, it's important to make
the kernel thread that does work our behalf limited to the same
set of CPUs.

This is not unique to vhost, it's just that virt scenarious are affected
by this more: people seem to run untrusted applications and expect the
damage to be contained.

--
MST

2010-07-14 23:26:45

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Tue, 2010-07-13 at 14:09 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 12, 2010 at 11:59:08PM -0700, Sridhar Samudrala wrote:
> > On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
> > >On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> > >>On 07/02, Peter Zijlstra wrote:
> > >>>On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > >>>> Does it (Tejun's kthread_clone() patch) also inherit the
> > >>>>cgroup of the caller?
> > >>>Of course, its a simple do_fork() which inherits everything just as you
> > >>>would expect from a similar sys_clone()/sys_fork() call.
> > >>Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> > >>from ioctl(), right?
> > >>
> > >>Then the new thread becomes the natural child of the caller, and it shares
> > >>->mm with the parent. And files, dup_fd() without CLONE_FS.
> > >>
> > >>Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> > >>TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> > >>just because the parent gets SIGQUIT or abother coredumpable signal.
> > >>Or the new thread can recieve SIGSTOP via ^Z.
> > >>
> > >>Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> > >>is merely clone(CLONE_VM).
> > >>
> > >>Oleg.
> > >
> > >Right. Doing this might break things like flush. The signal and exit
> > >behaviour needs to be examined carefully. I am also unsure whether
> > >using such threads might be more expensive than inheriting kthreadd.
> > >
> > Should we just leave it to the userspace to set the cgroup/cpumask
> > after qemu starts the guest and
> > the vhost threads?
> >
> > Thanks
> > Sridhar
>
> Yes but we can't trust userspace to do this. It's important
> to do it on thread creation: if we don't, malicious userspace
> can create large amount of work exceeding the cgroup limits.
>
> And the same applies so the affinity: if the qemu process
> is limited to a set of CPUs, it's important to make
> the kernel thread that does work our behalf limited to the same
> set of CPUs.
>
> This is not unique to vhost, it's just that virt scenarious are affected
> by this more: people seem to run untrusted applications and expect the
> damage to be contained.

OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
from the caller. How about an exported kthread function kthread_create_in_current_cg()
that does this?

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..e0616f0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,9 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
const char namefmt[], ...)
__attribute__((format(printf, 3, 4)));

+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+ void *data, char *name);
+
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..ea4e737 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/mutex.h>
#include <trace/events/sched.h>
+#include <linux/cgroup.h>

static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
@@ -149,6 +150,42 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create);

+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+ void *data, char *name)
+{
+ struct task_struct *worker;
+ cpumask_var_t mask;
+ int ret = -ENOMEM;
+
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+ goto out_free_mask;
+
+ worker = kthread_create(threadfn, data, "%s-%d", name, current->pid);
+ if (IS_ERR(worker))
+ goto out_free_mask;
+
+ ret = sched_getaffinity(current->pid, mask);
+ if (ret)
+ goto out_stop_worker;
+
+ ret = sched_setaffinity(worker->pid, mask);
+ if (ret)
+ goto out_stop_worker;
+
+ ret = cgroup_attach_task_current_cg(worker);
+ if (ret)
+ goto out_stop_worker;
+
+ return worker;
+
+out_stop_worker:
+ kthread_stop(worker);
+out_free_mask:
+ free_cpumask_var(mask);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(kthread_create_in_current_cg);
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @p: thread created by kthread_create().


Thanks
Sridhar

2010-07-15 00:08:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 07/14, Sridhar Samudrala wrote:
>
> OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
> from the caller. How about an exported kthread function kthread_create_in_current_cg()
> that does this?

Well. I must admit, this looks a bit strange to me ;)

Instead of exporting sched_xxxaffinity() we export the new function
which calls them. And I don't think this new helper is very useful
in general. May be I am wrong...

Oleg.

2010-07-15 05:29:42

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 7/14/2010 5:05 PM, Oleg Nesterov wrote:
> On 07/14, Sridhar Samudrala wrote:
>
>> OK. So we want to create a thread that is a child of kthreadd, but inherits the cgroup/cpumask
>> from the caller. How about an exported kthread function kthread_create_in_current_cg()
>> that does this?
>>
> Well. I must admit, this looks a bit strange to me ;)
>
> Instead of exporting sched_xxxaffinity() we export the new function
> which calls them. And I don't think this new helper is very useful
> in general. May be I am wrong...
>
If we agree on exporting sched_xxxaffinity() functions, we don't need
this new kthread function and we
can do the same in vhost as the original patch did.

Thanks
Sridhar

2010-07-26 17:18:25

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> On 07/02, Peter Zijlstra wrote:
> >
> > On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > >
> > > Does it (Tejun's kthread_clone() patch) also inherit the
> > > cgroup of the caller?
> >
> > Of course, its a simple do_fork() which inherits everything just as you
> > would expect from a similar sys_clone()/sys_fork() call.
>
> Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> from ioctl(), right?
>
> Then the new thread becomes the natural child of the caller, and it shares
> ->mm with the parent. And files, dup_fd() without CLONE_FS.
>
> Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> just because the parent gets SIGQUIT or abother coredumpable signal.
> Or the new thread can recieve SIGSTOP via ^Z.
>
> Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> is merely clone(CLONE_VM).
>
> Oleg.

With some machinery to stop it later, yes.
Oleg, how does the below look to you?

Here I explicitly drop the fds so we don't share them.
CLONE_VM takes care of sharing the mm I think.
About signals - for the vhost-net use this is OK as we use
uninterruptible sleep anyway (like the new kthread_worker does).

This code seems to work fine for me so far - any comments?

---

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..72c7b17 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
const char namefmt[], ...)
__attribute__((format(printf, 3, 4)));

+struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[], ...)
+ __attribute__((format(printf, 3, 4)));
+
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..b81588c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create);

+/* Same as kthread_create, but inherit attributes (cgroups, priority, CPU mask)
+ * from current. */
+struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[],
+ ...)
+{
+ struct kthread_create_info create;
+
+ create.threadfn = threadfn;
+ create.data = data;
+ init_completion(&create.done);
+
+ create_kthread(&create);
+ wait_for_completion(&create.done);
+
+ if (!IS_ERR(create.result)) {
+ va_list args;
+
+ /* Don't share files with parent as drivers use release for
+ * close on exit, etc. */
+ exit_files(create.result);
+
+ va_start(args, namefmt);
+ vsnprintf(create.result->comm, sizeof(create.result->comm),
+ namefmt, args);
+ va_end(args);
+ }
+ return create.result;
+}
+EXPORT_SYMBOL(kthread_create_inherit);
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @p: thread created by kthread_create().

2010-07-26 17:51:49

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, 2010-07-26 at 20:12 +0300, Michael S. Tsirkin wrote:
> On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> > On 07/02, Peter Zijlstra wrote:
> > >
> > > On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > > >
> > > > Does it (Tejun's kthread_clone() patch) also inherit the
> > > > cgroup of the caller?
> > >
> > > Of course, its a simple do_fork() which inherits everything just as you
> > > would expect from a similar sys_clone()/sys_fork() call.
> >
> > Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> > from ioctl(), right?
> >
> > Then the new thread becomes the natural child of the caller, and it shares
> > ->mm with the parent. And files, dup_fd() without CLONE_FS.
> >
> > Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> > TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> > just because the parent gets SIGQUIT or abother coredumpable signal.
> > Or the new thread can recieve SIGSTOP via ^Z.
> >
> > Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> > is merely clone(CLONE_VM).
> >
> > Oleg.
>
> With some machinery to stop it later, yes.
> Oleg, how does the below look to you?
>
> Here I explicitly drop the fds so we don't share them.
> CLONE_VM takes care of sharing the mm I think.
> About signals - for the vhost-net use this is OK as we use
> uninterruptible sleep anyway (like the new kthread_worker does).
>
> This code seems to work fine for me so far - any comments?
>
> ---
>
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index aabc8a1..72c7b17 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
> const char namefmt[], ...)
> __attribute__((format(printf, 3, 4)));
>
> +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
> + void *data,
> + const char namefmt[], ...)
> + __attribute__((format(printf, 3, 4)));
> +
> /**
> * kthread_run - create and wake a thread.
> * @threadfn: the function to run until signal_pending(current).
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 83911c7..b81588c 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
> }
> EXPORT_SYMBOL(kthread_create);
>
> +/* Same as kthread_create, but inherit attributes (cgroups, priority, CPU mask)
> + * from current. */
> +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
> + void *data,
> + const char namefmt[],
> + ...)
> +{
> + struct kthread_create_info create;
> +
> + create.threadfn = threadfn;
> + create.data = data;
> + init_completion(&create.done);
> +
> + create_kthread(&create);
> + wait_for_completion(&create.done);
> +
> + if (!IS_ERR(create.result)) {
> + va_list args;
> +
> + /* Don't share files with parent as drivers use release for
> + * close on exit, etc. */
> + exit_files(create.result);
> +
> + va_start(args, namefmt);
> + vsnprintf(create.result->comm, sizeof(create.result->comm),
> + namefmt, args);
> + va_end(args);
> + }
> + return create.result;
> +}
> +EXPORT_SYMBOL(kthread_create_inherit);
> +
> /**
> * kthread_bind - bind a just-created kthread to a cpu.
> * @p: thread created by kthread_create().


I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
flag rather than create_kthread() and then closing the files.
Either version should be fine.

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..634eaf7 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
const char namefmt[], ...)
__attribute__((format(printf, 3, 4)));

+struct task_struct *kthread_clone(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[], ...)
+ __attribute__((format(printf, 3, 4)));
+
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..806dae5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
}
EXPORT_SYMBOL(kthread_create);

+struct task_struct *kthread_clone(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[],
+ ...)
+{
+ struct kthread_create_info create;
+ int pid;
+
+ create.threadfn = threadfn;
+ create.data = data;
+ init_completion(&create.done);
+ INIT_LIST_HEAD(&create.list);
+
+ pid = kernel_thread(kthread, &create, CLONE_FS);
+ if (pid < 0) {
+ create.result = ERR_PTR(pid);
+ complete(&create.done);
+ }
+ wait_for_completion(&create.done);
+
+ if (!IS_ERR(create.result)) {
+ va_list args;
+ va_start(args, namefmt);
+ vsnprintf(create.result->comm, sizeof(create.result->comm),
+ namefmt, args);
+ va_end(args);
+ }
+
+ return create.result;
+}
+EXPORT_SYMBOL(kthread_clone);
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @p: thread created by kthread_create().




2010-07-26 18:11:50

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 07/26, Sridhar Samudrala wrote:
>
> I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> flag rather than create_kthread() and then closing the files.

!CLONE_FILES can't help. copy_files() does dup_fd() in this case.
The child still inherits the files.

> Either version should be fine.

I think neither version is fine ;)

exit_files() is not enough too. How about the signals, reparenting?


I already forgot all details, probably I missed somethig. But it
seems to me that it is better to just export get/set affinity and
forget about all complications.

Oleg.

2010-07-26 20:02:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> On 07/26, Sridhar Samudrala wrote:
> >
> > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > flag rather than create_kthread() and then closing the files.
>
> !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> The child still inherits the files.
>
> > Either version should be fine.
>
> I think neither version is fine ;)
>
> exit_files() is not enough too. How about the signals,

As I said, signals are unimportant as we are using this
thread to base a worker on - it sleeps uninterruptibly.

> reparenting?

That's actually a feature: it lets us find out which process
owns the device using the thread by looking at the parent.

>
> I already forgot all details, probably I missed somethig. But it
> seems to me that it is better to just export get/set affinity and
> forget about all complications.
>
> Oleg.

2010-07-26 20:33:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> On 07/26, Sridhar Samudrala wrote:
> >
> > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > flag rather than create_kthread() and then closing the files.
>
> !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> The child still inherits the files.
>
> > Either version should be fine.
>
> I think neither version is fine ;)
>
> exit_files() is not enough too. How about the signals, reparenting?
>
>
> I already forgot all details, probably I missed somethig. But it
> seems to me that it is better to just export get/set affinity and
> forget about all complications.
>
> Oleg.

Almost forgot, we need the same for priority.

--
MST

2010-07-27 05:02:12

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> On 07/26, Sridhar Samudrala wrote:
> >
> > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > flag rather than create_kthread() and then closing the files.
>
> !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> The child still inherits the files.
>
> > Either version should be fine.
>
> I think neither version is fine ;)
>
> exit_files() is not enough too. How about the signals, reparenting?
>
>
> I already forgot all details, probably I missed somethig. But it
> seems to me that it is better to just export get/set affinity and
> forget about all complications.
>
> Oleg.


Peter, could you please indicate whether you think this is the way to
go, too?

--
MST

2010-07-27 15:47:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> On 07/26, Sridhar Samudrala wrote:
> >
> > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > flag rather than create_kthread() and then closing the files.
>
> !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> The child still inherits the files.
>
> > Either version should be fine.
>
> I think neither version is fine ;)
>
> exit_files() is not enough too. How about the signals, reparenting?
>
>
> I already forgot all details, probably I missed somethig. But it
> seems to me that it is better to just export get/set affinity and
> forget about all complications.
>
> Oleg.

Oleg, so can I attach your Ack to the patch in question, and merge
it all through net-next?

--
MST

2010-07-30 14:22:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

Sorry for the delay, I can't be responsive these days...

On 07/27, Michael S. Tsirkin wrote:
>
> On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> > On 07/26, Sridhar Samudrala wrote:
> > >
> > > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > > flag rather than create_kthread() and then closing the files.
> >
> > !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> > The child still inherits the files.
> >
> > > Either version should be fine.
> >
> > I think neither version is fine ;)
> >
> > exit_files() is not enough too. How about the signals, reparenting?
> >
> >
> > I already forgot all details, probably I missed somethig. But it
> > seems to me that it is better to just export get/set affinity and
> > forget about all complications.
> >
> > Oleg.
>
> Oleg, so can I attach your Ack to the patch in question, and merge
> it all through net-next?

Well, I do not think you need my ack ;)


But I must admit, I personally dislike this idea. A kernel thread which
is the child of the user-space process, and in fact it is not the "real"
kernel thread. I think this is against the common case. If you do not
care the signals/reparenting, why can't you fork the user-space process
which does all the work via ioctl's ? OK, I do not understand the problem
domain, probably this can't work.

Anyway, the patch looks buggy to me. Starting from

create_kthread(&create);
wait_for_completion(&create.done);

At least you should check create_kthread() suceeds, otherwise
wait_for_completion() will hang forever. OTOH, if it suceeds then
wait_for_completion() is not needed. But this is minor.

create_kthread()->kernel_thread() uses CLONE_VM, this means that the
child will share ->mm. And this means that if the parent recieves
the coredumping signal it will hang forever in kernel space waiting
until this child exits.

This is just the immediate surprise I can see with this approach,
I am afraid there is something else.

And once again. We are doing this hacks only because we lack a
couples of exports (iiuk). This is, well, a bit strange ;)

Oleg.

2010-07-30 14:31:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

Hello,

On 07/30/2010 04:19 PM, Oleg Nesterov wrote:
> But I must admit, I personally dislike this idea. A kernel thread which
> is the child of the user-space process, and in fact it is not the "real"
> kernel thread. I think this is against the common case. If you do not
> care the signals/reparenting, why can't you fork the user-space process
> which does all the work via ioctl's ? OK, I do not understand the problem
> domain, probably this can't work.

Having kernel threads which are children of user process is plain
scary considering the many things parent/children relationship implies
and various bugs and security vulnerabilities in the area. I can't
pinpoint any problem but I think we really shouldn't be adding
something like this for this specific use case. If we can get away
with exporting a few symbols, let's go that way.

Thanks.

--
tejun

2010-08-01 08:56:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Fri, Jul 30, 2010 at 04:19:01PM +0200, Oleg Nesterov wrote:
> Sorry for the delay, I can't be responsive these days...
>
> On 07/27, Michael S. Tsirkin wrote:
> >
> > On Mon, Jul 26, 2010 at 08:08:34PM +0200, Oleg Nesterov wrote:
> > > On 07/26, Sridhar Samudrala wrote:
> > > >
> > > > I have been testing out a similar patch that uses kernel_thread() without CLONE_FILES
> > > > flag rather than create_kthread() and then closing the files.
> > >
> > > !CLONE_FILES can't help. copy_files() does dup_fd() in this case.
> > > The child still inherits the files.
> > >
> > > > Either version should be fine.
> > >
> > > I think neither version is fine ;)
> > >
> > > exit_files() is not enough too. How about the signals, reparenting?
> > >
> > >
> > > I already forgot all details, probably I missed somethig. But it
> > > seems to me that it is better to just export get/set affinity and
> > > forget about all complications.
> > >
> > > Oleg.
> >
> > Oleg, so can I attach your Ack to the patch in question, and merge
> > it all through net-next?
>
> Well, I do not think you need my ack ;)
>
>
> But I must admit, I personally dislike this idea. A kernel thread which
> is the child of the user-space process, and in fact it is not the "real"
> kernel thread. I think this is against the common case. If you do not
> care the signals/reparenting, why can't you fork the user-space process
> which does all the work via ioctl's ? OK, I do not understand the problem
> domain, probably this can't work.
>
> Anyway, the patch looks buggy to me. Starting from
>
> create_kthread(&create);
> wait_for_completion(&create.done);
>
> At least you should check create_kthread() suceeds, otherwise
> wait_for_completion() will hang forever. OTOH, if it suceeds then
> wait_for_completion() is not needed. But this is minor.
>
> create_kthread()->kernel_thread() uses CLONE_VM, this means that the
> child will share ->mm. And this means that if the parent recieves
> the coredumping signal it will hang forever in kernel space waiting
> until this child exits.
>
> This is just the immediate surprise I can see with this approach,
> I am afraid there is something else.
>
> And once again. We are doing this hacks only because we lack a
> couples of exports (iiuk). This is, well, a bit strange ;)
>
> Oleg.


Oleg, I mean Ack the exporting of get/set affinity.

Thanks!

--
MST

2010-08-02 15:05:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On 08/01, Michael S. Tsirkin wrote:
>
> Oleg, I mean Ack the exporting of get/set affinity.

Ah, I misunderstood.

Yes, I believe the exporting is the lesser evil. Please feel free
to add my ack.

Oleg.

2010-08-04 10:46:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH repost] sched: export sched_set/getaffinity to modules

On Tue, 2010-07-27 at 07:55 +0300, Michael S. Tsirkin wrote:

> Peter, could you please indicate whether you think this is the way to
> go, too?

I really dislike it, as you indicated, you now want priority too..

It seems the problem is that we normally don't consider work done by
kernel threads for user processes part of that process.

I'm not sure what work you're doing, but I'm pretty sure there's similar
things already in the kernel -- think about the work done by encryption
threads for encrypted sockets and stuff.

If you want proper containment of work caused by a process, I'd suggest
you start by looking at curing the general problem, instead of special
casing this one case.