2008-06-05 19:57:38

by David Rientjes

[permalink] [raw]
Subject: [patch] sched: prevent bound kthreads from changing cpus_allowed

Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them. Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.

Cc: Peter Zijlstra <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Paul Jackson <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
include/linux/sched.h | 1 +
kernel/cpuset.c | 14 +++++++++++++-
kernel/kthread.c | 1 +
kernel/sched.c | 6 ++++++
4 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ static inline void put_task_struct(struct task_struct *t)
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
+#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1190,6 +1190,15 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,

if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
+ if (tsk->flags & PF_THREAD_BOUND) {
+ cpumask_t mask;
+
+ mutex_lock(&callback_mutex);
+ mask = cs->cpus_allowed;
+ mutex_unlock(&callback_mutex);
+ if (!cpus_equal(tsk->cpus_allowed, mask))
+ return -EINVAL;
+ }

return security_task_setscheduler(tsk, 0, NULL);
}
@@ -1203,11 +1212,14 @@ static void cpuset_attach(struct cgroup_subsys *ss,
struct mm_struct *mm;
struct cpuset *cs = cgroup_cs(cont);
struct cpuset *oldcs = cgroup_cs(oldcont);
+ int err;

mutex_lock(&callback_mutex);
guarantee_online_cpus(cs, &cpus);
- set_cpus_allowed_ptr(tsk, &cpus);
+ err = set_cpus_allowed_ptr(tsk, &cpus);
mutex_unlock(&callback_mutex);
+ if (err)
+ return;

from = oldcs->mems_allowed;
to = cs->mems_allowed;
diff --git a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
set_task_cpu(k, cpu);
k->cpus_allowed = cpumask_of_cpu(cpu);
k->rt.nr_cpus_allowed = 1;
+ k->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);

diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5565,6 +5565,12 @@ int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
goto out;
}

+ if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+ !cpus_equal(p->cpus_allowed, *new_mask))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, new_mask);
else {


2008-06-05 20:30:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

A couple of questions on this:

1) Sometimes threads are bound to a set of CPUs, such as the CPUs
on a particular node:

drivers/pci/pci-driver.c: set_cpus_allowed_ptr(current, nodecpumask);
net/sunrpc/svc.c: set_cpus_allowed_ptr(current, nodecpumask);

Such cases can't invoke kthread_bind(), as that only binds to a single CPU.
I only see one place in your patch that sets PF_THREAD_BOUND; would it make
sense for such multi-CPU binds as above to be PF_THREAD_BOUND as well?

2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:

drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));

In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
kthread_bind() really doesn't seem to care where that thread is bound;
they just want it on a CPU that is still online.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-06-05 20:52:25

by Daniel Walker

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed


On Thu, 2008-06-05 at 12:57 -0700, David Rientjes wrote:
> Kthreads that have called kthread_bind() are bound to specific cpus, so
> other tasks should not be able to change their cpus_allowed from under
> them. Otherwise, it is possible to move kthreads, such as the migration
> or software watchdog threads, so they are not allowed access to the cpu
> they work on.

I'm all for it .. I've crashed test systems just by migrating tasks that
are suppose to stay bound while using cpusets ..

Daniel

2008-06-05 21:12:59

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On Thu, 5 Jun 2008, Paul Jackson wrote:

> A couple of questions on this:
>
> 1) Sometimes threads are bound to a set of CPUs, such as the CPUs
> on a particular node:
>
> drivers/pci/pci-driver.c: set_cpus_allowed_ptr(current, nodecpumask);
> net/sunrpc/svc.c: set_cpus_allowed_ptr(current, nodecpumask);
>
> Such cases can't invoke kthread_bind(), as that only binds to a single CPU.
> I only see one place in your patch that sets PF_THREAD_BOUND; would it make
> sense for such multi-CPU binds as above to be PF_THREAD_BOUND as well?
>

Not in the drivers/pci/pci-driver.c case because the setting of
cpus_allowed to nodecpumask is only temporary (as is the disabling of the
mempolicy). It's just done for the probe callback and then reset to the
old cpumask. So any migration here would be lost.

I can't speculate about the net/sunrpc/svc.c case because I don't know if
user migration is harmful or discouraged. The behavior with my patch is
the same for any calls to set_cpus_allowed_ptr() for tasks that haven't
called kthread_bind(), so I'll leave that decision up to those who know
best for this networking code.

> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:
>
> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));
>
> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
> kthread_bind() really doesn't seem to care where that thread is bound;
> they just want it on a CPU that is still online.
>

This particular case is simply moving the thread to any online cpu so that
it survives long enough for the subsequent kthread_stop() in
destroy_comp_task(). So I don't see a problem with this instance.

A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon
return, but I haven't found any cases in the tree where that is currently
necessary. And doing that would defeat the semantics of kthread_bind()
where these threads are supposed to be bound to a specific cpu and not
allowed to run on others.

David

2008-06-05 21:48:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

Ok - looks good to me. (I too have shot my foot off
moving tasks that shouldn't be moved - thanks.)

Reviewed-by: Paul Jackson <[email protected]>

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-06-09 20:59:37

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

David Rientjes wrote:
>> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:
>>
>> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));
>>
>> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
>> kthread_bind() really doesn't seem to care where that thread is bound;
>> they just want it on a CPU that is still online.
>>
>
> This particular case is simply moving the thread to any online cpu so that
> it survives long enough for the subsequent kthread_stop() in
> destroy_comp_task(). So I don't see a problem with this instance.
>
> A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon
> return, but I haven't found any cases in the tree where that is currently
> necessary. And doing that would defeat the semantics of kthread_bind()
> where these threads are supposed to be bound to a specific cpu and not
> allowed to run on others.

Actually I have another use case here. Above example in particular may be ok
but it does demonstrate the issue nicely. Which is that in some cases kthreads
are bound to a CPU but do not have a strict "must run here" requirement and
could be moved if needed.
For example I need an ability to move workqueue threads. Workqueue threads do
kthread_bind().
So how about we add something like kthread_bind_strict() which would set
PF_THREAD_BOUND ?
We could also simply add flags argument to the kthread_bind() which would be
better imo but requires more changes. ie It'd look like
kthread_bind(..., cpu, KTHREAD_BIND_STRICT);

Things like migration threads, stop machine, etc would use the strict version
and everything else would use non-strict bind.

---
On the related note (this seems like the right crowd :). What do people think
about kthreads and cpusets in general. We currently have a bit of a disconnect
in the logic.
1. kthreads can be put into a cpuset at which point their cpus_allowed mask is
updated properly
2. kthread's cpus_allowed mask is updated properly when cpuset setup changes
(cpus added, removed, etc).
3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they
either do kthread_bind() or set_cpus_allowed() and both of those simply ignore
inherited cpusets.

Notice how scenario #3 does not fit into the overall picture. The behaviour is
inconsistent.
How about this:
- Split sched_setaffinity into

sched_setaffinity()
{
task *p = get_task_by_pid();
return task_setaffinity(p);
}

task_setaffinity(task, cpumask, flags)
{
if (flags & FORCE) {
// Used for kthreads that require strict binding.
// Detach the task from the current cpuset
// and put it into the root cpuset.
// Set PF_THREAD_BOUND.
}

// Rest of the original sched_setaffinity logic
}

- Have kthreads call task_setaffinity() instead of set_cpus_allowed() directly.
That way the behaviour will be consistent across the board.

Comments ?

Max

2008-06-09 22:08:52

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On Mon, 9 Jun 2008, Max Krasnyanskiy wrote:

> Actually I have another use case here. Above example in particular may be ok
> but it does demonstrate the issue nicely. Which is that in some cases kthreads
> are bound to a CPU but do not have a strict "must run here" requirement and
> could be moved if needed.

That isn't a completely accurate description of the patch; the kthread
itself is always allowed to change its cpu affinity. This exception, for
example, allows stopmachine to still work correctly since it uses
set_cpus_allowed_ptr() for each thread.

This patch simply prohibits other tasks from changing the cpu affinity of
a kthread that has called kthread_bind().

> For example I need an ability to move workqueue threads.

Let's elaborate a little on that: you're moving workqueue threads from
their originating cpu to another cpu (hopefully on the same NUMA node)
using cpusets or sched_setaffinity()?

> Workqueue threads do
> kthread_bind().
> So how about we add something like kthread_bind_strict() which would set
> PF_THREAD_BOUND ?
> We could also simply add flags argument to the kthread_bind() which would be
> better imo but requires more changes. ie It'd look like
> kthread_bind(..., cpu, KTHREAD_BIND_STRICT);
>
> Things like migration threads, stop machine, etc would use the strict version
> and everything else would use non-strict bind.
>

It depends on the number of exceptions that we want to allow. If there's
one or two, it's sufficient to just use

p->flags &= ~PF_THREAD_BOUND;

upon return from kthread_bind(), or simply convert the existing code to
use set_cpus_allowed_ptr() instead:

kthread_bind(p, cpu);

becomes

cpumask_t mask = cpumask_of_cpu(cpu);
set_cpus_allowed_ptr(p, &mask);

Or, if we have more exceptions to the rule than actual strict binding for
kthreads using kthread_bind(), we can just add

p->flags |= PF_THREAD_BOUND;

upon return on watchdog and migration threads.

But, to me, the semantics of kthread_bind() are pretty clear. I think
it's dangerous to move kthreads such as watchdog or migration threads out
from under them and that is easily shown if you try to do it. There are
perhaps exceptions to the rule where existing kthread_bind() calls could
be converted to set_cpus_allowed_ptr(), but I think we should enumerate
those cases and discuss the hazards that could be associated with changing
their cpu affinity.

I'd also like to hear why you choose to move your workqueue threads off
their originating cpu.

> On the related note (this seems like the right crowd :). What do people think
> about kthreads and cpusets in general. We currently have a bit of a disconnect
> in the logic.
> 1. kthreads can be put into a cpuset at which point their cpus_allowed mask is
> updated properly
> 2. kthread's cpus_allowed mask is updated properly when cpuset setup changes
> (cpus added, removed, etc).
> 3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they
> either do kthread_bind() or set_cpus_allowed() and both of those simply ignore
> inherited cpusets.
>

With my patch, this is slightly different. Kthreads that have called
kthread_bind() can have a different cpu affinity than the cpus_allowed of
their cpuset. This happens when such a kthread is attached to a cpuset
and its 'cpus' file subsequently changes. Cpusets is written correctly to
use set_cpus_allowed_ptr() so that this change in affinity is now rejected
for PF_THREAD_BOUND tasks, yet the kthread is still a member of the
cpuset.

It's debatable whether the kthread should still remain as a member of the
cpuset, but there are significant advantages: cpusets offers more than
just a simple way to do sched_setaffinity() over an aggregate of tasks.

David

2008-06-10 04:23:05

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

David Rientjes wrote:
>> For example I need an ability to move workqueue threads.
>
> Let's elaborate a little on that: you're moving workqueue threads from
> their originating cpu to another cpu (hopefully on the same NUMA node)
> using cpusets or sched_setaffinity()?
Yes.

> But, to me, the semantics of kthread_bind() are pretty clear. I think
> it's dangerous to move kthreads such as watchdog or migration threads out
> from under them and that is easily shown if you try to do it. There are
> perhaps exceptions to the rule where existing kthread_bind() calls could
> be converted to set_cpus_allowed_ptr(), but I think we should enumerate
> those cases and discuss the hazards that could be associated with changing
> their cpu affinity.
I think what can and cannot be moved is a separate question. As far as cpu
affinity API for kthreads goes it makes sense to support both scenarios.
See below.

> I'd also like to hear why you choose to move your workqueue threads off
> their originating cpu.
CPU isolation. ie To avoid kernel activity on certain CPUs.

>> On the related note (this seems like the right crowd :). What do people think
>> about kthreads and cpusets in general. We currently have a bit of a disconnect
>> in the logic.
>> 1. kthreads can be put into a cpuset at which point their cpus_allowed mask is
>> updated properly
>> 2. kthread's cpus_allowed mask is updated properly when cpuset setup changes
>> (cpus added, removed, etc).
>> 3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they
>> either do kthread_bind() or set_cpus_allowed() and both of those simply ignore
>> inherited cpusets.
>>
> With my patch, this is slightly different. Kthreads that have called
> kthread_bind() can have a different cpu affinity than the cpus_allowed of
> their cpuset. This happens when such a kthread is attached to a cpuset
> and its 'cpus' file subsequently changes. Cpusets is written correctly to
> use set_cpus_allowed_ptr() so that this change in affinity is now rejected
> for PF_THREAD_BOUND tasks, yet the kthread is still a member of the
> cpuset.
That would be inconsistent and confusing, I think. If a task is in the cpuset
X all constraints of the cpuset X must apply. If some constraints cannot be
satisfied then that task should not be in the cpuset X.

> It's debatable whether the kthread should still remain as a member of the
> cpuset, but there are significant advantages: cpusets offers more than
> just a simple way to do sched_setaffinity() over an aggregate of tasks.
Yes cpusets are not only about cpu affinity. But again the behaviour should be
consistent across the board. cpuset.cpus must apply to all the task in the
set, not just some of the tasks.

Also I think you missed my other point/suggestion. Yes your patch fixes one
problem, which is kthreads that must not move will not move. Although like I
said the behaviour with regards to the cpuset is not consistent and confusing.
The other thing that I pointed out is that kthreads that _can_ move do not
honor cpuset constrains.
Let me give another example. Forget the workqueues for a second. Kthreads like
scsi_eh, kswapd, kacpid, etc, etc are all started with cpus_allows=ALL_CPUS
even though they inherit a cpuset from kthreadd. Yes I can move them manually
but the behaviour is not consistent, and for no good reason. kthreads can be
stopped/started at any time (module load for example) which means that I'd
have to keep moving them.

To sum it up here is what I'm suggesting:
kthread_bind(task, cpu)
{
// Set PF_THREAD_BOUND
// Move into root cpuset
// Bind to the cpu
}

kthread_setaffinity(task, cpumask)
{
// Enforce cpuset.cpus_allowed
// Updated affinity mask and migrate kthread (if needed)
}

Instead of calling set_cpus_allowed_ptr() kthreads that do not need strict cpu
binding will be calling kthread_setaffinity().
That way the behaviour is consistent across the board.

Makes sense ?

Max

2008-06-10 06:46:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On Mon, 2008-06-09 at 13:59 -0700, Max Krasnyanskiy wrote:
> David Rientjes wrote:
> >> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:
> >>
> >> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));
> >>
> >> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
> >> kthread_bind() really doesn't seem to care where that thread is bound;
> >> they just want it on a CPU that is still online.
> >>
> >
> > This particular case is simply moving the thread to any online cpu so that
> > it survives long enough for the subsequent kthread_stop() in
> > destroy_comp_task(). So I don't see a problem with this instance.
> >
> > A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon
> > return, but I haven't found any cases in the tree where that is currently
> > necessary. And doing that would defeat the semantics of kthread_bind()
> > where these threads are supposed to be bound to a specific cpu and not
> > allowed to run on others.
>
> Actually I have another use case here. Above example in particular may be ok
> but it does demonstrate the issue nicely. Which is that in some cases kthreads
> are bound to a CPU but do not have a strict "must run here" requirement and
> could be moved if needed.
> For example I need an ability to move workqueue threads. Workqueue threads do
> kthread_bind().

Per cpu workqueues should stay on their cpu.

What you're really looking for is a more fine grained alternative to
flush_workqueue().

2008-06-10 10:29:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed


* David Rientjes <[email protected]> wrote:

> Kthreads that have called kthread_bind() are bound to specific cpus,
> so other tasks should not be able to change their cpus_allowed from
> under them. Otherwise, it is possible to move kthreads, such as the
> migration or software watchdog threads, so they are not allowed access
> to the cpu they work on.

applied to tip/sched-devel, thanks David.

Ingo

2008-06-10 15:37:51

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed



Peter Zijlstra wrote:
> On Mon, 2008-06-09 at 13:59 -0700, Max Krasnyanskiy wrote:
>> David Rientjes wrote:
>>>> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:
>>>>
>>>> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));
>>>>
>>>> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
>>>> kthread_bind() really doesn't seem to care where that thread is bound;
>>>> they just want it on a CPU that is still online.
>>>>
>>> This particular case is simply moving the thread to any online cpu so that
>>> it survives long enough for the subsequent kthread_stop() in
>>> destroy_comp_task(). So I don't see a problem with this instance.
>>>
>>> A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon
>>> return, but I haven't found any cases in the tree where that is currently
>>> necessary. And doing that would defeat the semantics of kthread_bind()
>>> where these threads are supposed to be bound to a specific cpu and not
>>> allowed to run on others.
>> Actually I have another use case here. Above example in particular may be ok
>> but it does demonstrate the issue nicely. Which is that in some cases kthreads
>> are bound to a CPU but do not have a strict "must run here" requirement and
>> could be moved if needed.
>> For example I need an ability to move workqueue threads. Workqueue threads do
>> kthread_bind().
>
> Per cpu workqueues should stay on their cpu.
>
> What you're really looking for is a more fine grained alternative to
> flush_workqueue().
Actually I had a discussion on that with Oleg Nesterov. If you remember my
original solution (ie centralized cpu_isolate_map) was to completely redirect
work onto other cpus. Then you pointed out that it's the flush_() that really
makes the box stuck. So I started thinking about redoing the flush. While
looking at the code I realized that if I only change the flush_() then queued
work can get stale so to speak. ie Machine does not get stuck but some work
submitted on the isolated cpus will sit there for a long time. Oleg pointed
out exact same thing. So the simplest solution that does not require any
surgery to the workqueue is to just move the threads to other cpus. I did not
want to get into too much detail on the workqueue stuff here. I'll start a
separate thread on this.
As I pointed out, there are a bunch of other kthreads like: kswapd, kacpid,
pdflush, khubd, etc, etc, that clearly do not need any pinning but still
violate cpuset constraints they inherit from kthreadd.

Max


2008-06-10 16:30:22

by Max Krasnyansky

[permalink] [raw]
Subject: cpusets and kthreads, inconsistent behaviour

I pointed this out in the email thread about PF_THREAD_BOUND patch and wanted
to restart the thread to make sure that people pay attention :).
I was going to cook up a patch for this and wanted to get some early feedback
to avoid time waste.

Basically the issue is that current behaviour of the cpusets is inconsistent
with regards to kthreads. Kthreads inherit cpuset from a parent properly but
they simply ignore cpuset.cpus when their cpu affinity is set/updated.
I think the behaviour must be consistent across the board. cpuset.cpus must
apply to _all_ the tasks in the set, not just some of the tasks. If kthread
must run on the cpus other than current_cpuset.cpus then it should detach from
the cpuset.

To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush,
kseriod, etc are all started with cpus_allows=ALL_CPUS even though they
inherit a cpuset from kthreadd. Yes they can moved manually (with
sched_setaffinity) but the behaviour is not consistent, and for no good
reason. kthreads can be stopped/started at any time (module load for example)
which means that the user will have to keep moving them.

To sum it up here is what I'm suggesting:
kthread_bind(task, cpu)
{
// Set PF_THREAD_BOUND
// Move into root cpuset
// Bind to the cpu
}

kthread_setaffinity(task, cpumask)
{
// Enforce cpuset.cpus_allowed
// Updated affinity mask and migrate kthread (if needed)
}

Kthreads that do not require strict cpu binding will be calling
kthread_setaffinity() instead of set_cpus_allowed_ptr() and such.

Kthreads that require strict cpu binding will be calling kthread_bind() and
detach from the cpuset they inherit from their parent.

That way the behaviour is consistent across the board.

Comments ?

Max

2008-06-10 16:58:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On 06/10, Max Krasnyansky wrote:
>
> Peter Zijlstra wrote:
> >
> > Per cpu workqueues should stay on their cpu.
> >
> > What you're really looking for is a more fine grained alternative to
> > flush_workqueue().
> Actually I had a discussion on that with Oleg Nesterov. If you remember my
> original solution (ie centralized cpu_isolate_map) was to completely redirect
> work onto other cpus. Then you pointed out that it's the flush_() that really
> makes the box stuck. So I started thinking about redoing the flush. While
> looking at the code I realized that if I only change the flush_() then queued
> work can get stale so to speak. ie Machine does not get stuck but some work
> submitted on the isolated cpus will sit there for a long time. Oleg pointed
> out exact same thing. So the simplest solution that does not require any
> surgery to the workqueue is to just move the threads to other cpus.

Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's
should stay on their cpu.

I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread
to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc.
But still this doesn't look right to me.

Oleg.

2008-06-10 17:05:37

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On Mon, 9 Jun 2008, Max Krasnyansky wrote:

> > I'd also like to hear why you choose to move your workqueue threads off
> > their originating cpu.
> CPU isolation. ie To avoid kernel activity on certain CPUs.
>

This probably isn't something that you should be doing, at least with the
workqueue threads. The slab cache reaper, for example, depends on being
able to drain caches for each cpu and will be neglected if they are moved.

I'm curious why you haven't encountered problems with this while isolating
per-cpu workqueue threads in cpusets that don't have access to their own
cpu.

Regardless, we'd need a patch to the slab layer and ack'd by the
appropriate people at this point to allow the exception.

> Yes cpusets are not only about cpu affinity. But again the behaviour should be
> consistent across the board. cpuset.cpus must apply to all the task in the
> set, not just some of the tasks.
>

It has always been possible to assign a task to a cpu and then further
constrict its set of allowable cpus with sched_setaffinity(). So, while
the cpus_allowed in this case are always a subset of the cpuset's cpus,
you could still describe this as inconsistent.

> To sum it up here is what I'm suggesting:
> kthread_bind(task, cpu)
> {
> // Set PF_THREAD_BOUND
> // Move into root cpuset
> // Bind to the cpu
> }
>

kthread_bind() usually happens immediately following kthread_create(), so
it should already be in the root cpuset. If it has been forked in a
different cpuset, however, implicitly moving it may be more harmful than
any inconsistency that exists in cpus_allowed.

David

2008-06-10 17:20:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On Tue, 2008-06-10 at 21:00 +0400, Oleg Nesterov wrote:
> On 06/10, Max Krasnyansky wrote:
> >
> > Peter Zijlstra wrote:
> > >
> > > Per cpu workqueues should stay on their cpu.
> > >
> > > What you're really looking for is a more fine grained alternative to
> > > flush_workqueue().
> > Actually I had a discussion on that with Oleg Nesterov. If you remember my
> > original solution (ie centralized cpu_isolate_map) was to completely redirect
> > work onto other cpus. Then you pointed out that it's the flush_() that really
> > makes the box stuck. So I started thinking about redoing the flush. While
> > looking at the code I realized that if I only change the flush_() then queued
> > work can get stale so to speak. ie Machine does not get stuck but some work
> > submitted on the isolated cpus will sit there for a long time. Oleg pointed
> > out exact same thing. So the simplest solution that does not require any
> > surgery to the workqueue is to just move the threads to other cpus.
>
> Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's
> should stay on their cpu.
>
> I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread
> to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc.
> But still this doesn't look right to me.

The advantage of creating a more flexible or fine-grained flush is that
large machine also profit from it.

A simple scheme would be creating a workqueue context that is passed
along on enqueue, and then passed to flush.

This context could:

- either track the individual worklets and employ a completion scheme
to wait for them;

- or track on which cpus the worklets are enqueued and flush only those
few cpus.

Doing this would solve your case since nobody (except those having
business) will enqueue something on the isolated cpus.

And it will improve the large machine case for the same reasons - it
won't have to iterate all cpus.

Of course, things that use schedule_on_each_cpu() will still end up
doing things on your isolated cpus, but getting around those would
probably get you into some correctness trouble.


2008-06-10 17:46:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

On 06/05, David Rientjes wrote:
>
> Kthreads that have called kthread_bind() are bound to specific cpus, so
> other tasks should not be able to change their cpus_allowed from under
> them. Otherwise, it is possible to move kthreads, such as the migration
> or software watchdog threads, so they are not allowed access to the cpu
> they work on.

Imho, this is very good change, but...

> @@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
> set_task_cpu(k, cpu);
> k->cpus_allowed = cpumask_of_cpu(cpu);
> k->rt.nr_cpus_allowed = 1;
> + k->flags |= PF_THREAD_BOUND;

What if user-space moves this task right before "|= PF_THREAD_BOUND" ?

Oleg.

2008-06-10 17:59:37

by Max Krasnyansky

[permalink] [raw]
Subject: Re: [patch] sched: prevent bound kthreads from changing cpus_allowed



Oleg Nesterov wrote:
> On 06/10, Max Krasnyansky wrote:
>> Peter Zijlstra wrote:
>>> Per cpu workqueues should stay on their cpu.
>>>
>>> What you're really looking for is a more fine grained alternative to
>>> flush_workqueue().
>> Actually I had a discussion on that with Oleg Nesterov. If you remember my
>> original solution (ie centralized cpu_isolate_map) was to completely redirect
>> work onto other cpus. Then you pointed out that it's the flush_() that really
>> makes the box stuck. So I started thinking about redoing the flush. While
>> looking at the code I realized that if I only change the flush_() then queued
>> work can get stale so to speak. ie Machine does not get stuck but some work
>> submitted on the isolated cpus will sit there for a long time. Oleg pointed
>> out exact same thing. So the simplest solution that does not require any
>> surgery to the workqueue is to just move the threads to other cpus.
>
> Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's
> should stay on their cpu.
I never argued against the _should stay_ ;-). What I'm arguing against is the
_must stay_ which is a big difference. I'll start a separate thread on this.

> I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread
> to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc.
Yep. That's what I was referring to in the explanation that I send to Peter.

> But still this doesn't look right to me.
Yeah, it's all about perceptions. We'll fix that ;-).

Max

2008-06-10 18:47:54

by David Rientjes

[permalink] [raw]
Subject: Re: cpusets and kthreads, inconsistent behaviour

On Tue, 10 Jun 2008, Max Krasnyansky wrote:

> Basically the issue is that current behaviour of the cpusets is inconsistent
> with regards to kthreads. Kthreads inherit cpuset from a parent properly but
> they simply ignore cpuset.cpus when their cpu affinity is set/updated.
> I think the behaviour must be consistent across the board. cpuset.cpus must
> apply to _all_ the tasks in the set, not just some of the tasks. If kthread
> must run on the cpus other than current_cpuset.cpus then it should detach from
> the cpuset.
>

I disagree that a cpuset's set of allowable cpus should apply to all tasks
attached to that cpuset. It's certainly beneficial to be able to further
constrict the set of allowed cpus for a task using sched_setaffinity().

It makes more sense to argue that for each task p, p->cpus_allowed is a
subset of task_cs(p)->cpus_allowed.

> To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush,
> kseriod, etc are all started with cpus_allows=ALL_CPUS even though they
> inherit a cpuset from kthreadd. Yes they can moved manually (with
> sched_setaffinity) but the behaviour is not consistent, and for no good
> reason. kthreads can be stopped/started at any time (module load for example)
> which means that the user will have to keep moving them.
>

This doesn't seem to be purely a kthread issue. Tasks can be moved to a
disjoint set of cpus by any caller to set_cpus_allowed_ptr() in the
kernel.

David

2008-06-10 20:24:18

by Max Krasnyansky

[permalink] [raw]
Subject: workqueue cpu affinity

Ok looks like we got deeper into workqueue discussion in the wrong mail thread
:). So let me restart it.

Here is some backgound on this. Full cpu isolation requires some tweaks to the
workqueue handling. Either the workqueue threads need to be moved (which is my
current approach), or work needs to be redirected when it's submitted.
flush_*_work() needs to be improved too. See Peter's reply below.

First reaction that a lot of people get is "oh no no, this is bad, this will
not work". Which is understandable but _wrong_ ;-). See below for more details
and analysis.

One thing that helps in accepting this isolation idea is to think about the
use cases. There are two uses cases for it:
1. Normal threaded RT apps with threads that use system calls, block on
events, etc.
2. Specialized RT apps with thread(s) that require close to 100% of the CPU
resources. Their threads avoid using system calls and avoid blocking. This is
done to achieve very low latency and low overhead.

Scenario #1 is straightforward. You'd want to isolate the processor the RT app
is running on to avoid typical sources of latency. Workqueues running on the
same processor is not an issue (because RT threads block), but you do not get
the same latency guaranties.

Workqueues are an issue for the scenario #2. Workqueue kthreads do not get a
chance to run because user's RT threads are higher priority. However those RT
threads should not use regular kernel services because that by definition
means that they are not getting ~100% of the CPU they want. In other words
they cannot have it both ways :).

Therefore it's expected that the kernel won't be used heavily on those cpus,
and nothing really schedules workqueues and stuff. It's also expected that
certain kernel services may not be available on those CPUs. Again we cannot
have it both ways. ie Have all the kernel services and yet the kernel is not
supposed to use much CPU time :).

If at this point people still get this "Oh no, that's wrong" feeling, please
read this excellent statement by Paul J

"A key reason that Linux has succeeded is that it actively seeks to work for a
variety of people, purposes and products. One operating system is now a strong
player in the embedded market, the real time market, and the High Performance
Computing market, as well as being an important player in a variety of other
markets. That's a rather stunning success."
? Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list.

btw Paul, it got picked up by the kerneltrap.org
http://kerneltrap.org/Quote/A_Rather_Stunning_Success

Sorry for the lengthy into. Back to the technical discussions.

Peter Zijlstra wrote:
> The advantage of creating a more flexible or fine-grained flush is that
> large machine also profit from it.
I agree, our current workqueue flush scheme is expensive because it has to
schedule on each online cpu. So yes improving flush makes sense in general.

> A simple scheme would be creating a workqueue context that is passed
> along on enqueue, and then passed to flush.
>
> This context could:
>
> - either track the individual worklets and employ a completion scheme
> to wait for them;
>
> - or track on which cpus the worklets are enqueued and flush only those
> few cpus.
>
> Doing this would solve your case since nobody (except those having
> business) will enqueue something on the isolated cpus.
>
> And it will improve the large machine case for the same reasons - it
> won't have to iterate all cpus.
This will require a bit of surgery across the entire tree. There is a lot of
code that calls flush_scheduled_work()P. All that would have to be changed,
which is ok, but I think as the first step we could simply allow moving
workqueue threads out of cpus where that load in undesirable and make people
aware of what happens in that case.
When I get a chance I'll look into the flush scheme you proposed above.

> Of course, things that use schedule_on_each_cpu() will still end up
> doing things on your isolated cpus, but getting around those would
> probably get you into some correctness trouble.
There is literally a _single_ user of that API.
Actually lets look at all the current users of the schedule_on(cpu) kind of API.

git grep
'queue_delayed_work_on\|schedPule_delayed_work_on\|schedule_on_each_cpu' |\
grep -v 'workqueue\.[ch]\|\.txt'

> drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay);
> drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work,
No big deal. Worst case cpufreq state or that cpu will be stale.
RT apps would not want cpufreq governor messing with the cpu frequencies
anyway. So if you look back at the scenarios #1 and #2 I described above this
is non-issue.

> drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rcpu->sniffer,
> drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rm->cpu[cpu].sniffer,
Not a big deal either. In the worst case stats for the isolated cpus will not
be updated. Can probably be converted to timers.

> drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i);
> drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work
Yep I mentioned before that messing with the workqueues brakes oprofile. So
yes this one is an issue. However again it's not a catastrophic failure of the
system. Oprofile will not be able to collect samples from the CPU RT app is
running on and it actually warns the user about it (it prints and error that
the work is running on the wrong cpu). I'm working on a patch that collects
samples via IPI or per cpu timer. It will be configurable of course. So this
one is not a big deal either.

> mm/slab.c: schedule_delayed_work_on(cpu, reap_work,
Garbage collection. Again see scenarios I described above. If the kernel is
not being heavily used on the isolated cpu there is not a whole lot of SLAB
activity, not running the garbage collector is not a big deal.
Also SLUB does not have per cpu garbage collector, people running RT apps
should simply switch to the SLUB. So this one is non-issue.

> mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu);
This is one is swap LRU handling. This is the only user of
schedule_on_each_cpu() btw. This case is similar to the above cases. Most
people doing RT either have no swap at all, or avoid any kind of swapping
activity on the CPUs used for RT. If they aren't already they should :).

> mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
Not sure if it's an issue or not. It has not been for me.
And again if it is an issue it's not a catastrophic failure kind of thing.
There is not a whole lot of VM activity on the cpus running RT apps, otherwise
they won't run for very long ;-).

So as you can see all the current users that require strict workqueue cpu
affinity is at most an inconvenience (like not being able to profile cpuX, or
stale stats). Nothing fundamental fails. We've been running all kinds of
machines with both scenarios #1 and #2 for weeks (rebooted for upgrades only).
They do not show any more problems than the machine with the regular setup.

There may be some other users that implicitly rely on the work queue affinity
but I could not easily find them by looking at the code, nor did they show up
during testing.
If you know of any please let me know, we should convert them from
schedule_work()
to
schedule_work_on(cpuX)
to make the requirements clear.

Thanks
Max

2008-06-10 20:44:23

by Max Krasnyansky

[permalink] [raw]
Subject: Re: cpusets and kthreads, inconsistent behaviour



David Rientjes wrote:
> On Tue, 10 Jun 2008, Max Krasnyansky wrote:
>
>> Basically the issue is that current behaviour of the cpusets is inconsistent
>> with regards to kthreads. Kthreads inherit cpuset from a parent properly but
>> they simply ignore cpuset.cpus when their cpu affinity is set/updated.
>> I think the behaviour must be consistent across the board. cpuset.cpus must
>> apply to _all_ the tasks in the set, not just some of the tasks. If kthread
>> must run on the cpus other than current_cpuset.cpus then it should detach from
>> the cpuset.
>>
>
> I disagree that a cpuset's set of allowable cpus should apply to all tasks
> attached to that cpuset. It's certainly beneficial to be able to further
> constrict the set of allowed cpus for a task using sched_setaffinity().
>
> It makes more sense to argue that for each task p, p->cpus_allowed is a
> subset of task_cs(p)->cpus_allowed.
Yes that's exactly what I meant :). Sorry for not being clear. I did not mean
that each task in a cpuset must have the same affinity mask. So we're on the
same page here.

>> To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush,
>> kseriod, etc are all started with cpus_allows=ALL_CPUS even though they
>> inherit a cpuset from kthreadd. Yes they can moved manually (with
>> sched_setaffinity) but the behaviour is not consistent, and for no good
>> reason. kthreads can be stopped/started at any time (module load for example)
>> which means that the user will have to keep moving them.
>>
>
> This doesn't seem to be purely a kthread issue. Tasks can be moved to a
> disjoint set of cpus by any caller to set_cpus_allowed_ptr() in the
> kernel.
Hmm, technically you are correct of course. But we do not have any other
kernel tasks besides kthreads. All the kernel tasks I have on my machines have
kthreadd as their parent.
And I'm not aware of any kernel code that changes affinity mask of a
user-space task without paying attention to the cpuset the task belongs to. If
you know of any we should fix it because it'd clearly be a bug.

Thanx
Max

2008-06-10 20:55:52

by David Rientjes

[permalink] [raw]
Subject: Re: cpusets and kthreads, inconsistent behaviour

On Tue, 10 Jun 2008, Max Krasnyansky wrote:

> Hmm, technically you are correct of course. But we do not have any other
> kernel tasks besides kthreads. All the kernel tasks I have on my machines have
> kthreadd as their parent.
> And I'm not aware of any kernel code that changes affinity mask of a
> user-space task without paying attention to the cpuset the task belongs to. If
> you know of any we should fix it because it'd clearly be a bug.
>

This is why it shouldn't belong in the sched or kthread code; the
discrepency that you point out between p->cpus_allowed and
task_cs(p)->cpus_allowed is a cpuset created one.

So to avoid having tasks with a cpus_allowed mask that is not a subset of
its cpuset's set of allowable cpus, the solution would probably be to add
a flavor of cpuset_update_task_memory_state() for a cpus generation value.

2008-06-10 21:15:38

by Max Krasnyansky

[permalink] [raw]
Subject: Re: cpusets and kthreads, inconsistent behaviour

David Rientjes wrote:
> On Tue, 10 Jun 2008, Max Krasnyansky wrote:
>
>> Hmm, technically you are correct of course. But we do not have any other
>> kernel tasks besides kthreads. All the kernel tasks I have on my machines have
>> kthreadd as their parent.
>> And I'm not aware of any kernel code that changes affinity mask of a
>> user-space task without paying attention to the cpuset the task belongs to. If
>> you know of any we should fix it because it'd clearly be a bug.
>>
>
> This is why it shouldn't belong in the sched or kthread code; the
> discrepency that you point out between p->cpus_allowed and
> task_cs(p)->cpus_allowed is a cpuset created one.
I guess I do not see how the kernel task case is different from the
sched_setaffinity(). ie sched_setaffinity() is in the scheduler but it deals
with cpusets.

Also in this case the discrepancy is created _not_ by the cpuset, it's created
due to the lack of the appropriate API. In other words if we had something
kthread_setaffinty() from day one this would have been a non-issue :).

> So to avoid having tasks with a cpus_allowed mask that is not a subset of
> its cpuset's set of allowable cpus, the solution would probably be to add
> a flavor of cpuset_update_task_memory_state() for a cpus generation value.
Post policing would not work well in some scenarios due to lag time. ie By the
time cpusets realized that some kthread is running on the wrong cpus it maybe
too late.
We should just enforce cpuset constraint when kthreads change their affinity
mask, just like sched_setaffinity() already does.

Max

2008-06-11 06:50:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: workqueue cpu affinity

On Tue, 2008-06-10 at 13:24 -0700, Max Krasnyansky wrote:
> Ok looks like we got deeper into workqueue discussion in the wrong mail thread
> :). So let me restart it.
>
> Here is some backgound on this. Full cpu isolation requires some tweaks to the
> workqueue handling. Either the workqueue threads need to be moved (which is my
> current approach), or work needs to be redirected when it's submitted.
> flush_*_work() needs to be improved too. See Peter's reply below.
>
> First reaction that a lot of people get is "oh no no, this is bad, this will
> not work". Which is understandable but _wrong_ ;-). See below for more details
> and analysis.
>
> One thing that helps in accepting this isolation idea is to think about the
> use cases. There are two uses cases for it:
> 1. Normal threaded RT apps with threads that use system calls, block on
> events, etc.
> 2. Specialized RT apps with thread(s) that require close to 100% of the CPU
> resources. Their threads avoid using system calls and avoid blocking. This is
> done to achieve very low latency and low overhead.
>
> Scenario #1 is straightforward. You'd want to isolate the processor the RT app
> is running on to avoid typical sources of latency. Workqueues running on the
> same processor is not an issue (because RT threads block), but you do not get
> the same latency guaranties.
>
> Workqueues are an issue for the scenario #2. Workqueue kthreads do not get a
> chance to run because user's RT threads are higher priority. However those RT
> threads should not use regular kernel services because that by definition
> means that they are not getting ~100% of the CPU they want. In other words
> they cannot have it both ways :).
>
> Therefore it's expected that the kernel won't be used heavily on those cpus,
> and nothing really schedules workqueues and stuff. It's also expected that
> certain kernel services may not be available on those CPUs. Again we cannot
> have it both ways. ie Have all the kernel services and yet the kernel is not
> supposed to use much CPU time :).
>
> If at this point people still get this "Oh no, that's wrong" feeling, please
> read this excellent statement by Paul J
>
> "A key reason that Linux has succeeded is that it actively seeks to work for a
> variety of people, purposes and products. One operating system is now a strong
> player in the embedded market, the real time market, and the High Performance
> Computing market, as well as being an important player in a variety of other
> markets. That's a rather stunning success."
> — Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list.

While true, that doesn't mean we'll just merge anything :-)

> btw Paul, it got picked up by the kerneltrap.org
> http://kerneltrap.org/Quote/A_Rather_Stunning_Success
>
> Sorry for the lengthy into. Back to the technical discussions.
>
> Peter Zijlstra wrote:
> > The advantage of creating a more flexible or fine-grained flush is that
> > large machine also profit from it.
> I agree, our current workqueue flush scheme is expensive because it has to
> schedule on each online cpu. So yes improving flush makes sense in general.
>
> > A simple scheme would be creating a workqueue context that is passed
> > along on enqueue, and then passed to flush.
> >
> > This context could:
> >
> > - either track the individual worklets and employ a completion scheme
> > to wait for them;
> >
> > - or track on which cpus the worklets are enqueued and flush only those
> > few cpus.
> >
> > Doing this would solve your case since nobody (except those having
> > business) will enqueue something on the isolated cpus.
> >
> > And it will improve the large machine case for the same reasons - it
> > won't have to iterate all cpus.
> This will require a bit of surgery across the entire tree. There is a lot of
> code that calls flush_scheduled_work()P. All that would have to be changed,
> which is ok, but I think as the first step we could simply allow moving
> workqueue threads out of cpus where that load in undesirable and make people
> aware of what happens in that case.
> When I get a chance I'll look into the flush scheme you proposed above.
>
> > Of course, things that use schedule_on_each_cpu() will still end up
> > doing things on your isolated cpus, but getting around those would
> > probably get you into some correctness trouble.
> There is literally a _single_ user of that API.

There are quite a bit more on -rt, where a lot of on_each_cpu() callers,
that now use IPIs and run in hardirq context are converted to schedule.

> Actually lets look at all the current users of the schedule_on(cpu) kind of API.
>
> git grep
> 'queue_delayed_work_on\|schedPule_delayed_work_on\|schedule_on_each_cpu' |\
> grep -v 'workqueue\.[ch]\|\.txt'
>
> > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay);
> > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work,
> No big deal. Worst case cpufreq state or that cpu will be stale.
> RT apps would not want cpufreq governor messing with the cpu frequencies
> anyway. So if you look back at the scenarios #1 and #2 I described above this
> is non-issue.

Sure, ondemand cpu_freq doesn't make sense while running (hard) rt
apps.

> > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rcpu->sniffer,
> > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rm->cpu[cpu].sniffer,
> Not a big deal either. In the worst case stats for the isolated cpus will not
> be updated. Can probably be converted to timers.

sure... see below [1]

> > drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i);
> > drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work
> Yep I mentioned before that messing with the workqueues brakes oprofile. So
> yes this one is an issue. However again it's not a catastrophic failure of the
> system. Oprofile will not be able to collect samples from the CPU RT app is
> running on and it actually warns the user about it (it prints and error that
> the work is running on the wrong cpu). I'm working on a patch that collects
> samples via IPI or per cpu timer. It will be configurable of course. So this
> one is not a big deal either.

NMI/timers sound like a good way to run oprofile - I thought it could
already use them.. ? Anyway, see below.. [2]

> > mm/slab.c: schedule_delayed_work_on(cpu, reap_work,
> Garbage collection. Again see scenarios I described above. If the kernel is
> not being heavily used on the isolated cpu there is not a whole lot of SLAB
> activity, not running the garbage collector is not a big deal.
> Also SLUB does not have per cpu garbage collector, people running RT apps
> should simply switch to the SLUB. So this one is non-issue.

Dude, SLUB uses on_each_cpu(), that's even worse for your #2 case. Hmm
so does SLAB.. and a lot of other code.

> > mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu);
> This is one is swap LRU handling. This is the only user of
> schedule_on_each_cpu() btw. This case is similar to the above cases. Most
> people doing RT either have no swap at all, or avoid any kind of swapping
> activity on the CPUs used for RT. If they aren't already they should :).

It isn't actually swap only - its all paging, including pagecache etc..
Still, you're probably right in that the per cpu lrus are empty, but why
not improve the current scheme by keeping a cpumask of cpus with
non-emppty pagevecs, that way everybody wins.

> > mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
> Not sure if it's an issue or not. It has not been for me.
> And again if it is an issue it's not a catastrophic failure kind of thing.
> There is not a whole lot of VM activity on the cpus running RT apps, otherwise
> they won't run for very long ;-).

Looking at this code I'm not seeing the harm in letting it run - even
for your #2 case, it certainly is not worse than some of the
on_each_cpu() code, and starving it doesn't seem like a big issue.

---

I'm worried by your approach to RT - both your solutions [1,2] and
oversight of the on_each_cpu() stuff seem to indicate you don't care
about some jitter on your isolated cpu.

Timers and on_each_cpu() code run with hardirqs disabled and can do all
kinds of funny stuff like spin on shared locks. This will certainly
affect your #2 case.


Again, the problem with most of your ideas is that they are very narrow
- they fail to consider the bigger picture/other use-cases.

To quote Paul again:
"A key reason that Linux has succeeded is that it actively seeks
to work for a variety of people, purposes and products"

You often seem to forget 'variety' and target only your one use-case.

I'm not saying it doesn't work for you - I'm just saying that by putting
in a little more effort (ok, -rt is a lot more effort) we can make it
work for a lot more people by taking out a lot of the restrictions
you've put upon yourself.

Please don't take this too personal - I'm glad you're working on this.
I'm just trying to see what we can generalize.

2008-06-11 16:06:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: workqueue cpu affinity

On 06/10, Max Krasnyansky wrote:
>
> Here is some backgound on this. Full cpu isolation requires some tweaks to the
> workqueue handling. Either the workqueue threads need to be moved (which is my
> current approach), or work needs to be redirected when it's submitted.

_IF_ we have to do this, I think it is much better to move cwq->thread.

> Peter Zijlstra wrote:
> > The advantage of creating a more flexible or fine-grained flush is that
> > large machine also profit from it.
> I agree, our current workqueue flush scheme is expensive because it has to
> schedule on each online cpu. So yes improving flush makes sense in general.

Yes, it is easy to implement flush_work(struct work_struct *work) which
only waits for that work, so it can't hang unless it was enqueued on the
isolated cpu.

But in most cases it is enough to just do

if (cancel_work_sync(work))
work->func(work);

Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map).

But I don't think we should change the behaviour of flush_workqueue().

> This will require a bit of surgery across the entire tree. There is a lot of
> code that calls flush_scheduled_work()

Almost all of them should be changed to use cancel_work_sync().

Oleg.

2008-06-11 19:02:22

by Max Krasnyansky

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Peter Zijlstra wrote:
> On Tue, 2008-06-10 at 13:24 -0700, Max Krasnyansky wrote:
>> If at this point people still get this "Oh no, that's wrong" feeling, please
>> read this excellent statement by Paul J
>>
>> "A key reason that Linux has succeeded is that it actively seeks to work for a
>> variety of people, purposes and products. One operating system is now a strong
>> player in the embedded market, the real time market, and the High Performance
>> Computing market, as well as being an important player in a variety of other
>> markets. That's a rather stunning success."
>> — Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list.
>
> While true, that doesn't mean we'll just merge anything :-)
No, of course not.

>>> Of course, things that use schedule_on_each_cpu() will still end up
>>> doing things on your isolated cpus, but getting around those would
>>> probably get you into some correctness trouble.
>> There is literally a _single_ user of that API.
>
> There are quite a bit more on -rt, where a lot of on_each_cpu() callers,
> that now use IPIs and run in hardirq context are converted to schedule.
Yes I noticed that. I did not look deep enough though. I played with -rt
kernels but did not look over all the changes.

>>> drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i);
>>> drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work
>> Yep I mentioned before that messing with the workqueues brakes oprofile. So
To sum up the discussion:
Overall conclusion regarding workqueues for the current mainline kernels is
that starting/moving workqueue threads is not a big deal. It's definitely not
encouraged but at the same time is not be prohibited (and it isn't at this
point I'm moving them from user-space).

Looks like the action items are:
- Optimize workqueue flush
- Timer or IPI based Oprofile (configurable)
- Optimize pagevec drain. I wonder if there is something on that front in the
Nick's latest patches.


Of course I may not be able to look at all those

Did I miss any other suggestion ?





Max
>> yes this one is an issue. However again it's not a catastrophic failure of the
>> system. Oprofile will not be able to collect samples from the CPU RT app is
>> running on and it actually warns the user about it (it prints and error that
>> the work is running on the wrong cpu). I'm working on a patch that collects
>> samples via IPI or per cpu timer. It will be configurable of course. So this
>> one is not a big deal either.
>
> NMI/timers sound like a good way to run oprofile - I thought it could
> already use them.. ? Anyway, see below.. [2]
Yeah, I'm not sure why oprofile uses workqueues. Seems like a much heavier way
to collect samples than a lightweight per cpu timer.

>>> mm/slab.c: schedule_delayed_work_on(cpu, reap_work,
>> Garbage collection. Again see scenarios I described above. If the kernel is
>> not being heavily used on the isolated cpu there is not a whole lot of SLAB
>> activity, not running the garbage collector is not a big deal.
>> Also SLUB does not have per cpu garbage collector, people running RT apps
>> should simply switch to the SLUB. So this one is non-issue.
>
> Dude, SLUB uses on_each_cpu(), that's even worse for your #2 case. Hmm
> so does SLAB.. and a lot of other code.
Sure. I was talking about _workqueues_ specifically.
There is also add_timer_on().

Also on_each_cpu() is an IPI. It's not necessarily worse for the #2 case.
Depending what the IPI does of course. In the SLUB case it's essentially a
noop on the CPU that is no a heavy SLUB user. I've been meaning to profile
IPIs and timers but they have not generated enough noise.

btw Converting everything to threads like -rt does is not necessarily the best
solutions for all cases. Look at the "empty SLUB" case. IPI will get in and
out with minimal intrusion. The threaded approach will have to run the
scheduler and do a context switch. Granted in-kernel context switches are
super fast but still not as fast as the IPI. Just to clarify I'm not saying
-rt is doing the wrong thing, I'm just saying it's not black and white that
threads are better in _all_ cases.

>>> mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu);
>> This is one is swap LRU handling. This is the only user of
>> schedule_on_each_cpu() btw. This case is similar to the above cases. Most
>> people doing RT either have no swap at all, or avoid any kind of swapping
>> activity on the CPUs used for RT. If they aren't already they should :).
>
> It isn't actually swap only - its all paging, including pagecache etc..
> Still, you're probably right in that the per cpu lrus are empty,
Oh, I missed the pagecache part. btw It's also used only for the CONFIG_NUMA
case.

> but why not improve the current scheme by keeping a cpumask of cpus with
> non-emppty pagevecs, that way everybody wins.
I did not know enough about it to come with such an idea :). So yes now that
you mentioned it sounds like a good thing to do.

>>> mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu);
>> Not sure if it's an issue or not. It has not been for me.
>> And again if it is an issue it's not a catastrophic failure kind of thing.
>> There is not a whole lot of VM activity on the cpus running RT apps, otherwise
>> they won't run for very long ;-).
>
> Looking at this code I'm not seeing the harm in letting it run - even
> for your #2 case, it certainly is not worse than some of the
> on_each_cpu() code, and starving it doesn't seem like a big issue.
Yes that's I meant. ie That nothing needs to be done here. We just let it run
and if it starves it starves, no big deal.

> ---
> I'm worried by your approach to RT - both your solutions [1,2] and
> oversight of the on_each_cpu() stuff seem to indicate you don't care
> about some jitter on your isolated cpu.
Hang on, why do say an oversight of on_each_cpu() ?
I started this discussion about workqueues specifically. on_each_cpu() is an
IPI. Actually awhile ago I looked at most uses of the xxx_on_cpu() api
including for_each_xxx_cpu(). Sure I may have missed something, but there is
no oversight in general. Which is why I've solicited feedback for awhile now.
As I mentioned awhile back the reason I'm doing this stuff is precisely
because I care about the jitter. And what I'm getting is 1-2usec on the
isolated CPUs when non-isolated one is loaded to the extreme. This is on the
NUMA dual Opteron box. None of the existing -rt solutions I tried (RTAI,
Xenomai and -rt) could do this.

btw I'm not sure what your concern is for the #1 use case (ie regular -rt
threads that block). It's not exactly my solution ;-). I just mentioned it as
a use case that benefits from CPU isolation. In that case again due to the
fact that heaviest latency sources have been moved to the other CPUs rt apps
To sum up the discussion:
Overall conclusion regarding workqueues for the current mainline kernels is
that starting/moving workqueue threads is not a big deal. It's definitely not
encouraged but at the same time is not be prohibited (and it isn't at this
point I'm moving them from user-space).

Looks like the action items are:
- Optimize workqueue flush
- Timer or IPI based Oprofile (configurable)
- Optimize pagevec drain. I wonder if there is something on that front in the
Nick's latest patches.


Of course I may not be able to look at all those

Did I miss any other suggestion ?





Maxenjoy fairly low latencies.

> Timers and on_each_cpu() code run with hardirqs disabled and can do all
> kinds of funny stuff like spin on shared locks. This will certainly
> affect your #2 case.
It does. See above. Once most things are moved overall impact of those IPIs is
fairly small.

> Again, the problem with most of your ideas is that they are very narrow
> - they fail to consider the bigger picture/other use-cases.
Uh, that's a very broad statement ;-). I'm actually surprised that you draw
such a conclusion. I'm not even sure what ideas specifically you're talking
about. If you look at my latest cpuisol tree it contains exactly 4 patches.
Three that you've already ACKed and are either bug fixes or simple extensions.
So it's essentially a vanilla kernel running on the off-the-shelf hardware and
providing awesome latencies. I not sure how that is a bad thing or how it
fails to consider big picture.

> To quote Paul again:
> "A key reason that Linux has succeeded is that it actively seeks
> to work for a variety of people, purposes and products"
>
> You often seem to forget 'variety' and target only your one use-case.
Please be more specific. Which part targets only my use case. I'm assuming
that by _my_ use case you mean #2 - "rt thread that never blocks", because the
other use case is totally generic.
I don't see how things that I'm doing address only my use case. There is
clearly some usage of the scheduler isolation capabilities already. Look for
example at this wiki page:
http://rt.wiki.kernel.org/index.php/CPU_shielding_using_/proc_and_/dev/cpuset
My changes clearly improve that use case.

> I'm not saying it doesn't work for you - I'm just saying that by putting
> in a little more effort (ok, -rt is a lot more effort) we can make it
> work for a lot more people by taking out a lot of the restrictions
> you've put upon yourself.
Actually it may seem like a lot of restrictions but in reality the only
restriction is on how RT threads are designed (and there will always be some).
Otherwise you get an off-the-shelf box with essentially vanilla kernel that
can be used in any way you like and can run very tight RT apps at the same time.

> Please don't take this too personal - I'm glad you're working on this.
> I'm just trying to see what we can generalize.
Oh, no worries, I'm not taking this personally, except maybe the "most your
ideas suck" part which got me a little bit ;-). I'm definitely all for making
it suitable for more general usage.
This is actually first constructive criticism (except for the "most of your
ideas suck" part :). Oleg had some specific suggestions about workqueues. But
otherwise it's been mostly "oh, it feels wrong" kind of thing with no specific
suggestions and/or pointers.

Thanks for your suggestions and pointers. I definitely appreciate them.
I'll send a separate email with the summary of the discussion. This one is too
long :).

Once again thanks a lot for the detailed reply
Max

2008-06-11 19:21:19

by Max Krasnyansky

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Oleg Nesterov wrote:
> On 06/10, Max Krasnyansky wrote:
>> Here is some backgound on this. Full cpu isolation requires some tweaks to the
>> workqueue handling. Either the workqueue threads need to be moved (which is my
>> current approach), or work needs to be redirected when it's submitted.
>
> _IF_ we have to do this, I think it is much better to move cwq->thread.
Ok. btw That's what I'm doing now from user-space.

>> Peter Zijlstra wrote:
>>> The advantage of creating a more flexible or fine-grained flush is that
>>> large machine also profit from it.
>> I agree, our current workqueue flush scheme is expensive because it has to
>> schedule on each online cpu. So yes improving flush makes sense in general.
>
> Yes, it is easy to implement flush_work(struct work_struct *work) which
> only waits for that work, so it can't hang unless it was enqueued on the
> isolated cpu.
>
> But in most cases it is enough to just do
>
> if (cancel_work_sync(work))
> work->func(work);
Cool. That would.
btw Somehow I thought that you already implemented flush_work(). I do not see
it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been
something else. Do you mind adding that ?

> Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map).
That'd be special casing. I mean something will have to know what cpus cannot
be flushed. I liked your proposal above much better.

> But I don't think we should change the behaviour of flush_workqueue().
>
>> This will require a bit of surgery across the entire tree. There is a lot of
>> code that calls flush_scheduled_work()
>
> Almost all of them should be changed to use cancel_work_sync().

That'd be a lot of changes.

git grep flush_scheduled_work | wc
154 376 8674

Hmm, I guess maybe not that bad. I might actually do that :-).

Max

2008-06-11 19:21:32

by Max Krasnyansky

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Oleg Nesterov wrote:
> On 06/10, Max Krasnyansky wrote:
>> Here is some backgound on this. Full cpu isolation requires some tweaks to the
>> workqueue handling. Either the workqueue threads need to be moved (which is my
>> current approach), or work needs to be redirected when it's submitted.
>
> _IF_ we have to do this, I think it is much better to move cwq->thread.
Ok. btw That's what I'm doing now from user-space.

>> Peter Zijlstra wrote:
>>> The advantage of creating a more flexible or fine-grained flush is that
>>> large machine also profit from it.
>> I agree, our current workqueue flush scheme is expensive because it has to
>> schedule on each online cpu. So yes improving flush makes sense in general.
>
> Yes, it is easy to implement flush_work(struct work_struct *work) which
> only waits for that work, so it can't hang unless it was enqueued on the
> isolated cpu.
>
> But in most cases it is enough to just do
>
> if (cancel_work_sync(work))
> work->func(work);
Cool. That would work.
btw Somehow I thought that you already implemented flush_work(). I do not see
it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been
something else. Do you mind adding that ?

> Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map).
That'd be special casing. I mean something will have to know what cpus cannot
be flushed. I liked your proposal above much better.

> But I don't think we should change the behaviour of flush_workqueue().
>
>> This will require a bit of surgery across the entire tree. There is a lot of
>> code that calls flush_scheduled_work()
>
> Almost all of them should be changed to use cancel_work_sync().

That'd be a lot of changes.

git grep flush_scheduled_work | wc
154 376 8674

Hmm, I guess maybe not that bad. I might actually do that :-).

Max

2008-06-11 20:44:19

by Max Krasnyansky

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Previous emails were very long :). So here is an executive summary of the
discussions so far:

----
Workqueue kthread starvation by non-blocking user RT threads.

Starving workqueue threads on the isolated cpus does not seems like a big
deal. All current mainline users of schedule_on_cpu() kind of api can live
with it. Starvation of the workqueue threads is an issue for the -rt kernels.
See http://marc.info/?l=linux-kernel&m=121316707117552&w=2 for more info.

If absolutely necessary moving workqueue threads from the isolated cpus is
also not a big deal, even for cpu hotplug. It's certainly _not_ encouraged in
general but at the same time is not strictly prohibited either, because
nothing fundamental brakes (that's what my current isolation solution does).

----
Optimize workqueue flush.

Current flush_work_queue() implementation is an issue for the starvation case
mentioned above and in general it's not very efficient because it has to
schedule in each online cpu.

Peter suggested rewriting flush logic to avoid scheduling on each online cpu.

Oleg suggested converting existing users of flush_queued_work() to
cancel_work_sync(work) which will provide fine grained flushing and will not
schedule on each cpu.

Both of the suggestions would improve overall performance and address the case
when machine gets stuck due work queue thread starvation.

----
Timer or IPI based Oprofile.

Currently oprofile collects samples by using schedule_work_on_cpu(). Which
means that if workqueue threads are starved on, or moved from cpuX oprofile
fails to collect samples on that cpuX.

It seems that it can be easily converted to use per-CPU timer or IPI.
This might be useful in general (ie less expensive) and will take care of the
issue described above.

----
Optimize pagevec drain.

Current pavevec drain logic on the NUMA boxes schedules a workqueue on each
online cpu. It's not an issue for the CPU isolation per se but can be improved
in general.
Peter suggested keeping a cpumask of cpus with non-emppty pagevecs which will
not require scheduling work on each cpu.
I wonder if there is something on that front in the Nick's latest patches.
CC'ing Nick.

----

Did I miss anything ?


Max

2008-06-12 16:34:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: workqueue cpu affinity

On 06/11, Max Krasnyansky wrote:
>
> Oleg Nesterov wrote:
> >
> > Yes, it is easy to implement flush_work(struct work_struct *work) which
> > only waits for that work, so it can't hang unless it was enqueued on the
> > isolated cpu.
> >
> > But in most cases it is enough to just do
> >
> > if (cancel_work_sync(work))
> > work->func(work);
> Cool. That would work.
> btw Somehow I thought that you already implemented flush_work(). I do not see
> it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been
> something else. Do you mind adding that ?

Well... I don't think Andrew will take this patch right now... OK, I'll send
the preparation patch with comments. Do you see an immediate user for this
helper?

> > Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map).
> That'd be special casing. I mean something will have to know what cpus cannot
> be flushed.

OK, we can make it flush_workqueue_except_isolated_cpus(struct workqueue_struct *wq).

> I liked your proposal above much better.

it was Peter who suggested this ;)

> > But I don't think we should change the behaviour of flush_workqueue().
> >
> >> This will require a bit of surgery across the entire tree. There is a lot of
> >> code that calls flush_scheduled_work()
> >
> > Almost all of them should be changed to use cancel_work_sync().
>
> That'd be a lot of changes.
>
> git grep flush_scheduled_work | wc
> 154 376 8674
>
> Hmm, I guess maybe not that bad. I might actually do that :-).

Cool! I _bet_ you will find a lot of bugs ;)

Oleg.

2008-06-12 18:44:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Sorry for being late,.. and I'm afraid most will have to wait a bit
longer :-(

On Wed, 2008-06-11 at 12:02 -0700, Max Krasnyansky wrote:
> Peter Zijlstra wrote:

> > Please don't take this too personal - I'm glad you're working on this.
> > I'm just trying to see what we can generalize.

> Oh, no worries, I'm not taking this personally, except maybe the "most your
> ideas suck" part which got me a little bit ;-). I'm definitely all for making
> it suitable for more general usage.
> This is actually first constructive criticism (except for the "most of your
> ideas suck" part :).

No, no, you understand me wrong (or I expressed myself wrong). Your
ideas are good, its just the implementation / execution I have issues
with.

Like with extending the isolation map, what didn't leave any room for
hard-rt smp schedulers or multiple rt domains. Whereas the cpuset stuff
does.


2008-06-12 19:10:29

by Max Krasnyansky

[permalink] [raw]
Subject: Re: workqueue cpu affinity

Peter Zijlstra wrote:
> Sorry for being late,.. and I'm afraid most will have to wait a bit
> longer :-(
>
> On Wed, 2008-06-11 at 12:02 -0700, Max Krasnyansky wrote:
>> Peter Zijlstra wrote:
>
>>> Please don't take this too personal - I'm glad you're working on this.
>>> I'm just trying to see what we can generalize.
>
>> Oh, no worries, I'm not taking this personally, except maybe the "most your
>> ideas suck" part which got me a little bit ;-). I'm definitely all for making
>> it suitable for more general usage.
>> This is actually first constructive criticism (except for the "most of your
>> ideas suck" part :).
>
> No, no, you understand me wrong (or I expressed myself wrong). Your
> ideas are good, its just the implementation / execution I have issues
> with.
>
> Like with extending the isolation map, what didn't leave any room for
> hard-rt smp schedulers or multiple rt domains. Whereas the cpuset stuff
> does.
Yep. And I redid it completely and switched gears to fix/improve cpusets,
genirq, etc.
Anyway, I think we're on the same page. Please look over the summary that I
sent out and see if I missed anything.

Max