The bug is, a task may run on the cpu/node which is not in its cpuset.cpus/
cpuset.mems.
It can be reproduced by the following commands:
-----------------------------------
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0-1 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo $$ > /dev/cpuset/0/tasks
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 1 > /sys/devices/system/cpu/cpu1/online
-----------------------------------
There is only CPU0 in cpuset.cpus, but the task in this cpuset runs on
both CPU0 and CPU1.
It is because the task's cpu_allowed didn't get updated after we did
CPU offline/online manipulation. Similar for mem_allowed.
This patch fixes this bug expect for root cpuset. Because there is a
problem about root cpuset, in that whether it is necessary to update all
the tasks in root cpuset or not after cpu/node offline/online.
If updating, some kernel threads which is bound into a specified cpu will
be unbound.
If not updating, there is a bug in root cpuset. This bug is also caused
by offline/online manipulation. For example, there is a dual-cpu
machine. we create a sub cpuset in root cpuset and assign 1 to its cpus.
And then we attach some tasks into this sub cpuset. After this, we
offline CPU1. Now, the tasks in this new cpuset are moved into root
cpuset automatically because there is no cpu in sub cpuset. Then we
online CPU1, we find all the tasks which doesn't belong to root cpuset
originally just run on CPU0.
Maybe we need to add a flag in the task_struct to mark which task can't be
unbound?
Signed-off-by: Miao Xie <[email protected]>
Acked-by: Paul Jackson <[email protected]>
---
kernel/cpuset.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index aa955ee..ea7fdb9 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1876,6 +1876,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
struct cpuset *child; /* scans child cpusets of cp */
struct list_head queue;
struct cgroup *cont;
+ nodemask_t oldmems;
INIT_LIST_HEAD(&queue);
@@ -1895,6 +1896,8 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
continue;
+ oldmems = cp->mems_allowed;
+
/* Remove offline cpus and mems from this cpuset. */
mutex_lock(&callback_mutex);
cpus_and(cp->cpus_allowed, cp->cpus_allowed, cpu_online_map);
@@ -1906,6 +1909,10 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
if (cpus_empty(cp->cpus_allowed) ||
nodes_empty(cp->mems_allowed))
remove_tasks_in_empty_cpuset(cp);
+ else {
+ update_tasks_cpumask(cp);
+ update_tasks_nodemask(cp, &oldmems);
+ }
}
}
--
1.5.4.rc3
On Tue, Jun 3, 2008 at 7:05 PM, Miao Xie <[email protected]> wrote:
>
> Maybe we need to add a flag in the task_struct to mark which task can't be
> unbound?
Yes, something like a PF_CPU_BOUND flag that gets set in
kthread_bind() and checked for in set_cpus_allowed() would be the
right way to handle that. We have an internal change like that against
an older kernel verison - I'll see if I can find someone to forward
port it and send it in.
Paul
On Wed, Jun 4, 2008 at 2:30 AM, Paul Menage <[email protected]> wrote:
> On Tue, Jun 3, 2008 at 7:05 PM, Miao Xie <[email protected]> wrote:
>>
>> Maybe we need to add a flag in the task_struct to mark which task can't be
>> unbound?
>
> Yes, something like a PF_CPU_BOUND flag that gets set in
> kthread_bind() and checked for in set_cpus_allowed() would be the
> right way to handle that. We have an internal change like that against
> an older kernel verison - I'll see if I can find someone to forward
> port it and send it in.
And in fact they already did, and sent it to lkml a few months ago:
http://marc.info/?l=linux-kernel&m=120419372032623
Paul
Paul M, quoting Miao Xie <[email protected]>:
>>
>> Maybe we need to add a flag in the task_struct to mark which task can't be
>> unbound?
Do we need a new PF_* flag for this? Perhaps one can test for this
by examining the currently available properties of tasks. Would it
be sufficient to look for kernel threads (NULL mm_struct) whose
cpus_allowed is a strict subset of the online CPUs?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Wed, 4 Jun 2008, Paul Jackson wrote:
> Do we need a new PF_* flag for this? Perhaps one can test for this
> by examining the currently available properties of tasks. Would it
> be sufficient to look for kernel threads (NULL mm_struct) whose
> cpus_allowed is a strict subset of the online CPUs?
>
That would only identify kthreads that have been created with a subsequent
call to set_cpus_allowed() or kthread_bind().
The PF_CPU_BOUND change targets only the latter since there are kthreads,
such as kstopmachine, that can continue to manipulate their cpus_allowed
during their lifetime.
Other kthreads such as the scheduler migration thread and soft lockup
watchdog thread, however, always stay bound to a single cpu by use of
kthread_bind(). These are the tasks that get the PF_CPU_BOUND flag and
cannot be rebound via set_cpus_allowed() because of their negative
effects.
David
David wrote:
> That would only identify kthreads that have been created with a subsequent
> call to set_cpus_allowed() or kthread_bind().
>
> The PF_CPU_BOUND change targets only the latter since there are kthreads,
> such as kstopmachine, that can continue to manipulate their cpus_allowed
> during their lifetime.
Would your first sentence be more clearly written as:
> That would identify both kthreads that have been created with a subsequent
> call to set_cpus_allowed() or kthread_bind().
Or do I misunderstand?
If I am reading you correctly, then would it work to have a check in the
cpuset code (rather than in the lower set_cpus_allowed() routine),
where that check refused to move tasks out of the root cpuset if they
were (1) kernel threads (mm NULL) and (2) had cpus_allowed that were a
strict subset of the root cpusets 'cpus' (the online cpus).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Wed, 4 Jun 2008, Paul Jackson wrote:
> > That would identify both kthreads that have been created with a subsequent
> > call to set_cpus_allowed() or kthread_bind().
>
> Or do I misunderstand?
>
> If I am reading you correctly, then would it work to have a check in the
> cpuset code (rather than in the lower set_cpus_allowed() routine),
> where that check refused to move tasks out of the root cpuset if they
> were (1) kernel threads (mm NULL) and (2) had cpus_allowed that were a
> strict subset of the root cpusets 'cpus' (the online cpus).
>
No, because sched_setaffinity() can still move the threads.
David wrote:
> No, because sched_setaffinity() can still move the threads.
Good point.
How about also adding this check (NULL mm and struct subset cpus)
to sched_setaffinity?
Granted, this pair of checks, in cpusets and sched_setaffinity,
is getting to be a little more work than the PF_* flag.
I guess one question would be how hard we want to work to avoid
consuming another PF_* flag. I thought they were scarce, but I
might be wrong.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Wed, 4 Jun 2008, Paul Jackson wrote:
> How about also adding this check (NULL mm and struct subset cpus)
> to sched_setaffinity?
>
> Granted, this pair of checks, in cpusets and sched_setaffinity,
> is getting to be a little more work than the PF_* flag.
>
Yes, an alterative to the PF_CPU_BIND flag is to prevent calls to
set_cpus_allowed() for kthreads that should not change affinity.
> I guess one question would be how hard we want to work to avoid
> consuming another PF_* flag. I thought they were scarce, but I
> might be wrong.
>
I think we both agree that in terms of the code itself, adding the flag
and doing the check in set_cpus_allowed() is the optimal solution. It can
simply return -EINVAL for PF_CPU_BIND threads and the threads don't get
moved.
It's debatable whether allocating an additional PF_* flag for this purpose
is worthwhile.
I usually don't advocate against adding such bits that have a clear and
defined purpose for the sole reason that perhaps later we will need
additional bits that can't be worked around so easily as this one. It's
helpful to be conservative in allocating new flags but not at the expense
of the code itself, especially when loopholes can easily be introduced
that would have been otherwise prevented.
So I'll repost the patch and see where it goes.
David
David wrote:
> So I'll repost the patch and see where it goes.
Ok. Perhaps you could add a brief comment, noting that
the additional PF_* flag could perhaps be avoided with
checks in cpusets and sched_setaffinity, as explained in
http://lkml.org/lkml/2008/6/4/388.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214