Ok, here is hopefully the last iteration of the dynamic sched domain
patches before I can send it off to Andrew.
I would have posted it earlier, except I got sidetracked with all
the hotplug+preempt issues. cpusets+hotplug+preempt requires more
testing and I'll post those patches sometime later in the week
patch1 - sched.c+sched.h changes
patch2 - cpuset.c+Documentation changes
patch3 - ia64 changes
All patches are against 2.6.12-rc4-mm1 and have been tested
(except for the ia64 changes)
linux-2.6.12-rc4-mm1-1/include/linux/sched.h | 2
linux-2.6.12-rc4-mm1-1/kernel/sched.c | 131 +++++++++++++++--------
linux-2.6.12-rc4-mm1-2/Documentation/cpusets.txt | 16 ++
linux-2.6.12-rc4-mm1-2/kernel/cpuset.c | 89 ++++++++++++---
linux-2.6.12-rc4-mm1-3/arch/ia64/kernel/domain.c | 77 ++++++++-----
5 files changed, 225 insertions(+), 90 deletions(-)
o Patch1 adds the new API partition_sched_domains.
I have incorporated all of the feedback from before.
o I didnt think it necessary to add another semaphore to protect the sched domain
changes as suggested by Nick. As it was clear last week, hotplug+cpusets
causes enough issues with nested sems, so I didnt want to add yet another in
the mix
o I have removed the __devinit qualifier from some functions which will
now get called anytime changes are made to exclusive cpusets instead of
only during boot
o arch_init_sched_domains/arch_destroy_sched_domains now take a pointer
to a const cpumask_t * type.
-Dinakar
o Patch2 has updated cpusets documentation and the core update_cpu_domains
function
o I have also moved the dentry d_lock as discussed previously
o Patch3 has the ia64 changes similar to kernel/sched.c
o This patch compiles ok, but has not been tested
Dinakar Guniguntala wrote:
> o Patch2 has updated cpusets documentation and the core update_cpu_domains
> function
> o I have also moved the dentry d_lock as discussed previously
>
Hi Dinakar,
patch1 looks good. Just one tiny little minor thing:
> +
> + lock_cpu_hotplug();
> + partition_sched_domains(&pspan, &cspan);
> + unlock_cpu_hotplug();
> +}
> +
I don't think the cpu hotplug lock isn't supposed to provide
synchronisation between readers (for example, it may be turned
into an rwsem), but only between the thread and the cpu hotplug
callbacks.
In that case, can you move this locking into kernel/sched.c, and
add the comment in partition_sched_domains that the callers must
take care of synchronisation (which without reading the code, I
assume you're doing with the cpuset sem?).
If you agree with that change, you can add an
Acked-by: Nick Piggin <[email protected]>
to patch 1 and send it to Andrew whenever you're ready (better
CC Ingo as well). If not, please discuss! :)
Thanks,
Nick
On Tue, May 17, 2005 at 04:25:37PM +1000, Nick Piggin wrote:
> >+
> >+ lock_cpu_hotplug();
> >+ partition_sched_domains(&pspan, &cspan);
> >+ unlock_cpu_hotplug();
> >+}
> >+
>
> I don't think the cpu hotplug lock isn't supposed to provide
> synchronisation between readers (for example, it may be turned
> into an rwsem), but only between the thread and the cpu hotplug
> callbacks.
That should be ok
>
> In that case, can you move this locking into kernel/sched.c, and
> add the comment in partition_sched_domains that the callers must
> take care of synchronisation (which without reading the code, I
> assume you're doing with the cpuset sem?).
I didnt want to do this as my next patch, which introduces
hotplug support for dynamic sched domains, also calls
partition_sched_domains. That code is called with the hotplug lock
already held. (I am still testing that code, it should be out by
this weekend)
However I will add a comment about the synchronization and yes
currently it is taken care of by the cpuset sem
-Dinakar
Looking good. Some minor comments on these three patches ...
* The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
is a bit confusing (yes, that name was already there). How about
something like 'siblings' ?
* I suspect that the following two lines:
cpus_complement(cpu_default_map, cpu_isolated_map);
cpus_and(cpu_default_map, cpu_default_map, *cpu_map);
can be replaced with the one line:
cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
* You have 'cpu-exclusive' in some places in the Documentation.
I would mildly prefer to always spell this 'cpu_exclusive' (with
underscore, not hyphen).
* I like how this design came out, as described in:
A cpuset that is cpu exclusive has a sched domain associated with it.
The sched domain consists of all cpus in the current cpuset that are not
part of any exclusive child cpusets.
Good work.
* Question - any idea how much of a performance hiccup a system will feel
whenever someone changes the cpu_exclusive cpusets? Could this lead
to a denial-of-service attack, if say some untrusted user were allowed
modify privileges on some small cpuset that was cpu_exclusive, and they
abused that privilege by turning on and off the cpu_exclusive property
on their little cpuset (or creating/destroying an exclusive child):
cd /dev/cpuset/$(cat /proc/self/cpuset)
while true
do
for i in 0 1
do
echo $i > cpu_exclusive
done
done
If so, perhaps we should recommend that shared systems with untrusted
users avoid allowing a cpu_exclusive cpuset to be modifiable, or to have
a cpu_exclusive flag modifiable, by those untrusted users.
* The cpuset 'oldcs' in update_flag() seems to only be used for its
cpu_exclusive flag. We could save some stack space on my favorite
big honkin NUMA iron by just having a local variable for this
'old_cpu_exclusive' value, instead of the entire cpuset.
* Similarly, though with a bit less savings, one could replace 'oldcs'
in update_cpumask() with just the old_cpus_allowed mask.
Or, skip even that, and compute a boolean flag:
cpus_changed = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
before copying over the trialcs, so we only need one word of stack
for the boolean, not possibly many words for a cpumask.
* Non-traditional code style:
}
else {
should be instead:
} else {
* Is it the case that update_cpu_domains() is called with cpuset_sem held?
Would it be a good idea to note in the comment for that routine:
* Call with cpuset_sem held. May nest a call to the
* lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
I didn't callout the cpuset_sem lock precondition on many routines,
but since this one can nest the cpu_hotplug lock, it might be worth
calling it out, for the benefit of engineers who are passing through,
needing to know how the hotplug lock nests with other semaphores.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Tue, May 17, 2005 at 10:53:54PM -0700, Paul Jackson wrote:
> Looking good. Some minor comments on these three patches ...
>
> * The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
> is a bit confusing (yes, that name was already there). How about
> something like 'siblings' ?
Not sure which code you are referring to here ?? I dont see any nodemask
referring to SMT siblings ?
> can be replaced with the one line:
>
> cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
yeah, ok
> I would mildly prefer to always spell this 'cpu_exclusive' (with
> underscore, not hyphen).
fine
> Good work.
Thanks !
>
> * Question - any idea how much of a performance hiccup a system will feel
> whenever someone changes the cpu_exclusive cpusets? Could this lead
> to a denial-of-service attack, if say some untrusted user were allowed
> modify privileges on some small cpuset that was cpu_exclusive, and they
> abused that privilege by turning on and off the cpu_exclusive property
> on their little cpuset (or creating/destroying an exclusive child):
>
I tried your script and see that it makes absolutely no impact on top.
The CPU on which it is running is mostly 100% idle. However I'll run
more tests to confirm that it has no impact
>
> * The cpuset 'oldcs' in update_flag() seems to only be used for its
> cpu_exclusive flag. We could save some stack space on my favorite
> big honkin NUMA iron by just having a local variable for this
> 'old_cpu_exclusive' value, instead of the entire cpuset.
>
> * Similarly, though with a bit less savings, one could replace 'oldcs'
> in update_cpumask() with just the old_cpus_allowed mask.
> Or, skip even that, and compute a boolean flag:
> cpus_changed = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
> before copying over the trialcs, so we only need one word of stack
> for the boolean, not possibly many words for a cpumask.
ok for both
>
> * Non-traditional code style:
> }
> else {
> should be instead:
> } else {
I dont know how that snuck back in, I'll change that
>
> * Is it the case that update_cpu_domains() is called with cpuset_sem held?
> Would it be a good idea to note in the comment for that routine:
> * Call with cpuset_sem held. May nest a call to the
> * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
> I didn't callout the cpuset_sem lock precondition on many routines,
> but since this one can nest the cpu_hotplug lock, it might be worth
> calling it out, for the benefit of engineers who are passing through,
> needing to know how the hotplug lock nests with other semaphores.
ok
I do feel with the above updates the patches can go into -mm.
Appreciate all the review comments from everyone, Thanks
-Dinakar
Dinakar wrote:
> > * The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
> > is a bit confusing (yes, that name was already there). How about
> > something like 'siblings' ?
>
> Not sure which code you are referring to here ?? I dont see any nodemask
> referring to SMT siblings ?
This comment was referring to lines such as the following, which appear
a few places in your patch (though not lines you wrote, just nearby
lines, in all but one case):
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
I was thinking to change such a line to:
cpumask_t sibling = node_to_cpumask(cpu_to_node(i));
However, it is no biggie, and since it is not in your actual new
code, probably should not be part of your patch anyway.
There is one place, arch_destroy_sched_domains(), where you added such a
line, but there you should probably use the same 'nodemask' name as the
other couple of places, unless and until these places change together.
So bottom line - nevermind this comment.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar wrote:
> I tried your script and see that it makes absolutely no impact on top.
> The CPU on which it is running is mostly 100% idle. However I'll run
> more tests to confirm that it has no impact
I have no particular intuition, one way or the other, on how much a
dynamic reallocation of sched domains will impact the system. So
once you are comfortable that this is not normally a problem (which
you might already be), don't worry about it further on my account.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar wrote:
> I do feel with the above updates the patches can go into -mm.
Acked-by: Paul Jackson <[email protected]>
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401