I think there is a bug in build_sched_domains() when kmalloc fails to
allocate memory for 'sched_group_allnodes'.
Consider the case where build_sched_domains() is being invoked because a
CPUset is being made exclusive. Further lets say user changed cpu_exclusive
property of some CPUset as follow:
echo 1 > cpu_exclusive # (Step a)
echo 0 > cpu_exclusive # (Step b)
echo 1 > cpu_exclusive # (Step c)
Lets say that all memory allocations succeeded during the 1st attempt to
make CPUset exclusive (Step a). Step a would result in the sched domain
heirarchy being initialized for all CPUs in the CPUSet under
consideration. Step b would cause those memory allocations to be freed
up.
Now during Step c, lets say that kmalloc failed to allocate for
'sched_group_allnodes'. When that happens, the code just breaks out of
the 'for' loop. I think that is wrong, as it would cause the previous
assignments (made during Step a) to be retained. This could cause, among
other things, sd->groups to be invalid (for allnodes_domains)?
Patch below (against 2.6.16-mm1) fixes it. I have done some minimal test
of it.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
kernel/sched.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff -puN kernel/sched.c~sd_handle_error kernel/sched.c
--- linux-2.6.16-mm1/kernel/sched.c~sd_handle_error 2006-03-25 11:39:07.000000000 +0530
+++ linux-2.6.16-mm1-root/kernel/sched.c 2006-03-25 11:39:59.000000000 +0530
@@ -5965,6 +5965,7 @@ void build_sched_domains(const cpumask_t
{
int i;
#ifdef CONFIG_NUMA
+ int alloc_failed = 0;
struct sched_group **sched_group_nodes = NULL;
struct sched_group *sched_group_allnodes = NULL;
@@ -5993,7 +5994,7 @@ void build_sched_domains(const cpumask_t
#ifdef CONFIG_NUMA
if (cpus_weight(*cpu_map)
> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
- if (!sched_group_allnodes) {
+ if (!sched_group_allnodes && !alloc_failed) {
sched_group_allnodes
= kmalloc(sizeof(struct sched_group)
* MAX_NUMNODES,
@@ -6001,7 +6002,7 @@ void build_sched_domains(const cpumask_t
if (!sched_group_allnodes) {
printk(KERN_WARNING
"Can not alloc allnodes sched group\n");
- break;
+ alloc_failed = 1;
}
sched_group_allnodes_bycpu[i]
= sched_group_allnodes;
@@ -6010,7 +6011,10 @@ void build_sched_domains(const cpumask_t
*sd = SD_ALLNODES_INIT;
sd->span = *cpu_map;
group = cpu_to_allnodes_group(i);
- sd->groups = &sched_group_allnodes[group];
+ sd->groups = sched_group_allnodes ?
+ &sched_group_allnodes[group] : NULL;
+ if (!sd->groups)
+ sd->flags = 0; /* No load balancing */
p = sd;
} else
p = NULL;
_
--
Regards,
vatsa
Srivatsa Vaddagiri <[email protected]> wrote:
>
> I think there is a bug in build_sched_domains() when kmalloc fails to
> allocate memory for 'sched_group_allnodes'.
>
> Consider the case where build_sched_domains() is being invoked because a
> CPUset is being made exclusive. Further lets say user changed cpu_exclusive
> property of some CPUset as follow:
>
> echo 1 > cpu_exclusive # (Step a)
> echo 0 > cpu_exclusive # (Step b)
> echo 1 > cpu_exclusive # (Step c)
>
>
> Lets say that all memory allocations succeeded during the 1st attempt to
> make CPUset exclusive (Step a). Step a would result in the sched domain
> heirarchy being initialized for all CPUs in the CPUSet under
> consideration. Step b would cause those memory allocations to be freed
> up.
>
> Now during Step c, lets say that kmalloc failed to allocate for
> 'sched_group_allnodes'. When that happens, the code just breaks out of
> the 'for' loop. I think that is wrong, as it would cause the previous
> assignments (made during Step a) to be retained. This could cause, among
> other things, sd->groups to be invalid (for allnodes_domains)?
>
> Patch below (against 2.6.16-mm1) fixes it. I have done some minimal test
> of it.
>
This is rather ugly, sorry.
>
>
> kernel/sched.c | 10 +++++++---
> 1 files changed, 7 insertions(+), 3 deletions(-)
>
> diff -puN kernel/sched.c~sd_handle_error kernel/sched.c
> --- linux-2.6.16-mm1/kernel/sched.c~sd_handle_error 2006-03-25 11:39:07.000000000 +0530
> +++ linux-2.6.16-mm1-root/kernel/sched.c 2006-03-25 11:39:59.000000000 +0530
> @@ -5965,6 +5965,7 @@ void build_sched_domains(const cpumask_t
> {
> int i;
> #ifdef CONFIG_NUMA
> + int alloc_failed = 0;
> struct sched_group **sched_group_nodes = NULL;
> struct sched_group *sched_group_allnodes = NULL;
>
> @@ -5993,7 +5994,7 @@ void build_sched_domains(const cpumask_t
> #ifdef CONFIG_NUMA
> if (cpus_weight(*cpu_map)
> > SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
> - if (!sched_group_allnodes) {
> + if (!sched_group_allnodes && !alloc_failed) {
> sched_group_allnodes
> = kmalloc(sizeof(struct sched_group)
> * MAX_NUMNODES,
> @@ -6001,7 +6002,7 @@ void build_sched_domains(const cpumask_t
> if (!sched_group_allnodes) {
> printk(KERN_WARNING
> "Can not alloc allnodes sched group\n");
> - break;
> + alloc_failed = 1;
> }
> sched_group_allnodes_bycpu[i]
> = sched_group_allnodes;
> @@ -6010,7 +6011,10 @@ void build_sched_domains(const cpumask_t
> *sd = SD_ALLNODES_INIT;
> sd->span = *cpu_map;
> group = cpu_to_allnodes_group(i);
> - sd->groups = &sched_group_allnodes[group];
> + sd->groups = sched_group_allnodes ?
> + &sched_group_allnodes[group] : NULL;
> + if (!sd->groups)
> + sd->flags = 0; /* No load balancing */
> p = sd;
> } else
> p = NULL;
So if the kmalloc failed we'll try to limp along without load balancing?
I think it would be better to free any thus-far allocated memory and to
fail the whole thing.
Returning void from build_sched_domains was wrong.
build_sched_domains() should be static and __cpuinit, btw.
On Sat, Mar 25, 2006 at 06:06:05PM -0800, Andrew Morton wrote:
> This is rather ugly, sorry.
That was about my reaction too when I was going thr'
build_sched_domains()!
> So if the kmalloc failed we'll try to limp along without load balancing?
Not exactly. We will still load balance at lower domains (between
threads of a CPU & between CPUs of a node) that dont require any memory
allocation.
> I think it would be better to free any thus-far allocated memory and to
> fail the whole thing.
This would result in absolutely no load balancing (even for domain
levels which didnt need any memory allocation - like at threads-of-a-cpu
level). Is that acceptable?
> Returning void from build_sched_domains was wrong.
If we decide to return an error, then it has to be percolated all the
way down (for ex: update_cpu_domains should now have to return an error
too if partition_sched_domains returns an error)?
> build_sched_domains() should be static and __cpuinit, btw.
Ok ..Will take care of that in the next version of the patch.
And thanks for the response to the patch!
--
Regards,
vatsa
Srivatsa Vaddagiri <[email protected]> wrote:
>
> On Sat, Mar 25, 2006 at 06:06:05PM -0800, Andrew Morton wrote:
> > This is rather ugly, sorry.
>
> That was about my reaction too when I was going thr'
> build_sched_domains()!
>
> > So if the kmalloc failed we'll try to limp along without load balancing?
>
> Not exactly. We will still load balance at lower domains (between
> threads of a CPU & between CPUs of a node) that dont require any memory
> allocation.
>
> > I think it would be better to free any thus-far allocated memory and to
> > fail the whole thing.
>
> This would result in absolutely no load balancing (even for domain
> levels which didnt need any memory allocation - like at threads-of-a-cpu
> level). Is that acceptable?
>
> > Returning void from build_sched_domains was wrong.
>
> If we decide to return an error, then it has to be percolated all the
> way down (for ex: update_cpu_domains should now have to return an error
> too if partition_sched_domains returns an error)?
Well, when is this code called? It would be at boot time, in which case
the allocations will succeed (if not, the boot fails) or at cpu/node
hot-add, in which case the appropriate response is to fail to bring up the
new cpu/node.
It's better to send the administrator back to work out why we ran out of
memory than to appear to have brought the new cpu/node online, only to have
it run funny.
I think?
> > build_sched_domains() should be static and __cpuinit, btw.
>
> Ok ..Will take care of that in the next version of the patch.
>
umm, it's probably best to not bother. I think Ashok is looking into all
the memory we're presently wasting on non-cpu_hotplug builds. There's
quite a lot in there.
Andrew Morton wrote:
> Srivatsa Vaddagiri <[email protected]> wrote:
>
>>If we decide to return an error, then it has to be percolated all the
>>way down (for ex: update_cpu_domains should now have to return an error
>>too if partition_sched_domains returns an error)?
>
>
> Well, when is this code called? It would be at boot time, in which case
> the allocations will succeed (if not, the boot fails) or at cpu/node
> hot-add, in which case the appropriate response is to fail to bring up the
> new cpu/node.
>
> It's better to send the administrator back to work out why we ran out of
> memory than to appear to have brought the new cpu/node online, only to have
> it run funny.
>
> I think?
>
I agree, for bootup/hot-add.
Not so sure about exclusive cpusets.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, Mar 25, 2006 at 06:44:41PM -0800, Andrew Morton wrote:
> Well, when is this code called? It would be at boot time, in which case
> the allocations will succeed (if not, the boot fails) or at cpu/node
> hot-add, in which case the appropriate response is to fail to bring up the
> new cpu/node.
Hmm ..dont follow you here. Are you saying the above is the current
behavior (say in 2.6.16-mm1)? AFAICS that is not true because
arch_init_sched_domains (which handles both bootup & cpu hot-add cases)
doesnt return any error back to its caller.
Also note that build_sched_domains can be called from CPUset code too
(partition_sched_domains) when we modify the exclusive property of a CPUset.
> It's better to send the administrator back to work out why we ran out of
> memory than to appear to have brought the new cpu/node online, only to have
> it run funny.
>
> I think?
I don't know. Is load balancing an absolute critical feature that
without it system bootup and cpu hot-add should be failed? Also note
that we may be able to do partial load balancing (between threads of a
CPU) even when there is memory allocation failure.
Regarding giving administrator hints of allocation failure, the printks in
build_sched_domain do give those hints.
>
> > > build_sched_domains() should be static and __cpuinit, btw.
> >
> > Ok ..Will take care of that in the next version of the patch.
> >
>
> umm, it's probably best to not bother. I think Ashok is looking into all
> the memory we're presently wasting on non-cpu_hotplug builds. There's
> quite a lot in there.
Ok ..Will let Ashok take care of adding __cpuinit prefix.
--
Regards,
vatsa
On Sun, Mar 26, 2006 at 01:03:03PM +1000, Nick Piggin wrote:
> I agree, for bootup/hot-add.
>
> Not so sure about exclusive cpusets.
I think even if memory allocation fails while making a cpuset exclusive,
we would still want have the cpuset marked exclusive. Dynamic scheduler
domains is just one implication of making a cpuset exclusive. There
maybe other implications (like only administrator controls what tasks
run in a exclusive cpuset?) which we want to retain even when there is
mem allocation failure while building dyn sched domains.
Paul can probably clarify this much better than me!
--
Regards,
vatsa
Vatsa wrote:
> I think even if memory allocation fails while making a cpuset exclusive,
> we would still want have the cpuset marked exclusive.
Perhaps ...
The cpuset code itself in the kernel doesn't care at all whether the
dynamic sched domain setup succeeds or not.
If the user or app that made the system call (write '1' to
cpu_exclusive) was setting cpu_exclusive to get a sched domain,
they would want to know if it failed. If they were looking for one
of the other consequences of an exclusive cpuset, then such an error
is possibly confusing noise.
I keep wishing I had made a separate per-cpuset flag to mark a
cpuset as defining a dynamic sched domain. Had I done that, I would
definitely want to pass back an error in this case.
But even if the user/app isn't interested in this failure case,
probably the 'good kernel citizen' choice is to report it. The kernel
should report errors to fail to accomplish requested tasks, and not
second guess whether the user actually cares about that error.
And if we do report an error back to userland, then the request to
set cpuset_exclusive needs to fail entirely, not setting it at all.
So ... bottom line ... when and if it becomes convenient to do so,
have the kernel/sched.c:partition_sched_domains() routine return
an error instead of void. Then sometime later on, I can change the
cpuset code to pick up that error, unravel the cpu_exclusive setting,
and pass back the error.
But take your sweet time doing this -- no hurry here.
And if you decide it isn't worth the code or effort to do this,
I will gladly forget the whole thing.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401