2005-10-05 07:16:28

by Brian Gerst

[permalink] [raw]
Subject: Bogus load average and cpu times on x86_64 SMP kernels

I've been seeing bogus values from /proc/loadavg on an x86-64 SMP kernel
(but not UP).

$ cat /proc/loadavg
-1012098.26 922203.26 -982431.60 1/112 2688

This is in the current git tree. I'm also seeing strange values in
/proc/stat:

cpu 2489 40 920 60530 9398 171 288 1844674407350
cpu0 2509 60 940 60550 9418 191 308 0

The first line is the sum of all cpus (I only have one), so it's picking
up up bad data from the non-present cpus. The last value, stolen time,
is completely bogus since that value is only ever used on s390.

It looks to me like there is some problem with how the per-cpu
structures are being initialized, or are getting corrupted. I have not
been able to test i386 SMP yet to see if the problem is x86_64 specific.

--
Brian Gerst


2005-10-05 18:00:09

by Brian Gerst

[permalink] [raw]
Subject: Re: Bogus load average and cpu times on x86_64 SMP kernels

Brian Gerst wrote:
> I've been seeing bogus values from /proc/loadavg on an x86-64 SMP kernel
> (but not UP).
>
> $ cat /proc/loadavg
> -1012098.26 922203.26 -982431.60 1/112 2688
>
> This is in the current git tree. I'm also seeing strange values in
> /proc/stat:
>
> cpu 2489 40 920 60530 9398 171 288 1844674407350
> cpu0 2509 60 940 60550 9418 191 308 0
>
> The first line is the sum of all cpus (I only have one), so it's picking
> up up bad data from the non-present cpus. The last value, stolen time,
> is completely bogus since that value is only ever used on s390.
>
> It looks to me like there is some problem with how the per-cpu
> structures are being initialized, or are getting corrupted. I have not
> been able to test i386 SMP yet to see if the problem is x86_64 specific.

I found the culprit: CPU hotplug. The problem is that
prefill_possible_map() is called after setup_per_cpu_areas(). This
leaves the per-cpu data sections for the future cpus uninitialized
(still pointing to the original per-cpu data, which is initmem). Since
the cpus exists in cpu_possible_map, for_each_cpu will iterate over them
even though the per-cpu data is invalid.

--
Brian Gerst

2005-10-07 04:14:34

by Brian Gerst

[permalink] [raw]
Subject: [PATCH] Fix hotplug cpu on x86_64

Subject: [PATCH] Fix hotplug cpu on x86_64

Initialize cpu_possible_map properly in the hotplug cpu case. Before
this, the map was filled in after the per-cpu areas were set up. This left
those cpus pointing to initmem and causing invalid data in /proc/stat and
elsewhere.

Signed-off-by: Brian Gerst <[email protected]>

---

arch/x86_64/kernel/smpboot.c | 41 ++++++++++++++++-------------------------
1 files changed, 16 insertions(+), 25 deletions(-)

12027fcffd85447f0fbf28264c5ee072715c345b
diff --git a/arch/x86_64/kernel/smpboot.c b/arch/x86_64/kernel/smpboot.c
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -80,7 +80,23 @@ EXPORT_SYMBOL(cpu_online_map);
cpumask_t cpu_callin_map;
cpumask_t cpu_callout_map;

+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * cpu_possible_map should be static, it cannot change as cpu's
+ * are onlined, or offlined. The reason is per-cpu data-structures
+ * are allocated by some modules at init time, and dont expect to
+ * do this dynamically on cpu arrival/departure.
+ * cpu_present_map on the other hand can change dynamically.
+ * In case when cpu_hotplug is not compiled, then we resort to current
+ * behaviour, which is cpu_possible == cpu_present.
+ * If cpu-hotplug is supported, then we need to preallocate for all
+ * those NR_CPUS, hence cpu_possible_map represents entire NR_CPUS range.
+ * - Ashok Raj
+ */
+cpumask_t cpu_possible_map = CPU_MASK_ALL;
+#else
cpumask_t cpu_possible_map;
+#endif
EXPORT_SYMBOL(cpu_possible_map);

/* Per CPU bogomips and other parameters */
@@ -879,27 +895,6 @@ static __init void disable_smp(void)
cpu_set(0, cpu_core_map[0]);
}

-#ifdef CONFIG_HOTPLUG_CPU
-/*
- * cpu_possible_map should be static, it cannot change as cpu's
- * are onlined, or offlined. The reason is per-cpu data-structures
- * are allocated by some modules at init time, and dont expect to
- * do this dynamically on cpu arrival/departure.
- * cpu_present_map on the other hand can change dynamically.
- * In case when cpu_hotplug is not compiled, then we resort to current
- * behaviour, which is cpu_possible == cpu_present.
- * If cpu-hotplug is supported, then we need to preallocate for all
- * those NR_CPUS, hence cpu_possible_map represents entire NR_CPUS range.
- * - Ashok Raj
- */
-static void prefill_possible_map(void)
-{
- int i;
- for (i = 0; i < NR_CPUS; i++)
- cpu_set(i, cpu_possible_map);
-}
-#endif
-
/*
* Various sanity checks.
*/
@@ -967,10 +962,6 @@ void __init smp_prepare_cpus(unsigned in
current_cpu_data = boot_cpu_data;
current_thread_info()->cpu = 0; /* needed? */

-#ifdef CONFIG_HOTPLUG_CPU
- prefill_possible_map();
-#endif
-
if (smp_sanity_check(max_cpus) < 0) {
printk(KERN_INFO "SMP disabled\n");
disable_smp();


Attachments:
0001-Fix-hotplug-cpu-on-x86_64.txt (2.69 kB)

2005-10-07 09:50:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64

A similar patch is already queued with Linus.

I also have a followon patch to avoid the extreme memory wastage
currently caused by hotplug CPUs (e.g. with NR_CPUS==128 you currently
lose 4MB of memory just for preallocated per CPU data). But that is
something for post 2.6.14.

See ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt-current/patches/*
for the current queue.

-Andi

2005-10-07 15:07:19

by Protasevich, Natalie

[permalink] [raw]
Subject: RE: [PATCH] Fix hotplug cpu on x86_64

> Brian Gerst wrote:
> > Brian Gerst wrote:
> >> I've been seeing bogus values from /proc/loadavg on an x86-64 SMP
> >> kernel (but not UP).
> >>
> >> $ cat /proc/loadavg
> >> -1012098.26 922203.26 -982431.60 1/112 2688
> >>
> >> This is in the current git tree. I'm also seeing strange values in
> >> /proc/stat:
> >>
> >> cpu 2489 40 920 60530 9398 171 288 1844674407350 cpu0 2509 60 940
> >> 60550 9418 191 308 0
> >>
> >> The first line is the sum of all cpus (I only have one), so it's
> >> picking up up bad data from the non-present cpus. The last value,
> >> stolen time, is completely bogus since that value is only
> ever used
> >> on s390.
> >>
> >> It looks to me like there is some problem with how the per-cpu
> >> structures are being initialized, or are getting
> corrupted. I have
> >> not been able to test i386 SMP yet to see if the problem is x86_64
> >> specific.
> >
> > I found the culprit: CPU hotplug. The problem is that
> > prefill_possible_map() is called after setup_per_cpu_areas(). This
> > leaves the per-cpu data sections for the future cpus uninitialized
> > (still pointing to the original per-cpu data, which is initmem).
> > Since the cpus exists in cpu_possible_map, for_each_cpu
> will iterate
> > over them even though the per-cpu data is invalid.
>

I had to do the same in i386, but initially I was trying to avoid the
whole situation - allocating per_cpu data for all possible processors.
It seemed wasteful that on the system with NR_CPU=256 or 512 and brought
up as 4x everything per_cpu is (pre)allocated for all, although it's
sure convenient. I though at the time it would be great if
alloc_percpu() mechanism was able to dynamically re-create all the
per_cpu's for new processors, that way cpu_possible_map woun't probably
even be needed. Or is it too much trouble for too little gain...

Thanks,
--Natalie


> Initialize cpu_possible_map properly in the hotplug cpu case.
> Before this, the map was filled in after the per-cpu areas
> were set up. This left those cpus pointing to initmem and
> causing invalid data in /proc/stat and elsewhere.
>
> Signed-off-by: Brian Gerst <[email protected]>
>

2005-10-07 15:26:21

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64

Protasevich, Natalie wrote:
>>Brian Gerst wrote:
>>
>>>Brian Gerst wrote:
>>>
>>>>I've been seeing bogus values from /proc/loadavg on an x86-64 SMP
>>>>kernel (but not UP).
>>>>
>>>>$ cat /proc/loadavg
>>>>-1012098.26 922203.26 -982431.60 1/112 2688
>>>>
>>>>This is in the current git tree. I'm also seeing strange values in
>>>>/proc/stat:
>>>>
>>>>cpu 2489 40 920 60530 9398 171 288 1844674407350 cpu0 2509 60 940
>>>>60550 9418 191 308 0
>>>>
>>>>The first line is the sum of all cpus (I only have one), so it's
>>>>picking up up bad data from the non-present cpus. The last value,
>>>>stolen time, is completely bogus since that value is only
>>
>>ever used
>>
>>>>on s390.
>>>>
>>>>It looks to me like there is some problem with how the per-cpu
>>>>structures are being initialized, or are getting
>>
>>corrupted. I have
>>
>>>>not been able to test i386 SMP yet to see if the problem is x86_64
>>>>specific.
>>>
>>>I found the culprit: CPU hotplug. The problem is that
>>>prefill_possible_map() is called after setup_per_cpu_areas(). This
>>>leaves the per-cpu data sections for the future cpus uninitialized
>>>(still pointing to the original per-cpu data, which is initmem).
>>>Since the cpus exists in cpu_possible_map, for_each_cpu
>>
>>will iterate
>>
>>>over them even though the per-cpu data is invalid.
>>
>
> I had to do the same in i386, but initially I was trying to avoid the
> whole situation - allocating per_cpu data for all possible processors.
> It seemed wasteful that on the system with NR_CPU=256 or 512 and brought
> up as 4x everything per_cpu is (pre)allocated for all, although it's
> sure convenient. I though at the time it would be great if
> alloc_percpu() mechanism was able to dynamically re-create all the
> per_cpu's for new processors, that way cpu_possible_map woun't probably
> even be needed. Or is it too much trouble for too little gain...
>
> Thanks,
> --Natalie
>

It certainly is possible. In the hotplug cpu case, don't put the
.data.percpu section in __initmem. It will then be preserved for any
cpus that come online after boot.

--
Brian Gerst

2005-10-07 16:00:35

by Protasevich, Natalie

[permalink] [raw]
Subject: RE: [PATCH] Fix hotplug cpu on x86_64

> > I had to do the same in i386, but initially I was trying to
> avoid the
> > whole situation - allocating per_cpu data for all possible
> processors.
> > It seemed wasteful that on the system with NR_CPU=256 or 512 and
> > brought up as 4x everything per_cpu is (pre)allocated for all,
> > although it's sure convenient. I though at the time it
> would be great
> > if
> > alloc_percpu() mechanism was able to dynamically re-create all the
> > per_cpu's for new processors, that way cpu_possible_map woun't
> > probably even be needed. Or is it too much trouble for too
> little gain...
> >
> > Thanks,
> > --Natalie
> >
>
> It certainly is possible. In the hotplug cpu case, don't put
> the .data.percpu section in __initmem. It will then be
> preserved for any cpus that come online after boot.
>

I don't want to confuse Andrew, the patch is definitely correct and
needed...

It is more about "cold plug" case, when processors are not present
initially, but physically added to the systen during runtime. The
problem was not to preserve, but not to even allocate per_cpu for all
NR_CPUS, only for physically present processors. Then create per_cpu
with all the allocated data for the newly added processors. But as I
said it is probably too much to ask after all... :)
Thanks,
--Natalie

2005-10-07 16:24:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64

On Friday 07 October 2005 17:07, Protasevich, Natalie wrote:
> > Brian Gerst wrote:
> > > Brian Gerst wrote:
> > >> I've been seeing bogus values from /proc/loadavg on an x86-64 SMP
> > >> kernel (but not UP).
> > >>
> > >> $ cat /proc/loadavg
> > >> -1012098.26 922203.26 -982431.60 1/112 2688
> > >>
> > >> This is in the current git tree. I'm also seeing strange values in
> > >> /proc/stat:
> > >>
> > >> cpu 2489 40 920 60530 9398 171 288 1844674407350 cpu0 2509 60 940
> > >> 60550 9418 191 308 0
> > >>
> > >> The first line is the sum of all cpus (I only have one), so it's
> > >> picking up up bad data from the non-present cpus. The last value,
> > >> stolen time, is completely bogus since that value is only
> >
> > ever used
> >
> > >> on s390.
> > >>
> > >> It looks to me like there is some problem with how the per-cpu
> > >> structures are being initialized, or are getting
> >
> > corrupted. I have
> >
> > >> not been able to test i386 SMP yet to see if the problem is x86_64
> > >> specific.
> > >
> > > I found the culprit: CPU hotplug. The problem is that
> > > prefill_possible_map() is called after setup_per_cpu_areas(). This
> > > leaves the per-cpu data sections for the future cpus uninitialized
> > > (still pointing to the original per-cpu data, which is initmem).
> > > Since the cpus exists in cpu_possible_map, for_each_cpu
> >
> > will iterate
> >
> > > over them even though the per-cpu data is invalid.
>
> I had to do the same in i386, but initially I was trying to avoid the
> whole situation - allocating per_cpu data for all possible processors.
> It seemed wasteful that on the system with NR_CPU=256 or 512 and brought
> up as 4x everything per_cpu is (pre)allocated for all, although it's

It's quite. With NR_CPUS==128 the current x86-64 code wastes around 4MB
in per CPU data :/

> sure convenient. I though at the time it would be great if
> alloc_percpu() mechanism was able to dynamically re-create all the
> per_cpu's for new processors, that way cpu_possible_map woun't probably
> even be needed. Or is it too much trouble for too little gain...

The problem is to tell all subsystems to reinitialize the data for possible
CPUs (essentially the concept of "possible" CPUs would need to go) It would
need an reaudit of a lot of code. Probably quite some work.

I think it is better to try to figure out how many hotplug CPUs are supported,
otherwise use a small default. My current code uses disabled CPUs
as a heuristic, otherwise half of available CPUs and it can be overwritten
by the user.

-Andi

2005-10-07 16:42:21

by Protasevich, Natalie

[permalink] [raw]
Subject: RE: [PATCH] Fix hotplug cpu on x86_64

> > sure convenient. I though at the time it would be great if
> > alloc_percpu() mechanism was able to dynamically re-create all the
> > per_cpu's for new processors, that way cpu_possible_map woun't
> > probably even be needed. Or is it too much trouble for too
> little gain...
>
> The problem is to tell all subsystems to reinitialize the
> data for possible CPUs (essentially the concept of "possible"
> CPUs would need to go) It would need an reaudit of a lot of
> code. Probably quite some work.

You know Andi, I was imagining something like bitmap or linked list of
all per_cpu vars (dynamically updated) and just going through this
list... Or something like that (maybe some registration mechanism).
There are not too many of them - about two dozens, mostly all sorts of
accounting.

> I think it is better to try to figure out how many hotplug
> CPUs are supported, otherwise use a small default.

Exactly - such as on ES7000, it can support 32, 64, 128 etc. processors
depending on what configuration the customer actually ordered :)... it
should be something for that, then NR_CPUS could be either defined as
boot parameter or belong to subarchs.
Thanks,
--Natalie

> My
> current code uses disabled CPUs as a heuristic, otherwise
> half of available CPUs and it can be overwritten by the user.
>
> -Andi
>

2005-10-07 17:23:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64

On Friday 07 October 2005 18:42, Protasevich, Natalie wrote:

> You know Andi, I was imagining something like bitmap or linked list of
> all per_cpu vars (dynamically updated) and just going through this
> list... Or something like that (maybe some registration mechanism).
> There are not too many of them - about two dozens, mostly all sorts of
> accounting.

Finding them is no problem. We have NR_CPUS arrays for this (or other
per CPU mechanisms).

The problem is initializing them correct. There are currently two ways to do
this:

- (easier one, used by most subsystems) at startup set up state for
all possible CPUs.
- (complex one) register a CPU notifier and watch for CPU_UP/DOWN

Because of the first way the per cpu data is currently preallocated for
all hotplug CPUs. You cannot copy the state later because it might be
undefined then.

To make dynamic changes of possible map work would require to convert all
users to the second more complex way. Probably a lot of work.

>
> > I think it is better to try to figure out how many hotplug
> > CPUs are supported, otherwise use a small default.
>
> Exactly - such as on ES7000, it can support 32, 64, 128 etc. processors
> depending on what configuration the customer actually ordered :)... it
> should be something for that, then NR_CPUS could be either defined as
> boot parameter or belong to subarchs.

ACPI/mptables has the concept of "disabled" CPUs. I just bent that a bit
and use it as the number of possible CPUs.

-Andi

2005-10-08 00:53:20

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64


Hi,

> I also have a followon patch to avoid the extreme memory wastage
> currently caused by hotplug CPUs (e.g. with NR_CPUS==128 you currently
> lose 4MB of memory just for preallocated per CPU data). But that is
> something for post 2.6.14.

Im interested in doing that on ppc64 too. Are you currently only
creating per cpu data areas for possible cpus? The generic code does
NR_CPUS worth, we should change that in 2.6.15.

Anton

2005-10-08 10:31:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix hotplug cpu on x86_64

On Saturday 08 October 2005 02:50, Anton Blanchard wrote:
> Hi,
>
> > I also have a followon patch to avoid the extreme memory wastage
> > currently caused by hotplug CPUs (e.g. with NR_CPUS==128 you currently
> > lose 4MB of memory just for preallocated per CPU data). But that is
> > something for post 2.6.14.
>
> Im interested in doing that on ppc64 too. Are you currently only
> creating per cpu data areas for possible cpus?

Yes. In fact that caused some of the problems that lead to the investigation
of this.

> The generic code does
> NR_CPUS worth, we should change that in 2.6.15.

You need to audit all architecture code then that fills up possible map.
(at least for the architectures that support CPU hotplug)

With that change x86-64 could move back to the generic code BTW -
the numa allocation in the architecture specific code doesn't work anymore
because we usually find out about the cpu<->node mapping now only
afterwards :/

-Andi