2006-02-08 22:32:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Sun, 5 Feb 2006, Linux Kernel Mailing List wrote:

> [PATCH] percpu data: only iterate over possible CPUs

This sched.c bit breaks Xen, and probably also other architectures
that have CPU hotplug. I suspect the reason is that during early
bootup only the boot CPU is online, so nothing initialises the
runqueues for CPUs that are brought up afterwards.

I suspect we can get rid of this problem quite easily by moving
runqueue initialisation to init_idle()...

> diff --git a/kernel/sched.c b/kernel/sched.c
> index f77f23f..839466f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6109,7 +6109,7 @@ void __init sched_init(void)
> runqueue_t *rq;
> int i, j, k;
>
> - for (i = 0; i < NR_CPUS; i++) {
> + for_each_cpu(i) {
> prio_array_t *array;
>
> rq = cpu_rq(i);

--
All Rights Reversed


2006-02-09 01:21:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Wed, 8 Feb 2006, Rik van Riel wrote:

> On Sun, 5 Feb 2006, Linux Kernel Mailing List wrote:
>
> > [PATCH] percpu data: only iterate over possible CPUs
>
> The sched.c bit breaks Xen, and probably also other architectures
> that have CPU hotplug. I suspect the reason is that during early
> bootup only the boot CPU is online, so nothing initialises the
> runqueues for CPUs that are brought up afterwards.
>
> I suspect we can get rid of this problem quite easily by moving
> runqueue initialisation to init_idle()...

Well, it works. This (fairly trivial) patch makes hotplug cpu
work again, by ensuring that the runqueues of a newly brought
up CPU are initialized just before they are needed.

Without this patch the "spin_lock_irqsave(&rq->lock, flags);"
in init_idle() would oops if CONFIG_DEBUG_SPINLOCK was set.

With this patch, things just work.

Signed-off-by: Rik van Riel <[email protected]>

--- linux-2.6.15.i686/kernel/sched.c.idle_init 2006-02-08 17:56:50.000000000 -0500
+++ linux-2.6.15.i686/kernel/sched.c 2006-02-08 17:58:57.000000000 -0500
@@ -4437,6 +4437,35 @@ void __devinit init_idle(task_t *idle, i
{
runqueue_t *rq = cpu_rq(cpu);
unsigned long flags;
+ prio_array_t *array;
+ int j, k;
+
+ spin_lock_init(&rq->lock);
+ rq->nr_running = 0;
+ rq->active = rq->arrays;
+ rq->expired = rq->arrays + 1;
+ rq->best_expired_prio = MAX_PRIO;
+
+#ifdef CONFIG_SMP
+ rq->sd = NULL;
+ for (j = 1; j < 3; j++)
+ rq->cpu_load[j] = 0;
+ rq->active_balance = 0;
+ rq->push_cpu = 0;
+ rq->migration_thread = NULL;
+ INIT_LIST_HEAD(&rq->migration_queue);
+#endif
+ atomic_set(&rq->nr_iowait, 0);
+
+ for (j = 0; j < 2; j++) {
+ array = rq->arrays + j;
+ for (k = 0; k < MAX_PRIO; k++) {
+ INIT_LIST_HEAD(array->queue + k);
+ __clear_bit(k, array->bitmap);
+ }
+ // delimiter for bitsearch
+ __set_bit(MAX_PRIO, array->bitmap);
+ }

idle->sleep_avg = 0;
idle->array = NULL;
@@ -6110,41 +6139,6 @@ int in_sched_functions(unsigned long add

void __init sched_init(void)
{
- runqueue_t *rq;
- int i, j, k;
-
- for_each_cpu(i) {
- prio_array_t *array;
-
- rq = cpu_rq(i);
- spin_lock_init(&rq->lock);
- rq->nr_running = 0;
- rq->active = rq->arrays;
- rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;
-
-#ifdef CONFIG_SMP
- rq->sd = NULL;
- for (j = 1; j < 3; j++)
- rq->cpu_load[j] = 0;
- rq->active_balance = 0;
- rq->push_cpu = 0;
- rq->migration_thread = NULL;
- INIT_LIST_HEAD(&rq->migration_queue);
-#endif
- atomic_set(&rq->nr_iowait, 0);
-
- for (j = 0; j < 2; j++) {
- array = rq->arrays + j;
- for (k = 0; k < MAX_PRIO; k++) {
- INIT_LIST_HEAD(array->queue + k);
- __clear_bit(k, array->bitmap);
- }
- // delimiter for bitsearch
- __set_bit(MAX_PRIO, array->bitmap);
- }
- }
-
/*
* The boot idle thread does lazy MMU switching as well:
*/

2006-02-09 03:06:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Rik van Riel <[email protected]> wrote:
>
> On Sun, 5 Feb 2006, Linux Kernel Mailing List wrote:
>
> > [PATCH] percpu data: only iterate over possible CPUs
>
> This sched.c bit breaks Xen, and probably also other architectures
> that have CPU hotplug. I suspect the reason is that during early
> bootup only the boot CPU is online, so nothing initialises the
> runqueues for CPUs that are brought up afterwards.
>
> I suspect we can get rid of this problem quite easily by moving
> runqueue initialisation to init_idle()...

We've hit this snag with a few architectures. They're setting up
cpu_possible_map too late. It's never been clearly defined.

sched_init() is called here:

asmlinkage void __init start_kernel(void)
{
...
printk(linux_banner);
setup_arch(&command_line);
setup_per_cpu_areas();
smp_prepare_boot_cpu();
sched_init();

Users of __GENERIC_PER_CPU definitely need cpu_possible_map to be initialised
by the time setup_per_cpu_areas() is called, so I think it makes sense to
say "thou shalt initialise cpu_possible_map in setup_arch()".

I guess Xen isn't doing that. Can it be made to?

2006-02-09 03:09:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew Morton <[email protected]> wrote:
>
> Users of __GENERIC_PER_CPU definitely need cpu_possible_map to be initialised
> by the time setup_per_cpu_areas() is called,

err, they'll need it once Eric's
dont-waste-percpu-memory-on-not-possible-CPUs patch is merged..

> so I think it makes sense to
> say "thou shalt initialise cpu_possible_map in setup_arch()".
>
> I guess Xen isn't doing that. Can it be made to?

Lame fix: cpu_possible_map = (1<<NR_CPUS)-1 in setup_arch().

2006-02-09 04:36:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew Morton a ?crit :
> Andrew Morton <[email protected]> wrote:
>> Users of __GENERIC_PER_CPU definitely need cpu_possible_map to be initialised
>> by the time setup_per_cpu_areas() is called,
>
> err, they'll need it once Eric's
> dont-waste-percpu-memory-on-not-possible-CPUs patch is merged..
>
>> so I think it makes sense to
>> say "thou shalt initialise cpu_possible_map in setup_arch()".
>>
>> I guess Xen isn't doing that. Can it be made to?
>
> Lame fix: cpu_possible_map = (1<<NR_CPUS)-1 in setup_arch().

I dont understand why this HOTPLUG stuff is problematic for Xen (or other
arch) : If CONFIG_HOTPLUG_CPU is configured, then the map should be preset to
CPU_MASK_ALL. Its even documented in line 332 of include/linux/cpumask.h

* #ifdef CONFIG_HOTPLUG_CPU
* cpu_possible_map - all NR_CPUS bits set

arch/i386/kernel/smpboot.c is doing the only sane stuff about it :

#ifdef CONFIG_HOTPLUG_CPU
cpumask_t cpu_possible_map = CPU_MASK_ALL;
#else
cpumask_t cpu_possible_map;
#endif

Some remarks :

1) These cpu_possible_map could have __read_mostly attribute.
2) cpu_possible(cpu) macro could be defined to 1 if CONFIG_HOTPLUG_CPU, or a
test against NR_CPUS

#ifdef CONFIG_HOTPLUG_CPU
#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map)
#else
#define cpu_possible(cpu) ((unsigned)(cpu) < NR_CPUS)
#endif

Eric

2006-02-09 04:39:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Rik van Riel a ?crit :
> On Sun, 5 Feb 2006, Linux Kernel Mailing List wrote:
>
>> [PATCH] percpu data: only iterate over possible CPUs
>
> This sched.c bit breaks Xen, and probably also other architectures
> that have CPU hotplug. I suspect the reason is that during early
> bootup only the boot CPU is online, so nothing initialises the
> runqueues for CPUs that are brought up afterwards.
>
> I suspect we can get rid of this problem quite easily by moving
> runqueue initialisation to init_idle()...

Please fix Xen to match include/linux/cpumask.h documentation that says :

/*
* The following particular system cpumasks and operations manage
* possible, present and online cpus. Each of them is a fixed size
* bitmap of size NR_CPUS.
*
* #ifdef CONFIG_HOTPLUG_CPU
* cpu_possible_map - all NR_CPUS bits set
* cpu_present_map - has bit 'cpu' set iff cpu is populated
* cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
* #else
* cpu_possible_map - has bit 'cpu' set iff cpu is populated
* cpu_present_map - copy of cpu_possible_map
* cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
* #endif
*/



Eric

2006-02-09 04:46:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Eric Dumazet <[email protected]> wrote:
>
> Andrew Morton a ?crit :
> > Andrew Morton <[email protected]> wrote:
> >> Users of __GENERIC_PER_CPU definitely need cpu_possible_map to be initialised
> >> by the time setup_per_cpu_areas() is called,
> >
> > err, they'll need it once Eric's
> > dont-waste-percpu-memory-on-not-possible-CPUs patch is merged..
> >
> >> so I think it makes sense to
> >> say "thou shalt initialise cpu_possible_map in setup_arch()".
> >>
> >> I guess Xen isn't doing that. Can it be made to?
> >
> > Lame fix: cpu_possible_map = (1<<NR_CPUS)-1 in setup_arch().
>
> I dont understand why this HOTPLUG stuff is problematic for Xen (or other
> arch) : If CONFIG_HOTPLUG_CPU is configured, then the map should be preset to
> CPU_MASK_ALL.

Presumably not all architectures are doing that. And some of the problems
we've had are with CONFIG_HOTPLUG_CPU=n.

> Its even documented in line 332 of include/linux/cpumask.h
>
> * #ifdef CONFIG_HOTPLUG_CPU
> * cpu_possible_map - all NR_CPUS bits set

That seems a quite bad idea. If we know which CPUs are possible we should
populate cpu_possible_map correctly, whether or not CONFIG_HOTPLUG_CPU is
set. Setting all the bits like that wastes memory and wastes CPU cycles.

<greps>

That comment came from the tender pinkies of [email protected], although I suspect
it was just a transliteration of then-current practice.

2006-02-09 04:46:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Eric Dumazet wrote:
> Rik van Riel a ?crit :
>
>> On Sun, 5 Feb 2006, Linux Kernel Mailing List wrote:
>>
>>> [PATCH] percpu data: only iterate over possible CPUs
>>
>>
>> This sched.c bit breaks Xen, and probably also other architectures
>> that have CPU hotplug. I suspect the reason is that during early
>> bootup only the boot CPU is online, so nothing initialises the
>> runqueues for CPUs that are brought up afterwards.
>>
>> I suspect we can get rid of this problem quite easily by moving
>> runqueue initialisation to init_idle()...
>
>
> Please fix Xen to match include/linux/cpumask.h documentation that says :
>
> /*
> * The following particular system cpumasks and operations manage
> * possible, present and online cpus. Each of them is a fixed size
> * bitmap of size NR_CPUS.
> *
> * #ifdef CONFIG_HOTPLUG_CPU
> * cpu_possible_map - all NR_CPUS bits set

Note that this shouldn't have to be all NR_CPUs if the platform
can determine all possible hotpluggable CPUs.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-09 04:57:25

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew wrote:
> That comment came from the tender pinkies of [email protected], although I suspect
> it was just a transliteration of then-current practice.

You give my pinkie more credit than is its due.

That comment looks bogus.

Hmmm ... what should it be ...

* #ifdef CONFIG_HOTPLUG_CPU
* cpu_possible_map - has bit 'cpu' set iff cpu could be populated

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-09 16:08:22

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew Morton wrote:
> Eric Dumazet <[email protected]> wrote:
> >
> > Andrew Morton a ?crit :
> > > Andrew Morton <[email protected]> wrote:
> > >> Users of __GENERIC_PER_CPU definitely need cpu_possible_map to be initialised
> > >> by the time setup_per_cpu_areas() is called,
> > >
> > > err, they'll need it once Eric's
> > > dont-waste-percpu-memory-on-not-possible-CPUs patch is merged..
> > >
> > >> so I think it makes sense to
> > >> say "thou shalt initialise cpu_possible_map in setup_arch()".
> > >>
> > >> I guess Xen isn't doing that. Can it be made to?
> > >
> > > Lame fix: cpu_possible_map = (1<<NR_CPUS)-1 in setup_arch().
> >
> > I dont understand why this HOTPLUG stuff is problematic for Xen (or other
> > arch) : If CONFIG_HOTPLUG_CPU is configured, then the map should be preset to
> > CPU_MASK_ALL.
>
> Presumably not all architectures are doing that.

powerpc/ppc64, for instance, determines the number of possible cpus
from information exported by firmware (and I'm mystified as to why
other platforms don't do this). So it's typical to have a kernel an a
pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.


> > Its even documented in line 332 of include/linux/cpumask.h
> >
> > * #ifdef CONFIG_HOTPLUG_CPU
> > * cpu_possible_map - all NR_CPUS bits set
>
> That seems a quite bad idea. If we know which CPUs are possible we should
> populate cpu_possible_map correctly, whether or not CONFIG_HOTPLUG_CPU is
> set. Setting all the bits like that wastes memory and wastes CPU cycles.

Yes, that comment is wrong or outdated.

2006-02-09 16:13:35

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

> > Presumably not all architectures are doing that.
> powerpc/ppc64, for instance, determines the number of possible cpus
> from information exported by firmware (and I'm mystified as to why
> other platforms don't do this). So it's typical to have a kernel an a
> pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.

Simply because there is no such interface on s390. The only thing we know
for sure is that if we are running under z/VM the user is free to
configure up to 63 additional virtual cpus on the fly...

Heiko

2006-02-09 16:40:02

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thu, 9 Feb 2006, Heiko Carstens wrote:

> Simply because there is no such interface on s390. The only thing we know
> for sure is that if we are running under z/VM the user is free to
> configure up to 63 additional virtual cpus on the fly...

Xen is in a similar situation.

--
All Rights Reversed

2006-02-09 16:59:34

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Heiko Carstens wrote:
> > > Presumably not all architectures are doing that.
> > powerpc/ppc64, for instance, determines the number of possible cpus
> > from information exported by firmware (and I'm mystified as to why
> > other platforms don't do this). So it's typical to have a kernel an a
> > pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
>
> Simply because there is no such interface on s390. The only thing we know
> for sure is that if we are running under z/VM the user is free to
> configure up to 63 additional virtual cpus on the fly...

My "mystified" parenthetical was not meant as a criticism of
arch/!powerpc, but of the platform implementations themselves, for not
making such an interface available.

2006-02-09 17:05:50

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thu, Feb 09, 2006 at 08:08:09AM -0800, Nathan Lynch wrote:
>
> powerpc/ppc64, for instance, determines the number of possible cpus
> from information exported by firmware (and I'm mystified as to why
> other platforms don't do this). So it's typical to have a kernel an a
> pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
>

Iam assuming in the above ase, cpu_possible_map == cpu_present_map and both
dont change after the OS is booted. v.s in platforms capable of hot-plug
cpus present could grow up to cpu_possible_map (max) when a new cpu is
newly added, that wasnt even present at boot time.

The problem was with ACPI just simply looking at the namespace doesnt
exactly give us an idea of how many processors are possible in this platform.

The reason is we could get more added via SSDT dynamically.

on x86_64, we used the number of disabled cpus reported in MADT at boot time
as an indicator to set cpu_possible_map. So if you had 4 sockets, but just
2 populated, the BIOS would set the missing CPUS with disabled flag. We
simply count the number of disabled CPUs and add to possible map.

Andi added documentation to cover that as well in
Documentation/x86_64/cpu-hotplug-spec as a guideline for BIOS folks.


--
Cheers,
Ashok Raj
- Open Source Technology Center

2006-02-09 17:23:57

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Ashok Raj wrote:
> On Thu, Feb 09, 2006 at 08:08:09AM -0800, Nathan Lynch wrote:
> >
> > powerpc/ppc64, for instance, determines the number of possible cpus
> > from information exported by firmware (and I'm mystified as to why
> > other platforms don't do this). So it's typical to have a kernel an a
> > pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
> >
>
> Iam assuming in the above ase, cpu_possible_map == cpu_present_map and both
> dont change after the OS is booted. v.s in platforms capable of hot-plug
> cpus present could grow up to cpu_possible_map (max) when a new cpu is
> newly added, that wasnt even present at boot time.

No, cpu_present_map need not equal cpu_possible_map. Of course
cpu_possible_map doesn't change after boot, but cpu_present_map can.
Apart from that, I don't really understand what you're trying to say
here.

2006-02-09 17:37:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thu, Feb 09, 2006 at 05:13:31PM +0100, Heiko Carstens wrote:
> > > Presumably not all architectures are doing that.
> > powerpc/ppc64, for instance, determines the number of possible cpus
> > from information exported by firmware (and I'm mystified as to why
> > other platforms don't do this). So it's typical to have a kernel an a
> > pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
>
> Simply because there is no such interface on s390. The only thing we know
> for sure is that if we are running under z/VM the user is free to
> configure up to 63 additional virtual cpus on the fly...

x86-64 had the same problem, but we now require that you
boot with additional_cpus=... for how many you want. Default is 0
(used to be half available CPUs but that lead to confusion)

Also the firmware has a way now to give this information automatically.

You can of course make it default to 64, but it will cost you
several MB of memory in usually unused copies of per cpu data.

-Andi

2006-02-09 18:07:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Ashok Raj <[email protected]> wrote:
>
> The problem was with ACPI just simply looking at the namespace doesnt
> exactly give us an idea of how many processors are possible in this platform.

We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
NR_CPUS=lots will be appreciable.

Do any x86 platforms actually support CPU hotplug?

Does the ACPI problem which you describe occur with present-CPUs,
or only with possible-but-not-present ones?

2006-02-09 19:00:33

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thu, Feb 09, 2006 at 10:04:29AM -0800, Andrew Morton wrote:

>
> We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> NR_CPUS=lots will be appreciable.
>
> Do any x86 platforms actually support CPU hotplug?

Hi Andrew,

logical cpu hotplug (i.e onlining and offlining) can be done on any
system. No hw or BIOS support is required. (this is what smp suspend/resume
folks use)

I remember Natalie from Unisys mentioned they have one system which is
ACPI based and supports physical cpu hotplug, but the BIOS is old and
didnt support the hotplug notify via ACPI. They probably have some other
sysmgmt way to interact and initiate the hotplug.

Iam aware of couple more that use ia64 NUMA type hw as well.
(I dont think i can announce for them:-) ).

>
> Does the ACPI problem which you describe occur with present-CPUs,
> or only with possible-but-not-present ones?

Describing present cpus is not problem.

Only knowing possible-but-not-present upfront is an issue.

logical-cpu-hotplug only: cpu_present_map == cpu_possible_map always
physical-cpu-hotplug: At boot, cpu_present_map is a subset of possible_map.

Think its best to NOT set cpu_present_map to MASK_ALL as its being proposed,
but let the arch/platform code figure out early enough to set possible_map
accurately for that platform. If a platform has no way to determine it,
then it could use cmdline like what x86_64 introduced (additional_cpus=)
to overcome that.

--
Cheers,
Ashok Raj
- Open Source Technology Center

2006-02-09 20:40:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Ashok Raj <[email protected]> wrote:
>
> >
> > Does the ACPI problem which you describe occur with present-CPUs,
> > or only with possible-but-not-present ones?
>
> Describing present cpus is not problem.
>
> Only knowing possible-but-not-present upfront is an issue.
>
> logical-cpu-hotplug only: cpu_present_map == cpu_possible_map always
> physical-cpu-hotplug: At boot, cpu_present_map is a subset of possible_map.
>
> Think its best to NOT set cpu_present_map to MASK_ALL as its being proposed,
> but let the arch/platform code figure out early enough to set possible_map
> accurately for that platform. If a platform has no way to determine it,
> then it could use cmdline like what x86_64 introduced (additional_cpus=)
> to overcome that.

There is no proposal to change cpu_present_map.

The problem is cpu_possible_map. That is presently being initialised to
CPU_MASK_ALL, which adversely affects perfoermance. An NR_CPUS=16 kernel
on a 2-way presently has cpu_possible_map=0xffff, which will hurt.

The proposal is this:


From: Andrew Morton <[email protected]>

Initialising cpu_possible_map to all-ones with CONFIG_HOTPLUG_CPU means that

a) All for_each_cpu() loops will iterate across all NR_CPUS CPUs, rather
than over possible ones. That can be quite expensive.

b) Soon we'll be allocating per-cpu areas only for possible CPUs. So with
CPU_MASK_ALL, we'll be wasting memory.

I also switched voyager over to not use CPU_MASK_ALL in the non-CPU-hotplug
case. Will that break it?

I note that parisc is also using CPU_MASK_ALL. Suggest that it stop doing
that.

Cc: James Bottomley <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: Paul Jackson <[email protected]>
Cc: Ashok Raj <[email protected]>
Cc: Zwane Mwaikambo <[email protected]>
Cc: Paul Jackson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/smpboot.c | 4 ----
arch/i386/mach-voyager/voyager_smp.c | 2 +-
2 files changed, 1 insertion(+), 5 deletions(-)

diff -puN arch/i386/kernel/smpboot.c~x86-dont-initialise-cpu_possible_map-to-all-ones arch/i386/kernel/smpboot.c
--- devel/arch/i386/kernel/smpboot.c~x86-dont-initialise-cpu_possible_map-to-all-ones 2006-02-09 01:11:55.000000000 -0800
+++ devel-akpm/arch/i386/kernel/smpboot.c 2006-02-09 01:12:24.000000000 -0800
@@ -87,11 +87,7 @@ EXPORT_SYMBOL(cpu_online_map);
cpumask_t cpu_callin_map;
cpumask_t cpu_callout_map;
EXPORT_SYMBOL(cpu_callout_map);
-#ifdef CONFIG_HOTPLUG_CPU
-cpumask_t cpu_possible_map = CPU_MASK_ALL;
-#else
cpumask_t cpu_possible_map;
-#endif
EXPORT_SYMBOL(cpu_possible_map);
static cpumask_t smp_commenced_mask;

diff -puN arch/i386/mach-voyager/voyager_smp.c~x86-dont-initialise-cpu_possible_map-to-all-ones arch/i386/mach-voyager/voyager_smp.c
--- devel/arch/i386/mach-voyager/voyager_smp.c~x86-dont-initialise-cpu_possible_map-to-all-ones 2006-02-09 01:11:55.000000000 -0800
+++ devel-akpm/arch/i386/mach-voyager/voyager_smp.c 2006-02-09 01:12:43.000000000 -0800
@@ -240,7 +240,7 @@ static cpumask_t smp_commenced_mask = CP
cpumask_t cpu_callin_map = CPU_MASK_NONE;
cpumask_t cpu_callout_map = CPU_MASK_NONE;
EXPORT_SYMBOL(cpu_callout_map);
-cpumask_t cpu_possible_map = CPU_MASK_ALL;
+cpumask_t cpu_possible_map = CPU_MASK_NONE;
EXPORT_SYMBOL(cpu_possible_map);

/* The per processor IRQ masks (these are usually kept in sync) */
_

2006-02-09 21:06:47

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thu, Feb 09, 2006 at 12:37:29PM -0800, Andrew Morton wrote:
>
> There is no proposal to change cpu_present_map.
>
> The problem is cpu_possible_map. That is presently being initialised to
> CPU_MASK_ALL, which adversely affects perfoermance. An NR_CPUS=16 kernel
> on a 2-way presently has cpu_possible_map=0xffff, which will hurt.

Think i miss typed earlier. What you proposed is the correct solution.

I will followup with similar change on ia64 as well, (currently we do this
in smp_build_cpu_map()). And add something similar to what we did for
x86_64.

2006-02-10 10:03:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Thursday 09 February 2006 19:04, Andrew Morton wrote:
> Ashok Raj <[email protected]> wrote:
> >
> > The problem was with ACPI just simply looking at the namespace doesnt
> > exactly give us an idea of how many processors are possible in this platform.
>
> We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> NR_CPUS=lots will be appreciable.

What is this performance penalty exactly?

It wastes quite some memory (each possible CPU needs 32K of memory which
adds quickly up), but it shouldn't impact other CPU use.

>
> Do any x86 platforms actually support CPU hotplug?

Xen does. And it's needed for suspend/resume on normal x86 now.

-Andi


2006-02-10 10:05:32

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

> > > powerpc/ppc64, for instance, determines the number of possible cpus
> > > from information exported by firmware (and I'm mystified as to why
> > > other platforms don't do this). So it's typical to have a kernel an a
> > > pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
> >
> > Simply because there is no such interface on s390. The only thing we know
> > for sure is that if we are running under z/VM the user is free to
> > configure up to 63 additional virtual cpus on the fly...
>
> x86-64 had the same problem, but we now require that you
> boot with additional_cpus=... for how many you want. Default is 0
> (used to be half available CPUs but that lead to confusion)

So introducing the additional_cpus kernel parameter seems to be the way
to go (for XEN probably too). Even though it seems to be a bit odd if the
user specifies both maxcpus=... and additional_cpus=...

Heiko

2006-02-10 10:13:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

[putting some Xen people into cc]

On Friday 10 February 2006 11:05, Heiko Carstens wrote:
> > > > powerpc/ppc64, for instance, determines the number of possible cpus
> > > > from information exported by firmware (and I'm mystified as to why
> > > > other platforms don't do this). So it's typical to have a kernel an a
> > > > pSeries partition with NR_CPUS=128, but cpu_possible_map = 0xff.
> > >
> > > Simply because there is no such interface on s390. The only thing we know
> > > for sure is that if we are running under z/VM the user is free to
> > > configure up to 63 additional virtual cpus on the fly...
> >
> > x86-64 had the same problem, but we now require that you
> > boot with additional_cpus=... for how many you want. Default is 0
> > (used to be half available CPUs but that lead to confusion)
>
> So introducing the additional_cpus kernel parameter seems to be the way
> to go (for XEN probably too). Even though it seems to be a bit odd if the
> user specifies both maxcpus=... and additional_cpus=...

With additional_cpus you don't need maxcpus. They are added together.

Also for Xen it's probably not needed by default, but the hypervisor can somehow
pass it (it doesn't make sense to have more vcpus than real cpus
and that should be relatively small number, so the memory waste
shouldn't be that bad).

On x86-64 there is also a new firmware
way to specify it, but current machines don't support that yet.

-Andi

2006-02-10 10:46:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andi Kleen <[email protected]> wrote:
>
> On Thursday 09 February 2006 19:04, Andrew Morton wrote:
> > Ashok Raj <[email protected]> wrote:
> > >
> > > The problem was with ACPI just simply looking at the namespace doesnt
> > > exactly give us an idea of how many processors are possible in this platform.
> >
> > We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> > NR_CPUS=lots will be appreciable.
>
> What is this performance penalty exactly?

All those for_each_cpu() loops will hit NR_CPUS cachelines instead of
hweight(cpu_possible_map) cachelines.

> It wastes quite some memory (each possible CPU needs 32K of memory which
> adds quickly up), but it shouldn't impact other CPU use.
>
> >
> > Do any x86 platforms actually support CPU hotplug?
>
> Xen does.

yup.

> And it's needed for suspend/resume on normal x86 now.

True.

2006-02-10 11:11:03

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew Morton a ?crit :
> Andi Kleen <[email protected]> wrote:
>> On Thursday 09 February 2006 19:04, Andrew Morton wrote:
>>> Ashok Raj <[email protected]> wrote:
>>>> The problem was with ACPI just simply looking at the namespace doesnt
>>>> exactly give us an idea of how many processors are possible in this platform.
>>> We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
>>> NR_CPUS=lots will be appreciable.
>> What is this performance penalty exactly?
>
> All those for_each_cpu() loops will hit NR_CPUS cachelines instead of
> hweight(cpu_possible_map) cachelines.

You mean NR_CPUS bits, mostly all included in a single cacheline, and even in
a single long word :) for most cases (NR_CPUS <= 32 or 64)



2006-02-10 11:28:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Eric Dumazet <[email protected]> wrote:
>
> Andrew Morton a ?crit :
> > Andi Kleen <[email protected]> wrote:
> >> On Thursday 09 February 2006 19:04, Andrew Morton wrote:
> >>> Ashok Raj <[email protected]> wrote:
> >>>> The problem was with ACPI just simply looking at the namespace doesnt
> >>>> exactly give us an idea of how many processors are possible in this platform.
> >>> We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> >>> NR_CPUS=lots will be appreciable.
> >> What is this performance penalty exactly?
> >
> > All those for_each_cpu() loops will hit NR_CPUS cachelines instead of
> > hweight(cpu_possible_map) cachelines.
>
> You mean NR_CPUS bits, mostly all included in a single cacheline, and even in
> a single long word :) for most cases (NR_CPUS <= 32 or 64)
>

No, I mean cachelines:

static void recalc_bh_state(void)
{
int i;
int tot = 0;

if (__get_cpu_var(bh_accounting).ratelimit++ < 4096)
return;
__get_cpu_var(bh_accounting).ratelimit = 0;
for_each_cpu(i)
tot += per_cpu(bh_accounting, i).nr;

That's going to hit NR_CPUS cachelines even on a 2-way.

Or am I missing something really obvious here?

(Probably the most expensive ones will be get_page_state() and friends.
And argh, they're still hardwired to CPU_MASK_ALL).

2006-02-10 11:27:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andrew Morton <[email protected]> wrote:
>
> (Probably the most expensive ones will be get_page_state() and friends.
> And argh, they're still hardwired to CPU_MASK_ALL).
>

No, we mask it with cpu_online_map. Phew.

2006-02-10 12:13:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Friday 10 February 2006 11:42, Andrew Morton wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > On Thursday 09 February 2006 19:04, Andrew Morton wrote:
> > > Ashok Raj <[email protected]> wrote:
> > > >
> > > > The problem was with ACPI just simply looking at the namespace doesnt
> > > > exactly give us an idea of how many processors are possible in this platform.
> > >
> > > We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> > > NR_CPUS=lots will be appreciable.
> >
> > What is this performance penalty exactly?
>
> All those for_each_cpu() loops will hit NR_CPUS cachelines instead of
> hweight(cpu_possible_map) cachelines.

But are there any in real fast paths? iirc they are mostly in initialization,
where it doesn't matter too much.

-Andi

2006-02-10 14:18:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

--- a/fs/buffer.c 2006-02-10 15:08:21.000000000 +0100
+++ b/fs/buffer.c 2006-02-10 15:47:55.000000000 +0100
@@ -3138,7 +3138,7 @@
if (__get_cpu_var(bh_accounting).ratelimit++ < 4096)
return;
__get_cpu_var(bh_accounting).ratelimit = 0;
- for_each_cpu(i)
+ for_each_online_cpu(i)
tot += per_cpu(bh_accounting, i).nr;
buffer_heads_over_limit = (tot > max_buffer_heads);
}
@@ -3187,6 +3187,9 @@
brelse(b->bhs[i]);
b->bhs[i] = NULL;
}
+ get_cpu_var(bh_accounting).nr += per_cpu(bh_accounting, cpu).nr ;
+ per_cpu(bh_accounting, cpu).nr = 0;
+ put_cpu_var(bh_accounting);
}

static int buffer_cpu_notify(struct notifier_block *self,


Attachments:
buffer.patch (651.00 B)

2006-02-10 19:10:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Andi Kleen <[email protected]> wrote:
>
> On Friday 10 February 2006 11:42, Andrew Morton wrote:
> > Andi Kleen <[email protected]> wrote:
> > >
> > > On Thursday 09 February 2006 19:04, Andrew Morton wrote:
> > > > Ashok Raj <[email protected]> wrote:
> > > > >
> > > > > The problem was with ACPI just simply looking at the namespace doesnt
> > > > > exactly give us an idea of how many processors are possible in this platform.
> > > >
> > > > We need to fix this asap - the performance penalty for HOTPLUG_CPU=y,
> > > > NR_CPUS=lots will be appreciable.
> > >
> > > What is this performance penalty exactly?
> >
> > All those for_each_cpu() loops will hit NR_CPUS cachelines instead of
> > hweight(cpu_possible_map) cachelines.
>
> But are there any in real fast paths? iirc they are mostly in initialization,
> where it doesn't matter too much.
>

Could be so.

I just added one to percpu_counters though. And it'd be a pain to
introduce a cpu notifier for each and every percpu_counter.



From: Andrew Morton <[email protected]>

Implement percpu_counter_sum(). This is a more accurate but slower version of
percpu_counter_read_positive().

We need this for Alex's speedup-ext3_statfs patch. Otherwise it would be too
inaccurate on large CPU counts.

Cc: Ravikiran G Thirumalai <[email protected]>
Cc: Alex Tomas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/percpu_counter.h | 6 ++++++
mm/swap.c | 25 +++++++++++++++++++++++--
2 files changed, 29 insertions(+), 2 deletions(-)

diff -puN include/linux/percpu_counter.h~percpu_counter_sum include/linux/percpu_counter.h
--- devel/include/linux/percpu_counter.h~percpu_counter_sum 2006-02-08 13:34:08.000000000 -0800
+++ devel-akpm/include/linux/percpu_counter.h 2006-02-08 13:34:08.000000000 -0800
@@ -39,6 +39,7 @@ static inline void percpu_counter_destro
}

void percpu_counter_mod(struct percpu_counter *fbc, long amount);
+long percpu_counter_sum(struct percpu_counter *fbc);

static inline long percpu_counter_read(struct percpu_counter *fbc)
{
@@ -92,6 +93,11 @@ static inline long percpu_counter_read_p
return fbc->count;
}

+static inline long percpu_counter_sum(struct percpu_counter *fbc)
+{
+ return percpu_counter_read_positive(fbc);
+}
+
#endif /* CONFIG_SMP */

static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff -puN mm/swap.c~percpu_counter_sum mm/swap.c
--- devel/mm/swap.c~percpu_counter_sum 2006-02-08 13:34:08.000000000 -0800
+++ devel-akpm/mm/swap.c 2006-02-08 13:34:08.000000000 -0800
@@ -491,13 +491,34 @@ void percpu_counter_mod(struct percpu_co
if (count >= FBC_BATCH || count <= -FBC_BATCH) {
spin_lock(&fbc->lock);
fbc->count += count;
+ *pcount = 0;
spin_unlock(&fbc->lock);
- count = 0;
+ } else {
+ *pcount = count;
}
- *pcount = count;
put_cpu();
}
EXPORT_SYMBOL(percpu_counter_mod);
+
+/*
+ * Add up all the per-cpu counts, return the result. This is a more accurate
+ * but much slower version of percpu_counter_read_positive()
+ */
+long percpu_counter_sum(struct percpu_counter *fbc)
+{
+ long ret;
+ int cpu;
+
+ spin_lock(&fbc->lock);
+ ret = fbc->count;
+ for_each_cpu(cpu) {
+ long *pcount = per_cpu_ptr(fbc->counters, cpu);
+ ret += *pcount;
+ }
+ spin_unlock(&fbc->lock);
+ return ret < 0 ? 0 : ret;
+}
+EXPORT_SYMBOL(percpu_counter_sum);
#endif

/*
_

2006-02-11 00:11:00

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Eric Dumazet wrote:
>
> [PATCH] HOTPLUG_CPU : avoid hitting too many cachelines in recalc_bh_state()
>
> Instead of using for_each_cpu(i), we can use for_each_online_cpu(i) : The
> difference matters if HOTPUG_CPU=y
>
> When a CPU goes offline (ie removed from online map), it might have a non
> null bh_accounting.nr, so this patch adds a transfert of this counter to an
> online CPU counter.
>
> We already have a hotcpu_notifier, (function buffer_cpu_notify()), where we
> can do this bh_accounting.nr transfert.
>
> Signed-off-by: Eric Dumazet <[email protected]>

> --- a/fs/buffer.c 2006-02-10 15:08:21.000000000 +0100
> +++ b/fs/buffer.c 2006-02-10 15:47:55.000000000 +0100
> @@ -3138,7 +3138,7 @@
> if (__get_cpu_var(bh_accounting).ratelimit++ < 4096)
> return;
> __get_cpu_var(bh_accounting).ratelimit = 0;
> - for_each_cpu(i)
> + for_each_online_cpu(i)
> tot += per_cpu(bh_accounting, i).nr;
> buffer_heads_over_limit = (tot > max_buffer_heads);
> }
> @@ -3187,6 +3187,9 @@
> brelse(b->bhs[i]);
> b->bhs[i] = NULL;
> }
> + get_cpu_var(bh_accounting).nr += per_cpu(bh_accounting, cpu).nr ;
> + per_cpu(bh_accounting, cpu).nr = 0;
> + put_cpu_var(bh_accounting);
> }

But now there is a window between the time the cpu is marked offline
and the time its bh_accounting.nr is moved to another cpu. So
recalc_bh_state could fail to set buffer_heads_over_limit when it
should.

Maybe it's not worth worrying about? I don't really know this code,
just thought I would point it out.

2006-02-11 00:28:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Nathan Lynch <[email protected]> wrote:
>
> Eric Dumazet wrote:
> >
> > [PATCH] HOTPLUG_CPU : avoid hitting too many cachelines in recalc_bh_state()
> >
> > Instead of using for_each_cpu(i), we can use for_each_online_cpu(i) : The
> > difference matters if HOTPUG_CPU=y
> >
> > When a CPU goes offline (ie removed from online map), it might have a non
> > null bh_accounting.nr, so this patch adds a transfert of this counter to an
> > online CPU counter.
> >
> > We already have a hotcpu_notifier, (function buffer_cpu_notify()), where we
> > can do this bh_accounting.nr transfert.
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
>
> > --- a/fs/buffer.c 2006-02-10 15:08:21.000000000 +0100
> > +++ b/fs/buffer.c 2006-02-10 15:47:55.000000000 +0100
> > @@ -3138,7 +3138,7 @@
> > if (__get_cpu_var(bh_accounting).ratelimit++ < 4096)
> > return;
> > __get_cpu_var(bh_accounting).ratelimit = 0;
> > - for_each_cpu(i)
> > + for_each_online_cpu(i)
> > tot += per_cpu(bh_accounting, i).nr;
> > buffer_heads_over_limit = (tot > max_buffer_heads);
> > }
> > @@ -3187,6 +3187,9 @@
> > brelse(b->bhs[i]);
> > b->bhs[i] = NULL;
> > }
> > + get_cpu_var(bh_accounting).nr += per_cpu(bh_accounting, cpu).nr ;
> > + per_cpu(bh_accounting, cpu).nr = 0;
> > + put_cpu_var(bh_accounting);
> > }
>
> But now there is a window between the time the cpu is marked offline
> and the time its bh_accounting.nr is moved to another cpu. So
> recalc_bh_state could fail to set buffer_heads_over_limit when it
> should.

Yes, there is a wee race there, but the consequences of a glitch or small
inaccuracy in buffer_heads_over_limit will be transient and negligibe.

2006-02-11 14:49:37

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

> > > x86-64 had the same problem, but we now require that you
> > > boot with additional_cpus=... for how many you want. Default is 0
> > > (used to be half available CPUs but that lead to confusion)
> >
> > So introducing the additional_cpus kernel parameter seems to be the way
> > to go (for XEN probably too). Even though it seems to be a bit odd if the
> > user specifies both maxcpus=... and additional_cpus=...
>
> With additional_cpus you don't need maxcpus. They are added together.

How does x86_64 manage to get 'additional_cpus' parsed early enough? As far
as I can see this is done when parse_args() in start_kernel() gets called,
but that's after you need the parameter in prefill_possible_map().
IMHO that should be an early_param and you would need to call
parse_early_param() from setup_arch(). But then again, maybe I got it all
wrong.

But the more interesting question is: what do you do if the command line
contains both additional_cpus and maxcpus. I was just trying to make some
sense of this, but the result is questionable.
I ended up with a cpu_possible_map that has 'present cpus' +
'additional_cpus' bits set. And in smp_prepare_cpus I make sure that
cpu_present_map has not more than max_cpus bits set.

At least that doesn't break the current semantics of the maxcpus parameter.
But we're still wasting memory, since it would make sense that the
cpu_possible_map shouldn't have more than max_cpus bits set.

Heiko

2006-02-11 18:05:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Sat, Feb 11, 2006 at 03:49:29PM +0100, Heiko Carstens wrote:
> > > > x86-64 had the same problem, but we now require that you
> > > > boot with additional_cpus=... for how many you want. Default is 0
> > > > (used to be half available CPUs but that lead to confusion)
> > >
> > > So introducing the additional_cpus kernel parameter seems to be the way
> > > to go (for XEN probably too). Even though it seems to be a bit odd if the
> > > user specifies both maxcpus=... and additional_cpus=...
> >
> > With additional_cpus you don't need maxcpus. They are added together.
>
> How does x86_64 manage to get 'additional_cpus' parsed early enough? As far
> as I can see this is done when parse_args() in start_kernel() gets called,
> but that's after you need the parameter in prefill_possible_map().
> IMHO that should be an early_param and you would need to call
> parse_early_param() from setup_arch(). But then again, maybe I got it all
> wrong.

Yes, you're right - it's added too late to the map right now.
I will fix that. There are no earlyparams unfortunately, except for a
big hack in setup.c

> But the more interesting question is: what do you do if the command line
> contains both additional_cpus and maxcpus. I was just trying to make some
> sense of this, but the result is questionable.
> I ended up with a cpu_possible_map that has 'present cpus' +
> 'additional_cpus' bits set. And in smp_prepare_cpus I make sure that
> cpu_present_map has not more than max_cpus bits set.
>
> At least that doesn't break the current semantics of the maxcpus parameter.
> But we're still wasting memory, since it would make sense that the
> cpu_possible_map shouldn't have more than max_cpus bits set.

Yes, maybe it should be a early parameter too. But frankly I see
maxcpus more as a debugging hack or workarouno. I don't think it matters
much if it's not as efficient as it could be.

-Andi