2006-02-07 15:15:45

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

> tree 8c30052a0d7fadec37c785a42a71b28d0a9c5fcf
> parent cef5076987dd545ac74f4efcf1c962be8eac34b0
> author Eric Dumazet <[email protected]> Sun, 05 Feb 2006 15:27:36 -0800
> committer Linus Torvalds <[email protected]> Mon, 06 Feb 2006 03:06:51 -0800
>
> [PATCH] percpu data: only iterate over possible CPUs
>
> percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
> cpudata, instead of allocating memory only for possible cpus.
>
> As a preparation for changing that, we need to convert various 0 -> NR_CPUS
> loops to use for_each_cpu().
>
> (The above only applies to users of asm-generic/percpu.h. powerpc has gone it
> alone and is presently only allocating memory for present CPUs, so it's
> currently corrupting memory).

This patch is broken since it replaces several loops that iterate NR_CPUS
times with for_each_cpu before cpu_possible_map is setup:

> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -379,7 +379,6 @@ static void __devinit fdtable_defer_list
> void __init files_defer_init(void)
> {
> int i;
> - /* Really early - can't use for_each_cpu */
> - for (i = 0; i < NR_CPUS; i++)
> + for_each_cpu(i)
> fdtable_defer_list_init(i);
> }

The old comment indicates it: called before smp_prepare_cpus gets called
which sets up cpu_possible_map.

> diff --git a/kernel/sched.c b/kernel/sched.c
> index f77f23f..839466f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6109,7 +6109,7 @@ void __init sched_init(void)
> runqueue_t *rq;
> int i, j, k;
>
> - for (i = 0; i < NR_CPUS; i++) {
> + for_each_cpu(i) {
> prio_array_t *array;

Same here.

I didn't check the rest, but it looks like we end up with a bit of
uninitialized stuff.

Heiko


2006-02-07 15:29:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Tue, Feb 07 2006, Heiko Carstens wrote:
> > tree 8c30052a0d7fadec37c785a42a71b28d0a9c5fcf
> > parent cef5076987dd545ac74f4efcf1c962be8eac34b0
> > author Eric Dumazet <[email protected]> Sun, 05 Feb 2006 15:27:36 -0800
> > committer Linus Torvalds <[email protected]> Mon, 06 Feb 2006 03:06:51 -0800
> >
> > [PATCH] percpu data: only iterate over possible CPUs
> >
> > percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
> > cpudata, instead of allocating memory only for possible cpus.
> >
> > As a preparation for changing that, we need to convert various 0 -> NR_CPUS
> > loops to use for_each_cpu().
> >
> > (The above only applies to users of asm-generic/percpu.h. powerpc has gone it
> > alone and is presently only allocating memory for present CPUs, so it's
> > currently corrupting memory).
>
> This patch is broken since it replaces several loops that iterate NR_CPUS
> times with for_each_cpu before cpu_possible_map is setup:

Hrmpf, chicking and egg - we must not initialize data for unknown CPUs,
but we can't check since it's not setup.

To me it sounds really broken that core structures like that are not
setup before we init eg fs stuff.

--
Jens Axboe

2006-02-07 16:26:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Heiko Carstens a ?crit :
>> tree 8c30052a0d7fadec37c785a42a71b28d0a9c5fcf
>> parent cef5076987dd545ac74f4efcf1c962be8eac34b0
>> author Eric Dumazet <[email protected]> Sun, 05 Feb 2006 15:27:36 -0800
>> committer Linus Torvalds <[email protected]> Mon, 06 Feb 2006 03:06:51 -0800
>>
>> [PATCH] percpu data: only iterate over possible CPUs
>>
>> percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
>> cpudata, instead of allocating memory only for possible cpus.
>>
>> As a preparation for changing that, we need to convert various 0 -> NR_CPUS
>> loops to use for_each_cpu().
>>
>> (The above only applies to users of asm-generic/percpu.h. powerpc has gone it
>> alone and is presently only allocating memory for present CPUs, so it's
>> currently corrupting memory).
>
> This patch is broken since it replaces several loops that iterate NR_CPUS
> times with for_each_cpu before cpu_possible_map is setup:

This patch assumes that cpu_possible_map is setup before setup_per_cpu_areas().

That sounds a reasonable assumption, but maybe not on your architecture ?

I dont think cpu_possible_map has to be filled at smp_prepare_cpus() time, but
long before.

On i386/x86_64/ia64, this is done from setup_arch() called from start_kernel()
just before setup_per_cpu_areas()

On powerpc it's done from setup_system(), called before start_kernel()

Eric


2006-02-07 16:43:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs



On Tue, 7 Feb 2006, Eric Dumazet wrote:
>
> This patch assumes that cpu_possible_map is setup before
> setup_per_cpu_areas().
>
> That sounds a reasonable assumption, but maybe not on your architecture ?

I have to say, it sounds not just like a reasonable assumption, but like
the only sane assumption that there _could_ be.

> I dont think cpu_possible_map has to be filled at smp_prepare_cpus() time, but
> long before.
>
> On i386/x86_64/ia64, this is done from setup_arch() called from start_kernel()
> just before setup_per_cpu_areas()
>
> On powerpc it's done from setup_system(), called before start_kernel()

It absolutely _has_ to be done from setup_arch() or earlier, as shown by
the fact that "setup_per_cpu_areas()" is the very next thing that
init/main.c calls (and clearly, that needs to know what CPU's are
possible).

ppc64 certainly calls it early enough, as does x86/x86-64/ia64. I don't
see anybody else doing it too late either.

Heiko, can you point to the "old comment" you mentioned in the email, or
the architecture that does this wrong?

Linus

2006-02-07 17:36:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

Linus Torvalds <[email protected]> wrote:
>
>
>
> On Tue, 7 Feb 2006, Eric Dumazet wrote:
> >
> > This patch assumes that cpu_possible_map is setup before
> > setup_per_cpu_areas().
> >
> > That sounds a reasonable assumption, but maybe not on your architecture ?
>
> I have to say, it sounds not just like a reasonable assumption, but like
> the only sane assumption that there _could_ be.
>
> > I dont think cpu_possible_map has to be filled at smp_prepare_cpus() time, but
> > long before.
> >
> > On i386/x86_64/ia64, this is done from setup_arch() called from start_kernel()
> > just before setup_per_cpu_areas()
> >
> > On powerpc it's done from setup_system(), called before start_kernel()
>
> It absolutely _has_ to be done from setup_arch() or earlier, as shown by
> the fact that "setup_per_cpu_areas()" is the very next thing that
> init/main.c calls (and clearly, that needs to know what CPU's are
> possible).
>
> ppc64 certainly calls it early enough, as does x86/x86-64/ia64. I don't
> see anybody else doing it too late either.
>
> Heiko, can you point to the "old comment" you mentioned in the email, or
> the architecture that does this wrong?
>

This one:

--- devel/fs/file.c~reduce-size-of-percpudata-and-make-sure-per_cpuobject 2006-02-04 23:27:17.000000000 -0800
+++ devel-akpm/fs/file.c 2006-02-04 23:27:17.000000000 -0800
@@ -379,7 +379,6 @@ static void __devinit fdtable_defer_list
void __init files_defer_init(void)
{
int i;
- /* Really early - can't use for_each_cpu */
- for (i = 0; i < NR_CPUS; i++)
+ for_each_cpu(i)
fdtable_defer_list_init(i);
}

And yes, me too - when I saw that comment disappear I checked and decided
that the comment was both wrong and undesirable.

2006-02-07 17:49:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs



On Tue, 7 Feb 2006, Andrew Morton wrote:
>
> This one:
>
> --- devel/fs/file.c~reduce-size-of-percpudata-and-make-sure-per_cpuobject 2006-02-04 23:27:17.000000000 -0800
> +++ devel-akpm/fs/file.c 2006-02-04 23:27:17.000000000 -0800
> @@ -379,7 +379,6 @@ static void __devinit fdtable_defer_list
> void __init files_defer_init(void)
> {
> int i;
> - /* Really early - can't use for_each_cpu */
> - for (i = 0; i < NR_CPUS; i++)
> + for_each_cpu(i)
> fdtable_defer_list_init(i);
> }
>
> And yes, me too - when I saw that comment disappear I checked and decided
> that the comment was both wrong and undesirable.

Ahh, yes. The comment is totally incorrect, we must have done the CPU
setup much too later a long long time ago ;)

Linus

2006-02-07 18:31:04

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Tue, Feb 07, 2006 at 09:48:41AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 7 Feb 2006, Andrew Morton wrote:
> >
> > This one:
> >
> > --- devel/fs/file.c~reduce-size-of-percpudata-and-make-sure-per_cpuobject 2006-02-04 23:27:17.000000000 -0800
> > +++ devel-akpm/fs/file.c 2006-02-04 23:27:17.000000000 -0800
> > @@ -379,7 +379,6 @@ static void __devinit fdtable_defer_list
> > void __init files_defer_init(void)
> > {
> > int i;
> > - /* Really early - can't use for_each_cpu */
> > - for (i = 0; i < NR_CPUS; i++)
> > + for_each_cpu(i)
> > fdtable_defer_list_init(i);
> > }
> >
> > And yes, me too - when I saw that comment disappear I checked and decided
> > that the comment was both wrong and undesirable.
>
> Ahh, yes. The comment is totally incorrect, we must have done the CPU
> setup much too later a long long time ago ;)

One would think so, but I recall not all archs did that. Alpha for
example sets up cpu_possible_map in smp_prepare_cpus(). It however
makes more sense to fix the arch then use NR_CPUS, IMO.

Thanks
Dipankar

2006-02-07 18:44:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs



On Wed, 8 Feb 2006, Dipankar Sarma wrote:
>
> One would think so, but I recall not all archs did that. Alpha for
> example sets up cpu_possible_map in smp_prepare_cpus(). It however
> makes more sense to fix the arch then use NR_CPUS, IMO.

Ehh? alpha does it in setup_smp(), which in turn is called very early from
setup_arch().

Were you perhaps thinking of something else? Or am I just going blind and
confused?

Linus

2006-02-07 18:54:45

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Tue, Feb 07, 2006 at 10:43:55AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 8 Feb 2006, Dipankar Sarma wrote:
> >
> > One would think so, but I recall not all archs did that. Alpha for
> > example sets up cpu_possible_map in smp_prepare_cpus(). It however
> > makes more sense to fix the arch then use NR_CPUS, IMO.
>
> Ehh? alpha does it in setup_smp(), which in turn is called very early from
> setup_arch().
>
> Were you perhaps thinking of something else? Or am I just going blind and
> confused?

I am looking at 2.6.16-rc1 and I don't see cpu_possible_map
being set in setup_smp(). That said, it seems alpha setup_smp()
probes for cpus there, so there is no reason why it cannot
be set there. I think it is wrong not to set cpu_possible_map
very early.

Or perhaps it got fixed later on, in which case, oh well, I need
to download more often. <sigh>.

Thanks
Dipankar

2006-02-07 19:13:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs



On Wed, 8 Feb 2006, Dipankar Sarma wrote:
>
> I am looking at 2.6.16-rc1 and I don't see cpu_possible_map
> being set in setup_smp()

You're right, my bad. I looked at setup_smp() and how it walked through
every CPU in the firmware, but it doesn't actually ever set the possible
map, it fills in just hwrpb_cpu_present_mask (which is then then only used
_later_ to set cpu_possible_map for some silly reason).

As far as I can tell, "hwrpb_cpu_present_mask" is just wrong, and the code
_should_ be using cpu_possible_map.

rth? Ivan?

Linus

2006-02-08 04:40:58

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

> > I am looking at 2.6.16-rc1 and I don't see cpu_possible_map
> > being set in setup_smp()
>
> You're right, my bad. I looked at setup_smp() and how it walked through
> every CPU in the firmware, but it doesn't actually ever set the possible
> map, it fills in just hwrpb_cpu_present_mask (which is then then only used
> _later_ to set cpu_possible_map for some silly reason).
>
> As far as I can tell, "hwrpb_cpu_present_mask" is just wrong, and the code
> _should_ be using cpu_possible_map.

We still have this one in init/main.c:

/* Sets up cpus_possible() */
smp_prepare_cpus(max_cpus);

That is actually why s390 is doing this in smp_prepare_cpus. We also use the
passed value of max_cpus to set the number of bits in cpu_possible_map
accordingly. This isn't possible anymore if this should be done in setup_arch.

So it looks like we have to switch to setup_arch and set NR_CPUS bits in
cpu_possible_map on s390.

Heiko

2006-02-08 08:55:10

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [PATCH] percpu data: only iterate over possible CPUs

On Tue, Feb 07, 2006 at 11:11:56AM -0800, Linus Torvalds wrote:
> You're right, my bad. I looked at setup_smp() and how it walked through
> every CPU in the firmware, but it doesn't actually ever set the possible
> map, it fills in just hwrpb_cpu_present_mask (which is then then only used
> _later_ to set cpu_possible_map for some silly reason).
>
> As far as I can tell, "hwrpb_cpu_present_mask" is just wrong, and the code
> _should_ be using cpu_possible_map.

Yep, it seems that we can get rid of hwrpb_cpu_present_mask.
The appended patch is only minimally tested (works on UP with SMP kernel).

Ivan.

--- linux/arch/alpha/kernel/smp.c.orig Wed Jan 18 02:09:27 2006
+++ linux/arch/alpha/kernel/smp.c Wed Feb 8 02:38:46 2006
@@ -73,9 +73,6 @@ cpumask_t cpu_online_map;

EXPORT_SYMBOL(cpu_online_map);

-/* cpus reported in the hwrpb */
-static unsigned long hwrpb_cpu_present_mask __initdata = 0;
-
int smp_num_probed; /* Internal processor count */
int smp_num_cpus = 1; /* Number that came online. */

@@ -442,7 +439,7 @@ setup_smp(void)
if ((cpu->flags & 0x1cc) == 0x1cc) {
smp_num_probed++;
/* Assume here that "whami" == index */
- hwrpb_cpu_present_mask |= (1UL << i);
+ cpu_set(i, cpu_possible_map);
cpu->pal_revision = boot_cpu_palrev;
}

@@ -453,12 +450,12 @@ setup_smp(void)
}
} else {
smp_num_probed = 1;
- hwrpb_cpu_present_mask = (1UL << boot_cpuid);
+ cpu_set(boot_cpuid, cpu_possible_map);
}
cpu_present_mask = cpumask_of_cpu(boot_cpuid);

printk(KERN_INFO "SMP: %d CPUs probed -- cpu_present_mask = %lx\n",
- smp_num_probed, hwrpb_cpu_present_mask);
+ smp_num_probed, cpu_possible_map.bits[0]);
}

/*
@@ -467,8 +464,6 @@ setup_smp(void)
void __init
smp_prepare_cpus(unsigned int max_cpus)
{
- int cpu_count, i;
-
/* Take care of some initial bookkeeping. */
memset(ipi_data, 0, sizeof(ipi_data));

@@ -486,19 +481,7 @@ smp_prepare_cpus(unsigned int max_cpus)

printk(KERN_INFO "SMP starting up secondaries.\n");

- cpu_count = 1;
- for (i = 0; (i < NR_CPUS) && (cpu_count < max_cpus); i++) {
- if (i == boot_cpuid)
- continue;
-
- if (((hwrpb_cpu_present_mask >> i) & 1) == 0)
- continue;
-
- cpu_set(i, cpu_possible_map);
- cpu_count++;
- }
-
- smp_num_cpus = cpu_count;
+ smp_num_cpus = smp_num_probed;
}

void __devinit