2006-05-09 05:36:06

by Mike Kravetz

[permalink] [raw]
Subject: Add SYSTEM_BOOTING_KMALLOC_AVAIL system_state

There are a few places that check the system_state variable to
determine if they should use the bootmem or kmalloc allocator.
However, this is not accurate as system_state transitions from
SYSTEM_BOOTING to SYSTEM_RUNNING well after the bootmem allocator
is no longer usable. Introduce the SYSTEM_BOOTING_KMALLOC_AVAIL
state which indicates the kmalloc allocator is available for use.

Signed-off-by: Mike Kravetz <[email protected]>

diff -Naupr linux-2.6.17-rc3-mm1/arch/arm/kernel/setup.c linux-2.6.17-rc3-mm1.work/arch/arm/kernel/setup.c
--- linux-2.6.17-rc3-mm1/arch/arm/kernel/setup.c 2006-05-09 03:17:57.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/arch/arm/kernel/setup.c 2006-05-09 04:22:12.000000000 +0000
@@ -378,7 +378,7 @@ void cpu_init(void)
BUG();
}

- if (system_state == SYSTEM_BOOTING)
+ if (system_state < SYSTEM_RUNNING)
dump_cpu_info(cpu);

/*
diff -Naupr linux-2.6.17-rc3-mm1/arch/powerpc/kernel/smp.c linux-2.6.17-rc3-mm1.work/arch/powerpc/kernel/smp.c
--- linux-2.6.17-rc3-mm1/arch/powerpc/kernel/smp.c 2006-04-27 02:19:25.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/arch/powerpc/kernel/smp.c 2006-05-09 04:25:28.000000000 +0000
@@ -540,7 +540,7 @@ int __devinit start_secondary(void *unus
if (smp_ops->take_timebase)
smp_ops->take_timebase();

- if (system_state > SYSTEM_BOOTING)
+ if (system_state >= SYSTEM_RUNNING)
snapshot_timebase();

spin_lock(&call_lock);
diff -Naupr linux-2.6.17-rc3-mm1/include/linux/kernel.h linux-2.6.17-rc3-mm1.work/include/linux/kernel.h
--- linux-2.6.17-rc3-mm1/include/linux/kernel.h 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/include/linux/kernel.h 2006-05-09 04:33:12.000000000 +0000
@@ -189,6 +189,7 @@ extern void add_taint(unsigned);
/* Values used for system_state */
extern enum system_states {
SYSTEM_BOOTING,
+ SYSTEM_BOOTING_KMALLOC_AVAIL,
SYSTEM_RUNNING,
SYSTEM_HALT,
SYSTEM_POWER_OFF,
diff -Naupr linux-2.6.17-rc3-mm1/init/main.c linux-2.6.17-rc3-mm1.work/init/main.c
--- linux-2.6.17-rc3-mm1/init/main.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/init/main.c 2006-05-09 04:59:03.000000000 +0000
@@ -513,6 +513,7 @@ asmlinkage void __init start_kernel(void
cpuset_init_early();
mem_init();
kmem_cache_init();
+ system_state = SYSTEM_BOOTING_KMALLOC_AVAIL;
setup_per_cpu_pageset();
numa_policy_init();
if (late_time_init)
diff -Naupr linux-2.6.17-rc3-mm1/kernel/sched.c linux-2.6.17-rc3-mm1.work/kernel/sched.c
--- linux-2.6.17-rc3-mm1/kernel/sched.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/kernel/sched.c 2006-05-09 04:28:15.000000000 +0000
@@ -5917,7 +5917,7 @@ static void calibrate_migration_costs(co
-1
#endif
);
- if (system_state == SYSTEM_BOOTING) {
+ if (system_state < SYSTEM_RUNNING) {
printk("migration_cost=");
for (distance = 0; distance <= max_distance; distance++) {
if (distance)
diff -Naupr linux-2.6.17-rc3-mm1/kernel/sys.c linux-2.6.17-rc3-mm1.work/kernel/sys.c
--- linux-2.6.17-rc3-mm1/kernel/sys.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/kernel/sys.c 2006-05-09 04:27:15.000000000 +0000
@@ -259,7 +259,7 @@ int blocking_notifier_chain_register(str
* not yet working and interrupts must remain disabled. At
* such times we must not call down_write().
*/
- if (unlikely(system_state == SYSTEM_BOOTING))
+ if (unlikely(system_state < SYSTEM_RUNNING))
return notifier_chain_register(&nh->head, n);

down_write(&nh->rwsem);
@@ -290,7 +290,7 @@ int blocking_notifier_chain_unregister(s
* not yet working and interrupts must remain disabled. At
* such times we must not call down_write().
*/
- if (unlikely(system_state == SYSTEM_BOOTING))
+ if (unlikely(system_state < SYSTEM_RUNNING))
return notifier_chain_unregister(&nh->head, n);

down_write(&nh->rwsem);
diff -Naupr linux-2.6.17-rc3-mm1/mm/page_alloc.c linux-2.6.17-rc3-mm1.work/mm/page_alloc.c
--- linux-2.6.17-rc3-mm1/mm/page_alloc.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/mm/page_alloc.c 2006-05-09 05:31:33.000000000 +0000
@@ -1782,7 +1782,7 @@ static int __meminit __build_all_zonelis

void __meminit build_all_zonelists(void)
{
- if (system_state == SYSTEM_BOOTING) {
+ if (system_state < SYSTEM_RUNNING) {
__build_all_zonelists(0);
cpuset_init_current_mems_allowed();
} else {
@@ -2136,7 +2136,7 @@ int zone_wait_table_init(struct zone *zo
alloc_size = zone->wait_table_hash_nr_entries
* sizeof(wait_queue_head_t);

- if (system_state == SYSTEM_BOOTING) {
+ if (system_state < SYSTEM_RUNNING) {
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, alloc_size);
} else {
diff -Naupr linux-2.6.17-rc3-mm1/mm/sparse.c linux-2.6.17-rc3-mm1.work/mm/sparse.c
--- linux-2.6.17-rc3-mm1/mm/sparse.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/mm/sparse.c 2006-05-09 04:29:56.000000000 +0000
@@ -32,7 +32,7 @@ static struct mem_section *sparse_index_
unsigned long array_size = SECTIONS_PER_ROOT *
sizeof(struct mem_section);

- if (system_state == SYSTEM_RUNNING)
+ if (system_state >= SYSTEM_BOOTING_KMALLOC_AVAIL)
section = kmalloc_node(array_size, GFP_KERNEL, nid);
else
section = alloc_bootmem_node(NODE_DATA(nid), array_size);
diff -Naupr linux-2.6.17-rc3-mm1/mm/vmscan.c linux-2.6.17-rc3-mm1.work/mm/vmscan.c
--- linux-2.6.17-rc3-mm1/mm/vmscan.c 2006-05-09 03:18:05.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work/mm/vmscan.c 2006-05-09 04:29:00.000000000 +0000
@@ -1477,7 +1477,7 @@ int kswapd_run(int nid)
pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
if (IS_ERR(pgdat->kswapd)) {
/* failure at boot is fatal */
- BUG_ON(system_state == SYSTEM_BOOTING);
+ BUG_ON(system_state < SYSTEM_RUNNING);
printk("Failed to start kswapd on node %d\n",nid);
ret = -1;
}


2006-05-09 05:52:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Add SYSTEM_BOOTING_KMALLOC_AVAIL system_state

Mike Kravetz <[email protected]> wrote:
>
> There are a few places that check the system_state variable to
> determine if they should use the bootmem or kmalloc allocator.
> However, this is not accurate as system_state transitions from
> SYSTEM_BOOTING to SYSTEM_RUNNING well after the bootmem allocator
> is no longer usable. Introduce the SYSTEM_BOOTING_KMALLOC_AVAIL
> state which indicates the kmalloc allocator is available for use.

Let's not do this - system_state is getting out of control.

How about some private boolean in slab.c, and some special allocation
function like

void __init *alloc_memory_early(size_t size, gfp_t gfp_flags)
{
if (slab_is_available)
return kmalloc(size, gfp_flags);
return alloc_bootmem(size);
}

?

2006-05-09 06:01:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: Add SYSTEM_BOOTING_KMALLOC_AVAIL system_state

On Mon, 8 May 2006, Andrew Morton wrote:

> > SYSTEM_BOOTING to SYSTEM_RUNNING well after the bootmem allocator
> > is no longer usable. Introduce the SYSTEM_BOOTING_KMALLOC_AVAIL
> > state which indicates the kmalloc allocator is available for use.
>
> Let's not do this - system_state is getting out of control.
>
> How about some private boolean in slab.c, and some special allocation
> function like
>
> void __init *alloc_memory_early(size_t size, gfp_t gfp_flags)
> {
> if (slab_is_available)
> return kmalloc(size, gfp_flags);
> return alloc_bootmem(size);
> }

You may use g_cpucache_up to check the state of the slab allocator.

2006-05-09 17:21:16

by Dave Hansen

[permalink] [raw]
Subject: Re: Add SYSTEM_BOOTING_KMALLOC_AVAIL system_state

On Mon, 2006-05-08 at 22:49 -0700, Andrew Morton wrote:
> Mike Kravetz <[email protected]> wrote:
> >
> > There are a few places that check the system_state variable to
> > determine if they should use the bootmem or kmalloc allocator.
> > However, this is not accurate as system_state transitions from
> > SYSTEM_BOOTING to SYSTEM_RUNNING well after the bootmem allocator
> > is no longer usable. Introduce the SYSTEM_BOOTING_KMALLOC_AVAIL
> > state which indicates the kmalloc allocator is available for use.
>
> Let's not do this - system_state is getting out of control.
>
> How about some private boolean in slab.c, and some special allocation
> function like
>
> void __init *alloc_memory_early(size_t size, gfp_t gfp_flags)
> {
> if (slab_is_available)
> return kmalloc(size, gfp_flags);
> return alloc_bootmem(size);
> }

One issue with that approach is that you can't use it for larger
allocations (which we have a lot of at boot-time). Would it be OK to
fall back to the raw page allocator for things where kmalloc() fails?
Oh, and do we want to make it explicitly NUMA aware?

-- Dave

2006-05-09 17:48:21

by Mike Kravetz

[permalink] [raw]
Subject: Re: Add SYSTEM_BOOTING_KMALLOC_AVAIL system_state

On Tue, May 09, 2006 at 10:20:04AM -0700, Dave Hansen wrote:
> On Mon, 2006-05-08 at 22:49 -0700, Andrew Morton wrote:
> >
> > How about some private boolean in slab.c, and some special allocation
> > function like
> >
> > void __init *alloc_memory_early(size_t size, gfp_t gfp_flags)
> > {
> > if (slab_is_available)
> > return kmalloc(size, gfp_flags);
> > return alloc_bootmem(size);
> > }
>
> One issue with that approach is that you can't use it for larger
> allocations (which we have a lot of at boot-time). Would it be OK to
> fall back to the raw page allocator for things where kmalloc() fails?
> Oh, and do we want to make it explicitly NUMA aware?

Well, I am making it NUMA aware as the first identified user of such
a routine does want to make node specific allocations. I haven't thought
about the 'large' allocations.

Any thoughts about also including free_memory_early() routines? It
seems like a good idea. However, I'm somewhat afraid of these gaining
widespread use and the potential mis-use. Using free in the same routine
as the alloc (such as in error paths) is generally ok. But, trying to
do a free at a later time requires the attention of the coder. Just don't
want people to think these wrappers will be a substitute for that analysis.

--
Mike

2006-05-09 21:07:10

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH] alloc_memory_early() routines

Add alloc_memory_early() routines so that code needing to allocate
space during boot need not be aware of which allocator is in use.
Includes first use of such routine by the SPARSEMEM code.

I did not include support for 'large' allocations as suggested by
Dave, or corresponding free_memory_early() routines. The only
immediate need is for NUMA/node aware allocation. Others can be
added as the needs arise.

Signed-off-by: Mike Kravetz <[email protected]>

diff -Naupr linux-2.6.17-rc3-mm1/include/linux/slab.h linux-2.6.17-rc3-mm1.work3/include/linux/slab.h
--- linux-2.6.17-rc3-mm1/include/linux/slab.h 2006-05-03 22:19:15.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work3/include/linux/slab.h 2006-05-09 21:09:37.000000000 +0000
@@ -150,10 +150,12 @@ static inline void *kcalloc(size_t n, si

extern void kfree(const void *);
extern unsigned int ksize(const void *);
+extern void *alloc_memory_early(size_t size, gfp_t flags);

#ifdef CONFIG_NUMA
extern void *kmem_cache_alloc_node(kmem_cache_t *, gfp_t flags, int node);
extern void *kmalloc_node(size_t size, gfp_t flags, int node);
+extern void *alloc_memory_early_node(size_t size, gfp_t flags, int node);
#else
static inline void *kmem_cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int node)
{
@@ -163,6 +165,10 @@ static inline void *kmalloc_node(size_t
{
return kmalloc(size, flags);
}
+static inline void *alloc_memory_early_node(size_t size, gfp_t flags, int node)
+{
+ return alloc_memory_early(size, flags);
+}
#endif

extern int FASTCALL(kmem_cache_reap(int));
diff -Naupr linux-2.6.17-rc3-mm1/mm/slab.c linux-2.6.17-rc3-mm1.work3/mm/slab.c
--- linux-2.6.17-rc3-mm1/mm/slab.c 2006-05-03 22:19:16.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work3/mm/slab.c 2006-05-09 21:38:23.000000000 +0000
@@ -108,6 +108,7 @@
#include <linux/mempolicy.h>
#include <linux/mutex.h>
#include <linux/rtmutex.h>
+#include <linux/bootmem.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -3266,8 +3267,24 @@ void *kmalloc_node(size_t size, gfp_t fl
return kmem_cache_alloc_node(cachep, flags, node);
}
EXPORT_SYMBOL(kmalloc_node);
+
+void * __init alloc_memory_early_node(size_t size, gfp_t flags, int node)
+{
+ if (g_cpucache_up == FULL)
+ return kmalloc_node(size, flags, node);
+ else
+ return alloc_bootmem_node(NODE_DATA(node), size);
+}
#endif

+void * __init alloc_memory_early(size_t size, gfp_t flags)
+{
+ if (g_cpucache_up == FULL)
+ return kmalloc(size, flags);
+ else
+ return alloc_bootmem(size);
+}
+
/**
* kmalloc - allocate memory
* @size: how many bytes of memory are required.
diff -Naupr linux-2.6.17-rc3-mm1/mm/sparse.c linux-2.6.17-rc3-mm1.work3/mm/sparse.c
--- linux-2.6.17-rc3-mm1/mm/sparse.c 2006-05-03 22:19:16.000000000 +0000
+++ linux-2.6.17-rc3-mm1.work3/mm/sparse.c 2006-05-09 20:37:51.000000000 +0000
@@ -32,11 +32,7 @@ static struct mem_section *sparse_index_
unsigned long array_size = SECTIONS_PER_ROOT *
sizeof(struct mem_section);

- if (system_state == SYSTEM_RUNNING)
- section = kmalloc_node(array_size, GFP_KERNEL, nid);
- else
- section = alloc_bootmem_node(NODE_DATA(nid), array_size);
-
+ section = alloc_memory_early_node(array_size, GFP_KERNEL, nid);
if (section)
memset(section, 0, array_size);

2006-05-10 07:09:48

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

Hi Mike,

On 5/10/06, Mike Kravetz <[email protected]> wrote:
> diff -Naupr linux-2.6.17-rc3-mm1/mm/slab.c linux-2.6.17-rc3-mm1.work3/mm/slab.c
> --- linux-2.6.17-rc3-mm1/mm/slab.c 2006-05-03 22:19:16.000000000 +0000
> +++ linux-2.6.17-rc3-mm1.work3/mm/slab.c 2006-05-09 21:38:23.000000000 +0000

[snip]

> +void * __init alloc_memory_early_node(size_t size, gfp_t flags, int node)
> +{
> + if (g_cpucache_up == FULL)
> + return kmalloc_node(size, flags, node);
> + else
> + return alloc_bootmem_node(NODE_DATA(node), size);
> +}

I'd prefer you put this in mm/bootmem.c and added a

int slab_is_available(void)
{
return g_cpucache_up == FULL;
}

to mm/slab.c instead.

Pekka

2006-05-10 07:12:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

On Wed, 10 May 2006, Pekka Enberg wrote:

> > +void * __init alloc_memory_early_node(size_t size, gfp_t flags, int node)
> > +{
> > + if (g_cpucache_up == FULL)
> > + return kmalloc_node(size, flags, node);
> > + else
> > + return alloc_bootmem_node(NODE_DATA(node), size);
> > +}
>
> I'd prefer you put this in mm/bootmem.c and added a
>
> int slab_is_available(void)
> {
> return g_cpucache_up == FULL;
> }
>
> to mm/slab.c instead.

Does slab not available mean that bootmem can be used?

2006-05-10 07:16:54

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

On Wed, 10 May 2006, Pekka Enberg wrote:
> > > +void * __init alloc_memory_early_node(size_t size, gfp_t flags, int node)
> > > +{
> > > + if (g_cpucache_up == FULL)
> > > + return kmalloc_node(size, flags, node);
> > > + else
> > > + return alloc_bootmem_node(NODE_DATA(node), size);
> > > +}
> >
> > I'd prefer you put this in mm/bootmem.c and added a
> >
> > int slab_is_available(void)
> > {
> > return g_cpucache_up == FULL;
> > }
> >
> > to mm/slab.c instead.

On Wed, 10 May 2006, Christoph Lameter wrote:
> Does slab not available mean that bootmem can be used?

Yes.

Pekka

2006-05-10 07:19:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

On 5/10/06, Mike Kravetz <[email protected]> wrote:
> I did not include support for 'large' allocations as suggested by
> Dave, or corresponding free_memory_early() routines. The only
> immediate need is for NUMA/node aware allocation. Others can be
> added as the needs arise.

Sorry if this was already discussed, but you're not supposed to free
the memory allocated by alloc_memory_early() ever? If so, please add a
kerneldoc stating that.

Pekka

2006-05-10 08:29:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

Pekka J Enberg <[email protected]> writes:

> On Wed, 10 May 2006, Pekka Enberg wrote:
> > > > +void * __init alloc_memory_early_node(size_t size, gfp_t flags, int node)
> > > > +{
> > > > + if (g_cpucache_up == FULL)
> > > > + return kmalloc_node(size, flags, node);
> > > > + else
> > > > + return alloc_bootmem_node(NODE_DATA(node), size);
> > > > +}
> > >
> > > I'd prefer you put this in mm/bootmem.c and added a
> > >
> > > int slab_is_available(void)
> > > {
> > > return g_cpucache_up == FULL;
> > > }
> > >
> > > to mm/slab.c instead.
>
> On Wed, 10 May 2006, Christoph Lameter wrote:
> > Does slab not available mean that bootmem can be used?
>
> Yes.


Actually it doesn't - in early boot up there is a phase where even bootmem
doesn't work yet.

-Andi

2006-05-10 16:15:58

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

On Wed, May 10, 2006 at 12:11:59AM -0700, Christoph Lameter wrote:
> > I'd prefer you put this in mm/bootmem.c and added a
> >
> > int slab_is_available(void)
> > {
> > return g_cpucache_up == FULL;
> > }
> >
> > to mm/slab.c instead.
>
> Does slab not available mean that bootmem can be used?

I like the 'slab_is_available()' check. How about if we simply add
this routine and let the people doing the allocation determine what
allocator to use?

As has already been stated, slab not available does NOT imply that
bootmem can be used. Heck, on POWER there is even an allocator used
before bootmem. I doubt we could provide an 'intelligent' routine
to works in all cases. So, for right now we could/should just provide
the slab not available() check. There is only one piece of code in
SPARSEMEM that cares about this.

Sound reasonable?
--
Mike

2006-05-10 17:42:29

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH] alloc_memory_early() routines

On Wed, 10 May 2006, Mike Kravetz wrote:
> I like the 'slab_is_available()' check. How about if we simply add
> this routine and let the people doing the allocation determine what
> allocator to use?

I'm fine with that.

Pekka