On Thu, 2009-06-11 at 19:59 +0000, Linux Kernel Mailing List wrote:
> Gitweb: http://git.kernel.org/linus/83b519e8b9572c319c8e0c615ee5dd7272856090
> Commit: 83b519e8b9572c319c8e0c615ee5dd7272856090
> Parent: c91c4773b334d4d3a6d44626dc2a558ad97b86f3
> Author: Pekka Enberg <[email protected]>
> AuthorDate: Wed Jun 10 19:40:04 2009 +0300
> Committer: Pekka Enberg <[email protected]>
> CommitDate: Thu Jun 11 19:15:56 2009 +0300
>
> slab: setup allocators earlier in the boot sequence
>
> This patch makes kmalloc() available earlier in the boot sequence so we can get
> rid of some bootmem allocations. The bulk of the changes are due to
> kmem_cache_init() being called with interrupts disabled which requires some
> changes to allocator boostrap code.
>
> Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
> before we call mem_init() during boot as reported by Ingo Molnar:
This seem to explode in various places on powerpc :-(
It would have been nice if we had enough advance warning to actually fix
our archs too ... I had no idea it was going to be merged that soon. I
did plan to dig into this at some stage but that was too soon. Maybe it
should have been a CONFIG option for a couple of -rc's ?
But yeah, of course, only x86 matters right ? We can break everybody
else and fuck up bisection just for fun...
Cheers,
Ben.
> This seem to explode in various places on powerpc :-(
Main breakage is that slab now gets used a lot earlier than init_IRQ and
time_init(), but kmalloc() internally hard-enables interrupts when
GFP_WAIT is passed (it should not but that another side effect of the
change, see below). So we have the decrementer (CPU timer) popping and
it crashes on uninitialized data structures in the timer code.
The reason GFP_WAIT is passed comes from various bits of init code we
have using a function we call alloc_maybe_bootmem() which does kmalloc
if initialized and alloc_bootmem() before...
The problem is that we routinely call that with GFP_KERNEL since we used
to know it would default to bootmem before IRQs can be enabled safely,
which is no longer the case.
At this stage, I'm tempted to rip the all off and just use kmalloc
GFP_NOWAIT for now.
But it would be useful to have kmalloc -itself- remove GFP_WAIT when
called to early so that code doesn't have to do something different
depending on when it's called. In fact, we similarily need to remove
__GFP_IO/FS when in suspend/resume sequence...
For example, we have code for allocating IRQ remapping related data
structures that can be called very early at init_IRQ() time, or later if
some driver or bus wants to register a cascaded IRQ controller.
Cheers,
Ben.
On Fri, 2009-06-12 at 13:56 +1000, Benjamin Herrenschmidt wrote:
> > This seem to explode in various places on powerpc :-(
>
> Main breakage is that slab now gets used a lot earlier than init_IRQ and
> time_init(), but kmalloc() internally hard-enables interrupts when
> GFP_WAIT is passed (it should not but that another side effect of the
> change, see below). So we have the decrementer (CPU timer) popping and
> it crashes on uninitialized data structures in the timer code.
>
> The reason GFP_WAIT is passed comes from various bits of init code we
> have using a function we call alloc_maybe_bootmem() which does kmalloc
> if initialized and alloc_bootmem() before...
.../...
And that's not the only problem. The next one is a bit more sneaky, and
I suppose doesn't hit only us...
I'm not sure what's the right fix other than going all the way to having
kmalloc() automagically mask out bits when called at the wrong time.
Typically, what happens here is ioremap() causes __get_vm_area() which
blows up because it does GFP_KERNEL.
Now, that used to work because we have a trick for early ioremap's
(which the arch need, since we do need to ioremap very early, for things
like accessing memory mapped interrupt controllers, or other various low
level platform things).
What we do is VMALLOC_END is defined to be a variable (which we misnamed
ioremap_bot for historical reasons). If ioremap is called before
mem_init() then we allocate virtual space by moving down that variable.
Else, we use get_vm_area().
Now this is broken with the change because after mem_init(), we now use
get_vm_area() which does a kmalloc(...,GFP_KERNEL), which explodes
because it turns on interrupts way too early.
I'll cook up a patch that defines a global bitmask of "forbidden" GFP
bits and see how things go.
Cheers,
Ben.
On Fri, 2009-06-12 at 14:25 +1000, Benjamin Herrenschmidt wrote:
> I'll cook up a patch that defines a global bitmask of "forbidden" GFP
> bits and see how things go.
>From ad87215e01b257ccc1af64aa9d5776ace580dea3 Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 12 Jun 2009 15:03:47 +1000
Subject: [PATCH] Sanitize "gfp" flags during boot
With the recent shuffle of initialization order to move memory related
inits earlier, various subtle breakage was introduced in archs like
powerpc due to code somewhat assuming that GFP_KERNEL can be used as
soon as the allocators are up. This is not true because any __GFP_WAIT
allocation will cause interrupts to be enabled, which can be fatal if
it happens too early.
This isn't trivial to fix on every call site. For example, powerpc's
ioremap implementation needs to be called early. For that, it uses two
different mechanisms to carve out virtual space. Before memory init,
by moving down VMALLOC_END, and then, by calling get_vm_area().
Unfortunately, the later does GFK_KERNEL allocations. But we can't do
anything else because once vmalloc's been initialized, we can no longer
safely move VMALLOC_END to carve out space.
There are other examples, wehere can can be called either very early
or later on when devices are hot-plugged. It would be a major pain for
such code to have to "know" whether it's in a context where it should
use GFP_KERNEL or GFP_NOWAIT.
Finally, by having the ability to silently removed __GFP_WAIT from
allocations, we pave the way for suspend-to-RAM to use that feature
to also remove __GFP_IO from allocations done after suspending devices
has started. This is important because such allocations may hang if
devices on the swap-out path have been suspended, but not-yet suspended
drivers don't know about it, and may deadlock themselves by being hung
into a kmalloc somewhere while holding a mutex for example.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
include/linux/gfp.h | 8 ++++++++
init/main.c | 5 +++++
mm/page_alloc.c | 5 +++++
mm/slab.c | 9 +++++++++
mm/slub.c | 3 +++
5 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..b0f7a22 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -99,6 +99,14 @@ struct vm_area_struct;
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32
+/* Illegal bits */
+extern gfp_t gfp_smellybits;
+
+static inline gfp_t gfp_sanitize(gfp_t gfp_flags)
+{
+ return gfp_flags & ~gfp_smellybits;
+}
+
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
diff --git a/init/main.c b/init/main.c
index 5616661..bb812c1 100644
--- a/init/main.c
+++ b/init/main.c
@@ -539,6 +539,9 @@ void __init __weak thread_info_cache_init(void)
*/
static void __init mm_init(void)
{
+ /* Degrade everything into GFP_NOWAIT for now */
+ gfp_smellybits = __GFP_WAIT | __GFP_FS | __GFP_IO;
+
mem_init();
kmem_cache_init();
vmalloc_init();
@@ -634,6 +637,8 @@ asmlinkage void __init start_kernel(void)
printk(KERN_CRIT "start_kernel(): bug: interrupts were "
"enabled early\n");
early_boot_irqs_on();
+ /* GFP_KERNEL allocations are good to go now */
+ gfp_smellybits = 0;
local_irq_enable();
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17d5f53..efde0d5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,6 +77,8 @@ int percpu_pagelist_fraction;
int pageblock_order __read_mostly;
#endif
+gfp_t gfp_smellybits;
+
static void __free_pages_ok(struct page *page, unsigned int order);
/*
@@ -1473,6 +1475,9 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;
+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
lockdep_trace_alloc(gfp_mask);
might_sleep_if(wait);
diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..87b166e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2791,6 +2791,9 @@ static int cache_grow(struct kmem_cache *cachep,
gfp_t local_flags;
struct kmem_list3 *l3;
+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
/*
* Be lazy and only check for valid flags here, keeping it out of the
* critical path in kmem_cache_alloc().
@@ -3212,6 +3215,9 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
void *obj = NULL;
int nid;
+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
if (flags & __GFP_THISNODE)
return NULL;
@@ -3434,6 +3440,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
unsigned long save_flags;
void *objp;
+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
lockdep_trace_alloc(flags);
if (slab_should_failslab(cachep, flags))
diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..5c646f7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1512,6 +1512,9 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfpflags = gfp_sanitize(gfpflags);
+
if (!c->page)
goto new_slab;
--
1.6.1.2.14.gf26b5
Hi Benjamin,
[ First of all, sorry for the breakage and thank you for looking into
this! ]
On Fri, 12 Jun 2009, Benjamin Herrenschmidt wrote:
> > I'll cook up a patch that defines a global bitmask of "forbidden" GFP
> > bits and see how things go.
>
> >From ad87215e01b257ccc1af64aa9d5776ace580dea3 Mon Sep 17 00:00:00 2001
> From: Benjamin Herrenschmidt <[email protected]>
> Date: Fri, 12 Jun 2009 15:03:47 +1000
> Subject: [PATCH] Sanitize "gfp" flags during boot
OK, I am not sure we actually need that. The thing is, no one is allowed
to use kmalloc() unless slab_is_available() returns true so we can just
grep for the latter and do something like the following patch. Does that
make powerpc boot nicely again? Ingo, I think this fixes the early irq
screams you were having too.
There's some more in s390 architecture code and some drivers (!) but I
left them out from this patch for now.
Pekka
>From fdade1bf17b6717c0de2b3f7c6a7d7bd82fc46db Mon Sep 17 00:00:00 2001
From: Pekka Enberg <[email protected]>
Date: Fri, 12 Jun 2009 09:11:11 +0300
Subject: [PATCH] init: Use GFP_NOWAIT for early slab allocations
We setup slab allocators very early now while interrupts can still be disabled.
Therefore, make sure call-sites that use slab_is_available() to switch to slab
during boot use GFP_NOWAIT.
Signed-off-by: Pekka Enberg <[email protected]>
---
include/linux/vmalloc.h | 1 +
kernel/params.c | 2 +-
kernel/profile.c | 6 +++---
mm/page_alloc.c | 2 +-
mm/page_cgroup.c | 4 ++--
mm/sparse-vmemmap.c | 2 +-
mm/sparse.c | 2 +-
mm/vmalloc.c | 18 ++++++++++++++++++
8 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index a43ebec..7bcb9d7 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -53,6 +53,7 @@ static inline void vmalloc_init(void)
extern void *vmalloc(unsigned long size);
extern void *vmalloc_user(unsigned long size);
extern void *vmalloc_node(unsigned long size, int node);
+extern void *vmalloc_node_boot(unsigned long size, int node);
extern void *vmalloc_exec(unsigned long size);
extern void *vmalloc_32(unsigned long size);
extern void *vmalloc_32_user(unsigned long size);
diff --git a/kernel/params.c b/kernel/params.c
index de273ec..5c239c3 100644
--- a/kernel/params.c
+++ b/kernel/params.c
@@ -227,7 +227,7 @@ int param_set_charp(const char *val, struct kernel_param *kp)
* don't need to; this mangled commandline is preserved. */
if (slab_is_available()) {
kp->perm |= KPARAM_KMALLOCED;
- *(char **)kp->arg = kstrdup(val, GFP_KERNEL);
+ *(char **)kp->arg = kstrdup(val, GFP_NOWAIT);
if (!kp->arg)
return -ENOMEM;
} else
diff --git a/kernel/profile.c b/kernel/profile.c
index 28cf26a..86ada09 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -112,16 +112,16 @@ int __ref profile_init(void)
prof_len = (_etext - _stext) >> prof_shift;
buffer_bytes = prof_len*sizeof(atomic_t);
- if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
+ if (!alloc_cpumask_var(&prof_cpu_mask, GFP_NOWAIT))
return -ENOMEM;
cpumask_copy(prof_cpu_mask, cpu_possible_mask);
- prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
+ prof_buffer = kzalloc(buffer_bytes, GFP_NOWAIT);
if (prof_buffer)
return 0;
- prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
+ prof_buffer = alloc_pages_exact(buffer_bytes, GFP_NOWAIT|__GFP_ZERO);
if (prof_buffer)
return 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17d5f53..7760ef9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2903,7 +2903,7 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
* To use this new node's memory, further consideration will be
* necessary.
*/
- zone->wait_table = vmalloc(alloc_size);
+ zone->wait_table = __vmalloc(alloc_size, GFP_NOWAIT, PAGE_KERNEL);
}
if (!zone->wait_table)
return -ENOMEM;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 3dd4a90..c954e04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -119,9 +119,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
if (slab_is_available()) {
base = kmalloc_node(table_size,
- GFP_KERNEL | __GFP_NOWARN, nid);
+ GFP_NOWAIT | __GFP_NOWARN, nid);
if (!base)
- base = vmalloc_node(table_size, nid);
+ base = vmalloc_node_boot(table_size, nid);
} else {
base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
table_size,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a13ea64..9df6d99 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -49,7 +49,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
/* If the main allocator is up use that, fallback to bootmem. */
if (slab_is_available()) {
struct page *page = alloc_pages_node(node,
- GFP_KERNEL | __GFP_ZERO, get_order(size));
+ GFP_NOWAIT | __GFP_ZERO, get_order(size));
if (page)
return page_address(page);
return NULL;
diff --git a/mm/sparse.c b/mm/sparse.c
index da432d9..dd558d2 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -63,7 +63,7 @@ static struct mem_section noinline __init_refok *sparse_index_alloc(int nid)
sizeof(struct mem_section);
if (slab_is_available())
- section = kmalloc_node(array_size, GFP_KERNEL, nid);
+ section = kmalloc_node(array_size, GFP_NOWAIT, nid);
else
section = alloc_bootmem_node(NODE_DATA(nid), array_size);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f8189a4..3bec46d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1559,6 +1559,24 @@ void *vmalloc_node(unsigned long size, int node)
}
EXPORT_SYMBOL(vmalloc_node);
+/**
+ * vmalloc_node_boot - allocate memory on a specific node during boot
+ * @size: allocation size
+ * @node: numa node
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * For tight control over page level allocator and protection flags
+ * use __vmalloc() instead.
+ */
+void *vmalloc_node_boot(unsigned long size, int node)
+{
+ return __vmalloc_node(size, GFP_NOWAIT | __GFP_HIGHMEM, PAGE_KERNEL,
+ node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(vmalloc_node_boot);
+
#ifndef PAGE_KERNEL_EXEC
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif
--
1.6.0.4
On Fri, 2009-06-12 at 09:16 +0300, Pekka J Enberg wrote:
> OK, I am not sure we actually need that. The thing is, no one is allowed
> to use kmalloc() unless slab_is_available() returns true so we can just
> grep for the latter and do something like the following patch. Does that
> make powerpc boot nicely again? Ingo, I think this fixes the early irq
> screams you were having too.
>
> There's some more in s390 architecture code and some drivers (!) but I
> left them out from this patch for now.
I don't like that approach at all. Fixing all the call sites... we are
changing things all over the place, we'll certainly miss some, and
honestly, it's none of the business of things like vmalloc to know about
things like what kmalloc flags are valid and when...
Besides, by turning everything permanently to GFP_NOWAIT, you also
significantly increase the risk of failure of those allocations since
they can no longer ... wait :-) (And push things out to swap etc...)
I really believe this should be a slab internal thing, which is what my
patch does to a certain extent. IE. All callers need to care about is
KERNEL vs. ATOMIC and in some cases, NOIO or similar for filesystems
etc... but I don't think all sorts of kernel subsystems, because they
can be called early during boot, need to suddenly use GFP_NOWAIT all the
time.
That's why I much prefer my approach :-) (In addition to the fact that
it provides the basis for also fixing suspend/resume).
Cheers,
Ben.
> Pekka
>
> >From fdade1bf17b6717c0de2b3f7c6a7d7bd82fc46db Mon Sep 17 00:00:00 2001
> From: Pekka Enberg <[email protected]>
> Date: Fri, 12 Jun 2009 09:11:11 +0300
> Subject: [PATCH] init: Use GFP_NOWAIT for early slab allocations
>
> We setup slab allocators very early now while interrupts can still be disabled.
> Therefore, make sure call-sites that use slab_is_available() to switch to slab
> during boot use GFP_NOWAIT.
>
> Signed-off-by: Pekka Enberg <[email protected]>
> ---
> include/linux/vmalloc.h | 1 +
> kernel/params.c | 2 +-
> kernel/profile.c | 6 +++---
> mm/page_alloc.c | 2 +-
> mm/page_cgroup.c | 4 ++--
> mm/sparse-vmemmap.c | 2 +-
> mm/sparse.c | 2 +-
> mm/vmalloc.c | 18 ++++++++++++++++++
> 8 files changed, 28 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index a43ebec..7bcb9d7 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -53,6 +53,7 @@ static inline void vmalloc_init(void)
> extern void *vmalloc(unsigned long size);
> extern void *vmalloc_user(unsigned long size);
> extern void *vmalloc_node(unsigned long size, int node);
> +extern void *vmalloc_node_boot(unsigned long size, int node);
> extern void *vmalloc_exec(unsigned long size);
> extern void *vmalloc_32(unsigned long size);
> extern void *vmalloc_32_user(unsigned long size);
> diff --git a/kernel/params.c b/kernel/params.c
> index de273ec..5c239c3 100644
> --- a/kernel/params.c
> +++ b/kernel/params.c
> @@ -227,7 +227,7 @@ int param_set_charp(const char *val, struct kernel_param *kp)
> * don't need to; this mangled commandline is preserved. */
> if (slab_is_available()) {
> kp->perm |= KPARAM_KMALLOCED;
> - *(char **)kp->arg = kstrdup(val, GFP_KERNEL);
> + *(char **)kp->arg = kstrdup(val, GFP_NOWAIT);
> if (!kp->arg)
> return -ENOMEM;
> } else
> diff --git a/kernel/profile.c b/kernel/profile.c
> index 28cf26a..86ada09 100644
> --- a/kernel/profile.c
> +++ b/kernel/profile.c
> @@ -112,16 +112,16 @@ int __ref profile_init(void)
> prof_len = (_etext - _stext) >> prof_shift;
> buffer_bytes = prof_len*sizeof(atomic_t);
>
> - if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
> + if (!alloc_cpumask_var(&prof_cpu_mask, GFP_NOWAIT))
> return -ENOMEM;
>
> cpumask_copy(prof_cpu_mask, cpu_possible_mask);
>
> - prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> + prof_buffer = kzalloc(buffer_bytes, GFP_NOWAIT);
> if (prof_buffer)
> return 0;
>
> - prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> + prof_buffer = alloc_pages_exact(buffer_bytes, GFP_NOWAIT|__GFP_ZERO);
> if (prof_buffer)
> return 0;
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 17d5f53..7760ef9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2903,7 +2903,7 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> * To use this new node's memory, further consideration will be
> * necessary.
> */
> - zone->wait_table = vmalloc(alloc_size);
> + zone->wait_table = __vmalloc(alloc_size, GFP_NOWAIT, PAGE_KERNEL);
> }
> if (!zone->wait_table)
> return -ENOMEM;
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 3dd4a90..c954e04 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -119,9 +119,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
> table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> if (slab_is_available()) {
> base = kmalloc_node(table_size,
> - GFP_KERNEL | __GFP_NOWARN, nid);
> + GFP_NOWAIT | __GFP_NOWARN, nid);
> if (!base)
> - base = vmalloc_node(table_size, nid);
> + base = vmalloc_node_boot(table_size, nid);
> } else {
> base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> table_size,
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a13ea64..9df6d99 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -49,7 +49,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
> /* If the main allocator is up use that, fallback to bootmem. */
> if (slab_is_available()) {
> struct page *page = alloc_pages_node(node,
> - GFP_KERNEL | __GFP_ZERO, get_order(size));
> + GFP_NOWAIT | __GFP_ZERO, get_order(size));
> if (page)
> return page_address(page);
> return NULL;
> diff --git a/mm/sparse.c b/mm/sparse.c
> index da432d9..dd558d2 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -63,7 +63,7 @@ static struct mem_section noinline __init_refok *sparse_index_alloc(int nid)
> sizeof(struct mem_section);
>
> if (slab_is_available())
> - section = kmalloc_node(array_size, GFP_KERNEL, nid);
> + section = kmalloc_node(array_size, GFP_NOWAIT, nid);
> else
> section = alloc_bootmem_node(NODE_DATA(nid), array_size);
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f8189a4..3bec46d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1559,6 +1559,24 @@ void *vmalloc_node(unsigned long size, int node)
> }
> EXPORT_SYMBOL(vmalloc_node);
>
> +/**
> + * vmalloc_node_boot - allocate memory on a specific node during boot
> + * @size: allocation size
> + * @node: numa node
> + *
> + * Allocate enough pages to cover @size from the page level
> + * allocator and map them into contiguous kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + */
> +void *vmalloc_node_boot(unsigned long size, int node)
> +{
> + return __vmalloc_node(size, GFP_NOWAIT | __GFP_HIGHMEM, PAGE_KERNEL,
> + node, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL(vmalloc_node_boot);
> +
> #ifndef PAGE_KERNEL_EXEC
> # define PAGE_KERNEL_EXEC PAGE_KERNEL
> #endif
On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> I don't like that approach at all. Fixing all the call sites... we are
> changing things all over the place, we'll certainly miss some, and
> honestly, it's none of the business of things like vmalloc to know about
> things like what kmalloc flags are valid and when...
Oh and btw, your patch alone doesn't fix powerpc, because it's missing
a whole bunch of GFP_KERNEL's in the arch code... You would have to
grep the entire kernel for things that check slab_is_available() and
even then you'll be missing some.
For example, slab_is_available() didn't always exist, and so in the
early days on powerpc, we used a mem_init_done global that is set form
mem_init() (not perfect but works in practice). And we still have code
using that to do the test.
Anyway, I think changing all the call sites is the wrong approach,
especially for things that can routinely be called after boot when
GFP_KERNEL is the right thing to do.
Cheers,
Ben.
Hi Ben,
On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> I don't like that approach at all. Fixing all the call sites... we are
> changing things all over the place, we'll certainly miss some, and
> honestly, it's none of the business of things like vmalloc to know about
> things like what kmalloc flags are valid and when...
The call-sites I fixed up are all boot code AFAICT. And I like I said,
we can't really _miss_ any of those places, they must be checking for
slab_is_available() _anyway_; otherwise they have no business using
kmalloc(). And note: all call-sites that _unconditionally_ use
kmalloc(GFP_KERNEL) are safe because they worked before.
On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> Besides, by turning everything permanently to GFP_NOWAIT, you also
> significantly increase the risk of failure of those allocations since
> they can no longer ... wait :-) (And push things out to swap etc...)
Again, I audited the call-sites and they all should be boot-time code.
The only borderline case I could see is in s390 arch code which is why I
droppped that hunk for now.
On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> I really believe this should be a slab internal thing, which is what my
> patch does to a certain extent. IE. All callers need to care about is
> KERNEL vs. ATOMIC and in some cases, NOIO or similar for filesystems
> etc... but I don't think all sorts of kernel subsystems, because they
> can be called early during boot, need to suddenly use GFP_NOWAIT all the
> time.
>
> That's why I much prefer my approach :-) (In addition to the fact that
> it provides the basis for also fixing suspend/resume).
Sure, I think we can do what you want with the patch below.
But I still think we need my patch regardless. The call sites I
converted are all init code and should be using GFP_NOWAIT. Does it fix
your boot on powerpc?
Pekka
diff --git a/mm/slab.c b/mm/slab.c
index 9a90b00..722beb5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2791,6 +2791,13 @@ static int cache_grow(struct kmem_cache *cachep,
offset *= cachep->colour_off;
+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ local_flags &= ~__GFP_WAIT;
+
if (local_flags & __GFP_WAIT)
local_irq_enable();
diff --git a/mm/slub.c b/mm/slub.c
index 65ffda5..f9a6bc8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1547,6 +1547,13 @@ new_slab:
goto load_freelist;
}
+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ gfpflags &= ~__GFP_WAIT;
+
if (gfpflags & __GFP_WAIT)
local_irq_enable();
Hi Benjamin,
On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> > I don't like that approach at all. Fixing all the call sites... we are
> > changing things all over the place, we'll certainly miss some, and
> > honestly, it's none of the business of things like vmalloc to know about
> > things like what kmalloc flags are valid and when...
>
> Oh and btw, your patch alone doesn't fix powerpc, because it's missing
> a whole bunch of GFP_KERNEL's in the arch code... You would have to
> grep the entire kernel for things that check slab_is_available() and
> even then you'll be missing some.
Ah, the patch is not against current git so, yeah, I missed some.
On Fri, 2009-06-12 at 17:39 +1000, Benjamin Herrenschmidt wrote:
> For example, slab_is_available() didn't always exist, and so in the
> early days on powerpc, we used a mem_init_done global that is set form
> mem_init() (not perfect but works in practice). And we still have code
> using that to do the test.
IMHO, that would be a bug :-). But anyway, see the other thread for my
suggestion how to do what you want in a slightly cleaner way.
Pekka
On Fri, Jun 12, 2009 at 10:45:45AM +0300, Pekka Enberg wrote:
> On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> > I really believe this should be a slab internal thing, which is what my
> > patch does to a certain extent. IE. All callers need to care about is
> > KERNEL vs. ATOMIC and in some cases, NOIO or similar for filesystems
> > etc... but I don't think all sorts of kernel subsystems, because they
> > can be called early during boot, need to suddenly use GFP_NOWAIT all the
> > time.
> >
> > That's why I much prefer my approach :-) (In addition to the fact that
> > it provides the basis for also fixing suspend/resume).
>
> Sure, I think we can do what you want with the patch below.
I don't really like adding branches to slab allocator like this.
init code all needs to know what services are available, and
this includes the scheduler if it wants to do anything sleeping
(including sleeping slab allocations).
Core mm code is the last place to put in workarounds for broken
callers...
>
> But I still think we need my patch regardless. The call sites I
> converted are all init code and should be using GFP_NOWAIT. Does it fix
> your boot on powerpc?
>
> Pekka
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 9a90b00..722beb5 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2791,6 +2791,13 @@ static int cache_grow(struct kmem_cache *cachep,
>
> offset *= cachep->colour_off;
>
> + /*
> + * Lets not wait if we're booting up or suspending even if the user
> + * asks for it.
> + */
> + if (system_state != SYSTEM_RUNNING)
> + local_flags &= ~__GFP_WAIT;
> +
> if (local_flags & __GFP_WAIT)
> local_irq_enable();
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 65ffda5..f9a6bc8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1547,6 +1547,13 @@ new_slab:
> goto load_freelist;
> }
>
> + /*
> + * Lets not wait if we're booting up or suspending even if the user
> + * asks for it.
> + */
> + if (system_state != SYSTEM_RUNNING)
> + gfpflags &= ~__GFP_WAIT;
> +
> if (gfpflags & __GFP_WAIT)
> local_irq_enable();
>
>
Hi Nick,
On Fri, Jun 12, 2009 at 10:45:45AM +0300, Pekka Enberg wrote:
> > On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> > > I really believe this should be a slab internal thing, which is what my
> > > patch does to a certain extent. IE. All callers need to care about is
> > > KERNEL vs. ATOMIC and in some cases, NOIO or similar for filesystems
> > > etc... but I don't think all sorts of kernel subsystems, because they
> > > can be called early during boot, need to suddenly use GFP_NOWAIT all the
> > > time.
> > >
> > > That's why I much prefer my approach :-) (In addition to the fact that
> > > it provides the basis for also fixing suspend/resume).
> >
> > Sure, I think we can do what you want with the patch below.
On Fri, 2009-06-12 at 09:54 +0200, Nick Piggin wrote:
> I don't really like adding branches to slab allocator like this.
> init code all needs to know what services are available, and
> this includes the scheduler if it wants to do anything sleeping
> (including sleeping slab allocations).
>
> Core mm code is the last place to put in workarounds for broken
> callers...
Yes, the initialization code can be fixed to use GFP_NOWAIT. But it's
really the suspend case that makes me think the patch might be a good
idea. So the patch does not attempt to be a workaround for buggy callers
but rather a change in policy that we simply refuse to wait during
bootup and suspend.
Pekka
On Fri, Jun 12, 2009 at 10:59:52AM +0300, Pekka Enberg wrote:
> Hi Nick,
>
> On Fri, Jun 12, 2009 at 10:45:45AM +0300, Pekka Enberg wrote:
> > > On Fri, 2009-06-12 at 17:34 +1000, Benjamin Herrenschmidt wrote:
> > > > I really believe this should be a slab internal thing, which is what my
> > > > patch does to a certain extent. IE. All callers need to care about is
> > > > KERNEL vs. ATOMIC and in some cases, NOIO or similar for filesystems
> > > > etc... but I don't think all sorts of kernel subsystems, because they
> > > > can be called early during boot, need to suddenly use GFP_NOWAIT all the
> > > > time.
> > > >
> > > > That's why I much prefer my approach :-) (In addition to the fact that
> > > > it provides the basis for also fixing suspend/resume).
> > >
> > > Sure, I think we can do what you want with the patch below.
>
> On Fri, 2009-06-12 at 09:54 +0200, Nick Piggin wrote:
> > I don't really like adding branches to slab allocator like this.
> > init code all needs to know what services are available, and
> > this includes the scheduler if it wants to do anything sleeping
> > (including sleeping slab allocations).
> >
> > Core mm code is the last place to put in workarounds for broken
> > callers...
>
> Yes, the initialization code can be fixed to use GFP_NOWAIT. But it's
> really the suspend case that makes me think the patch might be a good
> idea. So the patch does not attempt to be a workaround for buggy callers
> but rather a change in policy that we simply refuse to wait during
> bootup and suspend.
Fair enough, but this can be done right down in the synchronous
reclaim path in the page allocator. This will catch more cases
of code using the page allocator directly, and should be not
as hot as the slab allocator.
Hi Nick,
On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
> Fair enough, but this can be done right down in the synchronous
> reclaim path in the page allocator. This will catch more cases
> of code using the page allocator directly, and should be not
> as hot as the slab allocator.
So you want to push the local_irq_enable() to the page allocator too? We
can certainly do that but I think we ought to wait for Andrew to merge
Mel's patches to mainline first, OK?
Pekka
Hi Ben,
On Fri, 2009-06-12 at 17:39 +1000, Benjamin Herrenschmidt wrote:
> For example, slab_is_available() didn't always exist, and so in the
> early days on powerpc, we used a mem_init_done global that is set form
> mem_init() (not perfect but works in practice). And we still have code
> using that to do the test.
Looking at powerpc arch code, can we get rid of the *_maybe_bootmem()
functions now? Or is slab initialization too late still? FWIW, I think
one simple fix on PPC is to just clear __GFP_NOWAIT in those functions
(all of them seem to be using GFP_KERNEL which is wrong during boot).
Pekka
On Fri, Jun 12, 2009 at 11:04:39AM +0300, Pekka Enberg wrote:
> Hi Nick,
>
> On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
> > Fair enough, but this can be done right down in the synchronous
> > reclaim path in the page allocator. This will catch more cases
> > of code using the page allocator directly, and should be not
> > as hot as the slab allocator.
>
> So you want to push the local_irq_enable() to the page allocator too? We
Well it would be nice to expose some page allocator functionality
at a bit lower level, yes. Like another thing is to avoid atomic
refcounting when there is no need for it (eg. in allocations for slab).
> can certainly do that but I think we ought to wait for Andrew to merge
> Mel's patches to mainline first, OK?
Sure.
On Fri, 2009-06-12 at 10:45 +0300, Pekka Enberg wrote:
> Hi Ben,
> The call-sites I fixed up are all boot code AFAICT. And I like I said,
> we can't really _miss_ any of those places, they must be checking for
> slab_is_available() _anyway_; otherwise they have no business using
> kmalloc(). And note: all call-sites that _unconditionally_ use
> kmalloc(GFP_KERNEL) are safe because they worked before.
No. The check for slab_is_available() can be levels higher, for example
the vmalloc case. I'm sure I can find a whole bunch more :-) Besides
I find the approach fragile, and it will suck for things that can be
rightfully called also later on.
> Again, I audited the call-sites and they all should be boot-time code.
> The only borderline case I could see is in s390 arch code which is why I
> droppped that hunk for now.
And the vmalloc case, and some page table handling code in arch/powerpc,
and I'm sure we can find bazillion of them more if we look closely.
> Sure, I think we can do what you want with the patch below.
>
> But I still think we need my patch regardless. The call sites I
> converted are all init code and should be using GFP_NOWAIT. Does it fix
> your boot on powerpc?
Not all init code needs to call GFP_NOWAIT. But again, my main worry
isn't necessary init code call sites, it's things that can themselves be
called from both init and later.
But to get a step back, I do prefer not having to bother in every call
site. It seems a lot more natural to me in this case to have the
allocator itself degrade, avoiding the burden on the callers, the risk
of error, the damage when we change and move things around etc...
Cheers,
Ben.
> Pekka
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 9a90b00..722beb5 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2791,6 +2791,13 @@ static int cache_grow(struct kmem_cache *cachep,
>
> offset *= cachep->colour_off;
>
> + /*
> + * Lets not wait if we're booting up or suspending even if the user
> + * asks for it.
> + */
> + if (system_state != SYSTEM_RUNNING)
> + local_flags &= ~__GFP_WAIT;
> +
> if (local_flags & __GFP_WAIT)
> local_irq_enable();
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 65ffda5..f9a6bc8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1547,6 +1547,13 @@ new_slab:
> goto load_freelist;
> }
>
> + /*
> + * Lets not wait if we're booting up or suspending even if the user
> + * asks for it.
> + */
> + if (system_state != SYSTEM_RUNNING)
> + gfpflags &= ~__GFP_WAIT;
> +
> if (gfpflags & __GFP_WAIT)
> local_irq_enable();
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > Sure, I think we can do what you want with the patch below.
>
> I don't really like adding branches to slab allocator like this.
> init code all needs to know what services are available, and
> this includes the scheduler if it wants to do anything sleeping
> (including sleeping slab allocations).
>
> Core mm code is the last place to put in workarounds for broken
> callers...
Right, and that's also a reason why I decided for having that
"smellybits" approach since applying a mask is going to be a lot less
cycle consuming than a conditional branch (especially on small embedded
CPUs, the conditional branch on modern stuff should be be reasonably
harmless).
Nick, have you seen my patch ? What do you think ?
Cheers,
Ben.
Hi Ben,
On Fri, 2009-06-12 at 18:40 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2009-06-12 at 10:45 +0300, Pekka Enberg wrote:
> > Hi Ben,
>
> > The call-sites I fixed up are all boot code AFAICT. And I like I said,
> > we can't really _miss_ any of those places, they must be checking for
> > slab_is_available() _anyway_; otherwise they have no business using
> > kmalloc(). And note: all call-sites that _unconditionally_ use
> > kmalloc(GFP_KERNEL) are safe because they worked before.
>
> No. The check for slab_is_available() can be levels higher, for example
> the vmalloc case. I'm sure I can find a whole bunch more :-) Besides
> I find the approach fragile, and it will suck for things that can be
> rightfully called also later on.
Yes, you're obviously right. I overlooked the fact that arch code have
their own special slab_is_available() heuristics (yikes!).
But are you happy with the two patches I posted so I can push them to
Linus?
Pekka
On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
> Fair enough, but this can be done right down in the synchronous
> reclaim path in the page allocator. This will catch more cases
> of code using the page allocator directly, and should be not
> as hot as the slab allocator.
>
Yes except that slab has explicit local_irq_enable() when __GFP_WAIT is
set so we also need to deal with that for the boot case.
But again, this is a lot less of an issue if you use my proposed patch
instead which just applies a mask of "forbidden" bits rather than a
conditional branch based on the system state. It will also allow for
more fine grained masking out if we decide, for example, that at some
stage we want to mask out GFP_IO etc...
Cheers,
Ben.
On Fri, 2009-06-12 at 11:04 +0300, Pekka Enberg wrote:
> Hi Nick,
>
> On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
> > Fair enough, but this can be done right down in the synchronous
> > reclaim path in the page allocator. This will catch more cases
> > of code using the page allocator directly, and should be not
> > as hot as the slab allocator.
>
> So you want to push the local_irq_enable() to the page allocator too? We
> can certainly do that but I think we ought to wait for Andrew to merge
> Mel's patches to mainline first, OK?
Doesn't my patch take care of all the cases in a much more simple way ?
Cheers,
Ben.
On Fri, 2009-06-12 at 11:17 +0300, Pekka Enberg wrote:
> Hi Ben,
>
> On Fri, 2009-06-12 at 17:39 +1000, Benjamin Herrenschmidt wrote:
> > For example, slab_is_available() didn't always exist, and so in the
> > early days on powerpc, we used a mem_init_done global that is set form
> > mem_init() (not perfect but works in practice). And we still have code
> > using that to do the test.
>
> Looking at powerpc arch code, can we get rid of the *_maybe_bootmem()
> functions now? Or is slab initialization too late still? FWIW, I think
> one simple fix on PPC is to just clear __GFP_NOWAIT in those functions
> (all of them seem to be using GFP_KERNEL which is wrong during boot).
I -think- we still use those in setup_arch() so we can't get rid of that
completely yet.
Cheers,
Ben.
> Pekka
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, Jun 12, 2009 at 11:44 AM, Benjamin
Herrenschmidt<[email protected]> wrote:
> On Fri, 2009-06-12 at 11:04 +0300, Pekka Enberg wrote:
>> Hi Nick,
>>
>> On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
>> > Fair enough, but this can be done right down in the synchronous
>> > reclaim path in the page allocator. This will catch more cases
>> > of code using the page allocator directly, and should be not
>> > as hot as the slab allocator.
>>
>> So you want to push the local_irq_enable() to the page allocator too? We
>> can certainly do that but I think we ought to wait for Andrew to merge
>> Mel's patches to mainline first, OK?
>
> Doesn't my patch take care of all the cases in a much more simple way ?
Nick, the patch Ben is talking about is here:
http://patchwork.kernel.org/patch/29700/
The biggest problem with the patch is that the gfp_smellybits is wide
open for abuse. Hmm.
Pekka
> Yes, you're obviously right. I overlooked the fact that arch code have
> their own special slab_is_available() heuristics (yikes!).
>
> But are you happy with the two patches I posted so I can push them to
> Linus?
I won't be able to test them until tomorrow. However, I think the first
one becomes unnecessary with the second one applied (provided you didn't
miss a case), no ?
I still prefer my approach of having a more fine grained control of what
bits to remove. First because applying a mask is less expensive than a
conditional branch (I used a negative mask because it would be too easy
to miss bits otherwise) and second, because it allows for masking of
other bits easily, for example, __GFP_IO for the suspend path etc...
Now, if you find it a bit too ugly, feel free to rename smellybits to
something else and create an accessor function for setting what bits are
masked out, but I still believe that the basic idea behind my patch is
saner than yours :-)
Cheers,
Ben.
On Fri, 2009-06-12 at 18:53 +1000, Benjamin Herrenschmidt wrote:
> Now, if you find it a bit too ugly, feel free to rename smellybits to
> something else and create an accessor function for setting what bits are
> masked out, but I still believe that the basic idea behind my patch is
> saner than yours :-)
It's not the naming I object to but the mechanism because I think is
open for abuse (think smelly driver playing tricks with it). So I do
think my patch is the sanest solution here. ;-)
Nick? Christoph?
Pekka
Hi Ben,
On Fri, 2009-06-12 at 18:53 +1000, Benjamin Herrenschmidt wrote:
> > Yes, you're obviously right. I overlooked the fact that arch code have
> > their own special slab_is_available() heuristics (yikes!).
> >
> > But are you happy with the two patches I posted so I can push them to
> > Linus?
>
> I won't be able to test them until tomorrow. However, I think the first
> one becomes unnecessary with the second one applied (provided you didn't
> miss a case), no ?
OK, I am dropping the slub/slab patch from the queue for now. Here's
what I am going to push to Linus:
http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=shortlog;h=topic/slab/earlyboot
So I am sending the GFP_NOWAIT conversion for boot code even though you
didn't seem to like it (but didn't explicitly NAK) as it fixes problems
on x86.
Pekka
On Fri, Jun 12, 2009 at 11:49:31AM +0300, Pekka Enberg wrote:
> On Fri, Jun 12, 2009 at 11:44 AM, Benjamin
> Herrenschmidt<[email protected]> wrote:
> > On Fri, 2009-06-12 at 11:04 +0300, Pekka Enberg wrote:
> >> Hi Nick,
> >>
> >> On Fri, 2009-06-12 at 10:02 +0200, Nick Piggin wrote:
> >> > Fair enough, but this can be done right down in the synchronous
> >> > reclaim path in the page allocator. This will catch more cases
> >> > of code using the page allocator directly, and should be not
> >> > as hot as the slab allocator.
> >>
> >> So you want to push the local_irq_enable() to the page allocator too? We
> >> can certainly do that but I think we ought to wait for Andrew to merge
> >> Mel's patches to mainline first, OK?
> >
> > Doesn't my patch take care of all the cases in a much more simple way ?
>
> Nick, the patch Ben is talking about is here:
>
> http://patchwork.kernel.org/patch/29700/
It's OK. I'd make it gfp_notsmellybits, and avoid the ~.
And read_mostly.
> The biggest problem with the patch is that the gfp_smellybits is wide
> open for abuse. Hmm.
Probably would be better to hide it in mm/ and then just
allow it to be modified with a couple of calls. OTOH if
it is only modified in a couple of places then maybe that's
overkill.
The whole problem comes about because we don't just restore
our previously saved flags here... I guess it probably adds
even more overhead to do that and make everything just work :(
On Fri, Jun 12, 2009 at 12:05:33PM +0300, Pekka Enberg wrote:
> On Fri, 2009-06-12 at 18:53 +1000, Benjamin Herrenschmidt wrote:
> > Now, if you find it a bit too ugly, feel free to rename smellybits to
> > something else and create an accessor function for setting what bits are
> > masked out, but I still believe that the basic idea behind my patch is
> > saner than yours :-)
>
> It's not the naming I object to but the mechanism because I think is
> open for abuse (think smelly driver playing tricks with it). So I do
> think my patch is the sanest solution here. ;-)
>
> Nick? Christoph?
I like less overhead of Ben's approach, and I like the slab
allocator being told about this rather than having to deduce
it from that horrible system_state thing.
OTOH, I don't know if it is useful, and is it just to work
around the problem of slab allocators unconditionally doing
the local_irq_enable? Or is it going to be more widely
useful?
> It's OK. I'd make it gfp_notsmellybits, and avoid the ~.
> And read_mostly.
read_mostly is fine. gfp_notsmellybits isn't a nice name :-) Make it
gfp_allowedbits then. I did it backward on purpose though as the risk of
"missing" bits here (as we may add new ones) is higher and it seemed to
me generally simpler to just explicit spell out the ones to forbid
(also, on powerpc, &~ is one instruction :-)
> Probably would be better to hide it in mm/ and then just
> allow it to be modified with a couple of calls. OTOH if
> it is only modified in a couple of places then maybe that's
> overkill.
It might indeed be nicer to hide it behind an accessor.
> The whole problem comes about because we don't just restore
> our previously saved flags here... I guess it probably adds
> even more overhead to do that and make everything just work :(
Well... that's part of the equation. My solution has the advantage to
also providing ground to forbid GFP_IO during suspend/resume etc...
Cheers,
Ben.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, Jun 12, 2009 at 07:24:20PM +1000, Benjamin Herrenschmidt wrote:
>
> > It's OK. I'd make it gfp_notsmellybits, and avoid the ~.
> > And read_mostly.
>
> read_mostly is fine. gfp_notsmellybits isn't a nice name :-) Make it
> gfp_allowedbits then. I did it backward on purpose though as the risk of
> "missing" bits here (as we may add new ones) is higher and it seemed to
> me generally simpler to just explicit spell out the ones to forbid
> (also, on powerpc, &~ is one instruction :-)
But just do the ~ in the assignment. No missing bits :)
> > Probably would be better to hide it in mm/ and then just
> > allow it to be modified with a couple of calls. OTOH if
> > it is only modified in a couple of places then maybe that's
> > overkill.
>
> It might indeed be nicer to hide it behind an accessor.
>
> > The whole problem comes about because we don't just restore
> > our previously saved flags here... I guess it probably adds
> > even more overhead to do that and make everything just work :(
>
> Well... that's part of the equation. My solution has the advantage to
> also providing ground to forbid GFP_IO during suspend/resume etc...
Yeah but it doesn't do it in the page allocator so it isn't
really useful as a general allocator flags tweak. ATM it only
helps this case of slab allocator hackery.
In my slab allocator I'm going to actually look at what it
costs to keep track of flags properly. That would be far cleaner...
OTOH, SLUB is apparently much more sensitive about page allocator
performance so maybe the hack is worthwhile there.
On Fri, 2009-06-12 at 11:30 +0200, Nick Piggin wrote:
> On Fri, Jun 12, 2009 at 07:24:20PM +1000, Benjamin Herrenschmidt wrote:
> >
> > > It's OK. I'd make it gfp_notsmellybits, and avoid the ~.
> > > And read_mostly.
> >
> > read_mostly is fine. gfp_notsmellybits isn't a nice name :-) Make it
> > gfp_allowedbits then. I did it backward on purpose though as the risk of
> > "missing" bits here (as we may add new ones) is higher and it seemed to
> > me generally simpler to just explicit spell out the ones to forbid
> > (also, on powerpc, &~ is one instruction :-)
>
> But just do the ~ in the assignment. No missing bits :)
Heh, ok.
> Yeah but it doesn't do it in the page allocator so it isn't
> really useful as a general allocator flags tweak. ATM it only
> helps this case of slab allocator hackery.
I though I did it in page_alloc.c too but I'm happy to be told what I
missed :-) The intend is certainly do have a general allocator flag
tweak.
Cheers,
Ben.
On Fri, Jun 12, 2009 at 07:44:25PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2009-06-12 at 11:30 +0200, Nick Piggin wrote:
> > On Fri, Jun 12, 2009 at 07:24:20PM +1000, Benjamin Herrenschmidt wrote:
> > Yeah but it doesn't do it in the page allocator so it isn't
> > really useful as a general allocator flags tweak. ATM it only
> > helps this case of slab allocator hackery.
>
> I though I did it in page_alloc.c too but I'm happy to be told what I
> missed :-) The intend is certainly do have a general allocator flag
> tweak.
Oh, no I missed that sorry you did. I'd be a bit worried about
wanting it as a general allocator tweak. Even suspending IO
for suspend/resume... it would be better to try solving that
ordering by design and if not then perhaps add something
to mm/vmscan.c rather than modify gfp flags all the way
down.
On Fri, 12 Jun 2009, Pekka Enberg wrote:
> Yes, you're obviously right. I overlooked the fact that arch code have
> their own special slab_is_available() heuristics (yikes!).
Thats another area where we would need some cleanup in the future. The
slab_is_available() has become a check for the end of the use of bootmem.
What would be clearer is to have something like
allocator_bootstrap_complete()
to tell us that all memory allocators are available (slab, page, percpu,
vmalloc);
On Fri, 12 Jun 2009, Pekka Enberg wrote:
> So I am sending the GFP_NOWAIT conversion for boot code even though you
> didn't seem to like it (but didn't explicitly NAK) as it fixes problems
> on x86.
The use GFP_NOWAIT means that the caller sites are still special cased for
an early boot situation. After bootstrap is complete the sites may use
GFP_KERNEL instead. Bad.
Best thing to do is to recognize the fact that we are still in early boot
in the allocators. Derived allocators (such as slab and vmalloc) mask bits
using GFP_RECLAIM_MASK and when doing allocations through the page
allocator. You could make GFP_RECLAIM_MASK a variable. During boot
__GFP_WAIT would not be set in GFP_RECLAIM_MASK.
Hi Christoph,
On Fri, 2009-06-12 at 09:49 -0400, Christoph Lameter wrote:
> Best thing to do is to recognize the fact that we are still in early boot
> in the allocators. Derived allocators (such as slab and vmalloc) mask bits
> using GFP_RECLAIM_MASK and when doing allocations through the page
> allocator. You could make GFP_RECLAIM_MASK a variable. During boot
> __GFP_WAIT would not be set in GFP_RECLAIM_MASK.
Ben's patch does something like that and I have patches that do that
floating around too.
The problem here is that it's not enough that we make GFP_RECLAIM_MASK a
variable. There are various _debugging checks_ that happen much earlier
than that. We need to mask out those too which adds overhead to
kmalloc() fastpath, for example.
Pekka
On Fri, 2009-06-12 at 09:49 -0400, Christoph Lameter wrote:
> On Fri, 12 Jun 2009, Pekka Enberg wrote:
>
> > So I am sending the GFP_NOWAIT conversion for boot code even though you
> > didn't seem to like it (but didn't explicitly NAK) as it fixes problems
> > on x86.
>
> The use GFP_NOWAIT means that the caller sites are still special cased for
> an early boot situation. After bootstrap is complete the sites may use
> GFP_KERNEL instead. Bad.
Amen :-)
Ben.
On Fri, 2009-06-12 at 16:54 +0300, Pekka Enberg wrote:
> Hi Christoph,
>
> On Fri, 2009-06-12 at 09:49 -0400, Christoph Lameter wrote:
> > Best thing to do is to recognize the fact that we are still in early boot
> > in the allocators. Derived allocators (such as slab and vmalloc) mask bits
> > using GFP_RECLAIM_MASK and when doing allocations through the page
> > allocator. You could make GFP_RECLAIM_MASK a variable. During boot
> > __GFP_WAIT would not be set in GFP_RECLAIM_MASK.
>
> Ben's patch does something like that and I have patches that do that
> floating around too.
>
> The problem here is that it's not enough that we make GFP_RECLAIM_MASK a
> variable. There are various _debugging checks_ that happen much earlier
> than that. We need to mask out those too which adds overhead to
> kmalloc() fastpath, for example.
Hrm... I though I stuck my masking before the lockdep tests but maybe I
missed some...
Again, I'm not saying my patch is the best way to solve it. My point is
more that the callers shouldn't have to bother. (And thus the WARN_ON
isn't right :-)
Cheers,
Ben.
Hi Ben,
On Sat, 2009-06-13 at 00:02 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2009-06-12 at 16:54 +0300, Pekka Enberg wrote:
> > Hi Christoph,
> >
> > On Fri, 2009-06-12 at 09:49 -0400, Christoph Lameter wrote:
> > > Best thing to do is to recognize the fact that we are still in early boot
> > > in the allocators. Derived allocators (such as slab and vmalloc) mask bits
> > > using GFP_RECLAIM_MASK and when doing allocations through the page
> > > allocator. You could make GFP_RECLAIM_MASK a variable. During boot
> > > __GFP_WAIT would not be set in GFP_RECLAIM_MASK.
> >
> > Ben's patch does something like that and I have patches that do that
> > floating around too.
> >
> > The problem here is that it's not enough that we make GFP_RECLAIM_MASK a
> > variable. There are various _debugging checks_ that happen much earlier
> > than that. We need to mask out those too which adds overhead to
> > kmalloc() fastpath, for example.
>
> Hrm... I though I stuck my masking before the lockdep tests but maybe I
> missed some...
Your patch is fine but what Christoph suggested is not (at least the way
I understood it).
Pekka
On Fri, 12 Jun 2009, Pekka Enberg wrote:
> The problem here is that it's not enough that we make GFP_RECLAIM_MASK a
> variable. There are various _debugging checks_ that happen much earlier
> than that. We need to mask out those too which adds overhead to
> kmalloc() fastpath, for example.
True. The GFP_RECLAIM_MASK only addresses the passing of gfp flags from a
derived allocator to the page allocator. It will deal only with the issue
of GFP_WAIT handling.