2009-06-12 08:14:01

by Pekka Enberg

[permalink] [raw]
Subject: [PATCH 2/2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

From: Pekka Enberg <[email protected]>

As explained by Benjamin Herrenschmidt:

Oh and btw, your patch alone doesn't fix powerpc, because it's missing
a whole bunch of GFP_KERNEL's in the arch code... You would have to
grep the entire kernel for things that check slab_is_available() and
even then you'll be missing some.

For example, slab_is_available() didn't always exist, and so in the
early days on powerpc, we used a mem_init_done global that is set form
mem_init() (not perfect but works in practice). And we still have code
using that to do the test.

Therefore, ignore __GFP_WAIT in the slab allocators if we're booting or
suspending.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---
mm/slab.c | 7 +++++++
mm/slub.c | 7 +++++++
2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..4b932e0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2812,6 +2812,13 @@ static int cache_grow(struct kmem_cache *cachep,

offset *= cachep->colour_off;

+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ local_flags &= ~__GFP_WAIT;
+
if (local_flags & __GFP_WAIT)
local_irq_enable();

diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..053ea3e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1548,6 +1548,13 @@ new_slab:
goto load_freelist;
}

+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ gfpflags &= ~__GFP_WAIT;
+
if (gfpflags & __GFP_WAIT)
local_irq_enable();

--
1.6.0.4


2009-06-12 09:03:30

by Pekka Enberg

[permalink] [raw]
Subject: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

From: Pekka Enberg <[email protected]>

As explained by Benjamin Herrenschmidt:

Oh and btw, your patch alone doesn't fix powerpc, because it's missing
a whole bunch of GFP_KERNEL's in the arch code... You would have to
grep the entire kernel for things that check slab_is_available() and
even then you'll be missing some.

For example, slab_is_available() didn't always exist, and so in the
early days on powerpc, we used a mem_init_done global that is set form
mem_init() (not perfect but works in practice). And we still have code
using that to do the test.

Therefore, ignore __GFP_WAIT in the slab allocators if we're booting or
suspending.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---
v1 -> v2: fix up some missing cases pointed out by BenH

mm/slab.c | 19 ++++++++++++++++++-
mm/slub.c | 24 ++++++++++++++++++++++--
2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..5119c22 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2812,6 +2812,15 @@ static int cache_grow(struct kmem_cache *cachep,

offset *= cachep->colour_off;

+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ local_flags &= ~__GFP_WAIT;
+
+ might_sleep_if(local_flags & __GFP_WAIT);
+
if (local_flags & __GFP_WAIT)
local_irq_enable();

@@ -3073,7 +3082,6 @@ alloc_done:
static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
gfp_t flags)
{
- might_sleep_if(flags & __GFP_WAIT);
#if DEBUG
kmem_flagcheck(cachep, flags);
#endif
@@ -3238,6 +3246,15 @@ retry:

if (!obj) {
/*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ local_flags &= ~__GFP_WAIT;
+
+ might_sleep_if(local_flags & __GFP_WAIT);
+
+ /*
* This allocation will be performed within the constraints
* of the current cpuset / memory policy requirements.
* We may trigger various forms of reclaim on the allowed
diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..6387c19 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1548,6 +1548,20 @@ new_slab:
goto load_freelist;
}

+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ gfpflags &= ~__GFP_WAIT;
+
+ /*
+ * Now that we really know whether or not we're going to sleep or not,
+ * lets do our debugging checks.
+ */
+ lockdep_trace_alloc(gfpflags);
+ might_sleep_if(gfpflags & __GFP_WAIT);
+
if (gfpflags & __GFP_WAIT)
local_irq_enable();

@@ -1595,8 +1609,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
unsigned long flags;
unsigned int objsize;

- lockdep_trace_alloc(gfpflags);
- might_sleep_if(gfpflags & __GFP_WAIT);
+ lockdep_trace_alloc(gfpflags & ~__GFP_WAIT);

if (should_failslab(s->objsize, gfpflags))
return NULL;
@@ -2607,6 +2620,13 @@ static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
if (s)
return s;

+ /*
+ * Lets not wait if we're booting up or suspending even if the user
+ * asks for it.
+ */
+ if (system_state != SYSTEM_RUNNING)
+ flags &= ~__GFP_WAIT;
+
/* Dynamically create dma cache */
if (flags & __GFP_WAIT)
down_write(&slub_lock);
--
1.6.0.4

2009-06-12 09:10:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


* Pekka J Enberg <[email protected]> wrote:

> index 3964d3c..6387c19 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1548,6 +1548,20 @@ new_slab:
> goto load_freelist;
> }
>
> + /*
> + * Lets not wait if we're booting up or suspending even if the user
> + * asks for it.
> + */
> + if (system_state != SYSTEM_RUNNING)
> + gfpflags &= ~__GFP_WAIT;

Hiding that bug like that is not particularly clean IMO. We should
not let system_state hacks spread like that.

We emit a debug warning but dont crash, so all should be fine and
the culprits can then be fixed, right?

Ingo

2009-06-12 09:22:24

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> We emit a debug warning but dont crash, so all should be fine and
> the culprits can then be fixed, right?

... rewind ... :-)

Ok so, no, the culprit cannot be all fixed in a satifactory way.

The main reason is that I believe it's not "right" to have every caller
of slab around know whether GFP_KERNEL is good to go or it should get
into GFP_NOWAIT. This depends on many factors (among others us moving
things around more), and is not actually a good solution for thing that
can be called both at boot and later, such as get_vm_area().

I really think we are looking for trouble (and a lot of hidden bugs) by
trying to "fix" all callers, in addition to making some code like
vmalloc() more failure prone because it's unconditionally changed from
GFP_KERNEL to GFP_NOWAIT.

It seems a lot more reasonably to me to have sl*b naturally degrade to
NOWAIT when it's too early to enable interrupts.

In addition, my proposal of having bits to mask off gfp will also be
useful in fixing similar issues with suspend/resume vs. GFP_NOIO which
should really become implicit when devices start becoming suspended.

Cheers,
Ben.

2009-06-12 09:25:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi Ben,

On Fri, Jun 12, 2009 at 12:21 PM, Benjamin
Herrenschmidt<[email protected]> wrote:
> I really think we are looking for trouble (and a lot of hidden bugs) by
> trying to "fix" all callers, in addition to making some code like
> vmalloc() more failure prone because it's unconditionally changed from
> GFP_KERNEL to GFP_NOWAIT.

It's a new API function vmalloc_node_boot() that uses GFP_NOWAIT so I
don't share your concern that it's error prone.

Pekka

2009-06-12 09:36:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 12:24 +0300, Pekka Enberg wrote:
> Hi Ben,
>
> On Fri, Jun 12, 2009 at 12:21 PM, Benjamin
> Herrenschmidt<[email protected]> wrote:
> > I really think we are looking for trouble (and a lot of hidden bugs) by
> > trying to "fix" all callers, in addition to making some code like
> > vmalloc() more failure prone because it's unconditionally changed from
> > GFP_KERNEL to GFP_NOWAIT.
>
> It's a new API function vmalloc_node_boot() that uses GFP_NOWAIT so I
> don't share your concern that it's error prone.

But you didn't fix __get_vm_area_caller() which means my ioremap is
still broken...

Take a break, take a step back, and look at the big picture. Do you
really want to find all the needles in the haystack or just make sure
you wear gloves when handling the hay ? :-)

Cheers,
Ben.

2009-06-12 09:45:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 12 Jun 2009, Benjamin Herrenschmidt wrote:
> Take a break, take a step back, and look at the big picture. Do you
> really want to find all the needles in the haystack or just make sure
> you wear gloves when handling the hay ? :-)

Well, I would like to find the needles but I think we should do it with
gloves on.

If everyone is happy with this version of Ben's patch, I'm going to just
apply it and push it to Linus.

Pekka

>From 7760451b006b165bce052622af7316b54bd5a122 Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 12 Jun 2009 12:39:58 +0300
Subject: [PATCH] Sanitize "gfp" flags during boot

With the recent shuffle of initialization order to move memory related
inits earlier, various subtle breakage was introduced in archs like
powerpc due to code somewhat assuming that GFP_KERNEL can be used as
soon as the allocators are up. This is not true because any __GFP_WAIT
allocation will cause interrupts to be enabled, which can be fatal if
it happens too early.

This isn't trivial to fix on every call site. For example, powerpc's
ioremap implementation needs to be called early. For that, it uses two
different mechanisms to carve out virtual space. Before memory init,
by moving down VMALLOC_END, and then, by calling get_vm_area().
Unfortunately, the later does GFK_KERNEL allocations. But we can't do
anything else because once vmalloc's been initialized, we can no longer
safely move VMALLOC_END to carve out space.

There are other examples, wehere can can be called either very early
or later on when devices are hot-plugged. It would be a major pain for
such code to have to "know" whether it's in a context where it should
use GFP_KERNEL or GFP_NOWAIT.

Finally, by having the ability to silently removed __GFP_WAIT from
allocations, we pave the way for suspend-to-RAM to use that feature
to also remove __GFP_IO from allocations done after suspending devices
has started. This is important because such allocations may hang if
devices on the swap-out path have been suspended, but not-yet suspended
drivers don't know about it, and may deadlock themselves by being hung
into a kmalloc somewhere while holding a mutex for example.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---
include/linux/gfp.h | 9 +++++++++
init/main.c | 1 +
mm/page_alloc.c | 18 ++++++++++++++++++
mm/slab.c | 9 +++++++++
mm/slub.c | 3 +++
5 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..8d2ea79 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -99,6 +99,13 @@ struct vm_area_struct;
/* 4GB DMA on some platforms */
#define GFP_DMA32 __GFP_DMA32

+extern gfp_t gfp_allowed_bits;
+
+static inline gfp_t gfp_sanitize(gfp_t gfp_flags)
+{
+ return gfp_flags & gfp_allowed_bits;
+}
+
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
@@ -245,4 +252,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(void);
void drain_local_pages(void *dummy);

+void mm_late_init(void);
+
#endif /* __LINUX_GFP_H */
diff --git a/init/main.c b/init/main.c
index b3e8f14..34c6e7e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -640,6 +640,7 @@ asmlinkage void __init start_kernel(void)
"enabled early\n");
early_boot_irqs_on();
local_irq_enable();
+ mm_late_init();

/*
* HACK ALERT! This is early. We're enabling the console before
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7760ef9..a42e4c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -77,6 +77,13 @@ int percpu_pagelist_fraction;
int pageblock_order __read_mostly;
#endif

+/*
+ * We set up the page allocator and the slab allocator early on with interrupts
+ * disabled. Therefore, make sure that we sanitize GFP flags accordingly before
+ * everything is up and running.
+ */
+gfp_t gfp_allowed_bits = ~(__GFP_WAIT|__GFP_FS | __GFP_IO);
+
static void __free_pages_ok(struct page *page, unsigned int order);

/*
@@ -1473,6 +1480,9 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;

+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
lockdep_trace_alloc(gfp_mask);

might_sleep_if(wait);
@@ -4728,3 +4738,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
spin_unlock_irqrestore(&zone->lock, flags);
}
#endif
+
+void mm_late_init(void)
+{
+ /*
+ * Interrupts are enabled now so all GFP allocations are safe.
+ */
+ gfp_allowed_bits = __GFP_BITS_MASK;
+}
diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..87b166e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2791,6 +2791,9 @@ static int cache_grow(struct kmem_cache *cachep,
gfp_t local_flags;
struct kmem_list3 *l3;

+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
/*
* Be lazy and only check for valid flags here, keeping it out of the
* critical path in kmem_cache_alloc().
@@ -3212,6 +3215,9 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
void *obj = NULL;
int nid;

+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
if (flags & __GFP_THISNODE)
return NULL;

@@ -3434,6 +3440,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
unsigned long save_flags;
void *objp;

+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfp_mask = gfp_sanitize(gfp_mask);
+
lockdep_trace_alloc(flags);

if (slab_should_failslab(cachep, flags))
diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..5c646f7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1512,6 +1512,9 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;

+ /* Sanitize flags so we don't enable irqs too early during boot */
+ gfpflags = gfp_sanitize(gfpflags);
+
if (!c->page)
goto new_slab;

--
1.6.0.4

2009-06-12 09:49:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 12:10 PM, Ingo Molnar<[email protected]> wrote:
>> @@ -1548,6 +1548,20 @@ new_slab:
>> ? ? ? ? ? ? ? goto load_freelist;
>> ? ? ? }
>>
>> + ? ? /*
>> + ? ? ?* Lets not wait if we're booting up or suspending even if the user
>> + ? ? ?* asks for it.
>> + ? ? ?*/
>> + ? ? if (system_state != SYSTEM_RUNNING)
>> + ? ? ? ? ? ? gfpflags &= ~__GFP_WAIT;
>
> Hiding that bug like that is not particularly clean IMO. We should
> not let system_state hacks spread like that.
>
> We emit a debug warning but dont crash, so all should be fine and
> the culprits can then be fixed, right?

OK, lets not use system_state then and go with Ben's approach then.
Again, neither of the patches are about "hiding buggy callers" but
changing allocation policy wrt. gfp flags during boot (and later on
during suspend).

Pekka

2009-06-12 09:52:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 12:49:17PM +0300, Pekka Enberg wrote:
> On Fri, Jun 12, 2009 at 12:10 PM, Ingo Molnar<[email protected]> wrote:
> >> @@ -1548,6 +1548,20 @@ new_slab:
> >> ? ? ? ? ? ? ? goto load_freelist;
> >> ? ? ? }
> >>
> >> + ? ? /*
> >> + ? ? ?* Lets not wait if we're booting up or suspending even if the user
> >> + ? ? ?* asks for it.
> >> + ? ? ?*/
> >> + ? ? if (system_state != SYSTEM_RUNNING)
> >> + ? ? ? ? ? ? gfpflags &= ~__GFP_WAIT;
> >
> > Hiding that bug like that is not particularly clean IMO. We should
> > not let system_state hacks spread like that.
> >
> > We emit a debug warning but dont crash, so all should be fine and
> > the culprits can then be fixed, right?
>
> OK, lets not use system_state then and go with Ben's approach then.
> Again, neither of the patches are about "hiding buggy callers" but
> changing allocation policy wrt. gfp flags during boot (and later on
> during suspend).

Maybe if we just not make it a general "tweak gfpflag" bit (at
least not until a bit more discussion), but a specific workaround
for the local_irq_enable in early boot problem.

Seems like it would not be hard to track things down if we add
a warning if we have GFP_WAIT and interrupts are not enabled...

2009-06-12 09:54:51

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 11:52 +0200, Nick Piggin wrote:
> Maybe if we just not make it a general "tweak gfpflag" bit (at
> least not until a bit more discussion), but a specific workaround
> for the local_irq_enable in early boot problem.
>
> Seems like it would not be hard to track things down if we add
> a warning if we have GFP_WAIT and interrupts are not enabled...

AFAICT, the point is that Ben thinks that we shouldn't go and try to fix
up all the callers. But yes, we could certainly do that too.

Pekka

2009-06-12 09:59:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 12:45 +0300, Pekka J Enberg wrote:
> On Fri, 12 Jun 2009, Benjamin Herrenschmidt wrote:
> > Take a break, take a step back, and look at the big picture. Do you
> > really want to find all the needles in the haystack or just make sure
> > you wear gloves when handling the hay ? :-)
>
> Well, I would like to find the needles but I think we should do it with
> gloves on.
>
> If everyone is happy with this version of Ben's patch, I'm going to just
> apply it and push it to Linus.

Thanks :-) Looks right at first glance. I'll test tomorrow.

Cheers,
Ben.

> Pekka
>
> >From 7760451b006b165bce052622af7316b54bd5a122 Mon Sep 17 00:00:00 2001
> From: Benjamin Herrenschmidt <[email protected]>
> Date: Fri, 12 Jun 2009 12:39:58 +0300
> Subject: [PATCH] Sanitize "gfp" flags during boot
>
> With the recent shuffle of initialization order to move memory related
> inits earlier, various subtle breakage was introduced in archs like
> powerpc due to code somewhat assuming that GFP_KERNEL can be used as
> soon as the allocators are up. This is not true because any __GFP_WAIT
> allocation will cause interrupts to be enabled, which can be fatal if
> it happens too early.
>
> This isn't trivial to fix on every call site. For example, powerpc's
> ioremap implementation needs to be called early. For that, it uses two
> different mechanisms to carve out virtual space. Before memory init,
> by moving down VMALLOC_END, and then, by calling get_vm_area().
> Unfortunately, the later does GFK_KERNEL allocations. But we can't do
> anything else because once vmalloc's been initialized, we can no longer
> safely move VMALLOC_END to carve out space.
>
> There are other examples, wehere can can be called either very early
> or later on when devices are hot-plugged. It would be a major pain for
> such code to have to "know" whether it's in a context where it should
> use GFP_KERNEL or GFP_NOWAIT.
>
> Finally, by having the ability to silently removed __GFP_WAIT from
> allocations, we pave the way for suspend-to-RAM to use that feature
> to also remove __GFP_IO from allocations done after suspending devices
> has started. This is important because such allocations may hang if
> devices on the swap-out path have been suspended, but not-yet suspended
> drivers don't know about it, and may deadlock themselves by being hung
> into a kmalloc somewhere while holding a mutex for example.
>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
> Signed-off-by: Pekka Enberg <[email protected]>
> ---
> include/linux/gfp.h | 9 +++++++++
> init/main.c | 1 +
> mm/page_alloc.c | 18 ++++++++++++++++++
> mm/slab.c | 9 +++++++++
> mm/slub.c | 3 +++
> 5 files changed, 40 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 0bbc15f..8d2ea79 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -99,6 +99,13 @@ struct vm_area_struct;
> /* 4GB DMA on some platforms */
> #define GFP_DMA32 __GFP_DMA32
>
> +extern gfp_t gfp_allowed_bits;
> +
> +static inline gfp_t gfp_sanitize(gfp_t gfp_flags)
> +{
> + return gfp_flags & gfp_allowed_bits;
> +}
> +
> /* Convert GFP flags to their corresponding migrate type */
> static inline int allocflags_to_migratetype(gfp_t gfp_flags)
> {
> @@ -245,4 +252,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
> void drain_all_pages(void);
> void drain_local_pages(void *dummy);
>
> +void mm_late_init(void);
> +
> #endif /* __LINUX_GFP_H */
> diff --git a/init/main.c b/init/main.c
> index b3e8f14..34c6e7e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -640,6 +640,7 @@ asmlinkage void __init start_kernel(void)
> "enabled early\n");
> early_boot_irqs_on();
> local_irq_enable();
> + mm_late_init();
>
> /*
> * HACK ALERT! This is early. We're enabling the console before
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7760ef9..a42e4c7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,13 @@ int percpu_pagelist_fraction;
> int pageblock_order __read_mostly;
> #endif
>
> +/*
> + * We set up the page allocator and the slab allocator early on with interrupts
> + * disabled. Therefore, make sure that we sanitize GFP flags accordingly before
> + * everything is up and running.
> + */
> +gfp_t gfp_allowed_bits = ~(__GFP_WAIT|__GFP_FS | __GFP_IO);
> +
> static void __free_pages_ok(struct page *page, unsigned int order);
>
> /*
> @@ -1473,6 +1480,9 @@ __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
> unsigned long did_some_progress;
> unsigned long pages_reclaimed = 0;
>
> + /* Sanitize flags so we don't enable irqs too early during boot */
> + gfp_mask = gfp_sanitize(gfp_mask);
> +
> lockdep_trace_alloc(gfp_mask);
>
> might_sleep_if(wait);
> @@ -4728,3 +4738,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> spin_unlock_irqrestore(&zone->lock, flags);
> }
> #endif
> +
> +void mm_late_init(void)
> +{
> + /*
> + * Interrupts are enabled now so all GFP allocations are safe.
> + */
> + gfp_allowed_bits = __GFP_BITS_MASK;
> +}
> diff --git a/mm/slab.c b/mm/slab.c
> index f46b65d..87b166e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2791,6 +2791,9 @@ static int cache_grow(struct kmem_cache *cachep,
> gfp_t local_flags;
> struct kmem_list3 *l3;
>
> + /* Sanitize flags so we don't enable irqs too early during boot */
> + gfp_mask = gfp_sanitize(gfp_mask);
> +
> /*
> * Be lazy and only check for valid flags here, keeping it out of the
> * critical path in kmem_cache_alloc().
> @@ -3212,6 +3215,9 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> void *obj = NULL;
> int nid;
>
> + /* Sanitize flags so we don't enable irqs too early during boot */
> + gfp_mask = gfp_sanitize(gfp_mask);
> +
> if (flags & __GFP_THISNODE)
> return NULL;
>
> @@ -3434,6 +3440,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
> unsigned long save_flags;
> void *objp;
>
> + /* Sanitize flags so we don't enable irqs too early during boot */
> + gfp_mask = gfp_sanitize(gfp_mask);
> +
> lockdep_trace_alloc(flags);
>
> if (slab_should_failslab(cachep, flags))
> diff --git a/mm/slub.c b/mm/slub.c
> index 3964d3c..5c646f7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1512,6 +1512,9 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> /* We handle __GFP_ZERO in the caller */
> gfpflags &= ~__GFP_ZERO;
>
> + /* Sanitize flags so we don't enable irqs too early during boot */
> + gfpflags = gfp_sanitize(gfpflags);
> +
> if (!c->page)
> goto new_slab;
>

2009-06-12 10:00:30

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> Maybe if we just not make it a general "tweak gfpflag" bit (at
> least not until a bit more discussion), but a specific workaround
> for the local_irq_enable in early boot problem.
>
> Seems like it would not be hard to track things down if we add
> a warning if we have GFP_WAIT and interrupts are not enabled...

But tweaking local_irq_enable() will have a lot more performance & bloat
impact overall on the normal case.

Cheers,
Ben.

2009-06-12 10:00:43

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 12 Jun 2009, Benjamin Herrenschmidt wrote:
> > > Take a break, take a step back, and look at the big picture. Do you
> > > really want to find all the needles in the haystack or just make sure
> > > you wear gloves when handling the hay ? :-)

On Fri, 2009-06-12 at 12:45 +0300, Pekka J Enberg wrote:
> > Well, I would like to find the needles but I think we should do it with
> > gloves on.
> >
> > If everyone is happy with this version of Ben's patch, I'm going to just
> > apply it and push it to Linus.

On Fri, 2009-06-12 at 19:58 +1000, Benjamin Herrenschmidt wrote:
> Thanks :-) Looks right at first glance. I'll test tomorrow.

Nick? I do think this is the best short-term solution. We can get rid of
it later on if we decide to fix up the callers instead.

Pekka

2009-06-12 10:08:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


* Pekka Enberg <[email protected]> wrote:

> On Fri, Jun 12, 2009 at 12:10 PM, Ingo Molnar<[email protected]> wrote:
> >> @@ -1548,6 +1548,20 @@ new_slab:
> >> ? ? ? ? ? ? ? goto load_freelist;
> >> ? ? ? }
> >>
> >> + ? ? /*
> >> + ? ? ?* Lets not wait if we're booting up or suspending even if the user
> >> + ? ? ?* asks for it.
> >> + ? ? ?*/
> >> + ? ? if (system_state != SYSTEM_RUNNING)
> >> + ? ? ? ? ? ? gfpflags &= ~__GFP_WAIT;
> >
> > Hiding that bug like that is not particularly clean IMO. We should
> > not let system_state hacks spread like that.
> >
> > We emit a debug warning but dont crash, so all should be fine and
> > the culprits can then be fixed, right?
>
> OK, lets not use system_state then and go with Ben's approach
> then. Again, neither of the patches are about "hiding buggy
> callers" but changing allocation policy wrt. gfp flags during boot
> (and later on during suspend).

IMHO such invisible side-channels modifying the semantics of GFP
flags is a bit dubious.

We could do GFP_INIT or GFP_BOOT. These can imply other useful
modifiers as well: panic-on-failure for example. (this would clean
up a fair amount of init code that currently checks for an panics on
allocation failure.)

Ingo

2009-06-12 10:11:58

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi Ingo,

On Fri, Jun 12, 2009 at 1:07 PM, Ingo Molnar<[email protected]> wrote:
> IMHO such invisible side-channels modifying the semantics of GFP
> flags is a bit dubious.
>
> We could do GFP_INIT or GFP_BOOT. These can imply other useful
> modifiers as well: panic-on-failure for example. (this would clean
> up a fair amount of init code that currently checks for an panics on
> allocation failure.)

OK, but that means we need to fix up every single caller. I'm fine
with that but Ben is not. As I am unable to test powerpc here, I am
inclined to just merge Ben's patch as "obviously correct".

That does not mean we can't introduce GFP_BOOT later on if we want to. Hmm?

Pekka

2009-06-12 10:15:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 01:11:52PM +0300, Pekka Enberg wrote:
> Hi Ingo,
>
> On Fri, Jun 12, 2009 at 1:07 PM, Ingo Molnar<[email protected]> wrote:
> > IMHO such invisible side-channels modifying the semantics of GFP
> > flags is a bit dubious.
> >
> > We could do GFP_INIT or GFP_BOOT. These can imply other useful
> > modifiers as well: panic-on-failure for example. (this would clean
> > up a fair amount of init code that currently checks for an panics on
> > allocation failure.)
>
> OK, but that means we need to fix up every single caller. I'm fine
> with that but Ben is not. As I am unable to test powerpc here, I am
> inclined to just merge Ben's patch as "obviously correct".

I agree with Ingo though that exposing it as a gfp modifier is
not so good. I just like the implementation to mask off GFP_WAIT
better, and also prefer not to test system state, but have someone
just call into slab to tell it not to unconditionally enable
interrupts.

> That does not mean we can't introduce GFP_BOOT later on if we want to. Hmm?

Yes, with sufficient warnings in place, I don't think it should be
too error prone to clean up remaining code over the course of
a few releases.

2009-06-12 10:30:23

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 12 Jun 2009, Nick Piggin wrote:
> > Hi Ingo,
> >
> > On Fri, Jun 12, 2009 at 1:07 PM, Ingo Molnar<[email protected]> wrote:
> > > IMHO such invisible side-channels modifying the semantics of GFP
> > > flags is a bit dubious.
> > >
> > > We could do GFP_INIT or GFP_BOOT. These can imply other useful
> > > modifiers as well: panic-on-failure for example. (this would clean
> > > up a fair amount of init code that currently checks for an panics on
> > > allocation failure.)
> >
> > OK, but that means we need to fix up every single caller. I'm fine
> > with that but Ben is not. As I am unable to test powerpc here, I am
> > inclined to just merge Ben's patch as "obviously correct".
>
> I agree with Ingo though that exposing it as a gfp modifier is
> not so good. I just like the implementation to mask off GFP_WAIT
> better, and also prefer not to test system state, but have someone
> just call into slab to tell it not to unconditionally enable
> interrupts.
>
> > That does not mean we can't introduce GFP_BOOT later on if we want to. Hmm?
>
> Yes, with sufficient warnings in place, I don't think it should be
> too error prone to clean up remaining code over the course of
> a few releases.

Hmm. This is turning into one epic patch discussion for sure! But here's a
patch to do what you suggested. With the amount of patches I am
generating, I'm bound to hit the right one sooner or later, no?-)

Pekka

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4880306..219b8fb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -319,4 +319,6 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
return kmalloc_node(size, flags | __GFP_ZERO, node);
}

+void __init kmem_cache_init_late(void);
+
#endif /* _LINUX_SLAB_H */
diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
index 0ec00b3..bb5368d 100644
--- a/include/linux/slob_def.h
+++ b/include/linux/slob_def.h
@@ -34,4 +34,9 @@ static __always_inline void *__kmalloc(size_t size, gfp_t flags)
return kmalloc(size, flags);
}

+static inline void kmem_cache_init_late(void)
+{
+ /* Nothing to do */
+}
+
#endif /* __LINUX_SLOB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index be5d40c..4dcbc2c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -302,4 +302,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
}
#endif

+void __init kmem_cache_init_late(void);
+
#endif /* _LINUX_SLUB_DEF_H */
diff --git a/init/main.c b/init/main.c
index b3e8f14..f6204f7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -640,6 +640,7 @@ asmlinkage void __init start_kernel(void)
"enabled early\n");
early_boot_irqs_on();
local_irq_enable();
+ kmem_cache_init_late();

/*
* HACK ALERT! This is early. We're enabling the console before
diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..1fac378 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -304,6 +304,12 @@ struct kmem_list3 {
};

/*
+ * The slab allocator is initialized with interrupts disabled. Therefore, make
+ * sure early boot allocations don't accidentally enable interrupts.
+ */
+static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
+
+/*
* Need this for bootstrapping a per node allocator.
*/
#define NUM_INIT_LISTS (3 * MAX_NUMNODES)
@@ -1654,6 +1660,14 @@ void __init kmem_cache_init(void)
*/
}

+void __init kmem_cache_init_late(void)
+{
+ /*
+ * Interrupts are enabled now so all GFP allocations are safe.
+ */
+ slab_gfp_mask = __GFP_BITS_MASK;
+}
+
static int __init cpucache_init(void)
{
int cpu;
@@ -3237,6 +3251,8 @@ retry:
}

if (!obj) {
+ local_flags &= slab_gfp_mask;
+
/*
* This allocation will be performed within the constraints
* of the current cpuset / memory policy requirements.
@@ -3354,12 +3370,14 @@ __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
unsigned long save_flags;
void *ptr;

+ flags &= slab_gfp_mask;
+
lockdep_trace_alloc(flags);

if (slab_should_failslab(cachep, flags))
return NULL;

- cache_alloc_debugcheck_before(cachep, flags);
+ cache_alloc_debugcheck_before(cachep, flags & slab_gfp_flags);
local_irq_save(save_flags);

if (unlikely(nodeid == -1))
@@ -3434,6 +3452,8 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
unsigned long save_flags;
void *objp;

+ flags &= slab_gfp_flags;
+
lockdep_trace_alloc(flags);

if (slab_should_failslab(cachep, flags))
diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..c09cb98 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -178,6 +178,12 @@ static enum {
SYSFS /* Sysfs up */
} slab_state = DOWN;

+/*
+ * The slab allocator is initialized with interrupts disabled. Therefore, make
+ * sure early boot allocations don't accidentally enable interrupts.
+ */
+static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
+
/* A list of all slab caches on the system */
static DECLARE_RWSEM(slub_lock);
static LIST_HEAD(slab_caches);
@@ -1595,6 +1601,8 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
unsigned long flags;
unsigned int objsize;

+ gfpflags &= slab_gfp_mask;
+
lockdep_trace_alloc(gfpflags);
might_sleep_if(gfpflags & __GFP_WAIT);

@@ -3104,6 +3112,14 @@ void __init kmem_cache_init(void)
nr_cpu_ids, nr_node_ids);
}

+void __init kmem_cache_init_late(void)
+{
+ /*
+ * Interrupts are enabled now so all GFP allocations are safe.
+ */
+ slab_gfp_mask = __GFP_BITS_MASK;
+}
+
/*
* Find a mergeable slab cache
*/

2009-06-12 10:32:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 1:30 PM, Pekka J Enberg<[email protected]> wrote:
> Hmm. This is turning into one epic patch discussion for sure! But here's a
> patch to do what you suggested. With the amount of patches I am
> generating, I'm bound to hit the right one sooner or later, no?-)

[ And yes, I do see SLAB parts are not even compiling. But you get the idea. ]

2009-06-12 11:10:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 12:07 +0200, Ingo Molnar wrote:
>
> IMHO such invisible side-channels modifying the semantics of GFP
> flags is a bit dubious.
>
> We could do GFP_INIT or GFP_BOOT. These can imply other useful
> modifiers as well: panic-on-failure for example. (this would clean
> up a fair amount of init code that currently checks for an panics on
> allocation failure.)

I disagree.

I believe most code shouldn't have to care whether it's in boot, suspend
or similar to get the right flags to kmalloc().

This is especially true for when the allocator is called indirectly by
something that can itself be called from either boot or non-boot.

I believe the best example here is __get_vm_area() will use GFP_KERNEL.
I don't think it should be "fixed" to do anything else. The normal case
of GFP_KERNEL is correct and it shouldn't be changed to do GFP_NOWAIT
just because it happens that we use it earlier during init time.

This is also true of a lot of code used on "hotplug" path that is
commonly used at init time but can be used later on.

To some extent, the subtle distinction of whether interrupts are enabled
or not is something that shouldn't be something those callers have to
bother with. Yes, it is obvious for some strictly init code, but it's
far from being always that simple, and it's not unlikely that we'll
decide to move around in the init sequence the point at which we decide
to enable interrupts. We shouldn't have to fix half of the init code
when we do that.

In fact, we could push the logic further (but please read it all before
reacting :-) The fact that we -do- specific GFP_ATOMIC for atomic
context is -almost- a side effect of history. To some extent we could
get rid of it since we can almost always know when we are in such a
context. In that case, though, I believe we should keep it that way, at
least because it does discourage people from allocating in those
contexts which is a good thing.

Back to the general idea, I think we shouldn't burden arch, driver,
subsystem etc... code with the need to understand the system state, in
our present case, init vs. non init, but the same issue applies with
suspend/resume vs. GFP_NOIO as I explained in a separate email.

This typically a case where I believe the best way to ensure we do the
right thing is to put the check in the few common code path where
everybody funnels through, which is the allocator itself.

Cheers,
Ben.

2009-06-12 11:11:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> OK, but that means we need to fix up every single caller. I'm fine
> with that but Ben is not. As I am unable to test powerpc here, I am
> inclined to just merge Ben's patch as "obviously correct".
>
> That does not mean we can't introduce GFP_BOOT later on if we want to. Hmm?

Again, you are missing part of the picture. Yes we -can- fix all the
-direct- callers that are obviously only be run at boot time. But what
about all the indirect ones (or even direct ones) that can be called
either at boot time or later. vmalloc() is the perfect example (or more
precisely __get_vm_area() which brings in ioremap etc...) but there are
many more.

Cheers,
Ben.

2009-06-12 11:14:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> I agree with Ingo though that exposing it as a gfp modifier is
> not so good. I just like the implementation to mask off GFP_WAIT
> better, and also prefer not to test system state, but have someone
> just call into slab to tell it not to unconditionally enable
> interrupts.

But interrupts is just one example. GFP_NOIO is another one vs. suspend
and resume.

What we have here is the allocator needs to be clamped down based on the
system state. I think it will not work to try to butcher every caller,
especially since they don't always know themselves in what state they
are called.

Moving the "fix" into the couple of nexuses where all the code path go
through really seem like a better, simpler, more maintainable and more
fool proof solution to me.

> Yes, with sufficient warnings in place, I don't think it should be
> too error prone to clean up remaining code over the course of
> a few releases.

But that will no fix all the cases. That will not fix __get_vm_area()
being called from both boot and non-boot (and ioremap, etc..) and every
similar thing we can have all over the place (I have some in the
interrupt handling on powerpc, I'm sure we can find much more).

I don't see what the problem is in providing simple allocator semantics
and have the allocator itself adapt to the system state, especially when
the code is as small as having a bit mask applied in 2 or 3 places.

Cheers,
Ben.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-06-12 11:24:30

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 21:13 +1000, Benjamin Herrenschmidt wrote:
> > I agree with Ingo though that exposing it as a gfp modifier is
> > not so good. I just like the implementation to mask off GFP_WAIT
> > better, and also prefer not to test system state, but have someone
> > just call into slab to tell it not to unconditionally enable
> > interrupts.
>
> But interrupts is just one example. GFP_NOIO is another one vs. suspend
> and resume.
>
> What we have here is the allocator needs to be clamped down based on the
> system state. I think it will not work to try to butcher every caller,
> especially since they don't always know themselves in what state they
> are called.

Let me put it another way....

If you have to teach every call site whether to use one flag or the
other, there is -no- difference with teaching them to call one routine
(alloc_bootmem) vs another (kmalloc).

The way I see thing is that the -whole- point of the exercise is to
remove the need for the callers to have to know in what environment they
are calling kmalloc().

Yes, we do still want that for atomic calls, just because it's a good
way to get people to think twice before allocating things in atomic
context, but that logic pretty much ends there.

If we're going to require any boot time caller of kmalloc() to pass a
different set of flags than any non-boot time caller, then the whole
idea of moving the initialization earlier so a single allocator can be
used is moot.

Cheers,
Ben.

2009-06-12 11:34:15

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi Ben,

On Fri, 2009-06-12 at 21:11 +1000, Benjamin Herrenschmidt wrote:
> > OK, but that means we need to fix up every single caller. I'm fine
> > with that but Ben is not. As I am unable to test powerpc here, I am
> > inclined to just merge Ben's patch as "obviously correct".
> >
> > That does not mean we can't introduce GFP_BOOT later on if we want to. Hmm?
>
> Again, you are missing part of the picture. Yes we -can- fix all the
> -direct- callers that are obviously only be run at boot time. But what
> about all the indirect ones (or even direct ones) that can be called
> either at boot time or later. vmalloc() is the perfect example (or more
> precisely __get_vm_area() which brings in ioremap etc...) but there are
> many more.

No, I don't think I am. We can fix up the indirect callers too by making
sure we pass the proper GFP flag and propagate that all the way down.
Yes, this is potentially quite a bit of code churn which is why I do see
your patch being the easy way out.

That said, Nick and Ingo seem to think special-casing is questionable
and I haven't had green light for any of the patches yet. The gfp
sanitization patch adds some overhead to kmalloc() and page allocator
paths which is obviously a concern.

So while we continue to discuss this, I'd really like to proceed with
the patch below. At least it should allow people to boot their kernels
(although it will produce warnings). I really don't want to keep other
people waiting for us to reach a resolution on this. Are you OK with
that?

Pekka

>From f6b726dae91cc74fb3a00f192932ec4fe0949875 Mon Sep 17 00:00:00 2001
From: Pekka Enberg <[email protected]>
Date: Fri, 12 Jun 2009 14:03:06 +0300
Subject: [PATCH] slab: don't enable interrupts during early boot

As explained by Benjamin Herrenschmidt:

Oh and btw, your patch alone doesn't fix powerpc, because it's missing
a whole bunch of GFP_KERNEL's in the arch code... You would have to
grep the entire kernel for things that check slab_is_available() and
even then you'll be missing some.

For example, slab_is_available() didn't always exist, and so in the
early days on powerpc, we used a mem_init_done global that is set form
mem_init() (not perfect but works in practice). And we still have code
using that to do the test.

Therefore, mask out __GFP_WAIT in the slab allocators in early boot code to
avoid enabling interrupts.

Signed-off-by: Pekka Enberg <[email protected]>
---
include/linux/slab.h | 2 ++
include/linux/slob_def.h | 5 +++++
include/linux/slub_def.h | 2 ++
init/main.c | 1 +
mm/slab.c | 22 ++++++++++++++++++++++
mm/slub.c | 18 ++++++++++++++++++
6 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4880306..219b8fb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -319,4 +319,6 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
return kmalloc_node(size, flags | __GFP_ZERO, node);
}

+void __init kmem_cache_init_late(void);
+
#endif /* _LINUX_SLAB_H */
diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
index 0ec00b3..bb5368d 100644
--- a/include/linux/slob_def.h
+++ b/include/linux/slob_def.h
@@ -34,4 +34,9 @@ static __always_inline void *__kmalloc(size_t size, gfp_t flags)
return kmalloc(size, flags);
}

+static inline void kmem_cache_init_late(void)
+{
+ /* Nothing to do */
+}
+
#endif /* __LINUX_SLOB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index be5d40c..4dcbc2c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -302,4 +302,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
}
#endif

+void __init kmem_cache_init_late(void);
+
#endif /* _LINUX_SLUB_DEF_H */
diff --git a/init/main.c b/init/main.c
index b3e8f14..f6204f7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -640,6 +640,7 @@ asmlinkage void __init start_kernel(void)
"enabled early\n");
early_boot_irqs_on();
local_irq_enable();
+ kmem_cache_init_late();

/*
* HACK ALERT! This is early. We're enabling the console before
diff --git a/mm/slab.c b/mm/slab.c
index f46b65d..a785808 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -304,6 +304,12 @@ struct kmem_list3 {
};

/*
+ * The slab allocator is initialized with interrupts disabled. Therefore, make
+ * sure early boot allocations don't accidentally enable interrupts.
+ */
+static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
+
+/*
* Need this for bootstrapping a per node allocator.
*/
#define NUM_INIT_LISTS (3 * MAX_NUMNODES)
@@ -1654,6 +1660,14 @@ void __init kmem_cache_init(void)
*/
}

+void __init kmem_cache_init_late(void)
+{
+ /*
+ * Interrupts are enabled now so all GFP allocations are safe.
+ */
+ slab_gfp_mask = __GFP_BITS_MASK;
+}
+
static int __init cpucache_init(void)
{
int cpu;
@@ -2812,6 +2826,10 @@ static int cache_grow(struct kmem_cache *cachep,

offset *= cachep->colour_off;

+ /* Lets avoid crashing in early boot code. */
+ if (WARN_ON_ONCE((local_flags & ~slab_gfp_mask) != 0))
+ local_flags &= slab_gfp_mask;
+
if (local_flags & __GFP_WAIT)
local_irq_enable();

@@ -3237,6 +3255,10 @@ retry:
}

if (!obj) {
+ /* Lets avoid crashing in early boot code. */
+ if (WARN_ON_ONCE((local_flags & ~slab_gfp_mask) != 0))
+ local_flags &= slab_gfp_mask;
+
/*
* This allocation will be performed within the constraints
* of the current cpuset / memory policy requirements.
diff --git a/mm/slub.c b/mm/slub.c
index 3964d3c..651bb34 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -178,6 +178,12 @@ static enum {
SYSFS /* Sysfs up */
} slab_state = DOWN;

+/*
+ * The slab allocator is initialized with interrupts disabled. Therefore, make
+ * sure early boot allocations don't accidentally enable interrupts.
+ */
+static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
+
/* A list of all slab caches on the system */
static DECLARE_RWSEM(slub_lock);
static LIST_HEAD(slab_caches);
@@ -1548,6 +1554,10 @@ new_slab:
goto load_freelist;
}

+ /* Lets avoid crashing in early boot code. */
+ if (WARN_ON_ONCE((gfpflags & ~slab_gfp_mask) != 0))
+ gfpflags &= slab_gfp_mask;
+
if (gfpflags & __GFP_WAIT)
local_irq_enable();

@@ -3104,6 +3114,14 @@ void __init kmem_cache_init(void)
nr_cpu_ids, nr_node_ids);
}

+void __init kmem_cache_init_late(void)
+{
+ /*
+ * Interrupts are enabled now so all GFP allocations are safe.
+ */
+ slab_gfp_mask = __GFP_BITS_MASK;
+}
+
/*
* Find a mergeable slab cache
*/
--
1.6.0.4


2009-06-12 11:41:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> No, I don't think I am. We can fix up the indirect callers too by making
> sure we pass the proper GFP flag and propagate that all the way down.
> Yes, this is potentially quite a bit of code churn which is why I do see
> your patch being the easy way out.

s/churn/bloat ... and I really don't see the point. We can duplicate all
of the __get_vm_area() interface variants with some _boot() versions,
that's going to grow the kernel text for no good reason.

Again, why would every call site have to know whether it's called during
the boot process or not... more than that, whether interrupts have been
turned on yet or not, especially since we may decide to move the point
where we turn them on in the future.

I really don't follow your logic here. It's fragile, adds complexity and
bloat, in an area where we are trying to remove some.

> That said, Nick and Ingo seem to think special-casing is questionable
> and I haven't had green light for any of the patches yet. The gfp
> sanitization patch adds some overhead to kmalloc() and page allocator
> paths which is obviously a concern.

Let's wait and see what Linus thinks...

> So while we continue to discuss this, I'd really like to proceed with
> the patch below. At least it should allow people to boot their kernels
> (although it will produce warnings). I really don't want to keep other
> people waiting for us to reach a resolution on this. Are you OK with
> that?

I don't care -how- we achieve the result I want as long as we achieve
it, which is to remove the need for callers to care. My approach was one
way to do it, I'm sure there's a better one. That's not the point. I'm
too tried now to properly review your patch and I'll need to test it
tomorrow morning, but it looks ok except for the WARN_ON maybe.

Again, I don't think we need to -fix- things. I think most code should
be able to call kmalloc(GFP_KERNEL) without having to bother at which
precise stage of the boot sequence it is running.

Cheers,
Ben.

> Pekka
>
> >From f6b726dae91cc74fb3a00f192932ec4fe0949875 Mon Sep 17 00:00:00 2001
> From: Pekka Enberg <[email protected]>
> Date: Fri, 12 Jun 2009 14:03:06 +0300
> Subject: [PATCH] slab: don't enable interrupts during early boot
>
> As explained by Benjamin Herrenschmidt:
>
> Oh and btw, your patch alone doesn't fix powerpc, because it's missing
> a whole bunch of GFP_KERNEL's in the arch code... You would have to
> grep the entire kernel for things that check slab_is_available() and
> even then you'll be missing some.
>
> For example, slab_is_available() didn't always exist, and so in the
> early days on powerpc, we used a mem_init_done global that is set form
> mem_init() (not perfect but works in practice). And we still have code
> using that to do the test.
>
> Therefore, mask out __GFP_WAIT in the slab allocators in early boot code to
> avoid enabling interrupts.
>
> Signed-off-by: Pekka Enberg <[email protected]>
> ---
> include/linux/slab.h | 2 ++
> include/linux/slob_def.h | 5 +++++
> include/linux/slub_def.h | 2 ++
> init/main.c | 1 +
> mm/slab.c | 22 ++++++++++++++++++++++
> mm/slub.c | 18 ++++++++++++++++++
> 6 files changed, 50 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 4880306..219b8fb 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -319,4 +319,6 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
> return kmalloc_node(size, flags | __GFP_ZERO, node);
> }
>
> +void __init kmem_cache_init_late(void);
> +
> #endif /* _LINUX_SLAB_H */
> diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
> index 0ec00b3..bb5368d 100644
> --- a/include/linux/slob_def.h
> +++ b/include/linux/slob_def.h
> @@ -34,4 +34,9 @@ static __always_inline void *__kmalloc(size_t size, gfp_t flags)
> return kmalloc(size, flags);
> }
>
> +static inline void kmem_cache_init_late(void)
> +{
> + /* Nothing to do */
> +}
> +
> #endif /* __LINUX_SLOB_DEF_H */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index be5d40c..4dcbc2c 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -302,4 +302,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
> }
> #endif
>
> +void __init kmem_cache_init_late(void);
> +
> #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/init/main.c b/init/main.c
> index b3e8f14..f6204f7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -640,6 +640,7 @@ asmlinkage void __init start_kernel(void)
> "enabled early\n");
> early_boot_irqs_on();
> local_irq_enable();
> + kmem_cache_init_late();
>
> /*
> * HACK ALERT! This is early. We're enabling the console before
> diff --git a/mm/slab.c b/mm/slab.c
> index f46b65d..a785808 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -304,6 +304,12 @@ struct kmem_list3 {
> };
>
> /*
> + * The slab allocator is initialized with interrupts disabled. Therefore, make
> + * sure early boot allocations don't accidentally enable interrupts.
> + */
> +static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
> +
> +/*
> * Need this for bootstrapping a per node allocator.
> */
> #define NUM_INIT_LISTS (3 * MAX_NUMNODES)
> @@ -1654,6 +1660,14 @@ void __init kmem_cache_init(void)
> */
> }
>
> +void __init kmem_cache_init_late(void)
> +{
> + /*
> + * Interrupts are enabled now so all GFP allocations are safe.
> + */
> + slab_gfp_mask = __GFP_BITS_MASK;
> +}
> +
> static int __init cpucache_init(void)
> {
> int cpu;
> @@ -2812,6 +2826,10 @@ static int cache_grow(struct kmem_cache *cachep,
>
> offset *= cachep->colour_off;
>
> + /* Lets avoid crashing in early boot code. */
> + if (WARN_ON_ONCE((local_flags & ~slab_gfp_mask) != 0))
> + local_flags &= slab_gfp_mask;
> +
> if (local_flags & __GFP_WAIT)
> local_irq_enable();
>
> @@ -3237,6 +3255,10 @@ retry:
> }
>
> if (!obj) {
> + /* Lets avoid crashing in early boot code. */
> + if (WARN_ON_ONCE((local_flags & ~slab_gfp_mask) != 0))
> + local_flags &= slab_gfp_mask;
> +
> /*
> * This allocation will be performed within the constraints
> * of the current cpuset / memory policy requirements.
> diff --git a/mm/slub.c b/mm/slub.c
> index 3964d3c..651bb34 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -178,6 +178,12 @@ static enum {
> SYSFS /* Sysfs up */
> } slab_state = DOWN;
>
> +/*
> + * The slab allocator is initialized with interrupts disabled. Therefore, make
> + * sure early boot allocations don't accidentally enable interrupts.
> + */
> +static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
> +
> /* A list of all slab caches on the system */
> static DECLARE_RWSEM(slub_lock);
> static LIST_HEAD(slab_caches);
> @@ -1548,6 +1554,10 @@ new_slab:
> goto load_freelist;
> }
>
> + /* Lets avoid crashing in early boot code. */
> + if (WARN_ON_ONCE((gfpflags & ~slab_gfp_mask) != 0))
> + gfpflags &= slab_gfp_mask;
> +
> if (gfpflags & __GFP_WAIT)
> local_irq_enable();
>
> @@ -3104,6 +3114,14 @@ void __init kmem_cache_init(void)
> nr_cpu_ids, nr_node_ids);
> }
>
> +void __init kmem_cache_init_late(void)
> +{
> + /*
> + * Interrupts are enabled now so all GFP allocations are safe.
> + */
> + slab_gfp_mask = __GFP_BITS_MASK;
> +}
> +
> /*
> * Find a mergeable slab cache
> */

2009-06-12 11:43:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi Ben,

On Fri, Jun 12, 2009 at 2:41 PM, Benjamin
Herrenschmidt<[email protected]> wrote:
>> That said, Nick and Ingo seem to think special-casing is questionable
>> and I haven't had green light for any of the patches yet. The gfp
>> sanitization patch adds some overhead to kmalloc() and page allocator
>> paths which is obviously a concern.
>
> Let's wait and see what Linus thinks...

Yup, lets do that.

On Fri, Jun 12, 2009 at 2:41 PM, Benjamin
Herrenschmidt<[email protected]> wrote:
>> So while we continue to discuss this, I'd really like to proceed with
>> the patch below. At least it should allow people to boot their kernels
>> (although it will produce warnings). I really don't want to keep other
>> people waiting for us to reach a resolution on this. Are you OK with
>> that?
>
> I don't care -how- we achieve the result I want as long as we achieve
> it, which is to remove the need for callers to care. My approach was one
> way to do it, I'm sure there's a better one. That's not the point. I'm
> too tried now to properly review your patch and I'll need to test it
> tomorrow morning, but it looks ok except for the WARN_ON maybe.

OK, the WARN_ON is there because you will get warnings for
might_sleep() et al as well.

Pekka

2009-06-12 15:05:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending



On Fri, 12 Jun 2009, Pekka J Enberg wrote:
>
> + if (system_state != SYSTEM_RUNNING)
> + local_flags &= ~__GFP_WAIT;
> +
> + might_sleep_if(local_flags & __GFP_WAIT);

This is pointless.

You're doing the "might_sleep_if()" way too late. At that point, you've
already lost 99% of all coverage, since now none of the cases of just
finding a free slab entry on the list will ever trigger that
"might_sleep()" case.

So you need to do this _early_, at the entry-point, not late, at cache
re-fill time.

So rather than removing the might_sleep_if() at the early point, and then
moving it to this late stage (because you only do the local_flags fixups
late), you need to move the local-flags fixup early instead, and do the
might_sleep_it() there.

The whole point of "might_sleep()" is that it triggers every time if
something is called in the wrong context - not just for the cases where it
actually _does_ sleep.

Linus

2009-06-12 15:10:26

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi Linus,

Linus Torvalds wrote:
>
> On Fri, 12 Jun 2009, Pekka J Enberg wrote:
>>
>> + if (system_state != SYSTEM_RUNNING)
>> + local_flags &= ~__GFP_WAIT;
>> +
>> + might_sleep_if(local_flags & __GFP_WAIT);
>
> This is pointless.
>
> You're doing the "might_sleep_if()" way too late. At that point, you've
> already lost 99% of all coverage, since now none of the cases of just
> finding a free slab entry on the list will ever trigger that
> "might_sleep()" case.
>
> So you need to do this _early_, at the entry-point, not late, at cache
> re-fill time.
>
> So rather than removing the might_sleep_if() at the early point, and then
> moving it to this late stage (because you only do the local_flags fixups
> late), you need to move the local-flags fixup early instead, and do the
> might_sleep_it() there.
>
> The whole point of "might_sleep()" is that it triggers every time if
> something is called in the wrong context - not just for the cases where it
> actually _does_ sleep.

OK, makes sense. So what do you think of this patch then:

http://patchwork.kernel.org/patch/29733/

It's what Ben has been proposing all along in a slightly edited form.

Pekka

2009-06-12 15:17:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending



On Fri, 12 Jun 2009, Pekka J Enberg wrote:
>
> Hmm. This is turning into one epic patch discussion for sure! But here's a
> patch to do what you suggested. With the amount of patches I am
> generating, I'm bound to hit the right one sooner or later, no?-)

Ok, this one looks pretty good. I like the statics, and I like how it lets
each allocator decide what to do.

Small nit: your mm/slab.c patch does an obviously unnecessary mask in:

cache_alloc_debugcheck_before(cachep, flags & slab_gfp_flags);

but that's stupid, because the bits were already masked earlier.

Linus

2009-06-12 15:21:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Linus Torvalds wrote:
>
> On Fri, 12 Jun 2009, Pekka J Enberg wrote:
>> Hmm. This is turning into one epic patch discussion for sure! But here's a
>> patch to do what you suggested. With the amount of patches I am
>> generating, I'm bound to hit the right one sooner or later, no?-)
>
> Ok, this one looks pretty good. I like the statics, and I like how it lets
> each allocator decide what to do.
>
> Small nit: your mm/slab.c patch does an obviously unnecessary mask in:
>
> cache_alloc_debugcheck_before(cachep, flags & slab_gfp_flags);
>
> but that's stupid, because the bits were already masked earlier.

Yeah, the SLAB parts were completely untested. I have this in my tree
now (that I sent a pull request for):

http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=f6b726dae91cc74fb3a00f192932ec4fe0949875

Do you want me to drop it? I can also do an incremental patch to do the
unmasking as in this patch.

Pekka

2009-06-12 15:23:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 12 Jun 2009 12:45:21 +0300 (EEST) Pekka J Enberg <[email protected]> wrote:

> From: Benjamin Herrenschmidt <[email protected]>
> Date: Fri, 12 Jun 2009 12:39:58 +0300
> Subject: [PATCH] Sanitize "gfp" flags during boot
>
> With the recent shuffle of initialization order to move memory related
> inits earlier, various subtle breakage was introduced in archs like
> powerpc due to code somewhat assuming that GFP_KERNEL can be used as
> soon as the allocators are up. This is not true because any __GFP_WAIT
> allocation will cause interrupts to be enabled, which can be fatal if
> it happens too early.
>
> This isn't trivial to fix on every call site. For example, powerpc's
> ioremap implementation needs to be called early. For that, it uses two
> different mechanisms to carve out virtual space. Before memory init,
> by moving down VMALLOC_END, and then, by calling get_vm_area().
> Unfortunately, the later does GFK_KERNEL allocations. But we can't do
> anything else because once vmalloc's been initialized, we can no longer
> safely move VMALLOC_END to carve out space.
>
> There are other examples, wehere can can be called either very early
> or later on when devices are hot-plugged. It would be a major pain for
> such code to have to "know" whether it's in a context where it should
> use GFP_KERNEL or GFP_NOWAIT.
>
> Finally, by having the ability to silently removed __GFP_WAIT from
> allocations, we pave the way for suspend-to-RAM to use that feature
> to also remove __GFP_IO from allocations done after suspending devices
> has started. This is important because such allocations may hang if
> devices on the swap-out path have been suspended, but not-yet suspended
> drivers don't know about it, and may deadlock themselves by being hung
> into a kmalloc somewhere while holding a mutex for example.
>
> ...
>
> +/*
> + * We set up the page allocator and the slab allocator early on with interrupts
> + * disabled. Therefore, make sure that we sanitize GFP flags accordingly before
> + * everything is up and running.
> + */
> +gfp_t gfp_allowed_bits = ~(__GFP_WAIT|__GFP_FS | __GFP_IO);

__read_mostly

> +void mm_late_init(void)
> +{
> + /*
> + * Interrupts are enabled now so all GFP allocations are safe.
> + */
> + gfp_allowed_bits = __GFP_BITS_MASK;
> +}

Using plain old -1 here would be a more obviously-correct change.

2009-06-12 15:30:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 12 Jun 2009 14:34:00 +0300 Pekka Enberg <[email protected]> wrote:

> +static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;

It'd be safer and saner to disable __GFP_FS and __GFP_IO as well.
Having either of those flags set without __GFP_WAIT is a somewhat
self-contradictory thing and there might be code under reclaim which
assumes that __GFP_FS|__GFP_IO implies __GFP_WAIT.

<wonders why mempool_alloc() didn't clear __GFP_FS>

2009-06-12 21:43:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-12 at 08:30 -0700, Andrew Morton wrote:
> On Fri, 12 Jun 2009 14:34:00 +0300 Pekka Enberg <[email protected]> wrote:
>
> > +static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
>
> It'd be safer and saner to disable __GFP_FS and __GFP_IO as well.

Right. That's what my original patch does in fact. I also re-enabled
them all together but in that case, it might be better to re-enable FS
and IO later, I'll let experts decide.

> Having either of those flags set without __GFP_WAIT is a somewhat
> self-contradictory thing and there might be code under reclaim which
> assumes that __GFP_FS|__GFP_IO implies __GFP_WAIT.
>
> <wonders why mempool_alloc() didn't clear __GFP_FS>

Cheers,
Ben.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-06-19 14:59:26

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi!

>
> As explained by Benjamin Herrenschmidt:
>
> Oh and btw, your patch alone doesn't fix powerpc, because it's missing
> a whole bunch of GFP_KERNEL's in the arch code... You would have to
> grep the entire kernel for things that check slab_is_available() and
> even then you'll be missing some.
>
> For example, slab_is_available() didn't always exist, and so in the
> early days on powerpc, we used a mem_init_done global that is set form
> mem_init() (not perfect but works in practice). And we still have code
> using that to do the test.
>
> Therefore, ignore __GFP_WAIT in the slab allocators if we're booting or
> suspending.

Ok... GFP_KERNEL allocations normally don't fail; now they
will. Should we at least force access to atomic reserves in such case?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-19 22:28:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, 2009-06-19 at 16:59 +0200, Pavel Machek wrote:
>
> Ok... GFP_KERNEL allocations normally don't fail; now they
> will. Should we at least force access to atomic reserves in such case?

No. First, code that assumes GFP_KERNEL don't fail is stupid. Any
allocation should always be assumed to potentially fail.

Then, if you start failing allocations at boot time, then you aren't
going anywhere are you ?

Cheers,
Ben.

2009-06-19 23:23:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Sat 2009-06-20 08:27:29, Benjamin Herrenschmidt wrote:
> On Fri, 2009-06-19 at 16:59 +0200, Pavel Machek wrote:
> >
> > Ok... GFP_KERNEL allocations normally don't fail; now they
> > will. Should we at least force access to atomic reserves in such case?
>
> No. First, code that assumes GFP_KERNEL don't fail is stupid. Any
> allocation should always be assumed to potentially fail.

Stupid, yes. Uncommon? Not sure.

> Then, if you start failing allocations at boot time, then you aren't
> going anywhere are you ?

Exactly. So boot code should have access to all the memory, right?
Setting some aside for GFP_ATOMIC does not make sense in that context.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-19 23:50:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Sat, 2009-06-20 at 01:23 +0200, Pavel Machek wrote:
> > No. First, code that assumes GFP_KERNEL don't fail is stupid. Any
> > allocation should always be assumed to potentially fail.
>
> Stupid, yes. Uncommon? Not sure.

A lot less than it used to be, we've been fixing those by the truckload
over the past few years. But again, if allocations start failing that
early at boot, you are likely to be doomed anyway. Still, better to do
proper error handling, and I think we -mostly- do (ok, not -always-).

> > Then, if you start failing allocations at boot time, then you aren't
> > going anywhere are you ?
>
> Exactly. So boot code should have access to all the memory, right?
> Setting some aside for GFP_ATOMIC does not make sense in that context.

I'm not certain what you mean here. If you're going to hit the atomic
reserve that early, you aren't going anywhere neither :-)

Is there any real problem you are trying to solve here or is it all
just academic ?

Cheers,
Ben.

2009-06-20 00:28:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Sat 2009-06-20 09:50:09, Benjamin Herrenschmidt wrote:
> On Sat, 2009-06-20 at 01:23 +0200, Pavel Machek wrote:
> > > No. First, code that assumes GFP_KERNEL don't fail is stupid. Any
> > > allocation should always be assumed to potentially fail.
> >
> > Stupid, yes. Uncommon? Not sure.
>
> A lot less than it used to be, we've been fixing those by the truckload
> over the past few years. But again, if allocations start failing that
> early at boot, you are likely to be doomed anyway. Still, better to do
> proper error handling, and I think we -mostly- do (ok, not -always-).
>
> > > Then, if you start failing allocations at boot time, then you aren't
> > > going anywhere are you ?
> >
> > Exactly. So boot code should have access to all the memory, right?
> > Setting some aside for GFP_ATOMIC does not make sense in that context.
>
> I'm not certain what you mean here. If you're going to hit the atomic
> reserve that early, you aren't going anywhere neither :-)
>
> Is there any real problem you are trying to solve here or is it all
> just academic ?

Academic for boot, probably real for suspend/resume. There the atomic
reserves could matter because the memory can be pretty full when you
start suspend.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-20 02:10:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Sat, 2009-06-20 at 02:28 +0200, Pavel Machek wrote:
>
> Academic for boot, probably real for suspend/resume. There the atomic
> reserves could matter because the memory can be pretty full when you
> start suspend.

Right, that might be something to look into, though we haven't yet
applied the technique for suspend & resume. My main issue with it at the
moment is how do I synchronize with allocations that are already
sleeping when changing the gfp flag mask without bloating the normal
path. I haven't had time to look into it, it's mostly a problem local to
the page allocator and reclaim, not much to do with SL*Bs though, which
is fortunate.

I also suspect that we might want to try to make -some- amount of free
space before starting suspend, though of course not nearly as
aggressively as with std.

Cheers,
Ben.

2009-06-21 06:24:01

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

Hi!

> > Academic for boot, probably real for suspend/resume. There the atomic
> > reserves could matter because the memory can be pretty full when you
> > start suspend.
>
> Right, that might be something to look into, though we haven't yet
> applied the technique for suspend & resume. My main issue with it at the
> moment is how do I synchronize with allocations that are already
> sleeping when changing the gfp flag mask without bloating the normal

Well, but the problem already exists, no? If someone is already
sleeping due to __GFP_WAIT, he'll probably sleep till the resume.

...well, if he's sleeping in the disk driver, I suspect driver will
finish outstanding requests as part of .suspend().

> I also suspect that we might want to try to make -some- amount of free
> space before starting suspend, though of course not nearly as
> aggressively as with std.

We free 4MB in 2.6.30, but Rafael is removing that for 2.6.31 :-(.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-21 09:32:17

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending


> > Right, that might be something to look into, though we haven't yet
> > applied the technique for suspend & resume. My main issue with it at the
> > moment is how do I synchronize with allocations that are already
> > sleeping when changing the gfp flag mask without bloating the normal
>
> Well, but the problem already exists, no? If someone is already
> sleeping due to __GFP_WAIT, he'll probably sleep till the resume.

Yes. In fact, without the masking, a driver that hasn't been suspended
yet could well start sleeping in GFP_KERNEL after the disk driver has
suspended. It may do so while holding a mutex or similar, which might
deadlock its own suspend() callback. It's not something that drivers can
trivially address by having a pre-suspend hook, and avoid allocations,
since allocations may be done by subsystems on behalf of the driver or
such. It's a can of worms, which is why I believe the only sane approach
is to stop allocators from doing IOs once we start suspend.

So yes, just applying the mask would help, but wouldn't completely fix
it unless we also find a way to synchronize.

> ...well, if he's sleeping in the disk driver, I suspect driver will
> finish outstanding requests as part of .suspend().
>
> > I also suspect that we might want to try to make -some- amount of free
> > space before starting suspend, though of course not nearly as
> > aggressively as with std.
>
> We free 4MB in 2.6.30, but Rafael is removing that for 2.6.31 :-(.

Well... we are taking a chance of making the above scenario more likely
to hit then.

Cheers,
Ben.

2009-06-25 04:34:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Sun, Jun 21, 2009 at 07:31:05PM +1000, Benjamin Herrenschmidt wrote:
>
> > > Right, that might be something to look into, though we haven't yet
> > > applied the technique for suspend & resume. My main issue with it at the
> > > moment is how do I synchronize with allocations that are already
> > > sleeping when changing the gfp flag mask without bloating the normal
> >
> > Well, but the problem already exists, no? If someone is already
> > sleeping due to __GFP_WAIT, he'll probably sleep till the resume.
>
> Yes. In fact, without the masking, a driver that hasn't been suspended
> yet could well start sleeping in GFP_KERNEL after the disk driver has
> suspended. It may do so while holding a mutex or similar, which might
> deadlock its own suspend() callback. It's not something that drivers can
> trivially address by having a pre-suspend hook, and avoid allocations,
> since allocations may be done by subsystems on behalf of the driver or
> such. It's a can of worms, which is why I believe the only sane approach
> is to stop allocators from doing IOs once we start suspend.

Maybe so. Masking off __GFP_WAIT up in slab and page allocator
isn't really needed though (or necessarily a good idea to throw
out that information far from where it is used).

Checking for suspend active and avoiding writeout from reclaim
for example might be a better idea.


> So yes, just applying the mask would help, but wouldn't completely fix
> it unless we also find a way to synchronize.

You could potentially use srcu or something like that in page
reclaim in order to have a way to be able to kick everyone
out. page reclaim entry/exit from the page allocator isn't such
a fastpath though, so even a simple mutex or something may be
possible.

2009-06-25 04:38:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 07:59:34PM +1000, Benjamin Herrenschmidt wrote:
>
> > Maybe if we just not make it a general "tweak gfpflag" bit (at
> > least not until a bit more discussion), but a specific workaround
> > for the local_irq_enable in early boot problem.
> >
> > Seems like it would not be hard to track things down if we add
> > a warning if we have GFP_WAIT and interrupts are not enabled...
>
> But tweaking local_irq_enable() will have a lot more performance & bloat
> impact overall on the normal case.

(sorry for the late replies. I've been sick and missed a few
things over the past week or two... not that this is a really
urgent issue ;))

I was not proposing to put a branch in local_irq_enable ;)
but to use local_irq_save/restore in the slab allocators rather
than unconditional.

2009-06-25 04:42:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

On Fri, Jun 12, 2009 at 08:30:05AM -0700, Andrew Morton wrote:
> On Fri, 12 Jun 2009 14:34:00 +0300 Pekka Enberg <[email protected]> wrote:
>
> > +static gfp_t slab_gfp_mask __read_mostly = __GFP_BITS_MASK & ~__GFP_WAIT;
>
> It'd be safer and saner to disable __GFP_FS and __GFP_IO as well.
> Having either of those flags set without __GFP_WAIT is a somewhat
> self-contradictory thing and there might be code under reclaim which
> assumes that __GFP_FS|__GFP_IO implies __GFP_WAIT.
>
> <wonders why mempool_alloc() didn't clear __GFP_FS>

Maybe we never get there if __GFP_WAIT is clear? It would be neater
if it did clear __GFP_FS, though...

2009-06-25 09:57:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v2] slab,slub: ignore __GFP_WAIT if we're booting or suspending

> Maybe so. Masking off __GFP_WAIT up in slab and page allocator
> isn't really needed though (or necessarily a good idea to throw
> out that information far from where it is used).
>
> Checking for suspend active and avoiding writeout from reclaim
> for example might be a better idea.

Ah ok. Yes, I agree. I'm not familiar with those code path and
so masking gfp here sounded like the easier solution but you may well be
right here :-)

> > So yes, just applying the mask would help, but wouldn't completely fix
> > it unless we also find a way to synchronize.
>
> You could potentially use srcu or something like that in page
> reclaim in order to have a way to be able to kick everyone
> out. page reclaim entry/exit from the page allocator isn't such
> a fastpath though, so even a simple mutex or something may be
> possible.

Ok. Well, I'll leave that to the suspend/resume folks for now, as I'm
way too busy at the moment to give that a serious look, but thanks for
the pointer.

Cheers,
Ben.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>