2008-12-26 06:39:15

by Miao Xie

[permalink] [raw]
Subject: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

The task still allocated the page caches on old node after modifying its
cpuset's mems when 'memory_spread_page' was set, it is caused by the old
mem_allowed_list of the task. Slab has the same problem.

This patch fixes this bug.

Signed-off-by: Miao Xie <[email protected]>
---
mm/filemap.c | 3 +++
mm/slab.c | 3 +++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f3e5f89..d978983 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
#ifdef CONFIG_NUMA
struct page *__page_cache_alloc(gfp_t gfp)
{
+ if ((gfp & __GFP_WAIT) && !in_interrupt())
+ cpuset_update_task_memory_state();
+
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
return alloc_pages_node(n, gfp, 0);
diff --git a/mm/slab.c b/mm/slab.c
index 0918751..3b6e3d7 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3460,6 +3460,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
if (should_failslab(cachep, flags))
return NULL;

+ if ((flags & __GFP_WAIT) && !in_interrupt())
+ cpuset_update_task_memory_state();
+
cache_alloc_debugcheck_before(cachep, flags);
local_irq_save(save_flags);
objp = __do_cache_alloc(cachep, flags);
--
1.5.4.rc3


2008-12-30 22:29:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

On Fri, 26 Dec 2008 14:37:07 +0800
Miao Xie <[email protected]> wrote:

> The task still allocated the page caches on old node after modifying its
> cpuset's mems when 'memory_spread_page' was set, it is caused by the old
> mem_allowed_list of the task. Slab has the same problem.

ok...

> diff --git a/mm/filemap.c b/mm/filemap.c
> index f3e5f89..d978983 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -517,6 +517,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
> #ifdef CONFIG_NUMA
> struct page *__page_cache_alloc(gfp_t gfp)
> {
> + if ((gfp & __GFP_WAIT) && !in_interrupt())
> + cpuset_update_task_memory_state();
> +
> if (cpuset_do_page_mem_spread()) {
> int n = cpuset_mem_spread_node();
> return alloc_pages_node(n, gfp, 0);
> diff --git a/mm/slab.c b/mm/slab.c
> index 0918751..3b6e3d7 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3460,6 +3460,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller)
> if (should_failslab(cachep, flags))
> return NULL;
>
> + if ((flags & __GFP_WAIT) && !in_interrupt())
> + cpuset_update_task_memory_state();
> +
> cache_alloc_debugcheck_before(cachep, flags);
> local_irq_save(save_flags);
> objp = __do_cache_alloc(cachep, flags);

Problems.

a) There's no need to test in_interrupt(). Any caller who passed us
__GFP_WAIT from interrupt context is horridly buggy and needs to be
fixed.

b) Even if the caller _did_ set __GFP_WAIT, there's no guarantee
that we're deadlock safe here. Does anyone ever do a __GFP_WAIT
allocation while holding callback_mutex? If so, it'll deadlock.

c) These are two of the kernel's hottest code paths. We really
really really really don't want to be polling for some dopey
userspace admin change on each call to __cache_alloc()!

d) How does slub handle this problem?

2008-12-31 03:14:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

On Wednesday 31 December 2008 09:28:05 Andrew Morton wrote:
> On Fri, 26 Dec 2008 14:37:07 +0800
>
> Miao Xie <[email protected]> wrote:
> > The task still allocated the page caches on old node after modifying its
> > cpuset's mems when 'memory_spread_page' was set, it is caused by the old
> > mem_allowed_list of the task. Slab has the same problem.
>
> ok...
>
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index f3e5f89..d978983 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -517,6 +517,9 @@ int add_to_page_cache_lru(struct page *page, struct
> > address_space *mapping, #ifdef CONFIG_NUMA
> > struct page *__page_cache_alloc(gfp_t gfp)
> > {
> > + if ((gfp & __GFP_WAIT) && !in_interrupt())
> > + cpuset_update_task_memory_state();
> > +
> > if (cpuset_do_page_mem_spread()) {
> > int n = cpuset_mem_spread_node();
> > return alloc_pages_node(n, gfp, 0);
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 0918751..3b6e3d7 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3460,6 +3460,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t
> > flags, void *caller) if (should_failslab(cachep, flags))
> > return NULL;
> >
> > + if ((flags & __GFP_WAIT) && !in_interrupt())
> > + cpuset_update_task_memory_state();
> > +

These paths are pretty performance critical. Why don't cpusets code do this
work in the slowpath where the cpuset's mems_allowed gets changed rather
than putting these calls all over the place with apparently no real rhyme or
reason :( (this is not against your patch, but just this part of the cpusets
design)

> > cache_alloc_debugcheck_before(cachep, flags);
> > local_irq_save(save_flags);
> > objp = __do_cache_alloc(cachep, flags);
>
> Problems.
>
> a) There's no need to test in_interrupt(). Any caller who passed us
> __GFP_WAIT from interrupt context is horridly buggy and needs to be
> fixed.

Right. There are existing sites that do the same check, which is probably
where it is copied from.


> b) Even if the caller _did_ set __GFP_WAIT, there's no guarantee
> that we're deadlock safe here. Does anyone ever do a __GFP_WAIT
> allocation while holding callback_mutex? If so, it'll deadlock.

It's static to cpuset.c, so I'd hope not.


> c) These are two of the kernel's hottest code paths. We really
> really really really don't want to be polling for some dopey
> userspace admin change on each call to __cache_alloc()!

Yeah, right. Let's try to fix cpuset.c instead...


> d) How does slub handle this problem?

SLUB seems to do a "sloppy" kind of memory policy allocation, where it just
relies on the page allocator to hand us the correct page and AFAIKS does not
exactly obey this stuff all the time.

2008-12-31 04:01:29

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

on 2008-12-31 11:13 Nick Piggin wrote:
> On Wednesday 31 December 2008 09:28:05 Andrew Morton wrote:
>> On Fri, 26 Dec 2008 14:37:07 +0800
>>
>> Miao Xie <[email protected]> wrote:
>>> The task still allocated the page caches on old node after modifying its
>>> cpuset's mems when 'memory_spread_page' was set, it is caused by the old
>>> mem_allowed_list of the task. Slab has the same problem.
>> ok...
>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index f3e5f89..d978983 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -517,6 +517,9 @@ int add_to_page_cache_lru(struct page *page, struct
>>> address_space *mapping, #ifdef CONFIG_NUMA
>>> struct page *__page_cache_alloc(gfp_t gfp)
>>> {
>>> + if ((gfp & __GFP_WAIT) && !in_interrupt())
>>> + cpuset_update_task_memory_state();
>>> +
>>> if (cpuset_do_page_mem_spread()) {
>>> int n = cpuset_mem_spread_node();
>>> return alloc_pages_node(n, gfp, 0);
>>> diff --git a/mm/slab.c b/mm/slab.c
>>> index 0918751..3b6e3d7 100644
>>> --- a/mm/slab.c
>>> +++ b/mm/slab.c
>>> @@ -3460,6 +3460,9 @@ __cache_alloc(struct kmem_cache *cachep, gfp_t
>>> flags, void *caller) if (should_failslab(cachep, flags))
>>> return NULL;
>>>
>>> + if ((flags & __GFP_WAIT) && !in_interrupt())
>>> + cpuset_update_task_memory_state();
>>> +
>
> These paths are pretty performance critical. Why don't cpusets code do this
> work in the slowpath where the cpuset's mems_allowed gets changed rather
> than putting these calls all over the place with apparently no real rhyme or
> reason :( (this is not against your patch, but just this part of the cpusets
> design)

I see. I will do it.

>>> cache_alloc_debugcheck_before(cachep, flags);
>>> local_irq_save(save_flags);
>>> objp = __do_cache_alloc(cachep, flags);
>> Problems.
>>
>> a) There's no need to test in_interrupt(). Any caller who passed us
>> __GFP_WAIT from interrupt context is horridly buggy and needs to be
>> fixed.
>
> Right. There are existing sites that do the same check, which is probably
> where it is copied from.

I will do cleanup in the next patch.
Thanks!

>
>> b) Even if the caller _did_ set __GFP_WAIT, there's no guarantee
>> that we're deadlock safe here. Does anyone ever do a __GFP_WAIT
>> allocation while holding callback_mutex? If so, it'll deadlock.
>
> It's static to cpuset.c, so I'd hope not.
>
>
>> c) These are two of the kernel's hottest code paths. We really
>> really really really don't want to be polling for some dopey
>> userspace admin change on each call to __cache_alloc()!
>
> Yeah, right. Let's try to fix cpuset.c instead...
>
>> d) How does slub handle this problem?
>
> SLUB seems to do a "sloppy" kind of memory policy allocation, where it just
> relies on the page allocator to hand us the correct page and AFAIKS does not
> exactly obey this stuff all the time.

2008-12-31 22:54:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

On Tue, 30 Dec 2008, Andrew Morton wrote:

> d) How does slub handle this problem?

SLUB understands memory policies to apply to the pages from which
objects are acquired. MPOL_INTERLEAVE applies to the pages acquired
by the allocator not to individual objects acquired by the user of the
allocator from these pages.

With that point of view most of the memory policy handling can be pushed
into the page allocator.

2008-12-31 22:39:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] cpuset,mm: fix allocating page cache/slab object on the unallowed node when memory spread is set

On Wed, 31 Dec 2008, Nick Piggin wrote:

> These paths are pretty performance critical. Why don't cpusets code do this
> work in the slowpath where the cpuset's mems_allowed gets changed rather
> than putting these calls all over the place with apparently no real rhyme or
> reason :( (this is not against your patch, but just this part of the cpusets
> design)

Right.

> > d) How does slub handle this problem?
>
> SLUB seems to do a "sloppy" kind of memory policy allocation, where it just
> relies on the page allocator to hand us the correct page and AFAIKS does not
> exactly obey this stuff all the time.

Slub avoids hanlding memory policy decisions and lets the page allocator
deal with it. That means that memory policies are not enforced on an
object basis but on a page basis. If you allocate a series of objects
under MPOL_INTERLEAVE then SLAB will give you one object from each node.
SLUB will give you objects from one page until the objects in a page are
exhausted. The next page will be acquired according to the current
memory policy. Meaning the page will come from the next node if
MPOL_INTERLEAVE is set. The following set of objects will be allocated
from that node. This allows a faster allocation for NUMA since the cachelines
for allocation can be kept hot. The page are still allocated from all
nodes.