2011-04-15 13:51:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm: make expand_downwards symmetrical to expand_upwards

Hi,
the following patch is just a cleanup for better readability without any
functional changes. What do you think about it?
---
>From 71de71aaa725ee87459b3a256e8bb0af7de4abeb Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 15 Apr 2011 14:56:26 +0200
Subject: [PATCH] mm: make expand_downwards symmetrical to expand_upwards

Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack.

check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case. Let's make this consistent and export expand_downwards
same way we do with expand_upwards.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 2 +-
mm/mmap.c | 2 +-
3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 692dbae..765cf4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1498,8 +1498,10 @@ unsigned long ra_submit(struct file_ra_state *ra,
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#if VM_GROWSUP
extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
+ #define expand_downwards(vma, address) do { } while (0)
#else
#define expand_upwards(vma, address) do { } while (0)
+extern int expand_downwards(struct vm_area_struct *vma, unsigned long address);
#endif
extern int expand_stack_downwards(struct vm_area_struct *vma,
unsigned long address);
diff --git a/mm/memory.c b/mm/memory.c
index ce22a25..f404fb6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2969,7 +2969,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
if (prev && prev->vm_end == address)
return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM;

- expand_stack(vma, address - PAGE_SIZE);
+ expand_downwards(vma, address - PAGE_SIZE);
}
if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) {
struct vm_area_struct *next = vma->vm_next;
diff --git a/mm/mmap.c b/mm/mmap.c
index e27e0cf..6b2a817 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1782,7 +1782,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
-static int expand_downwards(struct vm_area_struct *vma,
+int expand_downwards(struct vm_area_struct *vma,
unsigned long address)
{
int error;
--
1.7.4.1

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic


2011-04-18 03:00:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: make expand_downwards symmetrical to expand_upwards

On Fri, 14 Apr 2011, Michal Hocko wrote:

> Hi,
> the following patch is just a cleanup for better readability without any
> functional changes. What do you think about it?
> ---
> From 71de71aaa725ee87459b3a256e8bb0af7de4abeb Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Fri, 15 Apr 2011 14:56:26 +0200
> Subject: [PATCH] mm: make expand_downwards symmetrical to expand_upwards
>
> Currently we have expand_upwards exported while expand_downwards is
> accessible only via expand_stack.
>
> check_stack_guard_page is a nice example of the asymmetry. It uses
> expand_stack for VM_GROWSDOWN while expand_upwards is called for
> VM_GROWSUP case. Let's make this consistent and export expand_downwards
> same way we do with expand_upwards.
>
> Signed-off-by: Michal Hocko <[email protected]>

Yes, I've just been looking around here, and I like your symmetry.
But two points:

> ---
> include/linux/mm.h | 2 ++
> mm/memory.c | 2 +-
> mm/mmap.c | 2 +-
> 3 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 692dbae..765cf4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1498,8 +1498,10 @@ unsigned long ra_submit(struct file_ra_state *ra,
> extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> #if VM_GROWSUP
> extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
> + #define expand_downwards(vma, address) do { } while (0)

I think this is wrong: doesn't the VM_GROWSUP case actually want
a real expand_downwards() in addition to expand_upwards()?

> #else
> #define expand_upwards(vma, address) do { } while (0)
> +extern int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> #endif
> extern int expand_stack_downwards(struct vm_area_struct *vma,
> unsigned long address);

And if you're going for symmetry, wouldn't it be nice to add fs/exec.c
to the patch and remove this silly expand_stack_downwards() wrapper?

Hugh


> diff --git a/mm/memory.c b/mm/memory.c
> index ce22a25..f404fb6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2969,7 +2969,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
> if (prev && prev->vm_end == address)
> return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM;
>
> - expand_stack(vma, address - PAGE_SIZE);
> + expand_downwards(vma, address - PAGE_SIZE);
> }
> if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) {
> struct vm_area_struct *next = vma->vm_next;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e27e0cf..6b2a817 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1782,7 +1782,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> /*
> * vma is the first one with address < vma->vm_start. Have to extend vma.
> */
> -static int expand_downwards(struct vm_area_struct *vma,
> +int expand_downwards(struct vm_area_struct *vma,
> unsigned long address)
> {
> int error;
> --
> 1.7.4.1

2011-04-18 10:01:38

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v2] mm: make expand_downwards symmetrical to expand_upwards

On Sun 17-04-11 20:00:17, Hugh Dickins wrote:
> On Fri, 14 Apr 2011, Michal Hocko wrote:
[...]
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 692dbae..765cf4e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1498,8 +1498,10 @@ unsigned long ra_submit(struct file_ra_state *ra,
> > extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> > #if VM_GROWSUP
> > extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
> > + #define expand_downwards(vma, address) do { } while (0)
>
> I think this is wrong: doesn't the VM_GROWSUP case actually want
> a real expand_downwards() in addition to expand_upwards()?

Ahh, right you are. I haven't noticed that we have
expand_stack_downwards as well and this one is called from get_arg_page
if CONFIG_STACK_GROWSUP.

I am wondering why we do this just for CONFIG_STACK_GROWSUP. Do we
expand stack for GROWSDOWN automatically?

>
> > #else
> > #define expand_upwards(vma, address) do { } while (0)
> > +extern int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> > #endif
> > extern int expand_stack_downwards(struct vm_area_struct *vma,
> > unsigned long address);
>
> And if you're going for symmetry, wouldn't it be nice to add fs/exec.c
> to the patch and remove this silly expand_stack_downwards() wrapper?

Sounds reasonable. Thanks for the review, Hugh. I am also thinking
whether expand_stack_{downwards,upwards} is more suitable name for those
functions as we are more explicit that this is stack related.

What about the updated patch bellow?

Changes since v1
- fixed expand_downwards case for CONFIG_STACK_GROWSUP in get_arg_page.
- rename expand_{downwards,upwards} -> expand_stack_{downwards,upwards}
---

>From f3adcf40518eaf7c4ee1cb10abd16b59785a66ba Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 15 Apr 2011 14:56:26 +0200
Subject: [PATCH] mm: make expand_downwards symmetrical to expand_upwards

Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.

check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.

Let's clean this up by exporting both functions and make those name
consistent. Let's use expand_stack_{upwards,downwards} so that we are
explicit about stack manipulation in the name. expand_stack_downwards
has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
get_arg_page calls the downwards version in the early process
initialization phase for growsup configuration.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 13 ++++++++-----
mm/memory.c | 4 ++--
mm/mmap.c | 13 ++++---------
3 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 692dbae..17f9b86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1494,15 +1494,18 @@ unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
struct file *filp);

-/* Do stack extension */
+/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
+
+/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */
+extern int expand_stack_downwards(struct vm_area_struct *vma,
+ unsigned long address);
#if VM_GROWSUP
-extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
+extern int expand_stack_upwards(struct vm_area_struct *vma,
+ unsigned long address);
#else
- #define expand_upwards(vma, address) do { } while (0)
+ #define expand_stack_upwards(vma, address) do { } while (0)
#endif
-extern int expand_stack_downwards(struct vm_area_struct *vma,
- unsigned long address);

/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index ce22a25..ba5b4d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2969,7 +2969,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
if (prev && prev->vm_end == address)
return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM;

- expand_stack(vma, address - PAGE_SIZE);
+ expand_stack_downwards(vma, address - PAGE_SIZE);
}
if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) {
struct vm_area_struct *next = vma->vm_next;
@@ -2978,7 +2978,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
if (next && next->vm_start == address + PAGE_SIZE)
return next->vm_flags & VM_GROWSUP ? 0 : -ENOMEM;

- expand_upwards(vma, address + PAGE_SIZE);
+ expand_stack_upwards(vma, address + PAGE_SIZE);
}
return 0;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index e27e0cf..29c68b0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1731,7 +1731,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
* PA-RISC uses this for its stack; IA64 for its Register Backing Store.
* vma is the last one with address > vma->vm_end. Have to extend vma.
*/
-int expand_upwards(struct vm_area_struct *vma, unsigned long address)
+int expand_stack_upwards(struct vm_area_struct *vma, unsigned long address)
{
int error;

@@ -1782,7 +1782,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
-static int expand_downwards(struct vm_area_struct *vma,
+int expand_stack_downwards(struct vm_area_struct *vma,
unsigned long address)
{
int error;
@@ -1829,15 +1829,10 @@ static int expand_downwards(struct vm_area_struct *vma,
return error;
}

-int expand_stack_downwards(struct vm_area_struct *vma, unsigned long address)
-{
- return expand_downwards(vma, address);
-}
-
#ifdef CONFIG_STACK_GROWSUP
int expand_stack(struct vm_area_struct *vma, unsigned long address)
{
- return expand_upwards(vma, address);
+ return expand_stack_upwards(vma, address);
}

struct vm_area_struct *
@@ -1859,7 +1854,7 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr)
#else
int expand_stack(struct vm_area_struct *vma, unsigned long address)
{
- return expand_downwards(vma, address);
+ return expand_stack_downwards(vma, address);
}

struct vm_area_struct *
--
1.7.4.1

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-04-18 20:57:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] mm: make expand_downwards symmetrical to expand_upwards

On Mon, 18 Apr 2011 12:01:31 +0200
Michal Hocko <[email protected]> wrote:

> Currently we have expand_upwards exported while expand_downwards is
> accessible only via expand_stack or expand_stack_downwards.
>
> check_stack_guard_page is a nice example of the asymmetry. It uses
> expand_stack for VM_GROWSDOWN while expand_upwards is called for
> VM_GROWSUP case.
>
> Let's clean this up by exporting both functions and make those name
> consistent. Let's use expand_stack_{upwards,downwards} so that we are
> explicit about stack manipulation in the name. expand_stack_downwards
> has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
> get_arg_page calls the downwards version in the early process
> initialization phase for growsup configuration.

Has this patch been tested on any stack-grows-upwards architecture?

2011-04-19 11:10:09

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Mon 18-04-11 13:56:37, Andrew Morton wrote:
> On Mon, 18 Apr 2011 12:01:31 +0200
> Michal Hocko <[email protected]> wrote:
>
> > Currently we have expand_upwards exported while expand_downwards is
> > accessible only via expand_stack or expand_stack_downwards.
> >
> > check_stack_guard_page is a nice example of the asymmetry. It uses
> > expand_stack for VM_GROWSDOWN while expand_upwards is called for
> > VM_GROWSUP case.
> >
> > Let's clean this up by exporting both functions and make those name
> > consistent. Let's use expand_stack_{upwards,downwards} so that we are
> > explicit about stack manipulation in the name. expand_stack_downwards
> > has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
> > get_arg_page calls the downwards version in the early process
> > initialization phase for growsup configuration.
>
> Has this patch been tested on any stack-grows-upwards architecture?

The only one I can find in the tree is parisc and I do not have access
to any such machine. Maybe someone on the list (CCed) can help with
testing the patch bellow? Nevertheless, the patch doesn't change growsup
case. It just renames functions and exports growsdown.

IA64 which grows upwards only for registers still needs a fix because of
the rename, though. I'm sorry, I must have missed it in the grep output
before. No other arch specific code uses expand_{down,up}wards directly.

Changes since v2
- fix compilation error on ia64
Changes since v1
- fixed expand_downwards case for CONFIG_STACK_GROWSUP in get_arg_page.
- rename expand_{downwards,upwards} -> expand_stack_{downwards,upwards}
---
>From 091cf27fe9fad875bc7f6cdbb8206c617b06fc7b Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Fri, 15 Apr 2011 14:56:26 +0200
Subject: [PATCH] mm: make expand_downwards symmetrical to expand_upwards

Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.

check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.

Let's clean this up by exporting both functions and make those name
consistent. Let's use expand_stack_{upwards,downwards} so that we are
explicit about stack manipulation in the name. expand_stack_downwards
has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
get_arg_page calls the downwards version in the early process
initialization phase for growsup configuration.

Signed-off-by: Michal Hocko <[email protected]>
---
arch/ia64/mm/fault.c | 2 +-
include/linux/mm.h | 13 ++++++++-----
mm/memory.c | 4 ++--
mm/mmap.c | 13 ++++---------
4 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 0799fea..aebff8a 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -197,7 +197,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
*/
if (address > vma->vm_end + PAGE_SIZE - sizeof(long))
goto bad_area;
- if (expand_upwards(vma, address))
+ if (expand_stack_upwards(vma, address))
goto bad_area;
}
goto good_area;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 692dbae..17f9b86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1494,15 +1494,18 @@ unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
struct file *filp);

-/* Do stack extension */
+/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
+
+/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */
+extern int expand_stack_downwards(struct vm_area_struct *vma,
+ unsigned long address);
#if VM_GROWSUP
-extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
+extern int expand_stack_upwards(struct vm_area_struct *vma,
+ unsigned long address);
#else
- #define expand_upwards(vma, address) do { } while (0)
+ #define expand_stack_upwards(vma, address) do { } while (0)
#endif
-extern int expand_stack_downwards(struct vm_area_struct *vma,
- unsigned long address);

/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index ce22a25..ba5b4d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2969,7 +2969,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
if (prev && prev->vm_end == address)
return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM;

- expand_stack(vma, address - PAGE_SIZE);
+ expand_stack_downwards(vma, address - PAGE_SIZE);
}
if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) {
struct vm_area_struct *next = vma->vm_next;
@@ -2978,7 +2978,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
if (next && next->vm_start == address + PAGE_SIZE)
return next->vm_flags & VM_GROWSUP ? 0 : -ENOMEM;

- expand_upwards(vma, address + PAGE_SIZE);
+ expand_stack_upwards(vma, address + PAGE_SIZE);
}
return 0;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index e27e0cf..29c68b0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1731,7 +1731,7 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
* PA-RISC uses this for its stack; IA64 for its Register Backing Store.
* vma is the last one with address > vma->vm_end. Have to extend vma.
*/
-int expand_upwards(struct vm_area_struct *vma, unsigned long address)
+int expand_stack_upwards(struct vm_area_struct *vma, unsigned long address)
{
int error;

@@ -1782,7 +1782,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
-static int expand_downwards(struct vm_area_struct *vma,
+int expand_stack_downwards(struct vm_area_struct *vma,
unsigned long address)
{
int error;
@@ -1829,15 +1829,10 @@ static int expand_downwards(struct vm_area_struct *vma,
return error;
}

-int expand_stack_downwards(struct vm_area_struct *vma, unsigned long address)
-{
- return expand_downwards(vma, address);
-}
-
#ifdef CONFIG_STACK_GROWSUP
int expand_stack(struct vm_area_struct *vma, unsigned long address)
{
- return expand_upwards(vma, address);
+ return expand_stack_upwards(vma, address);
}

struct vm_area_struct *
@@ -1859,7 +1854,7 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr)
#else
int expand_stack(struct vm_area_struct *vma, unsigned long address)
{
- return expand_downwards(vma, address);
+ return expand_stack_downwards(vma, address);
}

struct vm_area_struct *
--
1.7.4.1

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2011-04-19 15:46:54

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 13:10 +0200, Michal Hocko wrote:
> On Mon 18-04-11 13:56:37, Andrew Morton wrote:
> > On Mon, 18 Apr 2011 12:01:31 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > Currently we have expand_upwards exported while expand_downwards is
> > > accessible only via expand_stack or expand_stack_downwards.
> > >
> > > check_stack_guard_page is a nice example of the asymmetry. It uses
> > > expand_stack for VM_GROWSDOWN while expand_upwards is called for
> > > VM_GROWSUP case.
> > >
> > > Let's clean this up by exporting both functions and make those name
> > > consistent. Let's use expand_stack_{upwards,downwards} so that we are
> > > explicit about stack manipulation in the name. expand_stack_downwards
> > > has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
> > > get_arg_page calls the downwards version in the early process
> > > initialization phase for growsup configuration.
> >
> > Has this patch been tested on any stack-grows-upwards architecture?
>
> The only one I can find in the tree is parisc and I do not have access
> to any such machine. Maybe someone on the list (CCed) can help with
> testing the patch bellow? Nevertheless, the patch doesn't change growsup
> case. It just renames functions and exports growsdown.

It compiles OK, but crashes on boot in fsck. The crash is definitely mm
but looks to be a slab problem (it's a null deref on a spinlock in
add_partial(), which seems unrelated to this patch).

[ 15.628000] sd 1:0:2:0: [sdc] Attached SCSI disk
done.
[ 16.632000] EXT3-fs: barriers not enabled
[ 16.640000] kjournald starting. Commit interval 5 seconds
[ 16.640000] EXT3-fs (sda3): mounted filesystem with ordered data mode
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
INIT: version 2.88 booting
Setting hostname to 'ion'...done.
Starting the hotplug events dispatcher: udevd[ 22.008000] udev[211]: starting version 164
.
Synthesizing the initial hotplug events...done.
Waiting for /dev to be fully populated...done.
Activating swap:swapon on /dev/sda2
swapon: /dev/sda2: found swap signature: version 1, page-size 4, same byte order
swapon: /dev/sda2: pagesize=4096, swapsize=1028157440, devsize=1028160000
[ 28.780000] Adding 1004056k swap on /dev/sda2. Priority:-1 extents:1 across:1004056k
.
Will now check root file system:fsck from util-linux-ng 2.17.2
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda3
/dev/sda3 has been mounted 37 times without being checked, check forced.
[ 257.192000] Backtrace:=========== \ 42.8%
[ 257.192000] [<0000000040214f78>] add_partial+0x28/0x98
[ 257.192000] [<0000000040217ff8>] __slab_free+0x1d0/0x1d8
[ 257.192000] [<000000004021825c>] kmem_cache_free+0xc4/0x128
[ 257.192000] [<00000000401fd1a4>] remove_vma+0x8c/0xc0
[ 257.192000] [<00000000401fd3a8>] exit_mmap+0x1d0/0x220
[ 257.192000] [<0000000040156514>] mmput+0xd4/0x200
[ 257.192000] [<000000004015c7b8>] exit_mm+0x100/0x2c0
[ 257.192000] [<000000004015ef40>] do_exit+0x778/0x9d8
[ 257.192000] [<000000004015f1ec>] do_group_exit+0x4c/0xe0
[ 257.192000] [<000000004015f298>] sys_exit_group+0x18/0x28
[ 257.192000] [<0000000040106034>] syscall_exit+0x0/0x14
[ 257.192000]
[ 257.192000]
[ 257.192000] Kernel Fault: Code=26 regs=00000040bf1807d0 (Addr=0000000000000000)
[ 257.192000]
[ 257.192000] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[ 257.192000] PSW: 00001000000001001111000000001110 Not tainted
[ 257.192000] r00-03 000000ff0804f00e 0000000040769e40 0000000040214f78 0000000000000000
[ 257.192000] r04-07 0000000040746e40 0000000000000001 0000004080ded370 0000000000000001
[ 257.192000] r08-11 0000000040654150 0000000000000000 0000000000000001 0000000000000001
[ 257.192000] r12-15 0000000000000000 00000000ffffffff 0000000000000024 0000000000000000
[ 257.192000] r16-19 00000000fb4ead9c 00000000fb4eac54 0000000000000000 0000000000000000
[ 257.192000] r20-23 000000000800000e 0000000000000001 000000007bbb7180 00000000401fd1a4
[ 257.192000] r24-27 0000000000000001 0000004080ded370 0000000000000000 0000000040746e40
[ 257.192000] r28-31 000000007ec0a908 00000040bf1807a0 00000040bf1807d0 0000000000000016
[ 257.192000] sr00-03 00000000002d9000 0000000000000000 0000000000000000 00000000002d9000
[ 257.192000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 257.192000]
[ 257.192000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011bbc0 000000004011bbc4
[ 257.192000] IIR: 0f4015dc ISR: 0000000000000000 IOR: 0000000000000000
[ 257.192000] CPU: 0 CR30: 00000040bf180000 CR31: fffffff0f0e098e0
[ 257.192000] ORIG_R28: 0000000040769e40
[ 257.192000] IAOQ[0]: _raw_spin_lock+0x0/0x20
[ 257.192000] IAOQ[1]: _raw_spin_lock+0x4/0x20
[ 257.192000] RP(r2): add_partial+0x28/0x98
[ 257.192000] Backtrace:
[ 257.192000] [<0000000040214f78>] add_partial+0x28/0x98
[ 257.192000] [<0000000040217ff8>] __slab_free+0x1d0/0x1d8
[ 257.192000] [<000000004021825c>] kmem_cache_free+0xc4/0x128
[ 257.192000] [<00000000401fd1a4>] remove_vma+0x8c/0xc0
[ 257.192000] [<00000000401fd3a8>] exit_mmap+0x1d0/0x220
[ 257.192000] [<0000000040156514>] mmput+0xd4/0x200
[ 257.192000] [<000000004015c7b8>] exit_mm+0x100/0x2c0
[ 257.192000] [<000000004015ef40>] do_exit+0x778/0x9d8
[ 257.192000] [<000000004015f1ec>] do_group_exit+0x4c/0xe0
[ 257.192000] [<000000004015f298>] sys_exit_group+0x18/0x28
[ 257.192000] [<0000000040106034>] syscall_exit+0x0/0x14
[ 257.192000]
[ 257.192000] Kernel panic - not syncing: Kernel Fault
[ 257.192000] Backtrace:
[ 257.192000] [<000000004011f984>] show_stack+0x14/0x20
[ 257.192000] [<000000004011f9a8>] dump_stack+0x18/0x28
[ 257.192000] [<000000004015946c>] panic+0xd4/0x368
[ 257.192000] [<0000000040120024>] parisc_terminate+0x14c/0x170
[ 257.192000] [<000000004012059c>] handle_interruption+0x2ac/0x8f8
[ 257.192000] [<000000004011bbc0>] _raw_spin_lock+0x0/0x20
[ 257.192000]
[ 257.192000] Rebooting in 5 seconds..

It seems to be a random intermittent mm crash because the next reboot
crashed with the same trace but after the fsck had completed and the
third came up to the login prompt.

James

2011-04-19 16:07:38

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 10:46 -0500, James Bottomley wrote:
> On Tue, 2011-04-19 at 13:10 +0200, Michal Hocko wrote:
> > On Mon 18-04-11 13:56:37, Andrew Morton wrote:
> > > On Mon, 18 Apr 2011 12:01:31 +0200
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > Currently we have expand_upwards exported while expand_downwards is
> > > > accessible only via expand_stack or expand_stack_downwards.
> > > >
> > > > check_stack_guard_page is a nice example of the asymmetry. It uses
> > > > expand_stack for VM_GROWSDOWN while expand_upwards is called for
> > > > VM_GROWSUP case.
> > > >
> > > > Let's clean this up by exporting both functions and make those name
> > > > consistent. Let's use expand_stack_{upwards,downwards} so that we are
> > > > explicit about stack manipulation in the name. expand_stack_downwards
> > > > has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because
> > > > get_arg_page calls the downwards version in the early process
> > > > initialization phase for growsup configuration.
> > >
> > > Has this patch been tested on any stack-grows-upwards architecture?
> >
> > The only one I can find in the tree is parisc and I do not have access
> > to any such machine. Maybe someone on the list (CCed) can help with
> > testing the patch bellow? Nevertheless, the patch doesn't change growsup
> > case. It just renames functions and exports growsdown.
>
> It compiles OK, but crashes on boot in fsck. The crash is definitely mm
> but looks to be a slab problem (it's a null deref on a spinlock in
> add_partial(), which seems unrelated to this patch).
>
> [ 15.628000] sd 1:0:2:0: [sdc] Attached SCSI disk
> done.
> [ 16.632000] EXT3-fs: barriers not enabled
> [ 16.640000] kjournald starting. Commit interval 5 seconds
> [ 16.640000] EXT3-fs (sda3): mounted filesystem with ordered data mode
> Begin: Running /scripts/local-bottom ... done.
> done.
> Begin: Running /scripts/init-bottom ... done.
> INIT: version 2.88 booting
> Setting hostname to 'ion'...done.
> Starting the hotplug events dispatcher: udevd[ 22.008000] udev[211]: starting version 164
> .
> Synthesizing the initial hotplug events...done.
> Waiting for /dev to be fully populated...done.
> Activating swap:swapon on /dev/sda2
> swapon: /dev/sda2: found swap signature: version 1, page-size 4, same byte order
> swapon: /dev/sda2: pagesize=4096, swapsize=1028157440, devsize=1028160000
> [ 28.780000] Adding 1004056k swap on /dev/sda2. Priority:-1 extents:1 across:1004056k
> .
> Will now check root file system:fsck from util-linux-ng 2.17.2
> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda3
> /dev/sda3 has been mounted 37 times without being checked, check forced.
> [ 257.192000] Backtrace:=========== \ 42.8%
> [ 257.192000] [<0000000040214f78>] add_partial+0x28/0x98
> [ 257.192000] [<0000000040217ff8>] __slab_free+0x1d0/0x1d8
> [ 257.192000] [<000000004021825c>] kmem_cache_free+0xc4/0x128
> [ 257.192000] [<00000000401fd1a4>] remove_vma+0x8c/0xc0
> [ 257.192000] [<00000000401fd3a8>] exit_mmap+0x1d0/0x220
> [ 257.192000] [<0000000040156514>] mmput+0xd4/0x200
> [ 257.192000] [<000000004015c7b8>] exit_mm+0x100/0x2c0
> [ 257.192000] [<000000004015ef40>] do_exit+0x778/0x9d8
> [ 257.192000] [<000000004015f1ec>] do_group_exit+0x4c/0xe0
> [ 257.192000] [<000000004015f298>] sys_exit_group+0x18/0x28
> [ 257.192000] [<0000000040106034>] syscall_exit+0x0/0x14
> [ 257.192000]
> [ 257.192000]
> [ 257.192000] Kernel Fault: Code=26 regs=00000040bf1807d0 (Addr=0000000000000000)
> [ 257.192000]
> [ 257.192000] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> [ 257.192000] PSW: 00001000000001001111000000001110 Not tainted
> [ 257.192000] r00-03 000000ff0804f00e 0000000040769e40 0000000040214f78 0000000000000000
> [ 257.192000] r04-07 0000000040746e40 0000000000000001 0000004080ded370 0000000000000001
> [ 257.192000] r08-11 0000000040654150 0000000000000000 0000000000000001 0000000000000001
> [ 257.192000] r12-15 0000000000000000 00000000ffffffff 0000000000000024 0000000000000000
> [ 257.192000] r16-19 00000000fb4ead9c 00000000fb4eac54 0000000000000000 0000000000000000
> [ 257.192000] r20-23 000000000800000e 0000000000000001 000000007bbb7180 00000000401fd1a4
> [ 257.192000] r24-27 0000000000000001 0000004080ded370 0000000000000000 0000000040746e40
> [ 257.192000] r28-31 000000007ec0a908 00000040bf1807a0 00000040bf1807d0 0000000000000016
> [ 257.192000] sr00-03 00000000002d9000 0000000000000000 0000000000000000 00000000002d9000
> [ 257.192000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 257.192000]
> [ 257.192000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011bbc0 000000004011bbc4
> [ 257.192000] IIR: 0f4015dc ISR: 0000000000000000 IOR: 0000000000000000
> [ 257.192000] CPU: 0 CR30: 00000040bf180000 CR31: fffffff0f0e098e0
> [ 257.192000] ORIG_R28: 0000000040769e40
> [ 257.192000] IAOQ[0]: _raw_spin_lock+0x0/0x20
> [ 257.192000] IAOQ[1]: _raw_spin_lock+0x4/0x20
> [ 257.192000] RP(r2): add_partial+0x28/0x98
> [ 257.192000] Backtrace:
> [ 257.192000] [<0000000040214f78>] add_partial+0x28/0x98
> [ 257.192000] [<0000000040217ff8>] __slab_free+0x1d0/0x1d8
> [ 257.192000] [<000000004021825c>] kmem_cache_free+0xc4/0x128
> [ 257.192000] [<00000000401fd1a4>] remove_vma+0x8c/0xc0
> [ 257.192000] [<00000000401fd3a8>] exit_mmap+0x1d0/0x220
> [ 257.192000] [<0000000040156514>] mmput+0xd4/0x200
> [ 257.192000] [<000000004015c7b8>] exit_mm+0x100/0x2c0
> [ 257.192000] [<000000004015ef40>] do_exit+0x778/0x9d8
> [ 257.192000] [<000000004015f1ec>] do_group_exit+0x4c/0xe0
> [ 257.192000] [<000000004015f298>] sys_exit_group+0x18/0x28
> [ 257.192000] [<0000000040106034>] syscall_exit+0x0/0x14
> [ 257.192000]
> [ 257.192000] Kernel panic - not syncing: Kernel Fault
> [ 257.192000] Backtrace:
> [ 257.192000] [<000000004011f984>] show_stack+0x14/0x20
> [ 257.192000] [<000000004011f9a8>] dump_stack+0x18/0x28
> [ 257.192000] [<000000004015946c>] panic+0xd4/0x368
> [ 257.192000] [<0000000040120024>] parisc_terminate+0x14c/0x170
> [ 257.192000] [<000000004012059c>] handle_interruption+0x2ac/0x8f8
> [ 257.192000] [<000000004011bbc0>] _raw_spin_lock+0x0/0x20
> [ 257.192000]
> [ 257.192000] Rebooting in 5 seconds..
>
> It seems to be a random intermittent mm crash because the next reboot
> crashed with the same trace but after the fsck had completed and the
> third came up to the login prompt.

I should add that this crash is with CONFIG_SLUB ... do you want me to
retry with CONFIG_SLAB?

James

2011-04-19 16:14:24

by John David Anglin

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to

> It compiles OK, but crashes on boot in fsck. The crash is definitely mm
> but looks to be a slab problem (it's a null deref on a spinlock in
> add_partial(), which seems unrelated to this patch).

I had a somewhat similar crash Sunday with the "debian" config building GCC.
This is with 2.6.39-rc3+ without the mm patch:

mx3210 login: [12244.664000] Backtrace:
[12244.664000] [<000000004020c9a0>] __slab_free+0x100/0x200
[12244.664000] [<000000004020d23c>] kmem_cache_free+0xf4/0x108
[12244.668000] [<000000001c7e9efc>] __journal_remove_journal_head+0x214/0x248 [
jbd]
[12244.668000] [<000000001c7edd48>] journal_put_journal_head+0xc8/0x168 [jbd]
[12244.668000] [<000000001c7e158c>] journal_invalidatepage+0x45c/0x710 [jbd]
[12244.672000] [<000000001c8652d8>] ext3_invalidatepage+0x88/0xe8 [ext3]
[12244.672000] [<00000000401d76f4>] do_invalidatepage+0x34/0x40
[12244.672000] [<00000000401d7770>] truncate_inode_page+0x70/0x178
[12244.676000] [<00000000401d798c>] truncate_inode_pages_range+0x114/0x518
[12244.676000] [<00000000401d7da4>] truncate_inode_pages+0x14/0x20
[12244.680000] [<000000001c86ac1c>] ext3_evict_inode+0x64/0x2a8 [ext3]
[12244.680000] [<0000000040234424>] evict+0xac/0x1c8

[12244.692000] PSW: 00001000000001101111001100001110 Not tainted
[12244.692000] r00-03 000000ff0806f30e 0000000040745e50 000000004020c9a0 000000
01453d6000
[12244.692000] r04-07 0000000040722e50 0000000000000000 00000002bf400000 000000
001c7dc000
[12244.696000] r08-11 0000000000000002 0000000040630298 0000000000000001 000000
007bba4940
[12244.696000] r12-15 00000002bf400000 0000000000000001 0000000000000000 000000
0000000001
[12244.700000] r16-19 000000007e71d888 000000007e71d888 0000000000001000 200000
0000000081
[12244.700000] r20-23 0000000000000000 0000000040630290 0000000000000001 000000
001c7e9efc
[12244.704000] r24-27 000000000800000e 00000001453d6000 0000000000000000 000000
0040722e50
[12244.704000] r28-31 0000000000000001 000000007bba4b70 000000007bba4ba0 000000
0000000023
[12244.708000] sr00-03 000000000556c000 000000000556c000 0000000000000000 000000000556c000
[12244.708000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[12244.712000]
[12244.712000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011b240 000000004011b244
[12244.712000] IIR: 0f4015dc ISR: 0000000000000000 IOR: 0000000000000000
[12244.716000] CPU: 3 CR30: 000000007bba4000 CR31: ffffffffffffffff
[12244.716000] ORIG_R28: 0000000000000001
[12244.720000] IAOQ[0]: _raw_spin_lock+0x10/0x20
[12244.720000] IAOQ[1]: _raw_spin_lock+0x14/0x20
[12244.720000] RP(r2): __slab_free+0x100/0x200

Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)

2011-04-19 16:59:13

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to

On Tue, 2011-04-19 at 12:06 -0400, John David Anglin wrote:
> > It compiles OK, but crashes on boot in fsck. The crash is definitely mm
> > but looks to be a slab problem (it's a null deref on a spinlock in
> > add_partial(), which seems unrelated to this patch).
>
> I had a somewhat similar crash Sunday with the "debian" config building GCC.
> This is with 2.6.39-rc3+ without the mm patch:
>
> mx3210 login: [12244.664000] Backtrace:
> [12244.664000] [<000000004020c9a0>] __slab_free+0x100/0x200
> [12244.664000] [<000000004020d23c>] kmem_cache_free+0xf4/0x108
> [12244.668000] [<000000001c7e9efc>] __journal_remove_journal_head+0x214/0x248 [
> jbd]
> [12244.668000] [<000000001c7edd48>] journal_put_journal_head+0xc8/0x168 [jbd]
> [12244.668000] [<000000001c7e158c>] journal_invalidatepage+0x45c/0x710 [jbd]
> [12244.672000] [<000000001c8652d8>] ext3_invalidatepage+0x88/0xe8 [ext3]
> [12244.672000] [<00000000401d76f4>] do_invalidatepage+0x34/0x40
> [12244.672000] [<00000000401d7770>] truncate_inode_page+0x70/0x178
> [12244.676000] [<00000000401d798c>] truncate_inode_pages_range+0x114/0x518
> [12244.676000] [<00000000401d7da4>] truncate_inode_pages+0x14/0x20
> [12244.680000] [<000000001c86ac1c>] ext3_evict_inode+0x64/0x2a8 [ext3]
> [12244.680000] [<0000000040234424>] evict+0xac/0x1c8
>
> [12244.692000] PSW: 00001000000001101111001100001110 Not tainted
> [12244.692000] r00-03 000000ff0806f30e 0000000040745e50 000000004020c9a0 000000
> 01453d6000
> [12244.692000] r04-07 0000000040722e50 0000000000000000 00000002bf400000 000000
> 001c7dc000
> [12244.696000] r08-11 0000000000000002 0000000040630298 0000000000000001 000000
> 007bba4940
> [12244.696000] r12-15 00000002bf400000 0000000000000001 0000000000000000 000000
> 0000000001
> [12244.700000] r16-19 000000007e71d888 000000007e71d888 0000000000001000 200000
> 0000000081
> [12244.700000] r20-23 0000000000000000 0000000040630290 0000000000000001 000000
> 001c7e9efc
> [12244.704000] r24-27 000000000800000e 00000001453d6000 0000000000000000 000000
> 0040722e50
> [12244.704000] r28-31 0000000000000001 000000007bba4b70 000000007bba4ba0 000000
> 0000000023
> [12244.708000] sr00-03 000000000556c000 000000000556c000 0000000000000000 000000000556c000
> [12244.708000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>
> [12244.712000]
> [12244.712000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011b240 000000004011b244
> [12244.712000] IIR: 0f4015dc ISR: 0000000000000000 IOR: 0000000000000000
> [12244.716000] CPU: 3 CR30: 000000007bba4000 CR31: ffffffffffffffff
> [12244.716000] ORIG_R28: 0000000000000001
> [12244.720000] IAOQ[0]: _raw_spin_lock+0x10/0x20
> [12244.720000] IAOQ[1]: _raw_spin_lock+0x14/0x20
> [12244.720000] RP(r2): __slab_free+0x100/0x200

Yes, it's the same crash. Apparently get_node() is returning NULL for
some reason.

James

2011-04-19 17:05:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, Apr 19, 2011 at 6:46 PM, James Bottomley
<[email protected]> wrote:
> It compiles OK, but crashes on boot in fsck. ?The crash is definitely mm
> but looks to be a slab problem (it's a null deref on a spinlock in
> add_partial(), which seems unrelated to this patch).
>
> [ ? 15.628000] sd 1:0:2:0: [sdc] Attached SCSI disk
> done.
> [ ? 16.632000] EXT3-fs: barriers not enabled
> [ ? 16.640000] kjournald starting. ?Commit interval 5 seconds
> [ ? 16.640000] EXT3-fs (sda3): mounted filesystem with ordered data mode
> Begin: Running /scripts/local-bottom ... done.
> done.
> Begin: Running /scripts/init-bottom ... done.
> INIT: version 2.88 booting
> Setting hostname to 'ion'...done.
> Starting the hotplug events dispatcher: udevd[ ? 22.008000] udev[211]: starting version 164
> .
> Synthesizing the initial hotplug events...done.
> Waiting for /dev to be fully populated...done.
> Activating swap:swapon on /dev/sda2
> swapon: /dev/sda2: found swap signature: version 1, page-size 4, same byte order
> swapon: /dev/sda2: pagesize=4096, swapsize=1028157440, devsize=1028160000
> [ ? 28.780000] Adding 1004056k swap on /dev/sda2. ?Priority:-1 extents:1 across:1004056k
> .
> Will now check root file system:fsck from util-linux-ng 2.17.2
> [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda3
> /dev/sda3 has been mounted 37 times without being checked, check forced.
> [ ?257.192000] Backtrace:=========== ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\ 42.8%
> [ ?257.192000] ?[<0000000040214f78>] add_partial+0x28/0x98
> [ ?257.192000] ?[<0000000040217ff8>] __slab_free+0x1d0/0x1d8
> [ ?257.192000] ?[<000000004021825c>] kmem_cache_free+0xc4/0x128
> [ ?257.192000] ?[<00000000401fd1a4>] remove_vma+0x8c/0xc0
> [ ?257.192000] ?[<00000000401fd3a8>] exit_mmap+0x1d0/0x220
> [ ?257.192000] ?[<0000000040156514>] mmput+0xd4/0x200
> [ ?257.192000] ?[<000000004015c7b8>] exit_mm+0x100/0x2c0
> [ ?257.192000] ?[<000000004015ef40>] do_exit+0x778/0x9d8
> [ ?257.192000] ?[<000000004015f1ec>] do_group_exit+0x4c/0xe0
> [ ?257.192000] ?[<000000004015f298>] sys_exit_group+0x18/0x28
> [ ?257.192000] ?[<0000000040106034>] syscall_exit+0x0/0x14
> [ ?257.192000]
> [ ?257.192000]
> [ ?257.192000] Kernel Fault: Code=26 regs=00000040bf1807d0 (Addr=0000000000000000)
> [ ?257.192000]
> [ ?257.192000] ? ? ?YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> [ ?257.192000] PSW: 00001000000001001111000000001110 Not tainted
> [ ?257.192000] r00-03 ?000000ff0804f00e 0000000040769e40 0000000040214f78 0000000000000000
> [ ?257.192000] r04-07 ?0000000040746e40 0000000000000001 0000004080ded370 0000000000000001
> [ ?257.192000] r08-11 ?0000000040654150 0000000000000000 0000000000000001 0000000000000001
> [ ?257.192000] r12-15 ?0000000000000000 00000000ffffffff 0000000000000024 0000000000000000
> [ ?257.192000] r16-19 ?00000000fb4ead9c 00000000fb4eac54 0000000000000000 0000000000000000
> [ ?257.192000] r20-23 ?000000000800000e 0000000000000001 000000007bbb7180 00000000401fd1a4
> [ ?257.192000] r24-27 ?0000000000000001 0000004080ded370 0000000000000000 0000000040746e40
> [ ?257.192000] r28-31 ?000000007ec0a908 00000040bf1807a0 00000040bf1807d0 0000000000000016
> [ ?257.192000] sr00-03 ?00000000002d9000 0000000000000000 0000000000000000 00000000002d9000
> [ ?257.192000] sr04-07 ?0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ ?257.192000]
> [ ?257.192000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011bbc0 000000004011bbc4
> [ ?257.192000] ?IIR: 0f4015dc ? ?ISR: 0000000000000000 ?IOR: 0000000000000000
> [ ?257.192000] ?CPU: ? ? ? ?0 ? CR30: 00000040bf180000 CR31: fffffff0f0e098e0
> [ ?257.192000] ?ORIG_R28: 0000000040769e40
> [ ?257.192000] ?IAOQ[0]: _raw_spin_lock+0x0/0x20
> [ ?257.192000] ?IAOQ[1]: _raw_spin_lock+0x4/0x20
> [ ?257.192000] ?RP(r2): add_partial+0x28/0x98
> [ ?257.192000] Backtrace:
> [ ?257.192000] ?[<0000000040214f78>] add_partial+0x28/0x98
> [ ?257.192000] ?[<0000000040217ff8>] __slab_free+0x1d0/0x1d8
> [ ?257.192000] ?[<000000004021825c>] kmem_cache_free+0xc4/0x128
> [ ?257.192000] ?[<00000000401fd1a4>] remove_vma+0x8c/0xc0
> [ ?257.192000] ?[<00000000401fd3a8>] exit_mmap+0x1d0/0x220
> [ ?257.192000] ?[<0000000040156514>] mmput+0xd4/0x200
> [ ?257.192000] ?[<000000004015c7b8>] exit_mm+0x100/0x2c0
> [ ?257.192000] ?[<000000004015ef40>] do_exit+0x778/0x9d8
> [ ?257.192000] ?[<000000004015f1ec>] do_group_exit+0x4c/0xe0
> [ ?257.192000] ?[<000000004015f298>] sys_exit_group+0x18/0x28
> [ ?257.192000] ?[<0000000040106034>] syscall_exit+0x0/0x14
> [ ?257.192000]
> [ ?257.192000] Kernel panic - not syncing: Kernel Fault
> [ ?257.192000] Backtrace:
> [ ?257.192000] ?[<000000004011f984>] show_stack+0x14/0x20
> [ ?257.192000] ?[<000000004011f9a8>] dump_stack+0x18/0x28
> [ ?257.192000] ?[<000000004015946c>] panic+0xd4/0x368
> [ ?257.192000] ?[<0000000040120024>] parisc_terminate+0x14c/0x170
> [ ?257.192000] ?[<000000004012059c>] handle_interruption+0x2ac/0x8f8
> [ ?257.192000] ?[<000000004011bbc0>] _raw_spin_lock+0x0/0x20
> [ ?257.192000]
> [ ?257.192000] Rebooting in 5 seconds..
>
> It seems to be a random intermittent mm crash because the next reboot
> crashed with the same trace but after the fsck had completed and the
> third came up to the login prompt.

Looks like a genuine SLUB problem on parisc. Christoph?

2011-04-19 17:11:32

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 20:05 +0300, Pekka Enberg wrote:
> > It seems to be a random intermittent mm crash because the next reboot
> > crashed with the same trace but after the fsck had completed and the
> > third came up to the login prompt.
>
> Looks like a genuine SLUB problem on parisc. Christoph?

Looking through the slub code, it seems to be making invalid
assumptions. All of the node stuff is dependent on CONFIG_NUMA.
However, we're CONFIG_DISCONTIGMEM (with CONFIG_NUMA not set): on the
machines I and Dave Anglin have, our physical memory ranges are 0-1GB
and 64-65GB, so I think slub crashes when we get a page from the high
memory range ... because it's not expecting a non-zero node number.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, Pekka Enberg wrote:

> On Tue, Apr 19, 2011 at 6:46 PM, James Bottomley
> <[email protected]> wrote:
> > It compiles OK, but crashes on boot in fsck. ?The crash is definitely mm
> > but looks to be a slab problem (it's a null deref on a spinlock in
> > add_partial(), which seems unrelated to this patch).

That means that the per node structures have not been setup yet. Node
hotplug not working?

> > It seems to be a random intermittent mm crash because the next reboot
> > crashed with the same trace but after the fsck had completed and the
> > third came up to the login prompt.
>
> Looks like a genuine SLUB problem on parisc. Christoph?

Race between node hotplug and use of the slab on that node?

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> On Tue, 2011-04-19 at 20:05 +0300, Pekka Enberg wrote:
> > > It seems to be a random intermittent mm crash because the next reboot
> > > crashed with the same trace but after the fsck had completed and the
> > > third came up to the login prompt.
> >
> > Looks like a genuine SLUB problem on parisc. Christoph?
>
> Looking through the slub code, it seems to be making invalid
> assumptions. All of the node stuff is dependent on CONFIG_NUMA.
> However, we're CONFIG_DISCONTIGMEM (with CONFIG_NUMA not set): on the
> machines I and Dave Anglin have, our physical memory ranges are 0-1GB
> and 64-65GB, so I think slub crashes when we get a page from the high
> memory range ... because it's not expecting a non-zero node number.

Right !NUMA systems only have node 0.

2011-04-19 17:48:30

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 12:15 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > On Tue, 2011-04-19 at 20:05 +0300, Pekka Enberg wrote:
> > > > It seems to be a random intermittent mm crash because the next reboot
> > > > crashed with the same trace but after the fsck had completed and the
> > > > third came up to the login prompt.
> > >
> > > Looks like a genuine SLUB problem on parisc. Christoph?
> >
> > Looking through the slub code, it seems to be making invalid
> > assumptions. All of the node stuff is dependent on CONFIG_NUMA.
> > However, we're CONFIG_DISCONTIGMEM (with CONFIG_NUMA not set): on the
> > machines I and Dave Anglin have, our physical memory ranges are 0-1GB
> > and 64-65GB, so I think slub crashes when we get a page from the high
> > memory range ... because it's not expecting a non-zero node number.
>
> Right !NUMA systems only have node 0.

That's rubbish. Discontigmem uses the nodes field to identify the
discontiguous region. page_to_nid() returns this value. Your code
wrongly assumes this is zero for non NUMA.

This simple program triggers the problem instantly because it forces
allocation up into the upper discontigmem range:

#include <stdlib.h>

void main(void)
{
const long size = 1024*1024*1024;
char *a = malloc(size);
int i;

for (i=0; i < size; i += 4096)
a[i] = '\0';
}

I can fix the panic by hard coding get_nodes() to return the zero node
for the non-numa case ... however, presumably it's more than just this
that's broken in slub?

James

---

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..243bd9c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -235,7 +235,11 @@ int slab_is_available(void)

static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
{
+#ifdef CONFIG_NUMA
return s->node[node];
+#else
+ return s->node[0];
+#endif
}

/* Verify that a pointer has an address that is valid within a slab page */

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> > Right !NUMA systems only have node 0.
>
> That's rubbish. Discontigmem uses the nodes field to identify the
> discontiguous region. page_to_nid() returns this value. Your code
> wrongly assumes this is zero for non NUMA.

Sorry the kernel has no node awareness if you do not set CONFIG_NUMA

F.e. zone node lookups work the following way

static inline int
zone_to_nid(struct zone *zone)
{
#ifdef CONFIG_NUMA
return zone->node;
#else
return 0;
#endif
}

How in the world did you get a zone setup in node 1 with a !NUMA config?


The problem seems to be that the kernel seems to allow a
definition of a page_to_nid() function that returns non zero in the !NUMA
case. And slub relies on page_to_nid returning zero in the !NUMA case.
Because NODES_WIDTH should be 0 in the !NUMA case and therefore
page_to_nid must return 0.

> I can fix the panic by hard coding get_nodes() to return the zero node
> for the non-numa case ... however, presumably it's more than just this
> that's broken in slub?

If you think that is broken then we have brokenness all over the kernel
whenever we determine the node from a page and use that to do a lookup.

2011-04-19 18:20:23

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 13:10 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > Right !NUMA systems only have node 0.
> >
> > That's rubbish. Discontigmem uses the nodes field to identify the
> > discontiguous region. page_to_nid() returns this value. Your code
> > wrongly assumes this is zero for non NUMA.
>
> Sorry the kernel has no node awareness if you do not set CONFIG_NUMA
>
> F.e. zone node lookups work the following way
>
> static inline int
> zone_to_nid(struct zone *zone)
> {
> #ifdef CONFIG_NUMA
> return zone->node;
> #else
> return 0;
> #endif
> }
>
> How in the world did you get a zone setup in node 1 with a !NUMA config?

I told you ... I forced an allocation into the first discontiguous
region. That will return 1 for page_to_nid().

> The problem seems to be that the kernel seems to allow a
> definition of a page_to_nid() function that returns non zero in the !NUMA
> case.

This is called reality, yes.

> And slub relies on page_to_nid returning zero in the !NUMA case.
> Because NODES_WIDTH should be 0 in the !NUMA case and therefore
> page_to_nid must return 0.

right, that's what I told you: slub is broken because it's making a
wrong assumption. Look in asm-generic/memory_model.h it shows how the
page_to_nid() is used in finding the pfn array. DISCONTIGMEM uses some
of the numa properties (including assigning zones to the discontiguous
regions).

> > I can fix the panic by hard coding get_nodes() to return the zero node
> > for the non-numa case ... however, presumably it's more than just this
> > that's broken in slub?
>
> If you think that is broken then we have brokenness all over the kernel
> whenever we determine the node from a page and use that to do a lookup.

Not really. The rest of the kernel uses the proper macros. in
DISCONTIGMEM but !NUMA configs, the numa macros expand correctly.
You've cut across that with all the CONFIG_NUMA checks in slub.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> > }
> >
> > How in the world did you get a zone setup in node 1 with a !NUMA config?
>
> I told you ... I forced an allocation into the first discontiguous
> region. That will return 1 for page_to_nid().

How? The kernel has no concept of a node 1 without CONFIG_NUMA and so you
cannot tell the page allocator to allocate from node 1.

zone_to_nid is used as a fallback mechanism for page_to_nid() and as shown
will always return 0 for !NUMA configs.

page_to_nid(x) == zone_to_nid(page_zone(x)) must hold true. It is not
here.

> > The problem seems to be that the kernel seems to allow a
> > definition of a page_to_nid() function that returns non zero in the !NUMA
> > case.
>
> This is called reality, yes.

There you have the bug. Fix that and things will work fine.

> right, that's what I told you: slub is broken because it's making a
> wrong assumption. Look in asm-generic/memory_model.h it shows how the
> page_to_nid() is used in finding the pfn array. DISCONTIGMEM uses some
> of the numa properties (including assigning zones to the discontiguous
> regions).

Bitrotted code? If it uses numa properties then it must use a zone field
in struct zone. So DISCONTIGMEM seems to require CONFIG_NUMA.

> > If you think that is broken then we have brokenness all over the kernel
> > whenever we determine the node from a page and use that to do a lookup.
>
> Not really. The rest of the kernel uses the proper macros. in
> DISCONTIGMEM but !NUMA configs, the numa macros expand correctly.
> You've cut across that with all the CONFIG_NUMA checks in slub.

What are "the proper macros"? AFAICT page_to_nid() is the proper way to
access the node of a page. If page_to_nid() returns 1 then you have a zone
that the kernel knows of as being in node 0 having a page on a different
node.

We can likely force page_to_nid to ignore the node information that have
been erroneously placed there but this looks like something deeper is
wrong here. The node field in struct page is not only used for the Linux
support of a NUMA node but also for blocks of memory. Those should be
separate things.

---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2011-04-19 13:20:20.092521248 -0500
+++ linux-2.6/include/linux/mm.h 2011-04-19 13:21:05.962521196 -0500
@@ -665,6 +665,7 @@ static inline int zone_to_nid(struct zon
#endif
}

+#ifdef CONFIG_NUMA
#ifdef NODE_NOT_IN_PAGE_FLAGS
extern int page_to_nid(struct page *page);
#else
@@ -673,6 +674,9 @@ static inline int page_to_nid(struct pag
return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif
+#else
+#define page_to_nid(x) 0
+#endif

static inline struct zone *page_zone(struct page *page)
{

2011-04-19 19:49:45

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 13:35 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > }
> > >
> > > How in the world did you get a zone setup in node 1 with a !NUMA config?
> >
> > I told you ... I forced an allocation into the first discontiguous
> > region. That will return 1 for page_to_nid().
>
> How? The kernel has no concept of a node 1 without CONFIG_NUMA and so you
> cannot tell the page allocator to allocate from node 1.

Yes, it does, as I explained in the email.

> zone_to_nid is used as a fallback mechanism for page_to_nid() and as shown
> will always return 0 for !NUMA configs.
>
> page_to_nid(x) == zone_to_nid(page_zone(x)) must hold true. It is not
> here.
>
> > > The problem seems to be that the kernel seems to allow a
> > > definition of a page_to_nid() function that returns non zero in the !NUMA
> > > case.
> >
> > This is called reality, yes.
>
> There you have the bug. Fix that and things will work fine.

Why don't yout file the bug against reality? I'm not sure I have enough
credibility ...

> > right, that's what I told you: slub is broken because it's making a
> > wrong assumption. Look in asm-generic/memory_model.h it shows how the
> > page_to_nid() is used in finding the pfn array. DISCONTIGMEM uses some
> > of the numa properties (including assigning zones to the discontiguous
> > regions).
>
> Bitrotted code?

Don't be silly: alpha, ia64, m32r, m68k, mips, parisc, tile and even x86
all use the discontigmem memory model in some configurations.

> If it uses numa properties then it must use a zone field
> in struct zone. So DISCONTIGMEM seems to require CONFIG_NUMA.

No ... you're giving me back your assumptions. They're not based on
what the kernel does. CONFIG_NUMA may or may not be defined with
CONFIG_DISCONTIGMEM.

Of all the above, only x86 always had NUMA with DISCONTIGMEM.

> > > If you think that is broken then we have brokenness all over the kernel
> > > whenever we determine the node from a page and use that to do a lookup.
> >
> > Not really. The rest of the kernel uses the proper macros. in
> > DISCONTIGMEM but !NUMA configs, the numa macros expand correctly.
> > You've cut across that with all the CONFIG_NUMA checks in slub.
>
> What are "the proper macros"? AFAICT page_to_nid() is the proper way to
> access the node of a page. If page_to_nid() returns 1 then you have a zone
> that the kernel knows of as being in node 0 having a page on a different
> node.

Well it depends what you want. If you only want the actual NUMA node,
then pfn_to_nid() probably isn't what you want, because in a
DISCONTIGMEM model, there may be multiple nids per actual numa node.

> We can likely force page_to_nid to ignore the node information that have
> been erroneously placed there but this looks like something deeper is
> wrong here. The node field in struct page is not only used for the Linux
> support of a NUMA node but also for blocks of memory. Those should be
> separate things.

Look, it's not wrong, it's by design. The assumption that non-numa
systems don't use nodes is the wrong one.

> ---
> include/linux/mm.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h 2011-04-19 13:20:20.092521248 -0500
> +++ linux-2.6/include/linux/mm.h 2011-04-19 13:21:05.962521196 -0500
> @@ -665,6 +665,7 @@ static inline int zone_to_nid(struct zon
> #endif
> }
>
> +#ifdef CONFIG_NUMA
> #ifdef NODE_NOT_IN_PAGE_FLAGS
> extern int page_to_nid(struct page *page);
> #else
> @@ -673,6 +674,9 @@ static inline int page_to_nid(struct pag
> return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> }
> #endif
> +#else
> +#define page_to_nid(x) 0
> +#endif

Don't be silly ... that breaks asm-generic/memory_model.h

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> > > I told you ... I forced an allocation into the first discontiguous
> > > region. That will return 1 for page_to_nid().
> >
> > How? The kernel has no concept of a node 1 without CONFIG_NUMA and so you
> > cannot tell the page allocator to allocate from node 1.
>
> Yes, it does, as I explained in the email.

Looked through it and canot find it. How would that be possible to do
with core kernel calls since the page allocator calls do not allow you to
specify a node under !NUMA.

> Don't be silly: alpha, ia64, m32r, m68k, mips, parisc, tile and even x86
> all use the discontigmem memory model in some configurations.

I guess DISCONTIGMEM is typically used together with NUMA. Otherwise we
would have run into this before.

> > > Not really. The rest of the kernel uses the proper macros. in
> > > DISCONTIGMEM but !NUMA configs, the numa macros expand correctly.
> > > You've cut across that with all the CONFIG_NUMA checks in slub.
> >
> > What are "the proper macros"? AFAICT page_to_nid() is the proper way to
> > access the node of a page. If page_to_nid() returns 1 then you have a zone
> > that the kernel knows of as being in node 0 having a page on a different
> > node.
>
> Well it depends what you want. If you only want the actual NUMA node,
> then pfn_to_nid() probably isn't what you want, because in a
> DISCONTIGMEM model, there may be multiple nids per actual numa node.

Right yes you got it. The notion of a node is different(!!!!!). What
matters to the core kernel is the notion of a NUMA node. If DISCONTIGMEM
runs with !NUMA then the way that "node" is used in DISCONTIGMEM is
different from the core code and refers only to the memory blocks managed
by DISCONTIGMEM. The node should be irrelevant to the core then.

> > We can likely force page_to_nid to ignore the node information that have
> > been erroneously placed there but this looks like something deeper is
> > wrong here. The node field in struct page is not only used for the Linux
> > support of a NUMA node but also for blocks of memory. Those should be
> > separate things.
>
> Look, it's not wrong, it's by design. The assumption that non-numa
> systems don't use nodes is the wrong one.

Depends on how you define the notion of a node. The way the core kernel
uses the term "node" means that there will be only one node and that is
node 0 if CONFIG_NUMA is off. Thus page_to_nid() must return 0 for !NUMA.

All sort of things in the core code will break in weird ways if you do
allow page_to_nid to return 1 under !NUMA. Just look at the usage of
page_to_nid(). Tried to use huge pages yet? And how will your version
of reality deal with the following checks in the page allocator? F.e.

VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));

Enabled CONFIG_DEBUG_VM yet?


> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h 2011-04-19 13:20:20.092521248 -0500
> > +++ linux-2.6/include/linux/mm.h 2011-04-19 13:21:05.962521196 -0500
> > @@ -665,6 +665,7 @@ static inline int zone_to_nid(struct zon
> > #endif
> > }
> >
> > +#ifdef CONFIG_NUMA
> > #ifdef NODE_NOT_IN_PAGE_FLAGS
> > extern int page_to_nid(struct page *page);
> > #else
> > @@ -673,6 +674,9 @@ static inline int page_to_nid(struct pag
> > return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
> > }
> > #endif
> > +#else
> > +#define page_to_nid(x) 0
> > +#endif
>
> Don't be silly ... that breaks asm-generic/memory_model.h

Well yeah looks like in order to be clean in the !NUMA case we would then
need a page_to_discontig_node_id() there that is different from the
page_to_nid() used for the core.

2011-04-19 21:21:48

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 15:56 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > > I told you ... I forced an allocation into the first discontiguous
> > > > region. That will return 1 for page_to_nid().
> > >
> > > How? The kernel has no concept of a node 1 without CONFIG_NUMA and so you
> > > cannot tell the page allocator to allocate from node 1.
> >
> > Yes, it does, as I explained in the email.
>
> Looked through it and canot find it. How would that be possible to do
> with core kernel calls since the page allocator calls do not allow you to
> specify a node under !NUMA.

it's used under DISCONTIGMEM to identify the pfn array.

> > Don't be silly: alpha, ia64, m32r, m68k, mips, parisc, tile and even x86
> > all use the discontigmem memory model in some configurations.
>
> I guess DISCONTIGMEM is typically used together with NUMA. Otherwise we
> would have run into this before.

Which bit of my telling you that six architectures already use it this
way did you not get? I'm not really interested in reconciling your
theories with how we currently operate. If you want to require NUMA
with DISCONTIGMEM, fine, we'll just define SLUB as broken if that's not
true ... that will fix my boot panic reports.

James

---

diff --git a/init/Kconfig b/init/Kconfig
index 56240e7..a7ad8fb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1226,6 +1226,7 @@ config SLAB
per cpu and per node queues.

config SLUB
+ depends on BROKEN || NUMA || !DISCONTIGMEM
bool "SLUB (Unqueued Allocator)"
help
SLUB is a slab allocator that minimizes cache line usage

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> > I guess DISCONTIGMEM is typically used together with NUMA. Otherwise we
> > would have run into this before.
>
> Which bit of my telling you that six architectures already use it this
> way did you not get? I'm not really interested in reconciling your
> theories with how we currently operate. If you want to require NUMA
> with DISCONTIGMEM, fine, we'll just define SLUB as broken if that's not
> true ... that will fix my boot panic reports.

Which part of me telling you that you will break lots of other things in
the core kernel dont you get? If you were able to get to a command prompt
with SLAB then lets all be happy for as long as it lasts.

2011-04-19 21:48:40

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 16:39 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > I guess DISCONTIGMEM is typically used together with NUMA. Otherwise we
> > > would have run into this before.
> >
> > Which bit of my telling you that six architectures already use it this
> > way did you not get? I'm not really interested in reconciling your
> > theories with how we currently operate. If you want to require NUMA
> > with DISCONTIGMEM, fine, we'll just define SLUB as broken if that's not
> > true ... that will fix my boot panic reports.
>
> Which part of me telling you that you will break lots of other things in
> the core kernel dont you get?

I get that you tell me this ... however, the systems that, according to
you, should be failing to get to boot prompt do, in fact, manage it.

> If you were able to get to a command prompt
> with SLAB then lets all be happy for as long as it lasts.

We can't re-engineer DISCONTIGMEM as a bug fix, so something like this
has to be done for stable regardless.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 19 Apr 2011, James Bottomley wrote:

> > Which part of me telling you that you will break lots of other things in
> > the core kernel dont you get?
>
> I get that you tell me this ... however, the systems that, according to
> you, should be failing to get to boot prompt do, in fact, manage it.

If you dont use certain subsystems then it may work. Also do you run with
debuggin on.

The following patch is I think what would be needed to fix it.



Subject: [PATCH] Fix discontig support for !NUMA

Under NUMA discontig nodes map directly to the kernel NUMA nodes.

However, when DISCONTIG is used without NUMA then the kernel has only
one NUMA mode (==0) but within the node there may be multiple discontig pages
on various "nodes" for page struct vector management purposes.

Define a function __page_to_nid() that always extracts the node from
the page struct. This can be used in places where we need the discontig
node. Define page_to_nid() under !NUMA to always return 0. This ensures
that the various subsystems relying on page_to_nid(page) == 0 on !NUMA
function properly.

<Untested since I do not have a PARISC system. There could be
additional occurrences that need __page_to_nid>

Signed-off-by: Christoph Lameter <[email protected]>

---
include/asm-generic/memory_model.h | 2 +-
include/linux/mm.h | 10 ++++++++--
mm/sparse.c | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2011-04-19 16:43:53.822507013 -0500
+++ linux-2.6/include/linux/mm.h 2011-04-19 16:44:52.082506944 -0500
@@ -666,14 +666,20 @@ static inline int zone_to_nid(struct zon
}

#ifdef NODE_NOT_IN_PAGE_FLAGS
-extern int page_to_nid(struct page *page);
+extern int __page_to_nid(struct page *page);
#else
-static inline int page_to_nid(struct page *page)
+static inline int __page_to_nid(struct page *page)
{
return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
}
#endif

+#ifdef CONFIG_NUMA
+#define page_to_nid __page_to_nid
+#else
+#define page_to_nid(x) 0
+#endif
+
static inline struct zone *page_zone(struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
Index: linux-2.6/include/asm-generic/memory_model.h
===================================================================
--- linux-2.6.orig/include/asm-generic/memory_model.h 2011-04-19 16:45:26.772506904 -0500
+++ linux-2.6/include/asm-generic/memory_model.h 2011-04-19 16:46:02.602506861 -0500
@@ -40,7 +40,7 @@

#define __page_to_pfn(pg) \
({ struct page *__pg = (pg); \
- struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
+ struct pglist_data *__pgdat = NODE_DATA(__page_to_nid(__pg)); \
(unsigned long)(__pg - __pgdat->node_mem_map) + \
__pgdat->node_start_pfn; \
})
Index: linux-2.6/mm/sparse.c
===================================================================
--- linux-2.6.orig/mm/sparse.c 2011-04-19 16:44:58.432506937 -0500
+++ linux-2.6/mm/sparse.c 2011-04-19 16:45:07.332506926 -0500
@@ -40,7 +40,7 @@ static u8 section_to_node_table[NR_MEM_S
static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#endif

-int page_to_nid(struct page *page)
+int __page_to_nid(struct page *page)
{
return section_to_node_table[page_to_section(page)];
}

2011-04-20 01:23:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > Which part of me telling you that you will break lots of other things in
> > > the core kernel dont you get?
> >
> > I get that you tell me this ... however, the systems that, according to
> > you, should be failing to get to boot prompt do, in fact, manage it.
>
> If you dont use certain subsystems then it may work. Also do you run with
> debuggin on.
>
> The following patch is I think what would be needed to fix it.

I'm worry about this patch. A lot of mm code assume !NUMA systems
only have node 0. Not only SLUB.

I'm not sure why this unfortunate mismatch occur. but I think DISCONTIG
hacks makes less sense. Can we consider parisc turn NUMA on instead?


2011-04-20 02:33:44

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Tue, 2011-04-19 at 16:58 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
>
> > > Which part of me telling you that you will break lots of other things in
> > > the core kernel dont you get?
> >
> > I get that you tell me this ... however, the systems that, according to
> > you, should be failing to get to boot prompt do, in fact, manage it.
>
> If you dont use certain subsystems then it may work. Also do you run with
> debuggin on.
>
> The following patch is I think what would be needed to fix it.

Not really: crashes immediately on boot

[ 0.000000] FP[0] enabled: Rev 1 Model 20
[ 0.000000] The 64-bit Kernel has started...
[ 0.000000] bootconsole [ttyB0] enabled
[ 0.000000] Initialized PDC Console for debugging.
[ 0.000000] Determining PDC firmware type: 64 bit PAT.
[ 0.000000] model 00008870 00000491 00000000 00000002 3e0505e7352af710 100000f0 00000008 000000b2 000000b2
[ 0.000000] vers 00000301
[ 0.000000] CPUID vers 20 rev 4 (0x00000284)
[ 0.000000] capabilities 0x35
[ 0.000000] model 9000/800/rp3440
[ 0.000000] parisc_cache_init: Only equivalent aliasing supported!
[ 0.000000] Memory Ranges:
[ 0.000000] 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
[ 0.000000] 1) Start 0x0000004040000000 End 0x000000407fdfffff Size 1022 MB
[ 0.000000] Total Memory: 2046 MB
[ 0.000000] initrd: 7f390000-7ffedf6d
[ 0.000000] initrd: reserving 3f390000-3ffedf6d (mem_max 7fe00000)
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at mm/mm_init.c:127!
[ 0.000000]
[ 0.000000] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[ 0.000000] PSW: 00001000000001001111111100001110 Not tainted
[ 0.000000] r00-03 000000ff0804ff0e 000000004076a640 0000000040798c50 0000004080000000
[ 0.000000] r04-07 0000000040746e40 0000000004040000 0000000000000001 0000000040654150
[ 0.000000] r08-11 00000000405bd540 000000000407fe00 0000000000000001 0000000000000000
[ 0.000000] r12-15 00000000405bc740 000f000000000000 00000000000001ff 000000004076a640
[ 0.000000] r16-19 00000000000000ff 0000000000000000 2000000000000000 0000000000000000
[ 0.000000] r20-23 0000004080000000 00000000405bd908 0000000000000000 0000000004040000
[ 0.000000] r24-27 0000000000000001 0000000000000000 0000004080000000 0000000040746e40
[ 0.000000] r28-31 2000000000000000 00000000405b0610 00000000405b0640 0000000000000000
[ 0.000000] sr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.000000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.000000]
[ 0.000000] IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000040798fac 0000000040798fb0
[ 0.000000] IIR: 03ffe01f ISR: 0000000010350000 IOR: 0000010000000000
[ 0.000000] CPU: 0 CR30: 00000000405b0000 CR31: fffffff0f0e098e0
[ 0.000000] ORIG_R28: 000000000003fe00
[ 0.000000] IAOQ[0]: mminit_verify_page_links+0x84/0xa0
[ 0.000000] IAOQ[1]: mminit_verify_page_links+0x88/0xa0
[ 0.000000] RP(r2): memmap_init_zone+0x148/0x2a0
[ 0.000000] Backtrace:
[ 0.000000] [<0000000040798c50>] memmap_init_zone+0x148/0x2a0
[ 0.000000] [<0000000040777ca8>] free_area_init_node+0x3c8/0x518
[ 0.000000] [<000000004076fde0>] paging_init+0x928/0xb20
[ 0.000000] [<0000000040770a48>] setup_arch+0xe8/0x120
[ 0.000000] [<000000004076c9a0>] start_kernel+0xf0/0x830
[ 0.000000] [<000000004011f4fc>] start_parisc+0xa4/0xb8
[ 0.000000] [<00000000404b0f0c>] packet_ioctl+0x1e4/0x208
[ 0.000000] [<00000000404a79d0>] unix_ioctl+0x70/0x168
[ 0.000000] [<0000000040482d24>] ip_mc_gsfget+0x14c/0x200
[ 0.000000] [<000000004046cc20>] raw_ioctl+0xe8/0x118
[ 0.000000] [<000000004044f524>] do_tcp_getsockopt+0x5c4/0x5d0
[ 0.000000] [<0000000040432d64>] netlink_getsockopt+0x15c/0x178
[ 0.000000]
[ 0.000000] Backtrace:
[ 0.000000] [<000000004011f984>] show_stack+0x14/0x20
[ 0.000000] [<000000004011f9a8>] dump_stack+0x18/0x28
[ 0.000000] [<000000004012022c>] die_if_kernel+0x194/0x258
[ 0.000000] [<0000000040120b30>] handle_interruption+0x840/0x8f8
[ 0.000000] [<0000000040798fac>] mminit_verify_page_links+0x84/0xa0
[ 0.000000]
[ 0.000000] ---[ end trace 139ce121c98e96c9 ]---
[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.000000] Backtrace:
[ 0.000000] [<000000004011f984>] show_stack+0x14/0x20
[ 0.000000] [<000000004011f9a8>] dump_stack+0x18/0x28
[ 0.000000] [<000000004015945c>] panic+0xd4/0x368
[ 0.000000] [<000000004015f054>] do_exit+0x89c/0x9d8
[ 0.000000] [<00000000401202d4>] die_if_kernel+0x23c/0x258
[ 0.000000] [<0000000040120b30>] handle_interruption+0x840/0x8f8
[ 0.000000] [<0000000040798fac>] mminit_verify_page_links+0x84/0xa0
[ 0.000000]

There's a lot more to discontigmem than just page_to_nid ... there's the
whole pfn_to_nid() thing as well

James

2011-04-20 02:48:58

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 10:23 +0900, KOSAKI Motohiro wrote:
> > On Tue, 19 Apr 2011, James Bottomley wrote:
> >
> > > > Which part of me telling you that you will break lots of other things in
> > > > the core kernel dont you get?
> > >
> > > I get that you tell me this ... however, the systems that, according to
> > > you, should be failing to get to boot prompt do, in fact, manage it.
> >
> > If you dont use certain subsystems then it may work. Also do you run with
> > debuggin on.
> >
> > The following patch is I think what would be needed to fix it.
>
> I'm worry about this patch. A lot of mm code assume !NUMA systems
> only have node 0. Not only SLUB.
>
> I'm not sure why this unfortunate mismatch occur. but I think DISCONTIG
> hacks makes less sense. Can we consider parisc turn NUMA on instead?

Well, you mean a patch like this? It won't build ... obviously we need
some more machinery

CC arch/parisc/kernel/asm-offsets.s
In file included from include/linux/sched.h:78,
from arch/parisc/kernel/asm-offsets.c:31:
include/linux/topology.h:212:2: error: #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
In file included from include/linux/sched.h:78,
from arch/parisc/kernel/asm-offsets.c:31:
include/linux/topology.h: In function 'numa_node_id':
include/linux/topology.h:255: error: implicit declaration of function 'cpu_to_node'

James

---

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 69ff049..ffe4058 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -261,6 +261,9 @@ config HPUX
bool "Support for HP-UX binaries"
depends on !64BIT

+config NUMA
+ def_bool n
+
config NR_CPUS
int "Maximum number of CPUs (2-32)"
range 2 32
diff --git a/mm/Kconfig b/mm/Kconfig
index e9c0c61..17a1474 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -64,6 +64,7 @@ endchoice
config DISCONTIGMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL
+ select NUMA

config SPARSEMEM
def_bool y


2011-04-20 02:57:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> On Wed, 2011-04-20 at 10:23 +0900, KOSAKI Motohiro wrote:
> > > On Tue, 19 Apr 2011, James Bottomley wrote:
> > >
> > > > > Which part of me telling you that you will break lots of other things in
> > > > > the core kernel dont you get?
> > > >
> > > > I get that you tell me this ... however, the systems that, according to
> > > > you, should be failing to get to boot prompt do, in fact, manage it.
> > >
> > > If you dont use certain subsystems then it may work. Also do you run with
> > > debuggin on.
> > >
> > > The following patch is I think what would be needed to fix it.
> >
> > I'm worry about this patch. A lot of mm code assume !NUMA systems
> > only have node 0. Not only SLUB.
> >
> > I'm not sure why this unfortunate mismatch occur. but I think DISCONTIG
> > hacks makes less sense. Can we consider parisc turn NUMA on instead?
>
> Well, you mean a patch like this? It won't build ... obviously we need
> some more machinery
>
> CC arch/parisc/kernel/asm-offsets.s
> In file included from include/linux/sched.h:78,
> from arch/parisc/kernel/asm-offsets.c:31:
> include/linux/topology.h:212:2: error: #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
> In file included from include/linux/sched.h:78,
> from arch/parisc/kernel/asm-offsets.c:31:
> include/linux/topology.h: In function 'numa_node_id':
> include/linux/topology.h:255: error: implicit declaration of function 'cpu_to_node'

Sorry about that. I'll see more carefully the code later. Probably long
time discontig-mem uninterest made multiple level breakage. Grr. ;-)


2011-04-20 05:53:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
<[email protected]> wrote:
> I'm worry about this patch. A lot of mm code assume !NUMA systems
> only have node 0. Not only SLUB.

So is that a valid assumption or not? Christoph seems to think it is
and James seems to think it's not. Which way should we aim to fix it?
Would be nice if other people chimed in as we already know what James
and Christoph think.

Pekka

2011-04-20 07:15:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
> <[email protected]> wrote:
> > I'm worry about this patch. A lot of mm code assume !NUMA systems
> > only have node 0. Not only SLUB.
>
> So is that a valid assumption or not? Christoph seems to think it is
> and James seems to think it's not. Which way should we aim to fix it?
> Would be nice if other people chimed in as we already know what James
> and Christoph think.

I'm sorry. I don't know it really. The fact was gone into historical myst. ;-)

Now, CONFIG_NUMA has mainly five meanings.

1) system may has !0 node id.
2) compile mm/mempolicy.c (ie enable mempolicy APIs)
3) Allocator (kmalloc, vmalloc, alloc_page, et al) awake NUMA topology.
4) enable zone-reclaim feature
5) scheduler makes per-node load balancing scheduler domain

Anyway, we have to fix this issue. I'm digging which fixing way has least risk.


btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
long time.

arch/x86/Kconfig
-------------------------------------
config ARCH_DISCONTIGMEM_ENABLE
def_bool y
depends on NUMA && X86_32


2011-04-20 07:34:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

Hi!

On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> > I'm worry about this patch. A lot of mm code assume !NUMA systems
>> > only have node 0. Not only SLUB.
>>
>> So is that a valid assumption or not? Christoph seems to think it is
>> and James seems to think it's not. Which way should we aim to fix it?
>> Would be nice if other people chimed in as we already know what James
>> and Christoph think.

On Wed, Apr 20, 2011 at 10:15 AM, KOSAKI Motohiro
<[email protected]> wrote:
> I'm sorry. I don't know it really. The fact was gone into historical myst. ;-)
>
> Now, CONFIG_NUMA has mainly five meanings.
>
> 1) system may has !0 node id.
> 2) compile mm/mempolicy.c (ie enable mempolicy APIs)
> 3) Allocator (kmalloc, vmalloc, alloc_page, et al) awake NUMA topology.
> 4) enable zone-reclaim feature
> 5) scheduler makes per-node load balancing scheduler domain
>
> Anyway, we have to fix this issue. ?I'm digging which fixing way has least risk.
>
>
> btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
> long time.
>
> arch/x86/Kconfig
> -------------------------------------
> config ARCH_DISCONTIGMEM_ENABLE
> ? ? ? ?def_bool y
> ? ? ? ?depends on NUMA && X86_32

That part makes me think the best option is to make parisc do
CONFIG_NUMA as well regardless of the historical intent was.

Pekka

2011-04-20 08:40:31

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> > btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
> > long time.
> >
> > arch/x86/Kconfig
> > -------------------------------------
> > config ARCH_DISCONTIGMEM_ENABLE
> > ? ? ? ?def_bool y
> > ? ? ? ?depends on NUMA && X86_32
>
> That part makes me think the best option is to make parisc do
> CONFIG_NUMA as well regardless of the historical intent was.
>
> Pekka

This?

compile test only.

---
arch/parisc/Kconfig | 7 +++++++
include/asm-generic/topology.h | 4 ----
include/linux/topology.h | 2 +-
3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 69ff049..0bf9ae8 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -229,6 +229,12 @@ config HOTPLUG_CPU
default y if SMP
select HOTPLUG

+config NUMA
+ bool "NUMA support"
+ help
+ Say Y to compile the kernel to support NUMA (Non-Uniform Memory
+ Access).
+
config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on 64BIT
@@ -236,6 +242,7 @@ config ARCH_SELECT_MEMORY_MODEL
config ARCH_DISCONTIGMEM_ENABLE
def_bool y
depends on 64BIT
+ depends on NUMA

config ARCH_FLATMEM_ENABLE
def_bool y
diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index fc824e2..932567b 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -27,8 +27,6 @@
#ifndef _ASM_GENERIC_TOPOLOGY_H
#define _ASM_GENERIC_TOPOLOGY_H

-#ifndef CONFIG_NUMA
-
/* Other architectures wishing to use this simple topology API should fill
in the below functions as appropriate in their own <asm/topology.h> file. */
#ifndef cpu_to_node
@@ -60,8 +58,6 @@
cpumask_of_node(pcibus_to_node(bus)))
#endif

-#endif /* CONFIG_NUMA */
-
#if !defined(CONFIG_NUMA) || !defined(CONFIG_HAVE_MEMORYLESS_NODES)

#ifndef set_numa_mem
diff --git a/include/linux/topology.h b/include/linux/topology.h
index b91a40e..e1e535b 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -209,7 +209,7 @@ int arch_update_cpu_topology(void);

#ifdef CONFIG_NUMA
#ifndef SD_NODE_INIT
-#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
+#define SD_NODE_INIT SD_ALLNODES_INIT
#endif

#endif /* CONFIG_NUMA */
--
1.7.3.1



2011-04-20 11:20:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
> That part makes me think the best option is to make parisc do
> CONFIG_NUMA as well regardless of the historical intent was.

But it's not just parisc. It's six other architectures as well, some
of which aren't even SMP. Does !SMP && NUMA make any kind of sense?

I think really, this is just a giant horrible misunderstanding on the part
of the MM people. There's no reason why an ARM chip with 16MB of memory
at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-04-20 11:28:29

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

Hi Matthew,

On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
>> That part makes me think the best option is to make parisc do
>> CONFIG_NUMA as well regardless of the historical intent was.

On Wed, Apr 20, 2011 at 2:20 PM, Matthew Wilcox <[email protected]> wrote:
> But it's not just parisc. ?It's six other architectures as well, some
> of which aren't even SMP. ?Does !SMP && NUMA make any kind of sense?

IIRC, we actually fixed SLAB or SLUB to work on such configs in the past.

On Wed, Apr 20, 2011 at 2:20 PM, Matthew Wilcox <[email protected]> wrote:
> I think really, this is just a giant horrible misunderstanding on the part
> of the MM people. ?There's no reason why an ARM chip with 16MB of memory
> at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.

Right. My point was simply that since x86 doesn't support DISCONTIGMEM
without NUMA, the misunderstanding is likely very wide-spread.

Pekka

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:

> > from arch/parisc/kernel/asm-offsets.c:31:
> > include/linux/topology.h: In function 'numa_node_id':
> > include/linux/topology.h:255: error: implicit declaration of function 'cpu_to_node'
>
> Sorry about that. I'll see more carefully the code later. Probably long
> time discontig-mem uninterest made multiple level breakage. Grr. ;-)

True. Someone needs to go through discontig and make it work right with a
!NUMA configuration. Many pieces of the core code assume that there will
be no node on a !NUMA config today. I guess that was different in ages
past.

Maybe we should make DISCONTIG broken under !NUMA until that time?

Tejon was working on getting rid of DISCONTIG. SPARSEMEM is the favored
alternative today. So we could potentially change the arches to use SPARSE
configs in the !NUMA case.

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, Matthew Wilcox wrote:

> On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
> > That part makes me think the best option is to make parisc do
> > CONFIG_NUMA as well regardless of the historical intent was.
>
> But it's not just parisc. It's six other architectures as well, some
> of which aren't even SMP. Does !SMP && NUMA make any kind of sense?

Of course not.

> I think really, this is just a giant horrible misunderstanding on the part
> of the MM people. There's no reason why an ARM chip with 16MB of memory
> at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.

DISCONTIG has fallen out of favor in the last years. SPARSEMEM has largely
replaced it. ARM uses that and does not suffer from these issue.

No one considered the issues of having a !NUMA configuration with
nodes (which DISCONTIG seems to create) when developing core code in the
last years. The implicit assumption has always been that page_to_nid(x)
etc is always zero on a !NUMA configuration.

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, Pekka Enberg wrote:

> That part makes me think the best option is to make parisc do
> CONFIG_NUMA as well regardless of the historical intent was.

Another possilibity is to use SPARSEMEM instead? We can do the same for
the other arches that we have done to x86.

2011-04-20 14:15:44

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

[added linux-arch to cc since we're going to be affecting them]
On Wed, 2011-04-20 at 14:28 +0300, Pekka Enberg wrote:
> Right. My point was simply that since x86 doesn't support DISCONTIGMEM
> without NUMA, the misunderstanding is likely very wide-spread.

Why don't we approach the problem in a few separate ways then.

1. We can look at what imposing NUMA on the DISCONTIGMEM archs
would do ... the embedded ones are going to be hardest hit, but
if it's not too much extra code, it might be palatable.
2. The other is that we can audit mm to look at all the node
assumptions in the non-numa case. My suspicion is that
accidentally or otherwise, it mostly works for the normal case,
so there might not be much needed to pull it back to working
properly for DISCONTIGMEM.
3. Finally we could look at deprecating DISCONTIGMEM in favour of
SPARSEMEM, but we'd still need to fix -stable for that case.
Especially as it will take time to convert all the architectures

I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
framework which allows machines with split physical memory ranges to
work. That's a very common case nowadays. Numa is supposed to be a
heavyweight framework to preserve node locality for non-uniform memory
access boxes (which none of the DISCONTIGMEM && !NUMA systems are).

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, James Bottomley wrote:

> 1. We can look at what imposing NUMA on the DISCONTIGMEM archs
> would do ... the embedded ones are going to be hardest hit, but
> if it's not too much extra code, it might be palatable.
> 2. The other is that we can audit mm to look at all the node
> assumptions in the non-numa case. My suspicion is that
> accidentally or otherwise, it mostly works for the normal case,
> so there might not be much needed to pull it back to working
> properly for DISCONTIGMEM.

The older code may work. SLAB f.e. does not call page_to_nid() in the
!NUMA case but keeps special metadata structures around in each slab page
that records the node used for allocation. The problem is with new code
added/revised in the last 5 years or so that uses page_to_nid() and
allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
the page allocator that should trigger if page_to_nid() returns strange
values. I wonder why that never occurred.

> 3. Finally we could look at deprecating DISCONTIGMEM in favour
of > SPARSEMEM, but we'd still need to fix -stable for that case.
> Especially as it will take time to convert all the architectures

The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
need an audit of the core VM before removing that or making it contingent
on the configurations of various VM subsystems.

> I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
> framework which allows machines with split physical memory ranges to
> work. That's a very common case nowadays. Numa is supposed to be a
> heavyweight framework to preserve node locality for non-uniform memory
> access boxes (which none of the DISCONTIGMEM && !NUMA systems are).

Well yes but we have SPARSE for that today. DISCONTIG with multiple per
pgdat structures in a !NUMA case is just weird and unexpected for many who
have done VM coding in the last years.

2011-04-20 15:03:08

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 09:50 -0500, Christoph Lameter wrote:
> On Wed, 20 Apr 2011, James Bottomley wrote:
>
> > 1. We can look at what imposing NUMA on the DISCONTIGMEM archs
> > would do ... the embedded ones are going to be hardest hit, but
> > if it's not too much extra code, it might be palatable.
> > 2. The other is that we can audit mm to look at all the node
> > assumptions in the non-numa case. My suspicion is that
> > accidentally or otherwise, it mostly works for the normal case,
> > so there might not be much needed to pull it back to working
> > properly for DISCONTIGMEM.
>
> The older code may work. SLAB f.e. does not call page_to_nid() in the
> !NUMA case but keeps special metadata structures around in each slab page
> that records the node used for allocation. The problem is with new code
> added/revised in the last 5 years or so that uses page_to_nid() and
> allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
> the page allocator that should trigger if page_to_nid() returns strange
> values. I wonder why that never occurred.

Actually, I think slab got changed when discontigmem was added ...
that's why it all works OK.

> > 3. Finally we could look at deprecating DISCONTIGMEM in favour
> of > SPARSEMEM, but we'd still need to fix -stable for that case.
> > Especially as it will take time to convert all the architectures
>
> The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
> need an audit of the core VM before removing that or making it contingent
> on the configurations of various VM subsystems.

Don't be stupid ... that would cause six architectures to get marked
broken.

> > I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
> > framework which allows machines with split physical memory ranges to
> > work. That's a very common case nowadays. Numa is supposed to be a
> > heavyweight framework to preserve node locality for non-uniform memory
> > access boxes (which none of the DISCONTIGMEM && !NUMA systems are).
>
> Well yes but we have SPARSE for that today. DISCONTIG with multiple per
> pgdat structures in a !NUMA case is just weird and unexpected for many who
> have done VM coding in the last years.

Look, I'm not really interested in who understands what. The fact is we
have six architectures with the possibility for DISCONTIGMEM && !NUMA,
so that's the case we need to fix in -stable.

They oops with SLUB, as far as I can tell, there are still no oops
reports with SLAB. The simplest -stable fix seems to be to mark SLUB
broken on DISCONTIGMEM && !NUMA.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, James Bottomley wrote:

> > The older code may work. SLAB f.e. does not call page_to_nid() in the
> > !NUMA case but keeps special metadata structures around in each slab page
> > that records the node used for allocation. The problem is with new code
> > added/revised in the last 5 years or so that uses page_to_nid() and
> > allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
> > the page allocator that should trigger if page_to_nid() returns strange
> > values. I wonder why that never occurred.
>
> Actually, I think slab got changed when discontigmem was added ...
> that's why it all works OK.

Could be. I was not around at the time.

> > > 3. Finally we could look at deprecating DISCONTIGMEM in favour
> > of > SPARSEMEM, but we'd still need to fix -stable for that case.
> > > Especially as it will take time to convert all the architectures
> >
> > The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
> > need an audit of the core VM before removing that or making it contingent
> > on the configurations of various VM subsystems.
>
> Don't be stupid ... that would cause six architectures to get marked
> broken.

Yes they are broken right now. Marking just means showing the user that we
are aware of the situation.

> Look, I'm not really interested in who understands what. The fact is we
> have six architectures with the possibility for DISCONTIGMEM && !NUMA,
> so that's the case we need to fix in -stable.
>
> They oops with SLUB, as far as I can tell, there are still no oops
> reports with SLAB. The simplest -stable fix seems to be to mark SLUB
> broken on DISCONTIGMEM && !NUMA.

There is barely any testing going on at all of this since we have had this
issue for more than 5 years and have not noticed it. The absence of bug
reports therefore proves nothing. Code inspection of the VM shows
that this is an issue that arises in multiple subsystems and that we have
VM_BUG_ONs in the page allocator that should trigger for these situations.

Usage of DISCONTIGMEM and !NUMA is not safe and should be flagged as such.

2011-04-20 16:33:06

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 17:40 +0900, KOSAKI Motohiro wrote:
> > > btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
> > > long time.
> > >
> > > arch/x86/Kconfig
> > > -------------------------------------
> > > config ARCH_DISCONTIGMEM_ENABLE
> > > def_bool y
> > > depends on NUMA && X86_32
> >
> > That part makes me think the best option is to make parisc do
> > CONFIG_NUMA as well regardless of the historical intent was.
> >
> > Pekka
>
> This?

I'm afraid it doesn't boot (it's another slub crash):

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 2.6.39-rc3+ (jejb@ion) (gcc version 4.2.4 (Debian 4.2.4-6)) #30 SMP Wed Apr 20 08:52:23 PDT 2011
[ 0.000000] unwind_init: start = 0x4057a000, end = 0x405b0e80, entries = 14056
[ 0.000000] WARNING: Out of order unwind entry! 000000004057b470 and 000000004057b480
[ 0.000000] WARNING: Out of order unwind entry! 000000004057b480 and 000000004057b490
[ 0.000000] WARNING: Out of order unwind entry! 000000004057c160 and 000000004057c170
[ 0.000000] WARNING: Out of order unwind entry! 000000004057c170 and 000000004057c180
[ 0.000000] FP[0] enabled: Rev 1 Model 20
[ 0.000000] The 64-bit Kernel has started...
[ 0.000000] bootconsole [ttyB0] enabled
[ 0.000000] Initialized PDC Console for debugging.
[ 0.000000] Determining PDC firmware type: 64 bit PAT.
[ 0.000000] model 00008870 00000491 00000000 00000002 3e0505e7352af710 100000f0 00000008 000000b2 000000b2
[ 0.000000] vers 00000301
[ 0.000000] CPUID vers 20 rev 4 (0x00000284)
[ 0.000000] capabilities 0x35
[ 0.000000] model 9000/800/rp3440
[ 0.000000] parisc_cache_init: Only equivalent aliasing supported!
[ 0.000000] Memory Ranges:
[ 0.000000] 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
[ 0.000000] 1) Start 0x0000004040000000 End 0x000000407fdfffff Size 1022 MB
[ 0.000000] Total Memory: 2046 MB
[ 0.000000] initrd: 7f390000-7ffedaa1
[ 0.000000] initrd: reserving 3f390000-3ffedaa1 (mem_max 7fe00000)
[ 0.000000] PERCPU: Embedded 10 pages/cpu @00000000418f5000 s12288 r8192 d20480 u40960
[ 0.000000] SMP: bootstrap CPU ID is 0
[ 0.000000] Built 2 zonelists in Node order, mobility grouping on. Total pages: 258560
[ 0.000000] Policy zone: Normal
[ 0.000000] Kernel command line: root=/dev/sda3 panic=5 console=ttyS1 palo_kernel=1/vmlinux-test
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Memory: 2042228k/2095104k available (3849k kernel code, 52876k reserved, 1661k data, 324k init)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] vmalloc : 0x0000000000008000 - 0x000000003f000000 (1007 MB)
[ 0.000000] memory : 0x0000000040000000 - 0x00000040bfe00000 (264190 MB)
[ 0.000000] .init : 0x000000004077c000 - 0x00000000407cd000 ( 324 kB)
[ 0.000000] .data : 0x00000000404c2518 - 0x0000000040661920 (1661 kB)
[ 0.000000] .text : 0x0000000040100000 - 0x00000000404c2518 (3849 kB)
[ 0.000000] SLUB: Genslabs=11, HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=8
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] CONFIG_RCU_FANOUT set to non-default value of 32
[ 0.000000] RCU-based detection of stalled CPUs is disabled.
[ 0.000000] NR_IRQS:128
[ 0.000000] Console: colour dummy device 160x64
[ 0.000000] numa_policy_init: interleaving failed
[ 0.000000] Calibrating delay loop... 1594.36 BogoMIPS (lpj=3188736)
[ 0.048000] pid_max: default: 32768 minimum: 301
[ 0.048000] Security Framework initialized
[ 0.048000] SELinux: Disabled at boot.
[ 0.060000] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
[ 0.072000] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
[ 0.076000] Mount-cache hash table entries: 256
[ 0.076000] Initializing cgroup subsys ns
[ 0.076000] ns_cgroup deprecated: consider using the 'clone_children' flag without the ns_cgroup.
[ 0.080000] Initializing cgroup subsys cpuacct
[ 0.092000] Initializing cgroup subsys devices
[ 0.092000] Initializing cgroup subsys freezer
[ 0.100000] Initializing cgroup subsys net_cls
[ 0.100000] Initializing cgroup subsys blkio
[ 0.200000] Backtrace:
[ 0.200000] [<000000004021c938>] add_partial+0x28/0x98
[ 0.200000] [<000000004021faa0>] __slab_free+0x1d0/0x1d8
[ 0.200000] [<000000004021fd04>] kmem_cache_free+0xc4/0x128
[ 0.200000] [<000000004033bf9c>] ida_get_new_above+0x21c/0x2c0
[ 0.200000] [<00000000402a8980>] sysfs_new_dirent+0xd0/0x238
[ 0.200000] [<00000000402a974c>] create_dir+0x5c/0x168
[ 0.200000] [<00000000402a9ab0>] sysfs_create_dir+0x98/0x128
[ 0.200000] [<000000004033d6c4>] kobject_add_internal+0x114/0x258
[ 0.200000] [<000000004033d9ac>] kobject_add_varg+0x7c/0xa0
[ 0.200000] [<000000004033df20>] kobject_add+0x50/0x90
[ 0.200000] [<000000004033dfb4>] kobject_create_and_add+0x54/0xc8
[ 0.200000] [<00000000407862a0>] cgroup_init+0x138/0x1f0
[ 0.200000] [<000000004077ce50>] start_kernel+0x5a0/0x840
[ 0.200000] [<000000004011fa3c>] start_parisc+0xa4/0xb8
[ 0.200000] [<00000000404bb034>] packet_ioctl+0x16c/0x208
[ 0.200000] [<000000004049ac30>] ip_mroute_setsockopt+0x260/0xf20
[ 0.200000]
[ 0.200000]
[ 0.200000] Kernel Fault: Code=26 regs=00000000405bca80 (Addr=0000000000000000)
[ 0.200000]
[ 0.200000] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[ 0.200000] PSW: 00001000000001001111001000001110 Not tainted
[ 0.200000] r00-03 000000ff0804f20e 0000000040778360 000000004021c938 0000000000000000
[ 0.200000] r04-07 0000000040754b60 0000000000000001 00000000418b1440 000000007ec01280
[ 0.200000] r08-11 0000000000000000 00000000405f7380 00000000000003c0 000000007fffffff
[ 0.200000] r12-15 00000000405bc7d8 00000000405f7398 00000000405bc6e8 00000000000041ed
[ 0.200000] r16-19 00000000f0d00b0c 0000000000000000 0000000000000000 0000000000000000
[ 0.200000] r20-23 000000000800000e 0000000000000001 00000000f0000000 000000004033bf9c
[ 0.200000] r24-27 0000000000000001 00000000418b1440 0000000000000000 0000000040754b60
[ 0.200000] r28-31 000000007ec08000 00000000405bca50 00000000405bca80 000000000000001d
[ 0.200000] sr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.200000] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.200000]
[ 0.200000] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004011c0f0 000000004011c0f4
[ 0.200000] IIR: 0f4015dc ISR: 0000000000000000 IOR: 0000000000000000
[ 0.200000] CPU: 0 CR30: 00000000405bc000 CR31: fffffff0f0e098e0
[ 0.200000] ORIG_R28: 0000000000000100
[ 0.200000] IAOQ[0]: _raw_spin_lock+0x0/0x20
[ 0.200000] IAOQ[1]: _raw_spin_lock+0x4/0x20
[ 0.200000] RP(r2): add_partial+0x28/0x98
[ 0.200000] Backtrace:
[ 0.200000] [<000000004021c938>] add_partial+0x28/0x98
[ 0.200000] [<000000004021faa0>] __slab_free+0x1d0/0x1d8
[ 0.200000] [<000000004021fd04>] kmem_cache_free+0xc4/0x128
[ 0.200000] [<000000004033bf9c>] ida_get_new_above+0x21c/0x2c0
[ 0.200000] [<00000000402a8980>] sysfs_new_dirent+0xd0/0x238
[ 0.200000] [<00000000402a974c>] create_dir+0x5c/0x168
[ 0.200000] [<00000000402a9ab0>] sysfs_create_dir+0x98/0x128
[ 0.200000] [<000000004033d6c4>] kobject_add_internal+0x114/0x258
[ 0.200000] [<000000004033d9ac>] kobject_add_varg+0x7c/0xa0
[ 0.200000] [<000000004033df20>] kobject_add+0x50/0x90
[ 0.200000] [<000000004033dfb4>] kobject_create_and_add+0x54/0xc8
[ 0.200000] [<00000000407862a0>] cgroup_init+0x138/0x1f0
[ 0.200000] [<000000004077ce50>] start_kernel+0x5a0/0x840
[ 0.200000] [<000000004011fa3c>] start_parisc+0xa4/0xb8
[ 0.200000] [<00000000404bb034>] packet_ioctl+0x16c/0x208
[ 0.200000] [<000000004049ac30>] ip_mroute_setsockopt+0x260/0xf20
[ 0.200000]
[ 0.200000] Kernel panic - not syncing: Kernel Fault
[ 0.200000] Backtrace:
[ 0.200000] [<000000004011fec4>] show_stack+0x14/0x20
[ 0.200000] [<000000004011fee8>] dump_stack+0x18/0x28
[ 0.200000] [<000000004015a9a4>] panic+0xd4/0x368
[ 0.200000] [<0000000040120564>] parisc_terminate+0x14c/0x170
[ 0.200000] [<0000000040120adc>] handle_interruption+0x2ac/0x8f8
[ 0.200000] [<000000004011c0f0>] _raw_spin_lock+0x0/0x20
[ 0.200000]
[ 0.200000] Rebooting in 5 seconds..

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, James Bottomley wrote:

> > > That part makes me think the best option is to make parisc do
> > > CONFIG_NUMA as well regardless of the historical intent was.

Well if it never supported NUMA then this is going to be problematic.
> > >
> > > Pekka
> >
> > This?
>
> I'm afraid it doesn't boot (it's another slub crash):

Is there any simulator available that we can use to run a parisc boot?

2011-04-20 18:10:00

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 11:50 -0500, Christoph Lameter wrote:
> > I'm afraid it doesn't boot (it's another slub crash):
>
> Is there any simulator available that we can use to run a parisc boot?

I don't think we have a simulator. However, if you send a ssh key to

[email protected]

He can loan you remote access to one of the systems that ESIEE in France
hosts for us. (he's expecting you).

James

2011-04-20 19:25:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, Apr 20, 2011 at 10:22:04AM -0500, Christoph Lameter wrote:
> There is barely any testing going on at all of this since we have had this
> issue for more than 5 years and have not noticed it. The absence of bug
> reports therefore proves nothing. Code inspection of the VM shows
> that this is an issue that arises in multiple subsystems and that we have
> VM_BUG_ONs in the page allocator that should trigger for these situations.

So ... we've proven that people using these architectures use SLAB
instead of SLUB, don't enable CONFIG_DEBUG_VM and don't use hugepages
(not really a surprise ... nobody's running Oracle on these arches :-)

I don't think that qualifies as "barely any testing". I think that's
"nobody developing the Linux MM uses one of these architectures".

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-04-20 21:05:32

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:

> diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
> index 69ff049..0bf9ae8 100644
> --- a/arch/parisc/Kconfig
> +++ b/arch/parisc/Kconfig
> @@ -229,6 +229,12 @@ config HOTPLUG_CPU
> default y if SMP
> select HOTPLUG
>
> +config NUMA
> + bool "NUMA support"
> + help
> + Say Y to compile the kernel to support NUMA (Non-Uniform Memory
> + Access).
> +
> config ARCH_SELECT_MEMORY_MODEL
> def_bool y
> depends on 64BIT
> @@ -236,6 +242,7 @@ config ARCH_SELECT_MEMORY_MODEL
> config ARCH_DISCONTIGMEM_ENABLE
> def_bool y
> depends on 64BIT
> + depends on NUMA
>
> config ARCH_FLATMEM_ENABLE
> def_bool y

I think this should probably be

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -244,6 +244,9 @@ config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
depends on ARCH_DISCONTIGMEM_ENABLE

+config NUMA
+ def_bool ARCH_DISCONTIGMEM_ENABLE
+
config NODES_SHIFT
int
default "3"

instead since we don't need CONFIG_NUMA for anything other than
CONFIG_PA8X00 and 64-bit enabled.

2011-04-20 21:19:00

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, James Bottomley wrote:

> [ 0.200000] Backtrace:
> [ 0.200000] [<000000004021c938>] add_partial+0x28/0x98
> [ 0.200000] [<000000004021faa0>] __slab_free+0x1d0/0x1d8
> [ 0.200000] [<000000004021fd04>] kmem_cache_free+0xc4/0x128
> [ 0.200000] [<000000004033bf9c>] ida_get_new_above+0x21c/0x2c0
> [ 0.200000] [<00000000402a8980>] sysfs_new_dirent+0xd0/0x238
> [ 0.200000] [<00000000402a974c>] create_dir+0x5c/0x168
> [ 0.200000] [<00000000402a9ab0>] sysfs_create_dir+0x98/0x128
> [ 0.200000] [<000000004033d6c4>] kobject_add_internal+0x114/0x258
> [ 0.200000] [<000000004033d9ac>] kobject_add_varg+0x7c/0xa0
> [ 0.200000] [<000000004033df20>] kobject_add+0x50/0x90
> [ 0.200000] [<000000004033dfb4>] kobject_create_and_add+0x54/0xc8
> [ 0.200000] [<00000000407862a0>] cgroup_init+0x138/0x1f0
> [ 0.200000] [<000000004077ce50>] start_kernel+0x5a0/0x840
> [ 0.200000] [<000000004011fa3c>] start_parisc+0xa4/0xb8
> [ 0.200000] [<00000000404bb034>] packet_ioctl+0x16c/0x208
> [ 0.200000] [<000000004049ac30>] ip_mroute_setsockopt+0x260/0xf20
> [ 0.200000]

This is probably because the parisc's DISCONTIGMEM memory ranges don't
have bits set in N_NORMAL_MEMORY.

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
}
memset(pfnnid_map, 0xff, sizeof(pfnnid_map));

- for (i = 0; i < npmem_ranges; i++)
+ for (i = 0; i < npmem_ranges; i++) {
+ node_set_state(i, N_NORMAL_MEMORY);
node_set_online(i);
+ }
#endif

/*

2011-04-20 21:34:50

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, Matthew Wilcox wrote:

> > That part makes me think the best option is to make parisc do
> > CONFIG_NUMA as well regardless of the historical intent was.
>
> But it's not just parisc. It's six other architectures as well, some
> of which aren't even SMP. Does !SMP && NUMA make any kind of sense?
>

It does as long as DISCONTIGMEM is hijacking NUMA abstractions throughout
the code; for example, look at the .config that James is probably using
for testing here:

CONFIG_PA8X00=y
CONFIG_64BIT=y
CONFIG_DISCONTIGMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_NODES_SHIFT=3

and CONFIG_NUMA is not enabled. So we want CONFIG_NODES_SHIFT of 3
(because MAX_PHYSMEM_RANGES is 8) and CONFIG_NEED_MULTIPLE_NODES is
enabled because of DISCONTIGMEM:

#
# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
# to represent different areas of memory. This variable allows
# those dependencies to exist individually.
#
config NEED_MULTIPLE_NODES
def_bool y
depends on DISCONTIGMEM || NUMA

when in reality we should do away with CONFIG_NEED_MULTIPLE_NODES and just
force DISCONTIGMEM to enable CONFIG_NUMA at least for -stable and as a
quick fix for James.

In the long run, we'll probably want to define a lighterweight CONFIG_NUMA
as a layer that CONFIG_DISCONTIGMEM can use for memory range abstractions
and then CONFIG_NUMA is built on top of it to define proximity between
those ranges.

2011-04-20 21:42:39

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, Christoph Lameter wrote:

> There is barely any testing going on at all of this since we have had this
> issue for more than 5 years and have not noticed it. The absence of bug
> reports therefore proves nothing. Code inspection of the VM shows
> that this is an issue that arises in multiple subsystems and that we have
> VM_BUG_ONs in the page allocator that should trigger for these situations.
>
> Usage of DISCONTIGMEM and !NUMA is not safe and should be flagged as such.
>

We don't actually have any bug reports in front of us that show anything
else in the VM other than slub has issues with this configuration, so
marking them as broken is probably premature. The parisc config that
triggered this debugging enables CONFIG_SLAB by default, so it probably
has gone unnoticed just because nobody other than James has actually tried
it on hppa64.

Let's see if KOSAKI-san's fixes to Kconfig (even though I'd prefer the
simpler and implicit "config NUMA def_bool ARCH_DISCONTIGMEM_ENABLE" over
his config NUMA) and my fix to parisc to set the bit in N_NORMAL_MEMORY
so that CONFIG_SLUB initializes kmem_cache_node correctly works and then
address issues in the core VM as they arise. Presumably someone has been
running DISCONTIGMEM on hppa64 in the past five years without issues with
defconfig, so the issue here may just be slub.

2011-04-20 22:15:25

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 14:18 -0700, David Rientjes wrote:
> This is probably because the parisc's DISCONTIGMEM memory ranges don't
> have bits set in N_NORMAL_MEMORY.
>
> diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> --- a/arch/parisc/mm/init.c
> +++ b/arch/parisc/mm/init.c
> @@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
> }
> memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
>
> - for (i = 0; i < npmem_ranges; i++)
> + for (i = 0; i < npmem_ranges; i++) {
> + node_set_state(i, N_NORMAL_MEMORY);
> node_set_online(i);
> + }
> #endif

Yes, this seems to be the missing piece that gets it to boot. We really
need this in generic code, unless someone wants to run through all the
other arch's doing it ...

James

2011-04-20 23:12:43

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 20 Apr 2011, James Bottomley wrote:

> > This is probably because the parisc's DISCONTIGMEM memory ranges don't
> > have bits set in N_NORMAL_MEMORY.
> >
> > diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> > --- a/arch/parisc/mm/init.c
> > +++ b/arch/parisc/mm/init.c
> > @@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
> > }
> > memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
> >
> > - for (i = 0; i < npmem_ranges; i++)
> > + for (i = 0; i < npmem_ranges; i++) {
> > + node_set_state(i, N_NORMAL_MEMORY);
> > node_set_online(i);
> > + }
> > #endif
>
> Yes, this seems to be the missing piece that gets it to boot. We really
> need this in generic code, unless someone wants to run through all the
> other arch's doing it ...
>

Looking at all other architectures that allow ARCH_DISCONTIGMEM_ENABLE, we
already know x86 is fine, avr32 disables ARCH_DISCONTIGMEM_ENABLE entirely
because its code only brings online node 0, and tile already sets the bit
in N_NORMAL_MEMORY correctly when bringing a node online, probably because
it was introduced after the various node state masks were added in
7ea1530ab3fd back in October 2007.

So we're really only talking about alpha, ia64, m32r, m68k, and mips and
it only seems to matter when using CONFIG_SLUB, which isn't surprising
when greping for it:

$ grep -r N_NORMAL_MEMORY mm/*
mm/memcontrol.c: if (!node_state(node, N_NORMAL_MEMORY))
mm/memcontrol.c: if (!node_state(node, N_NORMAL_MEMORY))
mm/page_alloc.c: [N_NORMAL_MEMORY] = { { [0] = 1UL } },
mm/page_alloc.c: node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY)

Those memory controller occurrences only result in it passing a node id of
-1 to kmalloc_node() which means no specific node target, and that's fine
for DISCONTIGMEM since we don't care about any proximity between memory
ranges.

This should fix the remaining architectures so they can use CONFIG_SLUB,
but I hope it can be tested by the individual arch maintainers like you
did for parisc.

diff --git a/arch/alpha/mm/numa.c b/arch/alpha/mm/numa.c
--- a/arch/alpha/mm/numa.c
+++ b/arch/alpha/mm/numa.c
@@ -245,6 +245,7 @@ setup_memory_node(int nid, void *kernel_end)
bootmap_size, BOOTMEM_DEFAULT);
printk(" reserving pages %ld:%ld\n", bootmap_start, bootmap_start+PFN_UP(bootmap_size));

+ node_set_state(nid, N_NORMAL_MEMORY);
node_set_online(nid);
}

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -573,6 +573,8 @@ void __init find_memory(void)
map>>PAGE_SHIFT,
bdp->node_min_pfn,
bdp->node_low_pfn);
+ if (node_present_pages(node))
+ node_set_state(node, N_NORMAL_MEMORY);
}

efi_memmap_walk(filter_rsvd_memory, free_node_bootmem);
diff --git a/arch/m32r/kernel/setup.c b/arch/m32r/kernel/setup.c
--- a/arch/m32r/kernel/setup.c
+++ b/arch/m32r/kernel/setup.c
@@ -247,7 +247,9 @@ void __init setup_arch(char **cmdline_p)

#ifdef CONFIG_DISCONTIGMEM
nodes_clear(node_online_map);
+ node_set_state(0, N_NORMAL_MEMORY); /* always has memory */
node_set_online(0);
+ node_set_state(1, N_NORMAL_MEMORY); /* always has memory */
node_set_online(1);
#endif /* CONFIG_DISCONTIGMEM */

diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
--- a/arch/m68k/mm/init_mm.c
+++ b/arch/m68k/mm/init_mm.c
@@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
}
#endif
pg_data_map[node].bdata = bootmem_node_data + node;
+ if (node_present_pages(node))
+ node_set_state(node, N_NORMAL_MEMORY);
node_set_online(node);
}

diff --git a/arch/mips/sgi-ip27/ip27-memory.c b/arch/mips/sgi-ip27/ip27-memory.c
--- a/arch/mips/sgi-ip27/ip27-memory.c
+++ b/arch/mips/sgi-ip27/ip27-memory.c
@@ -471,6 +471,8 @@ void __init paging_init(void)

if (end_pfn > max_low_pfn)
max_low_pfn = end_pfn;
+ if (end_pfn > start_pfn)
+ node_set_state(node, N_NORMAL_MEMORY);
}
zones_size[ZONE_NORMAL] = max_low_pfn;
free_area_init_nodes(zones_size);

2011-04-21 02:27:20

by David Rientjes

[permalink] [raw]
Subject: [patch] parisc: set memory ranges in N_NORMAL_MEMORY when onlined

When a DISCONTIGMEM memory range is brought online as a NUMA node, it
also needs to have its bet set in N_NORMAL_MEMORY. This is necessary for
generic kernel code that utilizes N_NORMAL_MEMORY as a subset of N_ONLINE
for memory savings.

These types of hacks can hopefully be removed once DISCONTIGMEM is either
removed or abstracted away from CONFIG_NUMA.

Fixes a panic in the slub code which only initializes structures for
N_NORMAL_MEMORY to save memory:

Backtrace:
[<000000004021c938>] add_partial+0x28/0x98
[<000000004021faa0>] __slab_free+0x1d0/0x1d8
[<000000004021fd04>] kmem_cache_free+0xc4/0x128
[<000000004033bf9c>] ida_get_new_above+0x21c/0x2c0
[<00000000402a8980>] sysfs_new_dirent+0xd0/0x238
[<00000000402a974c>] create_dir+0x5c/0x168
[<00000000402a9ab0>] sysfs_create_dir+0x98/0x128
[<000000004033d6c4>] kobject_add_internal+0x114/0x258
[<000000004033d9ac>] kobject_add_varg+0x7c/0xa0
[<000000004033df20>] kobject_add+0x50/0x90
[<000000004033dfb4>] kobject_create_and_add+0x54/0xc8
[<00000000407862a0>] cgroup_init+0x138/0x1f0
[<000000004077ce50>] start_kernel+0x5a0/0x840
[<000000004011fa3c>] start_parisc+0xa4/0xb8
[<00000000404bb034>] packet_ioctl+0x16c/0x208
[<000000004049ac30>] ip_mroute_setsockopt+0x260/0xf20

Tested-by: James Bottomley <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
arch/parisc/mm/init.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
}
memset(pfnnid_map, 0xff, sizeof(pfnnid_map));

- for (i = 0; i < npmem_ranges; i++)
+ for (i = 0; i < npmem_ranges; i++) {
+ node_set_state(i, N_NORMAL_MEMORY);
node_set_online(i);
+ }
#endif

/*

2011-04-21 13:03:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> --- a/arch/parisc/mm/init.c
> +++ b/arch/parisc/mm/init.c
> @@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
> }
> memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
>
> - for (i = 0; i < npmem_ranges; i++)
> + for (i = 0; i < npmem_ranges; i++) {
> + node_set_state(i, N_NORMAL_MEMORY);
> node_set_online(i);
> + }
> #endif


I'm surprised this one. If arch code doesn't initialized N_NORMAL_MEMORY,
(or N_HIGH_MEMORY. N_HIGH_MEMORY == N_NORMAL_MEMORY if CONFIG_HIGHMEM=n)
kswapd is created only at node0. wow.

The initialization must be necessary even if !NUMA, I think. ;-)
Probably we should have revisit all arch code when commit 9422ffba4a
(Memoryless nodes: No need for kswapd) was introduced, at least.

Thank you David. and I'm sad this multi level unforunate mismatch....

2011-04-21 13:16:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> On Wed, 20 Apr 2011, James Bottomley wrote:
>
> > > This is probably because the parisc's DISCONTIGMEM memory ranges don't
> > > have bits set in N_NORMAL_MEMORY.
> > >
> > > diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> > > --- a/arch/parisc/mm/init.c
> > > +++ b/arch/parisc/mm/init.c
> > > @@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
> > > }
> > > memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
> > >
> > > - for (i = 0; i < npmem_ranges; i++)
> > > + for (i = 0; i < npmem_ranges; i++) {
> > > + node_set_state(i, N_NORMAL_MEMORY);
> > > node_set_online(i);
> > > + }
> > > #endif
> >
> > Yes, this seems to be the missing piece that gets it to boot. We really
> > need this in generic code, unless someone wants to run through all the
> > other arch's doing it ...
> >
>
> Looking at all other architectures that allow ARCH_DISCONTIGMEM_ENABLE, we
> already know x86 is fine, avr32 disables ARCH_DISCONTIGMEM_ENABLE entirely
> because its code only brings online node 0, and tile already sets the bit
> in N_NORMAL_MEMORY correctly when bringing a node online, probably because
> it was introduced after the various node state masks were added in
> 7ea1530ab3fd back in October 2007.
>
> So we're really only talking about alpha, ia64, m32r, m68k, and mips and
> it only seems to matter when using CONFIG_SLUB, which isn't surprising
> when greping for it:
>
> $ grep -r N_NORMAL_MEMORY mm/*
> mm/memcontrol.c: if (!node_state(node, N_NORMAL_MEMORY))
> mm/memcontrol.c: if (!node_state(node, N_NORMAL_MEMORY))
> mm/page_alloc.c: [N_NORMAL_MEMORY] = { { [0] = 1UL } },
> mm/page_alloc.c: node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY) {
> mm/slub.c: for_each_node_state(node, N_NORMAL_MEMORY)
>
> Those memory controller occurrences only result in it passing a node id of
> -1 to kmalloc_node() which means no specific node target, and that's fine
> for DISCONTIGMEM since we don't care about any proximity between memory
> ranges.
>
> This should fix the remaining architectures so they can use CONFIG_SLUB,
> but I hope it can be tested by the individual arch maintainers like you
> did for parisc.

ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
N_NORMAL_MEMORY automatically if my understand is correct.
(plz see free_area_init_nodes)

I guess alpha and m32r have no active developrs. only m68k seems to be need
fix and we have a chance to get a review...


2011-04-21 13:33:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

Hey,

On Wed, Apr 20, 2011 at 08:50:15AM -0500, Christoph Lameter wrote:
> Tejon was working on getting rid of DISCONTIG. SPARSEMEM is the favored
> alternative today. So we could potentially change the arches to use SPARSE
> configs in the !NUMA case.

Well, the thing is that sparsemem w/ vmemmap is definitely better than
discontigmem on x86-64; however, on x86-32, vmemmap can't be used due
to address space shortage and there are some minor disadvantages to
sparsemem compared to discontigmem.

IIRC, the biggest was losing a bit of granuality in memsections and
possibly wasting slightly more memory on the page array. Both didn't
seem critical to me but given that the actual amount of code needed
for discontigmem in arch code was fairly small (although the amount of
added complexity for auditing/testing can be much higher) I didn't
feel sure about dropping discontigmem and thus the patchset to drop
discontigmem was posted as RFC, to which nobody commented.

http://thread.gmane.org/gmane.linux.kernel/1121321

What do you guys think?

Thanks.

--
tejun

2011-04-21 16:06:46

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Wed, 2011-04-20 at 14:42 -0700, David Rientjes wrote:
> On Wed, 20 Apr 2011, Christoph Lameter wrote:
>
> > There is barely any testing going on at all of this since we have had this
> > issue for more than 5 years and have not noticed it. The absence of bug
> > reports therefore proves nothing. Code inspection of the VM shows
> > that this is an issue that arises in multiple subsystems and that we have
> > VM_BUG_ONs in the page allocator that should trigger for these situations.
> >
> > Usage of DISCONTIGMEM and !NUMA is not safe and should be flagged as such.
> >
>
> We don't actually have any bug reports in front of us that show anything
> else in the VM other than slub has issues with this configuration, so
> marking them as broken is probably premature. The parisc config that
> triggered this debugging enables CONFIG_SLAB by default, so it probably
> has gone unnoticed just because nobody other than James has actually tried
> it on hppa64.
>
> Let's see if KOSAKI-san's fixes to Kconfig (even though I'd prefer the
> simpler and implicit "config NUMA def_bool ARCH_DISCONTIGMEM_ENABLE" over
> his config NUMA) and my fix to parisc to set the bit in N_NORMAL_MEMORY
> so that CONFIG_SLUB initializes kmem_cache_node correctly works and then
> address issues in the core VM as they arise. Presumably someone has been
> running DISCONTIGMEM on hppa64 in the past five years without issues with
> defconfig, so the issue here may just be slub.

Actually, we can fix slub. As far as all my memory hammer tests go, the
one liner below is the actual fix (it just forces slub get_node() to
return the zero node always on !NUMA). That, as far as a code
inspection goes, seems to make SLUB as good as SLAB ... as long as
no-one uses hugepages or VM DEBUG, which, I think we've demonstrated, is
the case for all the current DISCONTIGMEM users.

I think either the above or just marking slub broken in DISCONTIGMEM & !
NUMA is sufficient for stable. The fix is getting urgent, because
debian (which is what most of our users are running) has made SLUB the
default allocator, which is why we're now starting to run into these
panic reports.

The set memory range fix looks good for a backport too ... at least the
page cache is now no-longer reluctant to use my upper 1GB ...

I worry a bit more about backporting the selection of NUMA as a -stable
fix because it's a larger change (and requires changes to all the
architectures, since NUMA is an arch local Kconfig variable)

James

----

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..243bd9c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -235,7 +235,11 @@ int slab_is_available(void)

static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
{
+#ifdef CONFIG_NUMA
return s->node[node];
+#else
+ return s->node[0];
+#endif
}

/* Verify that a pointer has an address that is valid within a slab page */

2011-04-21 16:37:31

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 22:16 +0900, KOSAKI Motohiro wrote:
> > This should fix the remaining architectures so they can use CONFIG_SLUB,
> > but I hope it can be tested by the individual arch maintainers like you
> > did for parisc.
>
> ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
> N_NORMAL_MEMORY automatically if my understand is correct.
> (plz see free_area_init_nodes)
>
> I guess alpha and m32r have no active developrs. only m68k seems to be need
> fix and we have a chance to get a review...

Actually, it's not quite a fix yet, I'm afraid. I've just been
investigating why my main 4 way box got slower with kernel builds:
Apparently userspace processes are now all stuck on CPU0, so we're
obviously tripping over some NUMA scheduling stuff that's missing.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, James Bottomley wrote:

> On Thu, 2011-04-21 at 22:16 +0900, KOSAKI Motohiro wrote:
> > > This should fix the remaining architectures so they can use CONFIG_SLUB,
> > > but I hope it can be tested by the individual arch maintainers like you
> > > did for parisc.
> >
> > ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
> > N_NORMAL_MEMORY automatically if my understand is correct.
> > (plz see free_area_init_nodes)
> >
> > I guess alpha and m32r have no active developrs. only m68k seems to be need
> > fix and we have a chance to get a review...
>
> Actually, it's not quite a fix yet, I'm afraid. I've just been
> investigating why my main 4 way box got slower with kernel builds:
> Apparently userspace processes are now all stuck on CPU0, so we're
> obviously tripping over some NUMA scheduling stuff that's missing.

The simplest solution may be to move these arches to use SPARSE instead.
AFAICT this was relatively easy for the arm guys.

Here is short guide on how to do that from the mips people:

http://www.linux-mips.org/archives/linux-mips/2008-08/msg00154.html

http://mytechkorner.blogspot.com/2010/12/sparsemem.html

Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
the europeans may be off for awhile)

2011-04-21 18:45:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 13:33 -0500, Christoph Lameter wrote:
> http://www.linux-mips.org/archives/linux-mips/2008-08/msg00154.html
>
> http://mytechkorner.blogspot.com/2010/12/sparsemem.html
>
> Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
> the europeans may be off for awhile)

Yup, for sure. It's also interesting how much code ppc64 removed when
they did this:

http://lists.ozlabs.org/pipermail/linuxppc64-dev/2005-November/006646.html

Please cc me on patches. Or, if nobody else was planning on doing it, I
can take a stab at doing SPARSEMEM on one of the arches. I won't be
able to _run_ it outside of qemu, but it might be quicker than someone
starting from scratch.

Was it really just m68k and parisc that need immediate attention?

-- Dave

2011-04-21 19:33:54

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:

> ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
> N_NORMAL_MEMORY automatically if my understand is correct.
> (plz see free_area_init_nodes)
>

ia64 doesn't enable CONFIG_HIGHMEM, so it never gets set via this generic
code; mips also doesn't enable it for all configs even for 32-bit.

So we'll either want to take check_for_regular_memory() out from under
CONFIG_HIGHMEM and do it for all configs or teach slub to use
N_HIGH_MEMORY rather than N_NORMAL_MEMORY.

2011-04-21 19:38:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:

> > diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> > --- a/arch/parisc/mm/init.c
> > +++ b/arch/parisc/mm/init.c
> > @@ -266,8 +266,10 @@ static void __init setup_bootmem(void)
> > }
> > memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
> >
> > - for (i = 0; i < npmem_ranges; i++)
> > + for (i = 0; i < npmem_ranges; i++) {
> > + node_set_state(i, N_NORMAL_MEMORY);
> > node_set_online(i);
> > + }
> > #endif
>
>
> I'm surprised this one. If arch code doesn't initialized N_NORMAL_MEMORY,
> (or N_HIGH_MEMORY. N_HIGH_MEMORY == N_NORMAL_MEMORY if CONFIG_HIGHMEM=n)
> kswapd is created only at node0. wow.
>
> The initialization must be necessary even if !NUMA, I think. ;-)
> Probably we should have revisit all arch code when commit 9422ffba4a
> (Memoryless nodes: No need for kswapd) was introduced, at least.
>

I think we may want to just convert slub (and the memory controller) to
use N_HIGH_MEMORY rather than N_NORMAL_MEMORY since nothing else uses it
and the generic code seems to handle N_HIGH_MEMORY for all configs
appropriately.

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, David Rientjes wrote:

> I think we may want to just convert slub (and the memory controller) to
> use N_HIGH_MEMORY rather than N_NORMAL_MEMORY since nothing else uses it
> and the generic code seems to handle N_HIGH_MEMORY for all configs
> appropriately.

In 32 bit configurations some architectures (like x86) provide nodes
that have only high memory. Slab allocators only handle normal memory.
SLAB operates in a kind of degraded mode in that case by falling back for
each allocation to the nodes that have normal memory.

2011-04-21 20:05:25

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 13:33 -0500, Christoph Lameter wrote:
> On Thu, 21 Apr 2011, James Bottomley wrote:
>
> > On Thu, 2011-04-21 at 22:16 +0900, KOSAKI Motohiro wrote:
> > > > This should fix the remaining architectures so they can use CONFIG_SLUB,
> > > > but I hope it can be tested by the individual arch maintainers like you
> > > > did for parisc.
> > >
> > > ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
> > > N_NORMAL_MEMORY automatically if my understand is correct.
> > > (plz see free_area_init_nodes)
> > >
> > > I guess alpha and m32r have no active developrs. only m68k seems to be need
> > > fix and we have a chance to get a review...
> >
> > Actually, it's not quite a fix yet, I'm afraid. I've just been
> > investigating why my main 4 way box got slower with kernel builds:
> > Apparently userspace processes are now all stuck on CPU0, so we're
> > obviously tripping over some NUMA scheduling stuff that's missing.
>
> The simplest solution may be to move these arches to use SPARSE instead.
> AFAICT this was relatively easy for the arm guys.
>
> Here is short guide on how to do that from the mips people:
>
> http://www.linux-mips.org/archives/linux-mips/2008-08/msg00154.html
>
> http://mytechkorner.blogspot.com/2010/12/sparsemem.html
>
> Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
> the europeans may be off for awhile)

It sort of depends on your definition of easy. The problem going from
DISCONTIGMEM to SPARSEMEM is sorting out the section size (the minimum
indivisible size for a sectional_mem_map array) and also deciding on
whether you need SPARSEMEM_EXTREME (discontigmem allows arbitrarily
different sizes for each contiguous region) or
ARCH_HAS_HOLES_MEMORYMODEL (allows empty mem_map regions as well). I
suspect most architectures will want SPARSEMEM_EXTREME (it means that
the section array isn't fully populated) because the gaps can be huge
(we've got a 64GB gap on parisc).

However, even though I think we can do this going forwards ... I don't
think we can backport it as a bug fix for the slub panic.

James

Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, James Bottomley wrote:

> > Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
> > the europeans may be off for awhile)
>
> It sort of depends on your definition of easy. The problem going from
> DISCONTIGMEM to SPARSEMEM is sorting out the section size (the minimum
> indivisible size for a sectional_mem_map array) and also deciding on
> whether you need SPARSEMEM_EXTREME (discontigmem allows arbitrarily
> different sizes for each contiguous region) or
> ARCH_HAS_HOLES_MEMORYMODEL (allows empty mem_map regions as well). I
> suspect most architectures will want SPARSEMEM_EXTREME (it means that
> the section array isn't fully populated) because the gaps can be huge
> (we've got a 64GB gap on parisc).

Well my favorite is SPARSEMEM_VMEMMAP because it allows page level holes
and uses the TLB (via page tables) to avoid lookups in the SPARSE maps but
that is likely not going to be in an initial fix.

> However, even though I think we can do this going forwards ... I don't
> think we can backport it as a bug fix for the slub panic.

So far there seems to be no other solution that will fix the issues
cleanly since we have a clash of the notions of a node in !NUMA between
core and discontig. Which is a pretty basic thing to get wrong.

If we can avoid all the fancy stuff and Dave can just get a minimal SPARSE
config going then this may be the best solution for stable as well.

But then these configs have been broken for years and no one noticed. This
means the users of these arches likely have been running a subset of
kernel functionality. I suspect they have never freed memory from
DISCONTIG node 1 and higher without CONFIG_DEBUG_VM on. Otherwise I
cannot explain why the VM_BUG_ONs did not trigger in
mm/page_alloc.c:move_freepages() that should have been brought to the MM
developers attention.

This set of circumstances leads to the suspicion that there were only
tests run that showed that the kernel booted. Higher node memory was never
touched and the MM code was never truly exercised.

So I am not sure that there is any urgency in this matter. No one has
cared for years after all.

2011-04-21 21:19:55

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, Christoph Lameter wrote:

> In 32 bit configurations some architectures (like x86) provide nodes
> that have only high memory. Slab allocators only handle normal memory.
> SLAB operates in a kind of degraded mode in that case by falling back for
> each allocation to the nodes that have normal memory.
>

Let's do this:

- parisc: James has already queued "parisc: set memory ranges in
N_NORMAL_MEMORY when onlined" for 2.6.39, so all he needs now is
to merge a hybrid of the Kconfig changes requiring CONFIG_NUMA for
CONFIG_DISCONTIGMEM from KOSAKI-san and myself which also fix the
compile issues,

- generic code: we pull check_for_regular_memory() out from under
CONFIG_HIGHMEM so that N_NORMAL_MEMORY gets set appropriately for
all callers of free_area_init_nodes() from paging_init(); this fixes
ia64 and mips,

- alpha, m32r, m68k: push the changes to those individual architectures
that I proposed earlier that set N_NORMAL_MEMORY for DISCONTINGMEM
when memory regions have memory; KOSAKI-san says a couple of these
architectures may be orphaned so hopefully Andrew can pick them up
in -mm.

I'll reply to this email with the parisc Kconfig changes for James, the
generic change to check_for_regular_memory() for Andrew, and the
arch-specific changes to the appropriate maintainers and email lists (but
may need to go through -mm if they aren't picked up).

2011-04-21 21:22:40

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 16:07 -0500, Christoph Lameter wrote:
> On Thu, 21 Apr 2011, James Bottomley wrote:
>
> > > Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
> > > the europeans may be off for awhile)
> >
> > It sort of depends on your definition of easy. The problem going from
> > DISCONTIGMEM to SPARSEMEM is sorting out the section size (the minimum
> > indivisible size for a sectional_mem_map array) and also deciding on
> > whether you need SPARSEMEM_EXTREME (discontigmem allows arbitrarily
> > different sizes for each contiguous region) or
> > ARCH_HAS_HOLES_MEMORYMODEL (allows empty mem_map regions as well). I
> > suspect most architectures will want SPARSEMEM_EXTREME (it means that
> > the section array isn't fully populated) because the gaps can be huge
> > (we've got a 64GB gap on parisc).
>
> Well my favorite is SPARSEMEM_VMEMMAP because it allows page level holes
> and uses the TLB (via page tables) to avoid lookups in the SPARSE maps but
> that is likely not going to be in an initial fix.

Really, no ... that requires additional pte insertion logic and some
other stuff that's nasty to craft and requires significant testing.

> > However, even though I think we can do this going forwards ... I don't
> > think we can backport it as a bug fix for the slub panic.
>
> So far there seems to be no other solution that will fix the issues
> cleanly since we have a clash of the notions of a node in !NUMA between
> core and discontig. Which is a pretty basic thing to get wrong.

Yes there is ... there's the slub patch or the marking as broken.
Either are much simpler.

> If we can avoid all the fancy stuff and Dave can just get a minimal SPARSE
> config going then this may be the best solution for stable as well.
>
> But then these configs have been broken for years and no one noticed. This
> means the users of these arches likely have been running a subset of
> kernel functionality. I suspect they have never freed memory from
> DISCONTIG node 1 and higher without CONFIG_DEBUG_VM on. Otherwise I
> cannot explain why the VM_BUG_ONs did not trigger in
> mm/page_alloc.c:move_freepages() that should have been brought to the MM
> developers attention.

Yes they have. As willy said, they've just never been run with DEBUG_VM
or HUGEPAGES or, until recently, SLUB. The test boxes (at least for
parisc) get hammered quite a lot to flush out coherency issues. That's
why I'm confident this panic only triggers for slub. I found the panic
within about two days of turning SLUB on.

> This set of circumstances leads to the suspicion that there were only
> tests run that showed that the kernel booted. Higher node memory was never
> touched and the MM code was never truly exercised.

Look, try to stay on point with logic: they have been extensively
tested, just not in the slub configuration, which is the only one that
crashes. As I explained (several times) we're just now picking up slub
because debian now enables it by default.

> So I am not sure that there is any urgency in this matter. No one has
> cared for years after all.

If we didn't care, we wouldn't be making all this fuss. It's only a
couple of days since the bug was reported, which should indicate the
high importance attached to it (well, by everyone except you,
apparently).

James

2011-04-21 21:24:52

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 14:19 -0700, David Rientjes wrote:
> On Thu, 21 Apr 2011, Christoph Lameter wrote:
>
> > In 32 bit configurations some architectures (like x86) provide nodes
> > that have only high memory. Slab allocators only handle normal memory.
> > SLAB operates in a kind of degraded mode in that case by falling back for
> > each allocation to the nodes that have normal memory.
> >
>
> Let's do this:
>
> - parisc: James has already queued "parisc: set memory ranges in
> N_NORMAL_MEMORY when onlined" for 2.6.39, so all he needs now is
> to merge a hybrid of the Kconfig changes requiring CONFIG_NUMA for
> CONFIG_DISCONTIGMEM from KOSAKI-san and myself which also fix the
> compile issues,

Not quite: if we go this route, we need to sort out our CPU scheduling
problem as well ... as I said, I don't think we've got all the necessary
numa machinery in place yet.

James

2011-04-21 21:31:36

by David Rientjes

[permalink] [raw]
Subject: [patch] parisc: enable CONFIG_NUMA for DISCONTIGMEM and fix build errors

From: KOSAKI Motohiro <[email protected]>

CONFIG_DISCONTIGMEM requires CONFIG_NUMA because it represents its memory
regions as individual nodes that lack remote proximity, hence it already
has a default CONFIG_NODES_SHIFT of 3.

This configuration causes build errors, however, so fix them as well.

This was first realized when CONFIG_DISCONTIGMEM was used with
CONFIG_SLUB which depends heavily on CONFIG_NUMA being enabled or
disabled to support these memory regions.

A long-term fix is to separate the memory region abstraction required for
DISCONTIGMEM away from NUMA and then build NUMA on top of it with
proximity domains.

[[email protected]: implicitly enable NUMA for DISCONTIGMEM, changelog]
Signed-off-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
arch/parisc/Kconfig | 3 +++
include/asm-generic/topology.h | 4 ----
include/linux/topology.h | 2 +-
3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -244,6 +244,9 @@ config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
depends on ARCH_DISCONTIGMEM_ENABLE

+config NUMA
+ def_bool ARCH_DISCONTIGMEM_ENABLE
+
config NODES_SHIFT
int
default "3"
diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -27,8 +27,6 @@
#ifndef _ASM_GENERIC_TOPOLOGY_H
#define _ASM_GENERIC_TOPOLOGY_H

-#ifndef CONFIG_NUMA
-
/* Other architectures wishing to use this simple topology API should fill
in the below functions as appropriate in their own <asm/topology.h> file. */
#ifndef cpu_to_node
@@ -60,8 +58,6 @@
cpumask_of_node(pcibus_to_node(bus)))
#endif

-#endif /* CONFIG_NUMA */
-
#if !defined(CONFIG_NUMA) || !defined(CONFIG_HAVE_MEMORYLESS_NODES)

#ifndef set_numa_mem
diff --git a/include/linux/topology.h b/include/linux/topology.h
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -209,7 +209,7 @@ int arch_update_cpu_topology(void);

#ifdef CONFIG_NUMA
#ifndef SD_NODE_INIT
-#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
+#define SD_NODE_INIT SD_ALLNODES_INIT
#endif

#endif /* CONFIG_NUMA */

2011-04-21 21:34:12

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, James Bottomley wrote:

> > - parisc: James has already queued "parisc: set memory ranges in
> > N_NORMAL_MEMORY when onlined" for 2.6.39, so all he needs now is
> > to merge a hybrid of the Kconfig changes requiring CONFIG_NUMA for
> > CONFIG_DISCONTIGMEM from KOSAKI-san and myself which also fix the
> > compile issues,
>
> Not quite: if we go this route, we need to sort out our CPU scheduling
> problem as well ... as I said, I don't think we've got all the necessary
> numa machinery in place yet.
>

Ok, it seems like there're two options for this release cycle:

(1) merge the patch that enables CONFIG_NUMA for DISCONTIGMEM but only
do so if CONFIG_SLUB is enabled to avoid the build error, or

(2) disallow CONFIG_SLUB for parisc with DISCONTIGMEM.

2011-04-21 21:41:40

by David Rientjes

[permalink] [raw]
Subject: [patch] mm: always set nodes with regular memory in N_NORMAL_MEMORY

N_NORMAL_MEMORY is intended to include all nodes that have present memory
in regular zones, that is, zones below ZONE_HIGHMEM. This should be done
regardless of whether CONFIG_HIGHMEM is set or not.

This fixes ia64 so that the nodes get set appropriately in the nodemask
for DISCONTIGMEM and mips if it does not enable CONFIG_HIGHMEM even for
32-bit kernels.

If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
relies on this nodemask to setup kmem_cache_node data structures for each
cache.

Signed-off-by: David Rientjes <[email protected]>
---
mm/page_alloc.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4727,7 +4727,6 @@ out:
/* Any regular memory on that node ? */
static void check_for_regular_memory(pg_data_t *pgdat)
{
-#ifdef CONFIG_HIGHMEM
enum zone_type zone_type;

for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
@@ -4735,7 +4734,6 @@ static void check_for_regular_memory(pg_data_t *pgdat)
if (zone->present_pages)
node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
}
-#endif
}

/**

2011-04-21 21:49:31

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 14:34 -0700, David Rientjes wrote:
> On Thu, 21 Apr 2011, James Bottomley wrote:
>
> > > - parisc: James has already queued "parisc: set memory ranges in
> > > N_NORMAL_MEMORY when onlined" for 2.6.39, so all he needs now is
> > > to merge a hybrid of the Kconfig changes requiring CONFIG_NUMA for
> > > CONFIG_DISCONTIGMEM from KOSAKI-san and myself which also fix the
> > > compile issues,
> >
> > Not quite: if we go this route, we need to sort out our CPU scheduling
> > problem as well ... as I said, I don't think we've got all the necessary
> > numa machinery in place yet.
> >
>
> Ok, it seems like there're two options for this release cycle:
>
> (1) merge the patch that enables CONFIG_NUMA for DISCONTIGMEM but only
> do so if CONFIG_SLUB is enabled to avoid the build error, or

That's not an option without coming up with the rest of the numa
fixes ... we can't basically force all SMP systems to become UP.

What build error, by the way? There's only a runtime panic caused by
slub.

> (2) disallow CONFIG_SLUB for parisc with DISCONTIGMEM.

Well, that's this patch ... it will actually fix every architecture, not
just parisc.


> diff --git a/init/Kconfig b/init/Kconfig
> index 56240e7..a7ad8fb 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1226,6 +1226,7 @@ config SLAB
> per cpu and per node queues.
>
> config SLUB
> + depends on BROKEN || NUMA || !DISCONTIGMEM
> bool "SLUB (Unqueued Allocator)"
> help
> SLUB is a slab allocator that minimizes cache line usage


I already sent it to linux-arch and there's been no dissent; there have
been a few "will that fix my slub bug?" type of responses.

James

2011-04-21 22:00:28

by David Rientjes

[permalink] [raw]
Subject: [patch] alpha, mm: set all online nodes in N_NORMAL_MEMORY

For alpha, N_NORMAL_MEMORY represents all nodes that have present memory
since it does not support HIGHMEM. This patch sets the bit at the time
the node is initialized.

If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
uses this nodemask to setup per-cache kmem_cache_node data structures.

Signed-off-by: David Rientjes <[email protected]>
---
arch/alpha/mm/numa.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/mm/numa.c b/arch/alpha/mm/numa.c
--- a/arch/alpha/mm/numa.c
+++ b/arch/alpha/mm/numa.c
@@ -313,6 +313,7 @@ void __init paging_init(void)
zones_size[ZONE_DMA] = dma_local_pfn;
zones_size[ZONE_NORMAL] = (end_pfn - start_pfn) - dma_local_pfn;
}
+ node_set_state(nid, N_NORMAL_MEMORY);
free_area_init_node(nid, zones_size, start_pfn, NULL);
}

2011-04-21 22:01:33

by David Rientjes

[permalink] [raw]
Subject: [patch] m32r, mm: set all online nodes in N_NORMAL_MEMORY

For m32r, N_NORMAL_MEMORY represents all nodes that have present memory
since it does not support HIGHMEM. This patch sets the bit at the time
the node is initialized.

If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
uses this nodemask to setup per-cache kmem_cache_node data structures.

Signed-off-by: David Rientjes <[email protected]>
---
arch/m32r/mm/discontig.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -149,6 +149,7 @@ unsigned long __init zone_sizes_init(void)
zholes_size[ZONE_DMA] = mp->holes;
holes += zholes_size[ZONE_DMA];

+ node_set_state(nid, N_NORMAL_MEMORY);
free_area_init_node(nid, zones_size, start_pfn, zholes_size);
}

2011-04-21 22:02:56

by David Rientjes

[permalink] [raw]
Subject: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

For m68k, N_NORMAL_MEMORY represents all nodes that have present memory
since it does not support HIGHMEM. This patch sets the bit at the time
the node is brought online.

If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
uses this nodemask to setup per-cache kmem_cache_node data structures.

Signed-off-by: David Rientjes <[email protected]>
---
arch/m68k/mm/init_mm.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
--- a/arch/m68k/mm/init_mm.c
+++ b/arch/m68k/mm/init_mm.c
@@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
}
#endif
pg_data_map[node].bdata = bootmem_node_data + node;
+ if (node_present_pages(node))
+ node_set_state(node, N_NORMAL_MEMORY);
node_set_online(node);
}

2011-04-21 22:12:57

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, James Bottomley wrote:

> > Ok, it seems like there're two options for this release cycle:
> >
> > (1) merge the patch that enables CONFIG_NUMA for DISCONTIGMEM but only
> > do so if CONFIG_SLUB is enabled to avoid the build error, or
>
> That's not an option without coming up with the rest of the numa
> fixes ... we can't basically force all SMP systems to become UP.
>
> What build error, by the way? There's only a runtime panic caused by
> slub.
>

If you enable CONFIG_NUMA for ARCH_DISCONTIGMEM_ENABLE on parisc, it
results in the same build error that you identified in

http://marc.info/?l=linux-parisc&m=130326773918005

at least on my hppa64 compiler.

> > (2) disallow CONFIG_SLUB for parisc with DISCONTIGMEM.
>
> Well, that's this patch ... it will actually fix every architecture, not
> just parisc.
>
>
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 56240e7..a7ad8fb 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1226,6 +1226,7 @@ config SLAB
> > per cpu and per node queues.
> >
> > config SLUB
> > + depends on BROKEN || NUMA || !DISCONTIGMEM
> > bool "SLUB (Unqueued Allocator)"
> > help
> > SLUB is a slab allocator that minimizes cache line usage
>
>
> I already sent it to linux-arch and there's been no dissent; there have
> been a few "will that fix my slub bug?" type of responses.
>

I was concerned about tile because it actually got all this right by using
N_NORMAL_MEMORY appropriately and it uses slub by default, but it always
enables NUMA at the moment so this won't impact it.

Acked-by: David Rientjes <[email protected]>

I agree we can now defer "parisc: enable CONFIG_NUMA for DISCONTIGMEM and
fix build errors" until parisc moves away from DISCONTIGMEM, its extracted
away from CONFIG_NUMA, or the scheduler issues are debugged.

2011-04-21 22:19:13

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 21 Apr 2011, James Bottomley wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 94d2a33..243bd9c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -235,7 +235,11 @@ int slab_is_available(void)
>
> static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> {
> +#ifdef CONFIG_NUMA
> return s->node[node];
> +#else
> + return s->node[0];
> +#endif
> }
>
> /* Verify that a pointer has an address that is valid within a slab page */

Looks like parisc may have been just fine before 7340cc84141d (slub:
reduce differences between SMP and NUMA), which was merged into 2.6.37?

2011-04-21 22:31:36

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 15:19 -0700, David Rientjes wrote:
> On Thu, 21 Apr 2011, James Bottomley wrote:
>
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 94d2a33..243bd9c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -235,7 +235,11 @@ int slab_is_available(void)
> >
> > static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> > {
> > +#ifdef CONFIG_NUMA
> > return s->node[node];
> > +#else
> > + return s->node[0];
> > +#endif
> > }
> >
> > /* Verify that a pointer has an address that is valid within a slab page */
>
> Looks like parisc may have been just fine before 7340cc84141d (slub:
> reduce differences between SMP and NUMA), which was merged into 2.6.37?

That's possible. I've had no bug reports from the debian 2.6.32 kernel,
which is the only other one that has SLUB by default. The m68k guys
seem to think this is the cause of their problems too.

But the basic fact is that all our testing has been done on SLAB. It
wasn't until debian asked us to looks at a 2.6.38 kernel that I
accidentally picked up SLUB by importing their config into my build
environment.

James

2011-04-22 00:34:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

> On Thu, 21 Apr 2011, KOSAKI Motohiro wrote:
>
> > ia64 and mips have CONFIG_ARCH_POPULATES_NODE_MAP and it initialize
> > N_NORMAL_MEMORY automatically if my understand is correct.
> > (plz see free_area_init_nodes)
> >
>
> ia64 doesn't enable CONFIG_HIGHMEM, so it never gets set via this generic
> code; mips also doesn't enable it for all configs even for 32-bit.
>
> So we'll either want to take check_for_regular_memory() out from under
> CONFIG_HIGHMEM and do it for all configs or teach slub to use
> N_HIGH_MEMORY rather than N_NORMAL_MEMORY.

Hey, I already told this thing.

If CONFIG_HIGHMEM=n, N_HIGH_MEMORY and N_NORMAL_MEMORY are share the
same value. then,
node_set_state(nid, N_HIGH_MEMORY) in free_area_init_nodes()

mean set both N_HIGH_MEMORY and N_NORMAL_MEMORY.

2011-04-22 00:36:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch] mm: always set nodes with regular memory in N_NORMAL_MEMORY

> N_NORMAL_MEMORY is intended to include all nodes that have present memory
> in regular zones, that is, zones below ZONE_HIGHMEM. This should be done
> regardless of whether CONFIG_HIGHMEM is set or not.
>
> This fixes ia64 so that the nodes get set appropriately in the nodemask
> for DISCONTIGMEM and mips if it does not enable CONFIG_HIGHMEM even for
> 32-bit kernels.
>
> If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
> relies on this nodemask to setup kmem_cache_node data structures for each
> cache.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/page_alloc.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4727,7 +4727,6 @@ out:
> /* Any regular memory on that node ? */
> static void check_for_regular_memory(pg_data_t *pgdat)
> {
> -#ifdef CONFIG_HIGHMEM
> enum zone_type zone_type;
>
> for (zone_type = 0; zone_type <= ZONE_NORMAL; zone_type++) {
> @@ -4735,7 +4734,6 @@ static void check_for_regular_memory(pg_data_t *pgdat)
> if (zone->present_pages)
> node_set_state(zone_to_nid(zone), N_NORMAL_MEMORY);
> }
> -#endif

enum node_states {
N_POSSIBLE, /* The node could become online at some point */
N_ONLINE, /* The node is online */
N_NORMAL_MEMORY, /* The node has regular memory */
#ifdef CONFIG_HIGHMEM
N_HIGH_MEMORY, /* The node has regular or high memory */
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};

Then, only node_set_state(nid, N_HIGH_MEMORY) is enough initialization, IIUC.
Can you please explain when do we need this patch?

2011-04-22 08:02:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Fri, Apr 22, 2011 at 1:12 AM, David Rientjes <[email protected]> wrote:
>> > diff --git a/init/Kconfig b/init/Kconfig
>> > index 56240e7..a7ad8fb 100644
>> > --- a/init/Kconfig
>> > +++ b/init/Kconfig
>> > @@ -1226,6 +1226,7 @@ config SLAB
>> > ? ? ? ? ? per cpu and per node queues.
>> >
>> > ?config SLUB
>> > + ? ? ? depends on BROKEN || NUMA || !DISCONTIGMEM
>> > ? ? ? ? bool "SLUB (Unqueued Allocator)"
>> > ? ? ? ? help
>> > ? ? ? ? ? ?SLUB is a slab allocator that minimizes cache line usage
>>
>>
>> I already sent it to linux-arch and there's been no dissent; there have
>> been a few "will that fix my slub bug?" type of responses.
>
> I was concerned about tile because it actually got all this right by using
> N_NORMAL_MEMORY appropriately and it uses slub by default, but it always
> enables NUMA at the moment so this won't impact it.
>
> Acked-by: David Rientjes <[email protected]>

I'm OK with this Kconfig patch. Can someone send a proper patch with
signoffs and such? Do we want to tag this for -stable too?

Pekka

2011-04-22 13:50:04

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Fri, 2011-04-22 at 11:02 +0300, Pekka Enberg wrote:
> On Fri, Apr 22, 2011 at 1:12 AM, David Rientjes <[email protected]> wrote:
> >> > diff --git a/init/Kconfig b/init/Kconfig
> >> > index 56240e7..a7ad8fb 100644
> >> > --- a/init/Kconfig
> >> > +++ b/init/Kconfig
> >> > @@ -1226,6 +1226,7 @@ config SLAB
> >> > per cpu and per node queues.
> >> >
> >> > config SLUB
> >> > + depends on BROKEN || NUMA || !DISCONTIGMEM
> >> > bool "SLUB (Unqueued Allocator)"
> >> > help
> >> > SLUB is a slab allocator that minimizes cache line usage
> >>
> >>
> >> I already sent it to linux-arch and there's been no dissent; there have
> >> been a few "will that fix my slub bug?" type of responses.
> >
> > I was concerned about tile because it actually got all this right by using
> > N_NORMAL_MEMORY appropriately and it uses slub by default, but it always
> > enables NUMA at the moment so this won't impact it.
> >
> > Acked-by: David Rientjes <[email protected]>
>
> I'm OK with this Kconfig patch. Can someone send a proper patch with
> signoffs and such? Do we want to tag this for -stable too?

I already did here:

http://marc.info/?l=linux-arch&m=130324857801976

I got the wrong linux-mm email address, though (I thought you were on
vger).

I've got a parisc specific patch already for this (also for stable), so
I can just queue this alongside if everyone's OK with that?

James

2011-04-22 17:01:01

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

Hi James,

On Fri, 2011-04-22 at 08:49 -0500, James Bottomley wrote:
> > I'm OK with this Kconfig patch. Can someone send a proper patch with
> > signoffs and such? Do we want to tag this for -stable too?
>
> I already did here:
>
> http://marc.info/?l=linux-arch&m=130324857801976
>
> I got the wrong linux-mm email address, though (I thought you were on
> vger).

Grr, it's a SLUB patch and you didn't CC any of the maintainers! If that
was an attempt to sneak it past me, that's not cool. And if you left it
out by mistake, that's not cool either!

> I've got a parisc specific patch already for this (also for stable), so
> I can just queue this alongside if everyone's OK with that?

Feel free, I'm not subscribed to linux-arch so I don't have the patch in
my inbox at all:

Acked-by: Pekka Enberg <[email protected]>

Pekka

2011-04-22 17:03:35

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Fri, 2011-04-22 at 20:00 +0300, Pekka Enberg wrote:
> Hi James,
>
> On Fri, 2011-04-22 at 08:49 -0500, James Bottomley wrote:
> > > I'm OK with this Kconfig patch. Can someone send a proper patch with
> > > signoffs and such? Do we want to tag this for -stable too?
> >
> > I already did here:
> >
> > http://marc.info/?l=linux-arch&m=130324857801976
> >
> > I got the wrong linux-mm email address, though (I thought you were on
> > vger).
>
> Grr, it's a SLUB patch and you didn't CC any of the maintainers! If that
> was an attempt to sneak it past me, that's not cool. And if you left it
> out by mistake, that's not cool either!

No, christoph was directly on the cc ... although I didn't realise there
were more slub maintainers.

> > I've got a parisc specific patch already for this (also for stable), so
> > I can just queue this alongside if everyone's OK with that?
>
> Feel free, I'm not subscribed to linux-arch so I don't have the patch in
> my inbox at all:
>
> Acked-by: Pekka Enberg <[email protected]>

Will do, thanks.

James

2011-04-22 18:19:23

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 11:45 -0700, Dave Hansen wrote:
> On Thu, 2011-04-21 at 13:33 -0500, Christoph Lameter wrote:
> > http://www.linux-mips.org/archives/linux-mips/2008-08/msg00154.html
> >
> > http://mytechkorner.blogspot.com/2010/12/sparsemem.html
> >
> > Dave Hansen, Mel: Can you provide us with some help? (Its Easter and so
> > the europeans may be off for awhile)
>
> Yup, for sure. It's also interesting how much code ppc64 removed when
> they did this:
>
> http://lists.ozlabs.org/pipermail/linuxppc64-dev/2005-November/006646.html

I looked at converting parisc to sparsemem and there's one problem that
none of these cover. How do you set up bootmem? If I look at the
examples, they all seem to have enough memory in the first range to
allocate from, so there's no problem. On parisc, with discontigmem, we
set up all of our ranges as bootmem (we can do this because we
effectively have one node per range). Obviously, since sparsemem has a
single bitmap for all of the bootmem, we can no longer allocate all of
our memory to it (well, without exploding because some of our gaps are
gigabytes big). How does everyone cope with this (do you search for
your largest range and use that as bootmem or something)?

James


If

2011-04-22 20:24:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Fri, 2011-04-22 at 13:19 -0500, James Bottomley wrote:
> I looked at converting parisc to sparsemem and there's one problem that
> none of these cover. How do you set up bootmem? If I look at the
> examples, they all seem to have enough memory in the first range to
> allocate from, so there's no problem. On parisc, with discontigmem, we
> set up all of our ranges as bootmem (we can do this because we
> effectively have one node per range). Obviously, since sparsemem has a
> single bitmap for all of the bootmem, we can no longer allocate all of
> our memory to it (well, without exploding because some of our gaps are
> gigabytes big). How does everyone cope with this (do you search for
> your largest range and use that as bootmem or something)?

Sparsemem is purely post-bootmem. It doesn't deal with sparse
bootmem. :(

That said, I'm not sure you're in trouble. One bit of bitmap covers 4k
(with 4k pages of course) of memory, one byte covers 32k, and A 32MB
bitmap can cover 1TB of address space. It explodes, but I think it's
manageable. It hasn't been a problem enough up to this point to go fix
it.

-- Dave

2011-04-22 20:35:56

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Fri, 2011-04-22 at 13:24 -0700, Dave Hansen wrote:
> On Fri, 2011-04-22 at 13:19 -0500, James Bottomley wrote:
> > I looked at converting parisc to sparsemem and there's one problem that
> > none of these cover. How do you set up bootmem? If I look at the
> > examples, they all seem to have enough memory in the first range to
> > allocate from, so there's no problem. On parisc, with discontigmem, we
> > set up all of our ranges as bootmem (we can do this because we
> > effectively have one node per range). Obviously, since sparsemem has a
> > single bitmap for all of the bootmem, we can no longer allocate all of
> > our memory to it (well, without exploding because some of our gaps are
> > gigabytes big). How does everyone cope with this (do you search for
> > your largest range and use that as bootmem or something)?
>
> Sparsemem is purely post-bootmem. It doesn't deal with sparse
> bootmem. :(

Well, this is enabled in discontigmem, sigh.

> That said, I'm not sure you're in trouble. One bit of bitmap covers 4k
> (with 4k pages of course) of memory, one byte covers 32k, and A 32MB
> bitmap can cover 1TB of address space. It explodes, but I think it's
> manageable. It hasn't been a problem enough up to this point to go fix
> it.

I think the platform limited physical address range is 42 bits, so I
suppose that's 128MB ... hopefully we should have that as a contiguous
range from the end of the loaded kernel. We're lucky they didn't enable
the full ZX1 address range; that would have been 48 bits (or a whole
gigabyte just for the bitmap).

James

2011-04-22 21:33:14

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards

On Thu, 2011-04-21 at 11:45 -0700, Dave Hansen wrote:
> On Thu, 2011-04-21 at 13:33 -0500, Christoph Lameter wrote:
> > http://www.linux-mips.org/archives/linux-mips/2008-08/msg00154.html
> >

By the way, this reference is actively wrong for parisc (having just
debugged the problem). The basic issue is that until we start paging,
we have the kernel and some memory beyond it barely covered with the pg0
page table set up in head.S On our systems, that extends out to 16MB.
SPARSEMEM is much more bootmem resource greedy than DISCONTIGMEM, so if
we actually call sparse_init() before we have the page tables set up, we
fall off the end of our 16MB mapping and go boom. For us, therefore, we
can't call sparse_init() until we have our proper page tables in place.

James

2011-04-23 18:34:23

by James Bottomley

[permalink] [raw]
Subject: [PATCH] convert parisc to sparsemem (was Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards)

This is the preliminary conversion. It's very nasty on parisc because
the memory allocation isn't symmetric anymore: under DISCONTIGMEM, we
push all memory into bootmem and then let free_all_bootmem() do the
magic for us; now we have to do separate initialisations for ranges
because SPARSEMEM can't do multi-range boot memory. It's also got the
horrible hack that I only use the first found range for bootmem. I'm
not sure if this is correct (it won't be if the first found range can be
under about 50MB because we'll run out of bootmem during boot) ... we
might have to sort the ranges and use the larges, but that will involve
us in even more hackery around the bootmem reservations code.

The boot sequence got a few seconds slower because now all of the loops
over our pfn ranges actually have to skip through the holes (which takes
time for 64GB).

All in all, I've not been very impressed with SPARSEMEM over
DISCONTIGMEM. It seems to have a lot of rough edges (necessitating
exception code) which DISCONTIGMEM just copes with.

And before you say the code is smaller, that's because I converted us to
generic show_mem().

James

---

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 69ff049..b416641 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -233,22 +233,17 @@ config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on 64BIT

-config ARCH_DISCONTIGMEM_ENABLE
+config ARCH_SPARSEMEM_ENABLE
def_bool y
depends on 64BIT

config ARCH_FLATMEM_ENABLE
def_bool y

-config ARCH_DISCONTIGMEM_DEFAULT
+config ARCH_SPARSEMEM_DEFAULT
def_bool y
depends on ARCH_DISCONTIGMEM_ENABLE

-config NODES_SHIFT
- int
- default "3"
- depends on NEED_MULTIPLE_NODES
-
source "kernel/Kconfig.preempt"
source "kernel/Kconfig.hz"
source "mm/Kconfig"
diff --git a/arch/parisc/include/asm/mmzone.h b/arch/parisc/include/asm/mmzone.h
index 9608d2c..8344bcb 100644
--- a/arch/parisc/include/asm/mmzone.h
+++ b/arch/parisc/include/asm/mmzone.h
@@ -1,73 +1,11 @@
#ifndef _PARISC_MMZONE_H
#define _PARISC_MMZONE_H

-#ifdef CONFIG_DISCONTIGMEM
+#ifdef CONFIG_SPARSEMEM

-#define MAX_PHYSMEM_RANGES 8 /* Fix the size for now (current known max is 3) */
-extern int npmem_ranges;
-
-struct node_map_data {
- pg_data_t pg_data;
-};
-
-extern struct node_map_data node_data[];
-
-#define NODE_DATA(nid) (&node_data[nid].pg_data)
-
-#define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
-#define node_end_pfn(nid) \
-({ \
- pg_data_t *__pgdat = NODE_DATA(nid); \
- __pgdat->node_start_pfn + __pgdat->node_spanned_pages; \
-})
-
-/* We have these possible memory map layouts:
- * Astro: 0-3.75, 67.75-68, 4-64
- * zx1: 0-1, 257-260, 4-256
- * Stretch (N-class): 0-2, 4-32, 34-xxx
- */
-
-/* Since each 1GB can only belong to one region (node), we can create
- * an index table for pfn to nid lookup; each entry in pfnnid_map
- * represents 1GB, and contains the node that the memory belongs to. */
-
-#define PFNNID_SHIFT (30 - PAGE_SHIFT)
-#define PFNNID_MAP_MAX 512 /* support 512GB */
-extern unsigned char pfnnid_map[PFNNID_MAP_MAX];
-
-#ifndef CONFIG_64BIT
-#define pfn_is_io(pfn) ((pfn & (0xf0000000UL >> PAGE_SHIFT)) == (0xf0000000UL >> PAGE_SHIFT))
+#define MAX_PHYSMEM_RANGES 8 /* current max is 3 but future proof this */
#else
-/* io can be 0xf0f0f0f0f0xxxxxx or 0xfffffffff0000000 */
-#define pfn_is_io(pfn) ((pfn & (0xf000000000000000UL >> PAGE_SHIFT)) == (0xf000000000000000UL >> PAGE_SHIFT))
-#endif
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
- unsigned int i;
- unsigned char r;
-
- if (unlikely(pfn_is_io(pfn)))
- return 0;
-
- i = pfn >> PFNNID_SHIFT;
- BUG_ON(i >= sizeof(pfnnid_map) / sizeof(pfnnid_map[0]));
- r = pfnnid_map[i];
- BUG_ON(r == 0xff);
-
- return (int)r;
-}
-
-static inline int pfn_valid(int pfn)
-{
- int nid = pfn_to_nid(pfn);
-
- if (nid >= 0)
- return (pfn < node_end_pfn(nid));
- return 0;
-}
-
-#else /* !CONFIG_DISCONTIGMEM */
#define MAX_PHYSMEM_RANGES 1
#endif
+
#endif /* _PARISC_MMZONE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index a84cc1f..654285a 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -139,9 +139,9 @@ extern int npmem_ranges;
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
#define pfn_valid(pfn) ((pfn) < max_mapnr)
-#endif /* CONFIG_DISCONTIGMEM */
+#endif

#ifdef CONFIG_HUGETLB_PAGE
#define HPAGE_SHIFT 22 /* 4MB (is this fixed?) */
diff --git a/arch/parisc/kernel/parisc_ksyms.c b/arch/parisc/kernel/parisc_ksyms.c
index df65366..526122c 100644
--- a/arch/parisc/kernel/parisc_ksyms.c
+++ b/arch/parisc/kernel/parisc_ksyms.c
@@ -147,12 +147,6 @@ extern void $$dyncall(void);
EXPORT_SYMBOL($$dyncall);
#endif

-#ifdef CONFIG_DISCONTIGMEM
-#include <asm/mmzone.h>
-EXPORT_SYMBOL(node_data);
-EXPORT_SYMBOL(pfnnid_map);
-#endif
-
#ifdef CONFIG_FUNCTION_TRACER
extern void _mcount(void);
EXPORT_SYMBOL(_mcount);
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 5fa1e27..69c547c 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -21,7 +21,6 @@
#include <linux/initrd.h>
#include <linux/swap.h>
#include <linux/unistd.h>
-#include <linux/nodemask.h> /* for node_online_map */
#include <linux/pagemap.h> /* for release_pages and page_cache_release */

#include <asm/pgalloc.h>
@@ -35,11 +34,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);

extern int data_start;

-#ifdef CONFIG_DISCONTIGMEM
-struct node_map_data node_data[MAX_NUMNODES] __read_mostly;
-unsigned char pfnnid_map[PFNNID_MAP_MAX] __read_mostly;
-#endif
-
static struct resource data_resource = {
.name = "Kernel data",
.flags = IORESOURCE_BUSY | IORESOURCE_MEM,
@@ -110,7 +104,7 @@ static void __init setup_bootmem(void)
unsigned long bootmap_pages;
unsigned long bootmap_start_pfn;
unsigned long bootmap_pfn;
-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
physmem_range_t pmem_holes[MAX_PHYSMEM_RANGES - 1];
int npmem_holes;
#endif
@@ -144,7 +138,7 @@ static void __init setup_bootmem(void)
}
}

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
/*
* Throw out ranges that are too far apart (controlled by
* MAX_GAP).
@@ -156,7 +150,7 @@ static void __init setup_bootmem(void)
pmem_ranges[i-1].pages) > MAX_GAP) {
npmem_ranges = i;
printk("Large gap in memory detected (%ld pages). "
- "Consider turning on CONFIG_DISCONTIGMEM\n",
+ "Consider turning on CONFIG_SPARSEMEM\n",
pmem_ranges[i].start_pfn -
(pmem_ranges[i-1].start_pfn +
pmem_ranges[i-1].pages));
@@ -228,7 +222,7 @@ static void __init setup_bootmem(void)

printk(KERN_INFO "Total Memory: %ld MB\n",mem_max >> 20);

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM
/* Merge the ranges, keeping track of the holes */

{
@@ -253,48 +247,29 @@ static void __init setup_bootmem(void)
}
#endif

- bootmap_pages = 0;
- for (i = 0; i < npmem_ranges; i++)
- bootmap_pages += bootmem_bootmap_pages(pmem_ranges[i].pages);
+ bootmap_pages = bootmem_bootmap_pages(pmem_ranges[0].pages);

bootmap_start_pfn = PAGE_ALIGN(__pa((unsigned long) &_end)) >> PAGE_SHIFT;

-#ifdef CONFIG_DISCONTIGMEM
- for (i = 0; i < MAX_PHYSMEM_RANGES; i++) {
- memset(NODE_DATA(i), 0, sizeof(pg_data_t));
- NODE_DATA(i)->bdata = &bootmem_node_data[i];
- }
- memset(pfnnid_map, 0xff, sizeof(pfnnid_map));
-
- for (i = 0; i < npmem_ranges; i++) {
- node_set_state(i, N_NORMAL_MEMORY);
- node_set_online(i);
- }
-#endif
-
/*
- * Initialize and free the full range of memory in each range.
- * Note that the only writing these routines do are to the bootmap,
- * and we've made sure to locate the bootmap properly so that they
- * won't be writing over anything important.
+ * Only initialise the first memory range to bootmem (the bootmem
+ * allocation map can't cope with large holes)
*/

bootmap_pfn = bootmap_start_pfn;
max_pfn = 0;
- for (i = 0; i < npmem_ranges; i++) {
+ {
unsigned long start_pfn;
unsigned long npages;

- start_pfn = pmem_ranges[i].start_pfn;
- npages = pmem_ranges[i].pages;
+ start_pfn = pmem_ranges[0].start_pfn;
+ npages = pmem_ranges[0].pages;

- bootmap_size = init_bootmem_node(NODE_DATA(i),
+ bootmap_size = init_bootmem_node(NODE_DATA(0),
bootmap_pfn,
start_pfn,
(start_pfn + npages) );
- free_bootmem_node(NODE_DATA(i),
- (start_pfn << PAGE_SHIFT),
- (npages << PAGE_SHIFT) );
+ free_bootmem(start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT);
bootmap_pfn += (bootmap_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
if ((start_pfn + npages) > max_pfn)
max_pfn = start_pfn + npages;
@@ -323,7 +298,7 @@ static void __init setup_bootmem(void)
((bootmap_pfn - bootmap_start_pfn) << PAGE_SHIFT),
BOOTMEM_DEFAULT);

-#ifndef CONFIG_DISCONTIGMEM
+#ifndef CONFIG_SPARSEMEM

/* reserve the holes */

@@ -369,6 +344,13 @@ static void __init setup_bootmem(void)
request_resource(res, &data_resource);
}
request_resource(&sysram_resources[0], &pdcdata_resource);
+
+#ifdef CONFIG_SPARSEMEM
+ for (i = 0; i < npmem_ranges; i++) {
+ memory_present(0, pmem_ranges[i].start_pfn,
+ pmem_ranges[i].start_pfn + pmem_ranges[i].pages);
+ }
+#endif
}

static void __init map_pages(unsigned long start_vaddr,
@@ -580,7 +562,7 @@ unsigned long pcxl_dma_start __read_mostly;

void __init mem_init(void)
{
- int codesize, reservedpages, datasize, initsize;
+ int codesize, reservedpages, datasize, initsize, i;

/* Do sanity checks on page table constants */
BUILD_BUG_ON(PTE_ENTRY_SIZE != sizeof(pte_t));
@@ -589,19 +571,27 @@ void __init mem_init(void)
BUILD_BUG_ON(PAGE_SHIFT + BITS_PER_PTE + BITS_PER_PMD + BITS_PER_PGD
> BITS_PER_LONG);

- high_memory = __va((max_pfn << PAGE_SHIFT));
-
-#ifndef CONFIG_DISCONTIGMEM
- max_mapnr = page_to_pfn(virt_to_page(high_memory - 1)) + 1;
totalram_pages += free_all_bootmem();
-#else
- {
- int i;
-
- for (i = 0; i < npmem_ranges; i++)
- totalram_pages += free_all_bootmem_node(NODE_DATA(i));
+ /* free all the ranges not in bootmem */
+ for (i = 1; i < npmem_ranges; i++) {
+ unsigned long pfn = pmem_ranges[i].start_pfn;
+ unsigned long end = pfn + pmem_ranges[i].pages;
+
+ if (end > max_pfn)
+ max_pfn = end;
+
+ for (; pfn < end; pfn++) {
+ struct page *page = pfn_to_page(pfn);
+ ClearPageReserved(page);
+ init_page_count(page);
+ __free_page(page);
+ totalram_pages++;
+ }
}
-#endif
+
+ max_low_pfn = max_pfn;
+ high_memory = __va((max_pfn << PAGE_SHIFT));
+ max_mapnr = page_to_pfn(virt_to_page(high_memory - 1)) + 1;

codesize = (unsigned long)_etext - (unsigned long)_text;
datasize = (unsigned long)_edata - (unsigned long)_etext;
@@ -610,24 +600,15 @@ void __init mem_init(void)
reservedpages = 0;
{
unsigned long pfn;
-#ifdef CONFIG_DISCONTIGMEM
- int i;
-
- for (i = 0; i < npmem_ranges; i++) {
- for (pfn = node_start_pfn(i); pfn < node_end_pfn(i); pfn++) {
- if (PageReserved(pfn_to_page(pfn)))
- reservedpages++;
- }
- }
-#else /* !CONFIG_DISCONTIGMEM */
for (pfn = 0; pfn < max_pfn; pfn++) {
/*
* Only count reserved RAM pages
*/
+ if (!pfn_valid(pfn))
+ continue;
if (PageReserved(pfn_to_page(pfn)))
reservedpages++;
}
-#endif
}

#ifdef CONFIG_PA11
@@ -680,78 +661,6 @@ void __init mem_init(void)
unsigned long *empty_zero_page __read_mostly;
EXPORT_SYMBOL(empty_zero_page);

-void show_mem(unsigned int filter)
-{
- int i,free = 0,total = 0,reserved = 0;
- int shared = 0, cached = 0;
-
- printk(KERN_INFO "Mem-info:\n");
- show_free_areas();
-#ifndef CONFIG_DISCONTIGMEM
- i = max_mapnr;
- while (i-- > 0) {
- total++;
- if (PageReserved(mem_map+i))
- reserved++;
- else if (PageSwapCache(mem_map+i))
- cached++;
- else if (!page_count(&mem_map[i]))
- free++;
- else
- shared += page_count(&mem_map[i]) - 1;
- }
-#else
- for (i = 0; i < npmem_ranges; i++) {
- int j;
-
- for (j = node_start_pfn(i); j < node_end_pfn(i); j++) {
- struct page *p;
- unsigned long flags;
-
- pgdat_resize_lock(NODE_DATA(i), &flags);
- p = nid_page_nr(i, j) - node_start_pfn(i);
-
- total++;
- if (PageReserved(p))
- reserved++;
- else if (PageSwapCache(p))
- cached++;
- else if (!page_count(p))
- free++;
- else
- shared += page_count(p) - 1;
- pgdat_resize_unlock(NODE_DATA(i), &flags);
- }
- }
-#endif
- printk(KERN_INFO "%d pages of RAM\n", total);
- printk(KERN_INFO "%d reserved pages\n", reserved);
- printk(KERN_INFO "%d pages shared\n", shared);
- printk(KERN_INFO "%d pages swap cached\n", cached);
-
-
-#ifdef CONFIG_DISCONTIGMEM
- {
- struct zonelist *zl;
- int i, j;
-
- for (i = 0; i < npmem_ranges; i++) {
- zl = node_zonelist(i, 0);
- for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zoneref *z;
- struct zone *zone;
-
- printk("Zone list for zone %d on node %d: ", j, i);
- for_each_zone_zonelist(zone, z, zl, j)
- printk("[%d/%s] ", zone_to_nid(zone),
- zone->name);
- printk("\n");
- }
- }
- }
-#endif
-}
-
/*
* pagetable_init() sets up the page tables
*
@@ -886,6 +795,9 @@ EXPORT_SYMBOL(map_hpux_gateway_page);
void __init paging_init(void)
{
int i;
+ unsigned long zones_size[MAX_NR_ZONES] = { 0, };
+ unsigned long holes_size[MAX_NR_ZONES] = { 0, };
+ unsigned long mem_start_pfn = ~0UL, mem_end_pfn = 0, mem_size_pfn = 0;

setup_bootmem();
pagetable_init();
@@ -893,27 +805,31 @@ void __init paging_init(void)
flush_cache_all_local(); /* start with known state */
flush_tlb_all_local(NULL);

- for (i = 0; i < npmem_ranges; i++) {
- unsigned long zones_size[MAX_NR_ZONES] = { 0, };
-
- zones_size[ZONE_NORMAL] = pmem_ranges[i].pages;
-
-#ifdef CONFIG_DISCONTIGMEM
- /* Need to initialize the pfnnid_map before we can initialize
- the zone */
- {
- int j;
- for (j = (pmem_ranges[i].start_pfn >> PFNNID_SHIFT);
- j <= ((pmem_ranges[i].start_pfn + pmem_ranges[i].pages) >> PFNNID_SHIFT);
- j++) {
- pfnnid_map[j] = i;
- }
- }
-#endif
+ /*
+ * from here, the kernel and all of the physical memory is
+ * fully covered with page table entries. This is required
+ * because sparse_init() is very memory greedy and will fall
+ * off the end of the kernel initial page mapping.
+ */
+
+ sparse_init();

- free_area_init_node(i, zones_size,
- pmem_ranges[i].start_pfn, NULL);
+ for (i = 0; i < npmem_ranges; i++) {
+ unsigned long start = pmem_ranges[i].start_pfn;
+ unsigned long size = pmem_ranges[i].pages;
+ unsigned long end = start + size;
+
+ if (mem_start_pfn > start)
+ mem_start_pfn = start;
+ if (mem_end_pfn < end)
+ mem_end_pfn = end;
+ mem_size_pfn += size;
}
+
+ zones_size[ZONE_NORMAL] = mem_end_pfn - mem_start_pfn;
+ holes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] - mem_size_pfn;
+
+ free_area_init_node(0, zones_size, mem_start_pfn, holes_size);
}

#ifdef CONFIG_PA20

2011-04-24 01:59:55

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

David Rientjes wrote:
> For m68k, N_NORMAL_MEMORY represents all nodes that have present memory
> since it does not support HIGHMEM. This patch sets the bit at the time
> the node is brought online.
>
> If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
> uses this nodemask to setup per-cache kmem_cache_node data structures.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> arch/m68k/mm/init_mm.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
> --- a/arch/m68k/mm/init_mm.c
> +++ b/arch/m68k/mm/init_mm.c
> @@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
> }
> #endif
> pg_data_map[node].bdata = bootmem_node_data + node;
> + if (node_present_pages(node))
> + node_set_state(node, N_NORMAL_MEMORY);
> node_set_online(node);
> }
>
>
As Andreas pointed out, node_present_pages is set in free_area_init_node
which only gets called at the very end of m68k mm paging_init.

The correct patch would be something like this - the need for the
conditional is perhaps debatable, seeing as we set the pages present
just before node_set_state.

Tested on my ARAnyM test setup so far. I'd like to wait for an
independent kernel image built by Thorsten before I test on the actual
hardware. Sorry but you'll have to restart your build Thorsten :-)

Signed-off-by: Michael Schmitz <[email protected]>
--
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 02b7a03..b806c19 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -300,6 +300,8 @@ void __init paging_init(void)
zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
free_area_init_node(i, zones_size,
m68k_memory[i].addr >> PAGE_SHIFT,
NULL);
+ if (node_present_pages(i))
+ node_set_state(i, N_NORMAL_MEMORY);
}
}

2011-04-24 03:15:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

2011/4/24 Michael Schmitz <[email protected]>:
> David Rientjes wrote:
>>
>> For m68k, N_NORMAL_MEMORY represents all nodes that have present memory
>> since it does not support HIGHMEM. ?This patch sets the bit at the time the
>> node is brought online.
>>
>> If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
>> uses this nodemask to setup per-cache kmem_cache_node data structures.
>>
>> Signed-off-by: David Rientjes <[email protected]>
>> ---
>> ?arch/m68k/mm/init_mm.c | ? ?2 ++
>> ?1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
>> --- a/arch/m68k/mm/init_mm.c
>> +++ b/arch/m68k/mm/init_mm.c
>> @@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
>> ? ? ? ?}
>> ?#endif
>> ? ? ? ?pg_data_map[node].bdata = bootmem_node_data + node;
>> + ? ? ? if (node_present_pages(node))
>> + ? ? ? ? ? ? ? node_set_state(node, N_NORMAL_MEMORY);
>> ? ? ? ?node_set_online(node);
>> ?}
>>
>
> As Andreas pointed out, node_present_pages is set in free_area_init_node
> which only gets called at the very end of m68k mm paging_init.
>
> The correct patch would be something like this - the need for the
> conditional is perhaps debatable, seeing as we set the pages present just
> before node_set_state.
>
> Tested on my ARAnyM test setup so far. I'd like to wait for an independent
> kernel image built by Thorsten before I test on the actual hardware. Sorry
> but you'll have to restart your build Thorsten :-)
>
> Signed-off-by: Michael Schmitz <[email protected]>
> --
> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
> index 02b7a03..b806c19 100644
> --- a/arch/m68k/mm/motorola.c
> +++ b/arch/m68k/mm/motorola.c
> @@ -300,6 +300,8 @@ void __init paging_init(void)
> ? ? ? ? ? ? ? zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
> ? ? ? ? ? ? ? free_area_init_node(i, zones_size,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? m68k_memory[i].addr >> PAGE_SHIFT, NULL);
> + ? ? ? ? ? ? ? ?if (node_present_pages(i))
> + ? ? ? ? ? ? ? ? ? ? ? ?node_set_state(i, N_NORMAL_MEMORY);
> ? ? ? }
> }

I think you are right. however I doubt m68k need to care memoryless node check.
probably following patch just work.

------------------------------------
diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
--- a/arch/m68k/mm/init_mm.c
+++ b/arch/m68k/mm/init_mm.c
@@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
? ? ? ?}
?#endif
? ? ? ?pg_data_map[node].bdata = bootmem_node_data + node;
? ? ? ?node_set_online(node);
+? ? ? node_set_state(node, N_NORMAL_MEMORY);


Maybe, the correct solution is to use CONFIG_ARCH_POPULATES_NODE_MAP feature.
the usage is simple, adding add_active_range() and replacing
free_area_init_node() with free_area_init_nodes().

Thanks.

2011-04-24 04:01:34

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

KOSAKI Motohiro wrote:
> 2011/4/24 Michael Schmitz <[email protected]>:
>
>> David Rientjes wrote:
>>
>>> For m68k, N_NORMAL_MEMORY represents all nodes that have present memory
>>> since it does not support HIGHMEM. This patch sets the bit at the time the
>>> node is brought online.
>>>
>>> If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
>>> uses this nodemask to setup per-cache kmem_cache_node data structures.
>>>
>>> Signed-off-by: David Rientjes <[email protected]>
>>> ---
>>> arch/m68k/mm/init_mm.c | 2 ++
>>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
>>> --- a/arch/m68k/mm/init_mm.c
>>> +++ b/arch/m68k/mm/init_mm.c
>>> @@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
>>> }
>>> #endif
>>> pg_data_map[node].bdata = bootmem_node_data + node;
>>> + if (node_present_pages(node))
>>> + node_set_state(node, N_NORMAL_MEMORY);
>>> node_set_online(node);
>>> }
>>>
>>>
>> As Andreas pointed out, node_present_pages is set in free_area_init_node
>> which only gets called at the very end of m68k mm paging_init.
>>
>> The correct patch would be something like this - the need for the
>> conditional is perhaps debatable, seeing as we set the pages present just
>> before node_set_state.
>>
>> Tested on my ARAnyM test setup so far. I'd like to wait for an independent
>> kernel image built by Thorsten before I test on the actual hardware. Sorry
>> but you'll have to restart your build Thorsten :-)
>>
>> Signed-off-by: Michael Schmitz <[email protected]>
>> --
>> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
>> index 02b7a03..b806c19 100644
>> --- a/arch/m68k/mm/motorola.c
>> +++ b/arch/m68k/mm/motorola.c
>> @@ -300,6 +300,8 @@ void __init paging_init(void)
>> zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
>> free_area_init_node(i, zones_size,
>> m68k_memory[i].addr >> PAGE_SHIFT, NULL);
>> + if (node_present_pages(i))
>> + node_set_state(i, N_NORMAL_MEMORY);
>> }
>> }
>>
>
> I think you are right. however I doubt m68k need to care memoryless node check.
> probably following patch just work.
>
> ------------------------------------
> diff --git a/arch/m68k/mm/init_mm.c b/arch/m68k/mm/init_mm.c
> --- a/arch/m68k/mm/init_mm.c
> +++ b/arch/m68k/mm/init_mm.c
> @@ -59,6 +59,8 @@ void __init m68k_setup_node(int node)
> }
> #endif
> pg_data_map[node].bdata = bootmem_node_data + node;
> node_set_online(node);
> + node_set_state(node, N_NORMAL_MEMORY);
>
It does in fact - that's what I had as a workaround before.

Memory node information is provided by the respective m68k bootstrap so
memoryless nodes won't be possible indeed.
> Maybe, the correct solution is to use CONFIG_ARCH_POPULATES_NODE_MAP feature.
> the usage is simple, adding add_active_range() and replacing
> free_area_init_node() with free_area_init_nodes().
>
Either of the patches should work as long as there is not a potential
problem with setting the node state before setting node_present_pages.
Your suggestion may work as well but I cannot judge whether there may be
side effects in relation to special case handling of node zero on m68k.
Do we need to call add_active_range at the time init_bootmem_node is
called? That would be a bit earlier in paging_init, and achieve much the
same thing as the proposed patch or the earlier workaround.

Besides, your solution would set N_HIGH_MEMORY not N_NORMAL_MEMORY if I
read free_area_init_nodes() right. Does that matter?

Cheers,

Michael

2011-04-24 05:05:22

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Michael Schmitz wrote:
> Besides, your solution would set N_HIGH_MEMORY not N_NORMAL_MEMORY if
> I read free_area_init_nodes() right. Does that matter?
check_for_regular_memory() does take care of that - so it would be all
the same in the end. Whether it's worth it for a handful of nodes at
best I'm unsure. I'll leave that for Geert to decide.

Cheers,

Michael

2011-04-24 06:04:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

2011/4/24 Michael Schmitz <[email protected]>:
> Michael Schmitz wrote:
>>
>> Besides, your solution would set N_HIGH_MEMORY not N_NORMAL_MEMORY if I
>> read free_area_init_nodes() right. Does that matter?
>
> check_for_regular_memory() does take care of that - so it would be all the
> same in the end. Whether it's worth it for a handful of nodes at best I'm
> unsure. I'll leave that for Geert to decide.

check_for_regular_memory() doesn't. It is CONFIG_HIGHMEM specific.
The trick is in node_status definition. N_NORMAL_MEMORY and
N_HIGHMEMORY share the same value. then
node_set_state(nid, N_HIGH_MEMORY) turn on both normal and
highmem bitmask.

--------------------------------------------------------------------
enum node_states {
N_POSSIBLE, /* The node could become online at some point */
N_ONLINE, /* The node is online */
N_NORMAL_MEMORY, /* The node has regular memory */
#ifdef CONFIG_HIGHMEM
N_HIGH_MEMORY, /* The node has regular or high memory */
#else
N_HIGH_MEMORY = N_NORMAL_MEMORY,
#endif
N_CPU, /* The node has one or more cpus */
NR_NODE_STATES
};

2011-04-24 11:51:34

by Thorsten Glaser

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Michael Schmitz dixit:

> Tested on my ARAnyM test setup so far. I'd like to wait for an independent
> kernel image built by Thorsten before I test on the actual hardware. Sorry but
> you'll have to restart your build Thorsten :-)

Heh okay, if you want. But only cross-compiling :D


KOSAKI Motohiro dixit:

>I think you are right. however I doubt m68k need to care memoryless node check.
>probably following patch just work.

That’s what we have now. Compiles, boots, works fine.


So, which of these do you guys want?

bye,
//mirabilos
--
13:47⎜<tobiasu> if i were omnipotent, i would divide by zero
all day long ;)
(thinking about http://lobacevski.tumblr.com/post/3260866481 by waga)

2011-04-24 12:19:35

by Thorsten Glaser

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Michael Schmitz dixit:

> Tested on my ARAnyM test setup so far. I'd like to wait for an independent
> kernel image built by Thorsten before I test on the actual hardware. Sorry but

ARAnyM too, but both this patch and the earlier one compile and boot
(on atari) and I’m building packages under both of them (two ARAnyM
instances) right now. Toolchain used:

binutils-m68k-linux-gnu 2.21.51.20110419-2
gcc-4.4-m68k-linux-gnu 4.4.6-2+m68k.1

(i.e. sid latest; gcc with only a patch to keep building libgcc2,
since we don’t have gcc-4.5 on m68k yet)

bye,
//mirabilos
--
13:47⎜<tobiasu> if i were omnipotent, i would divide by zero
all day long ;)
(thinking about http://lobacevski.tumblr.com/post/3260866481 by waga)

2011-04-24 16:27:25

by John David Anglin

[permalink] [raw]
Subject: Re: [PATCH] convert parisc to sparsemem (was Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards)

On Sat, 23 Apr 2011, James Bottomley wrote:

> The boot sequence got a few seconds slower because now all of the loops
> over our pfn ranges actually have to skip through the holes (which takes
> time for 64GB).

On my rp3440, the biggest gap seems to be 265GB:

dave@mx3210:~$ cat /proc/iomem
00000000-3fffffff : System RAM
00000000-000009ff : PDC data (Page Zero)
00100000-004acfff : Kernel code
004ad000-00661fff : Kernel data
40000000-4fffffff : IOVA Space
100000000-27fdfffff : System RAM
4040000000-40ffffffff : System RAM

Dave
--
J. David Anglin [email protected]
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)

2011-04-25 02:41:50

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Hello Thorsten,
>> Tested on my ARAnyM test setup so far. I'd like to wait for an independent
>> kernel image built by Thorsten before I test on the actual hardware. Sorry but
>> you'll have to restart your build Thorsten :-)
>>
>
> Heh okay, if you want. But only cross-compiling :D
>
Thought so. That's fine actually. Your cross toolchain will be a lot
more recent than mine.
>
> KOSAKI Motohiro dixit:
>
>
>> I think you are right. however I doubt m68k need to care memoryless node check.
>> probably following patch just work.
>>
>
> That’s what we have now. Compiles, boots, works fine.
>
>
> So, which of these do you guys want?
>
Whatever is cleaner and easier to understand. I'm a poor judge of code
elegance.

Cheers,

Michael

2011-04-25 14:28:23

by Thorsten Glaser

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Michael Schmitz dixit:

> Thought so. That's fine actually. Your cross toolchain will be a lot more
> recent than mine.

Just these tools in the exact versions I’d use natively, too:

binutils-m68k-linux-gnu 2.21.51.20110419-2
gcc-4.4-m68k-linux-gnu 4.4.6-2+m68k.1

>> So, which of these do you guys want?
>>
> Whatever is cleaner and easier to understand. I'm a poor judge of code
> elegance.

I’d say the “correcter” one, if m68k would have to care about
memoryless nodes in the future (not sure whether it’s possible
for that to happen). But I’ve added your two-liner to Debian now.

bye,
//mirabilos
--
13:47⎜<tobiasu> if i were omnipotent, i would divide by zero
all day long ;)
(thinking about http://lobacevski.tumblr.com/post/3260866481 by waga)

2011-04-25 20:56:55

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

On Sun, 24 Apr 2011, Michael Schmitz wrote:

> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
> index 02b7a03..b806c19 100644
> --- a/arch/m68k/mm/motorola.c
> +++ b/arch/m68k/mm/motorola.c
> @@ -300,6 +300,8 @@ void __init paging_init(void)
> zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
> free_area_init_node(i, zones_size,
> m68k_memory[i].addr >> PAGE_SHIFT, NULL);
> + if (node_present_pages(i))
> + node_set_state(i, N_NORMAL_MEMORY);
> }
> }
>

Ok, would you like to write a changelog for this similar to mine and then
propose it as an alternative?

Thanks!

2011-04-26 00:32:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] convert parisc to sparsemem (was Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards)

Hi James,

% make CROSS_COMPILE=hppa64-linux- ARCH=parisc
CHK include/linux/version.h
CHK include/generated/utsrelease.h
CC arch/parisc/kernel/asm-offsets.s
In file included from include/linux/topology.h:32:0,
from include/linux/sched.h:78,
from arch/parisc/kernel/asm-offsets.c:31:
include/linux/mmzone.h:916:27: fatal error: asm/sparsemem.h: No such file or directory

Parhaps, you forgot to quilt add?


> This is the preliminary conversion. It's very nasty on parisc because
> the memory allocation isn't symmetric anymore: under DISCONTIGMEM, we
> push all memory into bootmem and then let free_all_bootmem() do the
> magic for us; now we have to do separate initialisations for ranges
> because SPARSEMEM can't do multi-range boot memory. It's also got the
> horrible hack that I only use the first found range for bootmem. I'm
> not sure if this is correct (it won't be if the first found range can be
> under about 50MB because we'll run out of bootmem during boot) ... we
> might have to sort the ranges and use the larges, but that will involve
> us in even more hackery around the bootmem reservations code.
>
> The boot sequence got a few seconds slower because now all of the loops
> over our pfn ranges actually have to skip through the holes (which takes
> time for 64GB).
>
> All in all, I've not been very impressed with SPARSEMEM over
> DISCONTIGMEM. It seems to have a lot of rough edges (necessitating
> exception code) which DISCONTIGMEM just copes with.
>
> And before you say the code is smaller, that's because I converted us to
> generic show_mem().

Cool! I hoped to remove arch specific show_mem() long time.


And, nitpick comment.

Could you please use #ifdef CONFIG_FLAGMEM instead #ifndef CONFIG_SPARSEMEM?
MM gyes parse '#ifndef CONFIG_SPARSEMEM' as valid-both-flatmem-and-discontigmem.
but this code isn't.

If my quick grep is correct, all of your #ifndef SPARSEMEM can be converted
#ifdef FALTMEM.



2011-04-26 02:52:08

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

David Rientjes wrote:
> Ok, would you like to write a changelog for this similar to mine and then
> propose it as an alternative?
>
> Thanks!
>
Will this one do?

Cheers,

Michael


commit 99b9c43cfb18a8e2222e9ef80b04a5c3e1dad254
Author: Michael Schmitz <[email protected]>
Date: Tue Apr 26 14:51:54 2011 +1200

[m68k] For m68k, N_NORMAL_MEMORY represents all nodes that have
present memory
since it does not support HIGHMEM. This patch sets the bit at
the time
node_present_pages has been set by free_area_init_node.
At the time the node is brought online, the the node state would
have to
be done unconditionally since information about present memory
has not yet
been recorded.

If N_NORMAL_MEMORY is not accurate, slub may encounter errors
since it
uses this nodemask to setup per-cache kmem_cache_node data
structures.

This pach is an alternative to the one proposed by David
Rientjes <[email protected]>
attempting to set node state immediately when bringing the node
online.

Signed-off-by: Michael Schmitz <[email protected]>
---
arch/m68k/mm/motorola.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 02b7a03..b806c19 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -300,6 +300,8 @@ void __init paging_init(void)
zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
free_area_init_node(i, zones_size,
m68k_memory[i].addr >> PAGE_SHIFT,
NULL);
+ if (node_present_pages(i))
+ node_set_state(i, N_NORMAL_MEMORY);
}
}

2011-04-26 10:51:52

by Thorsten Glaser

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Michael Schmitz dixit:

> be done unconditionally since information about present memory has not
> yet
> been recorded.
> If N_NORMAL_MEMORY is not accurate, slub may encounter
> errors since it

Hrm…

> @@ -300,6 +300,8 @@ void __init paging_init(void)
> zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
> free_area_init_node(i, zones_size,
> m68k_memory[i].addr >> PAGE_SHIFT, NULL);
> + if (node_present_pages(i))
> + node_set_state(i, N_NORMAL_MEMORY);
> }
> }

No, this has whitespace problems (tabs are expanded to spaces).

bye,
//mirabilos
--
13:47⎜<tobiasu> if i were omnipotent, i would divide by zero
all day long ;)
(thinking about http://lobacevski.tumblr.com/post/3260866481 by waga)

2011-04-27 13:19:41

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

On Tue, Apr 26, 2011 at 12:50, Thorsten Glaser <[email protected]> wrote:
> Michael Schmitz dixit:
>
>>       be done unconditionally since information about present memory has not
>> yet
>>       been recorded.
>>                    If N_NORMAL_MEMORY is not accurate, slub may encounter
>> errors since it
>
> Hrm…
>
>> @@ -300,6 +300,8 @@ void __init paging_init(void)
>>               zones_size[ZONE_DMA] = m68k_memory[i].size >> PAGE_SHIFT;
>>               free_area_init_node(i, zones_size,
>>                                   m68k_memory[i].addr >> PAGE_SHIFT, NULL);
>> +                if (node_present_pages(i))
>> +                        node_set_state(i, N_NORMAL_MEMORY);
>>       }
>> }
>
> No, this has whitespace problems (tabs are expanded to spaces).

Fixed those up, applied, and will send to Linus for 2.6.39-final.

BTW, if I enable CONFIG_SLUB on my ARAnyM setup, I don't get a crash, but it
hangs after:

| INIT: version 2.86 booting

With Michael's patch, it continues fine.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

2011-04-27 15:16:41

by Thorsten Glaser

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Geert Uytterhoeven dixit:

>Fixed those up, applied, and will send to Linus for 2.6.39-final.

Tested-by: Thorsten Glaser <[email protected]>

>BTW, if I enable CONFIG_SLUB on my ARAnyM setup, I don't get a crash, but it
>hangs after:

I think that’s toolchain, environment, etc. dependent…

>With Michael's patch, it continues fine.

Yeah, same here.

bye,
//mirabilos
--
FWIW, I'm quite impressed with mksh interactively. I thought it was much
*much* more bare bones. But it turns out it beats the living hell out of
ksh93 in that respect. I'd even consider it for my daily use if I hadn't
wasted half my life on my zsh setup. :-) -- Frank Terbeck in #!/bin/mksh

2011-04-27 16:36:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] convert parisc to sparsemem (was Re: [PATCH v3] mm: make expand_downwards symmetrical to expand_upwards)

On Sat, 2011-04-23 at 13:34 -0500, James Bottomley wrote:
> This is the preliminary conversion. It's very nasty on parisc because
> the memory allocation isn't symmetric anymore: under DISCONTIGMEM, we
> push all memory into bootmem and then let free_all_bootmem() do the
> magic for us;

Urg, that's unfortunate. I bet we could fairly easily teach the bootmem
allocator to allow a couple of bootmem_data's to hang off of an
individual pgdat. Put each pmem_ranges in one of those instead of a
pgdat. That would at least help with the bitmap size explosion and
extra loops.

> now we have to do separate initialisations for ranges
> because SPARSEMEM can't do multi-range boot memory. It's also got the
> horrible hack that I only use the first found range for bootmem. I'm
> not sure if this is correct (it won't be if the first found range can be
> under about 50MB because we'll run out of bootmem during boot) ... we
> might have to sort the ranges and use the larges, but that will involve
> us in even more hackery around the bootmem reservations code.
>
> The boot sequence got a few seconds slower because now all of the loops
> over our pfn ranges actually have to skip through the holes (which takes
> time for 64GB).

Which iterations were these, btw? All of the ones I saw the patch touch
seemed to be running over just a single pmem_range.

> All in all, I've not been very impressed with SPARSEMEM over
> DISCONTIGMEM. It seems to have a lot of rough edges (necessitating
> exception code) which DISCONTIGMEM just copes with.

We definitely need to look at extending it to cover bootmem-time a bit.
Is that even worth it these days with the no-bootmem bits around?

-- Dave

2011-04-28 03:05:16

by Michael Schmitz

[permalink] [raw]
Subject: Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY

Hi Geert,

>> No, this has whitespace problems (tabs are expanded to spaces).
>
> Fixed those up, applied, and will send to Linus for 2.6.39-final.

Thanks, I have no idea how that happened.
>
> BTW, if I enable CONFIG_SLUB on my ARAnyM setup, I don't get a crash, but it
> hangs after:
>
> | INIT: version 2.86 booting
>
> With Michael's patch, it continues fine.

It may just need a long time to crash - forcing e2fsck is a good way
to speed up the crash usually.

May depend on how much RAM you give to ARAnyM, on the kernel size and
any number of other parameters.

Cheers,

Michael