2006-05-09 07:03:45

by Michael Ellerman

[permalink] [raw]
Subject: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

I can't believe I'm the first person to see this, so I imagine I'm missing
something. Perhaps it's only an issue on powerpc?

I have a machine with some memory at 0, then a hole, and then some more memory
which doesn't start on a section boundary. This is causing the following
crash:

add_region nid 1 start_pfn 0x77c0 pages 0x840
add_region nid 1 start_pfn 0x0 pages 0x6000

...

Unable to handle kernel paging request for data at address 0x00002430
Faulting instruction address: 0xc0000000004f2940
cpu 0x4: Vector: 300 (Data Access) at [c000000000737aa0]
pc: c0000000004f2940: .__alloc_bootmem_node+0x28/0x7c
lr: c0000000000a47a0: .sparse_init+0xa8/0x138
sp: c000000000737d20
msr: 8000000000001032
dar: 2430
dsisr: 40000000
current = 0xc000000000538410
paca = 0xc000000000539780
pid = 0, comm = swapper
enter ? for help
4:mon> r
R00 = c0000000000a47a0 R16 = 0000000005ff5000
R01 = c000000000737d20 R17 = 0000000000000004
R02 = c0000000007331e0 R18 = 00000000100d0000
R03 = 0000000000000000 R19 = 00000000100b0000
R04 = 0000000000038000 R20 = 00000000100d0000
R05 = 0000000000000080 R21 = 0000000010070000

The root cause is that we have no memory at pfn 7000 and so early_pfn_to_nid()
is giving us back -1 in sparse_early_mem_map_alloc(). We then pass -1 to
NODE_DATA() which gets us NULL, and hence __alloc_bootmem_node() explodes.

AFAICT there's no logic to prevent us creating sections with no zeroth page,
and in fact my box is doing it. Therefore it's not valid to assume we can
get the nid from the zeroth page in a section. All we know is that there's
one or more pages in that section for which early_pfn_to_nid() will work.

So I came up with this hack. Loop through all pages in the section until
we get a valid nid, this should always work.

We also call early_pfn_to_nid() in node_memmap_size_bytes(), but I didn't
touch that because it's not used on powerpc so I can't test it.

With this patch my machine boots and seems to be happy.

cheers

Signed-off-by: Michael Ellerman <[email protected]>
---

mm/sparse.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

Index: to-merge/mm/sparse.c
===================================================================
--- to-merge.orig/mm/sparse.c
+++ to-merge/mm/sparse.c
@@ -172,10 +172,26 @@ static int sparse_init_one_section(struc
return 1;
}

+static int sparse_section_nr_to_nid(unsigned long pnum)
+{
+ unsigned long pfn = section_nr_to_pfn(pnum);
+ int i, nid;
+
+ for (i = 0; i < PAGES_PER_SECTION; i++) {
+ nid = early_pfn_to_nid(pfn + i);
+ if (nid != -1)
+ break;
+ }
+
+ BUG_ON(nid == -1);
+
+ return nid;
+}
+
static struct page *sparse_early_mem_map_alloc(unsigned long pnum)
{
struct page *map;
- int nid = early_pfn_to_nid(section_nr_to_pfn(pnum));
+ int nid = sparse_section_nr_to_nid(pnum);
struct mem_section *ms = __nr_to_section(pnum);

map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);


2006-05-09 08:32:56

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

Michael Ellerman wrote:
> I can't believe I'm the first person to see this, so I imagine I'm missing
> something. Perhaps it's only an issue on powerpc?
>
> I have a machine with some memory at 0, then a hole, and then some more memory
> which doesn't start on a section boundary. This is causing the following
> crash:
>
> add_region nid 1 start_pfn 0x77c0 pages 0x840
> add_region nid 1 start_pfn 0x0 pages 0x6000

Nasty, could you send me your full boot log and your config and I'll
have a look at it. I can say this code has been booted on a lot of
power boxes and I've never seen that before! :)

Anyhow, will take a look and see if we can avoid iterating over the area
to find it.

-apw

2006-05-09 09:11:20

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

On Tue, 2006-05-09 at 09:32 +0100, Andy Whitcroft wrote:
> Michael Ellerman wrote:
> > I can't believe I'm the first person to see this, so I imagine I'm missing
> > something. Perhaps it's only an issue on powerpc?
> >
> > I have a machine with some memory at 0, then a hole, and then some more memory
> > which doesn't start on a section boundary. This is causing the following
> > crash:
> >
> > add_region nid 1 start_pfn 0x77c0 pages 0x840
> > add_region nid 1 start_pfn 0x0 pages 0x6000
>
> Nasty, could you send me your full boot log and your config and I'll
> have a look at it. I can say this code has been booted on a lot of
> power boxes and I've never seen that before! :)

Ah yeah, I seem to have neglected to mention it's a kdump boot :} Sorry.
That's why we get the strangely aligned memory sections. The section
starting at 0 is for the kdump kernel, the bit at 0x77c0 covers some
firmware that's allocated there by the first kernel.

I can still give you a .config and log if you want it, but not 'til
tomorrow morning.

> Anyhow, will take a look and see if we can avoid iterating over the area
> to find it.

Yeah it's a bit nasty having the loop. I figure on most configs it'll
pop on the first iteration, but still.

cheers

--
Michael Ellerman
IBM OzLabs

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part

2006-05-09 10:24:52

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

Michael Ellerman wrote:
> On Tue, 2006-05-09 at 09:32 +0100, Andy Whitcroft wrote:
>
>>Michael Ellerman wrote:
>>
>>>I can't believe I'm the first person to see this, so I imagine I'm missing
>>>something. Perhaps it's only an issue on powerpc?
>>>
>>>I have a machine with some memory at 0, then a hole, and then some more memory
>>>which doesn't start on a section boundary. This is causing the following
>>>crash:
>>>
>>>add_region nid 1 start_pfn 0x77c0 pages 0x840
>>>add_region nid 1 start_pfn 0x0 pages 0x6000
>>
>>Nasty, could you send me your full boot log and your config and I'll
>>have a look at it. I can say this code has been booted on a lot of
>>power boxes and I've never seen that before! :)
>
>
> Ah yeah, I seem to have neglected to mention it's a kdump boot :} Sorry.
> That's why we get the strangely aligned memory sections. The section
> starting at 0 is for the kdump kernel, the bit at 0x77c0 covers some
> firmware that's allocated there by the first kernel.
>
> I can still give you a .config and log if you want it, but not 'til
> tomorrow morning.

Heh. Well at least that makes more sense. It would be good to have
them generally. But no I'm not scared of where this odd alignment is
coming from I can look at what we can do about it.

Thanks.

-apw

2006-05-09 13:35:21

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

sparsemem record nid during memory present

Record the node id as we mark sections for instantiation. Use this
nid during instantiation to direct allocations.

Signed-off-by: Andy Whitcroft <[email protected]>
---
sparse.c | 19 +++++++++++++++++--
1 files changed, 17 insertions(+), 2 deletions(-)
diff -upN reference/mm/sparse.c current/mm/sparse.c
--- reference/mm/sparse.c
+++ current/mm/sparse.c
@@ -102,6 +102,20 @@ int __section_nr(struct mem_section* ms)
return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
}

+/*
+ * During early boot we need to record the nid from which we will
+ * later allocate the section mem_map. Encode this into the section
+ * pointer. Overload the section_mem_map with this information.
+ */
+static inline unsigned long sparse_encode_early_nid(int nid)
+ return (nid << SECTION_MAP_LAST_BIT);
+}
+
+static inline int sparse_early_nid(struct mem_section *section)
+ unsigned long nid = section->section_mem_map;
+ return (nid >> SECTION_MAP_LAST_BIT);
+}
+
/* Record a memory area against a node. */
void memory_present(int nid, unsigned long start, unsigned long end)
{
@@ -116,7 +130,8 @@ void memory_present(int nid, unsigned lo

ms = __nr_to_section(section);
if (!ms->section_mem_map)
- ms->section_mem_map = SECTION_MARKED_PRESENT;
+ ms->section_mem_map = sparse_encode_early_nid(nid) |
+ SECTION_MARKED_PRESENT;
}
}

@@ -175,8 +190,8 @@ static int sparse_init_one_section(struc
static struct page *sparse_early_mem_map_alloc(unsigned long pnum)
{
struct page *map;
- int nid = early_pfn_to_nid(section_nr_to_pfn(pnum));
struct mem_section *ms = __nr_to_section(pnum);
+ int nid = sparse_early_nid(ms);

map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
if (map)


Attachments:
sparsemem-record-nid-during-memory-present (1.72 kB)

2006-05-09 14:28:32

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

sparsemem record nid during memory present

Record the node id as we mark sections for instantiation. Use this
nid during instantiation to direct allocations.

Signed-off-by: Andy Whitcroft <[email protected]>
---
sparse.c | 22 ++++++++++++++++++++--
1 files changed, 20 insertions(+), 2 deletions(-)
diff -upN reference/mm/sparse.c current/mm/sparse.c
--- reference/mm/sparse.c
+++ current/mm/sparse.c
@@ -102,6 +102,22 @@ int __section_nr(struct mem_section* ms)
return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
}

+/*
+ * During early boot we need to record the nid from which we will
+ * later allocate the section mem_map. Encode this into the section
+ * pointer. Overload the section_mem_map with this information.
+ */
+static inline unsigned long sparse_encode_early_nid(int nid)
+{
+ return (nid << SECTION_MAP_LAST_BIT);
+}
+
+static inline int sparse_early_nid(struct mem_section *section)
+{
+ unsigned long nid = section->section_mem_map;
+ return (nid >> SECTION_MAP_LAST_BIT);
+}
+
/* Record a memory area against a node. */
void memory_present(int nid, unsigned long start, unsigned long end)
{
@@ -116,7 +132,8 @@ void memory_present(int nid, unsigned lo

ms = __nr_to_section(section);
if (!ms->section_mem_map)
- ms->section_mem_map = SECTION_MARKED_PRESENT;
+ ms->section_mem_map = sparse_encode_early_nid(nid) |
+ SECTION_MARKED_PRESENT;
}
}

@@ -167,6 +184,7 @@ static int sparse_init_one_section(struc
if (!valid_section(ms))
return -EINVAL;

+ ms->section_mem_map &= ~SECTION_MAP_MASK;
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum);

return 1;
@@ -175,8 +193,8 @@ static int sparse_init_one_section(struc
static struct page *sparse_early_mem_map_alloc(unsigned long pnum)
{
struct page *map;
- int nid = early_pfn_to_nid(section_nr_to_pfn(pnum));
struct mem_section *ms = __nr_to_section(pnum);
+ int nid = sparse_early_nid(ms);

map = alloc_remap(nid, sizeof(struct page) * PAGES_PER_SECTION);
if (map)


Attachments:
sparsemem-record-nid-during-memory-present (1.96 kB)

2006-05-09 16:05:15

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

On Tue, May 09, 2006 at 02:34:51PM +0100, Andy Whitcroft wrote:
> 3) record the nid -- when we record the memory present in the system we
> are passed the nid.
>
> Somehow the last of these seems the most logical given we have the
> correct information at the time we record that we need to instantiate
> the section. So I had a quick go at something which seems to have come
> out pretty clean. Attached is a completly untested patch to show what I
> am proposing.

Looks sane to me. I've always wanted to encode the nid in the section.
But, never had a compelling reason to do so.

With this code in place, we could optimize the pfn_to_nid() routines to
now obtain the nid from the section (rather than page struct). However,
I'm not sure this is worth the effort.

--
Mike

2006-05-09 16:14:59

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

mike kravetz wrote:
> On Tue, May 09, 2006 at 02:34:51PM +0100, Andy Whitcroft wrote:
>
>>3) record the nid -- when we record the memory present in the system we
>>are passed the nid.
>>
>>Somehow the last of these seems the most logical given we have the
>>correct information at the time we record that we need to instantiate
>>the section. So I had a quick go at something which seems to have come
>>out pretty clean. Attached is a completly untested patch to show what I
>>am proposing.
>
>
> Looks sane to me. I've always wanted to encode the nid in the section.
> But, never had a compelling reason to do so.
>
> With this code in place, we could optimize the pfn_to_nid() routines to
> now obtain the nid from the section (rather than page struct). However,
> I'm not sure this is worth the effort.

Well its only in there temporarily during init, its not in there once we
have allocated the section mem_map.

-apw

2006-05-09 16:27:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

On Tue, 2006-05-09 at 15:28 +0100, Andy Whitcroft wrote:
> +/*
> + * During early boot we need to record the nid from which we will
> + * later allocate the section mem_map. Encode this into the section
> + * pointer. Overload the section_mem_map with this information.
> + */

Andy, this all looks pretty good. Although, it might be nice to
document this a bit more.

First, can you update the mem_section definition comment? It has a nice
explanation of how we use section_mem_map, and it would be a shame to
miss this use.

Also, your comment says when we _record_ the nid information, but not
that it is only _used_ during early boot. I think this is what Mike K.
missed, and it might be good to clarify.

How about something like this:

/*
* During early boot, before section_mem_map is used for an actual
* mem_map, we use section_mem_map to store the section's NUMA
* node. This keeps us from having to use another data structure. The
* node information is cleared just before we store the real mem_map.
*/

-- Dave

2006-05-09 16:31:44

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

Dave Hansen wrote:
> On Tue, 2006-05-09 at 15:28 +0100, Andy Whitcroft wrote:
>
>>+/*
>>+ * During early boot we need to record the nid from which we will
>>+ * later allocate the section mem_map. Encode this into the section
>>+ * pointer. Overload the section_mem_map with this information.
>>+ */
>
>
> Andy, this all looks pretty good. Although, it might be nice to
> document this a bit more.
>
> First, can you update the mem_section definition comment? It has a nice
> explanation of how we use section_mem_map, and it would be a shame to
> miss this use.
>
> Also, your comment says when we _record_ the nid information, but not
> that it is only _used_ during early boot. I think this is what Mike K.
> missed, and it might be good to clarify.
>
> How about something like this:
>
> /*
> * During early boot, before section_mem_map is used for an actual
> * mem_map, we use section_mem_map to store the section's NUMA
> * node. This keeps us from having to use another data structure. The
> * node information is cleared just before we store the real mem_map.
> */

Yep sounds very sane. Will update and resend -- best wait and see if it
fixes Michael find its works for him too :).

Thanks.

-apw

2006-05-10 06:06:18

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] SPARSEMEM + NUMA can't handle unaligned memory regions?

On Tue, 2006-05-09 at 15:28 +0100, Andy Whitcroft wrote:
> Andy Whitcroft wrote:
> > 3) record the nid -- when we record the memory present in the system we
> > are passed the nid.
> >
> > Somehow the last of these seems the most logical given we have the
> > correct information at the time we record that we need to instantiate
> > the section. So I had a quick go at something which seems to have come
> > out pretty clean. Attached is a completly untested patch to show what I
> > am proposing.
>
> Ok. Attached is a version which builds and boots. The patch looks
> pretty simple. Michael could you give it a spin on the broken machine
> for me.

Hey, it's not broken, it's just special :D

That works a treat, boots and otherwise seems fine.

> Signed-off-by: Andy Whitcroft <[email protected]>

> +/*
> + * During early boot we need to record the nid from which we will
> + * later allocate the section mem_map. Encode this into the section
> + * pointer. Overload the section_mem_map with this information.
> + */
> +static inline unsigned long sparse_encode_early_nid(int nid)
> +{
> + return (nid << SECTION_MAP_LAST_BIT);
> +}
> +
> +static inline int sparse_early_nid(struct mem_section *section)
> +{
> + unsigned long nid = section->section_mem_map;
> + return (nid >> SECTION_MAP_LAST_BIT);
> +}

What about just for readability: (in linux/mmzone.h)

#define SECTION_MARKED_PRESENT (1UL<<0)
#define SECTION_HAS_MEM_MAP (1UL<<1)
#define SECTION_MAP_LAST_BIT (1UL<<2)
#define SECTION_NID_SHIFT SECTION_MAP_LAST_BIT

cheers

--
Michael Ellerman
IBM OzLabs

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part