With DISCONTIGMEM, the mapping between a pfn and its owning node is
initialized using data provided by the BIOS or from the command line.
However, the initialization may fail if the extents are not aligned
to section boundary (64M).
The symptom of this bug is an early boot failure in pfn_to_page(),
as it tries to access NODE_DATA(__nid) using index from an unitialized
element of the physnode_map[] array.
While the bug is always present, it is more likely to be hit in kdump
kernels on large machines, because:
1. The memory map for a kdump kernel is specified as exactmap, and
exactmap is more likely to be unaligned.
2. Large reservations are more likely to span across a 64M boundary.
Signed-off-by: Petr Tesarik <[email protected]>
---
arch/x86/mm/numa_32.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 0342d27..f278b04 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -46,15 +46,16 @@ EXPORT_SYMBOL(physnode_map);
void memory_present(int nid, unsigned long start, unsigned long end)
{
- unsigned long pfn;
+ unsigned long sect, endsect;
printk(KERN_INFO "Node: %d, start_pfn: %lx, end_pfn: %lx\n",
nid, start, end);
printk(KERN_DEBUG " Setting physnode_map array to node %d for pfns:\n", nid);
printk(KERN_DEBUG " ");
- for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
- physnode_map[pfn / PAGES_PER_SECTION] = nid;
- printk(KERN_CONT "%lx ", pfn);
+ endsect = (end - 1) / PAGES_PER_SECTION;
+ for (sect = start / PAGES_PER_SECTION; sect <= endsect; ++sect) {
+ physnode_map[sect] = nid;
+ printk(KERN_CONT "%lx ", sect * PAGES_PER_SECTION);
}
printk(KERN_CONT "\n");
}
--
1.8.4.5
On Fri, 31 Jan 2014, Petr Tesarik wrote:
> With DISCONTIGMEM, the mapping between a pfn and its owning node is
> initialized using data provided by the BIOS or from the command line.
> However, the initialization may fail if the extents are not aligned
> to section boundary (64M).
>
> The symptom of this bug is an early boot failure in pfn_to_page(),
> as it tries to access NODE_DATA(__nid) using index from an unitialized
> element of the physnode_map[] array.
>
> While the bug is always present, it is more likely to be hit in kdump
> kernels on large machines, because:
>
> 1. The memory map for a kdump kernel is specified as exactmap, and
> exactmap is more likely to be unaligned.
>
> 2. Large reservations are more likely to span across a 64M boundary.
>
> Signed-off-by: Petr Tesarik <[email protected]>
What's missing here is how you're trying to fix the issue.
> diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
> index 0342d27..f278b04 100644
> --- a/arch/x86/mm/numa_32.c
> +++ b/arch/x86/mm/numa_32.c
> @@ -46,15 +46,16 @@ EXPORT_SYMBOL(physnode_map);
>
> void memory_present(int nid, unsigned long start, unsigned long end)
> {
> - unsigned long pfn;
> + unsigned long sect, endsect;
>
> printk(KERN_INFO "Node: %d, start_pfn: %lx, end_pfn: %lx\n",
> nid, start, end);
> printk(KERN_DEBUG " Setting physnode_map array to node %d for pfns:\n", nid);
> printk(KERN_DEBUG " ");
> - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> - physnode_map[pfn / PAGES_PER_SECTION] = nid;
> - printk(KERN_CONT "%lx ", pfn);
> + endsect = (end - 1) / PAGES_PER_SECTION;
> + for (sect = start / PAGES_PER_SECTION; sect <= endsect; ++sect) {
> + physnode_map[sect] = nid;
> + printk(KERN_CONT "%lx ", sect * PAGES_PER_SECTION);
> }
> printk(KERN_CONT "\n");
> }
This looks more like refactoring than anything else and doesn't make it
clear at all what the fix is.
On 01/31/2014 02:05 AM, Petr Tesarik wrote:
> With DISCONTIGMEM, the mapping between a pfn and its owning node is
> initialized using data provided by the BIOS or from the command line.
> However, the initialization may fail if the extents are not aligned
> to section boundary (64M).
So is this a problem that shows up with DISCONTIGMEM? Just curious, but
what the heck kind of 32-bit NUMA hardware is still in the wild? Did
someon buy a NUMA-Q on eBay? :)
> void memory_present(int nid, unsigned long start, unsigned long end)
> {
> - unsigned long pfn;
> + unsigned long sect, endsect;
>
> printk(KERN_INFO "Node: %d, start_pfn: %lx, end_pfn: %lx\n",
> nid, start, end);
> printk(KERN_DEBUG " Setting physnode_map array to node %d for pfns:\n", nid);
> printk(KERN_DEBUG " ");
> - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> - physnode_map[pfn / PAGES_PER_SECTION] = nid;
> - printk(KERN_CONT "%lx ", pfn);
> + endsect = (end - 1) / PAGES_PER_SECTION;
> + for (sect = start / PAGES_PER_SECTION; sect <= endsect; ++sect) {
> + physnode_map[sect] = nid;
> + printk(KERN_CONT "%lx ", sect * PAGES_PER_SECTION);
> }
> printk(KERN_CONT "\n");
> }
So, if start and end are not aligned to section boundaries, we will miss
setting physnode_map[] for the final section?
For instance, if we have a 64MB section size and try to call
memory_present(32MB -> 96MB), we will set 0->64MB present, but not set
the 64MB->128MB section as present.
Right?
Can you just align 'start' down to the section's start and 'end' up to
the end of the section that contains it? I guess you do that
implicitly, but you should be able to do it without refactoring the for
loop entirely.
On Fri, 31 Jan 2014 13:14:29 -0800
Dave Hansen <[email protected]> wrote:
> On 01/31/2014 02:05 AM, Petr Tesarik wrote:
> > With DISCONTIGMEM, the mapping between a pfn and its owning node is
> > initialized using data provided by the BIOS or from the command line.
> > However, the initialization may fail if the extents are not aligned
> > to section boundary (64M).
>
> So is this a problem that shows up with DISCONTIGMEM?
Yes, that's it.
> Just curious, but
> what the heck kind of 32-bit NUMA hardware is still in the wild? Did
> someon buy a NUMA-Q on eBay? :)
In fact, this is a patch that has been floating around in SUSE
Enterprise kernels for some time. It was originally added to pass
certification on IBM SurePOS 700 x4900-785.
When cleaning up our kernel patches, I noticed that the bug is still
present in the upstream kernel, so I posted this patch. While I don't
have any evidence that someone actually needs the fix today, it seems
wrong to leave buggy code in the kernel.
If you all agree that we rip off DISCONTIGMEM instead, I can post
patches to do that and be equally happy. ;-)
> > void memory_present(int nid, unsigned long start, unsigned long end)
> > {
> > - unsigned long pfn;
> > + unsigned long sect, endsect;
> >
> > printk(KERN_INFO "Node: %d, start_pfn: %lx, end_pfn: %lx\n",
> > nid, start, end);
> > printk(KERN_DEBUG " Setting physnode_map array to node %d for pfns:\n", nid);
> > printk(KERN_DEBUG " ");
> > - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> > - physnode_map[pfn / PAGES_PER_SECTION] = nid;
> > - printk(KERN_CONT "%lx ", pfn);
> > + endsect = (end - 1) / PAGES_PER_SECTION;
> > + for (sect = start / PAGES_PER_SECTION; sect <= endsect; ++sect) {
> > + physnode_map[sect] = nid;
> > + printk(KERN_CONT "%lx ", sect * PAGES_PER_SECTION);
> > }
> > printk(KERN_CONT "\n");
> > }
>
> So, if start and end are not aligned to section boundaries, we will miss
> setting physnode_map[] for the final section?
If end belongs to a different section than start, the final section
will not be initialized, yes.
> For instance, if we have a 64MB section size and try to call
> memory_present(32MB -> 96MB), we will set 0->64MB present, but not set
> the 64MB->128MB section as present.
>
> Right?
Exactly.
> Can you just align 'start' down to the section's start and 'end' up to
> the end of the section that contains it? I guess you do that
> implicitly, but you should be able to do it without refactoring the for
> loop entirely.
Works for me.
Petr Tesarik
On 02/01/2014 04:13 AM, Petr Tesarik wrote:
>> > Just curious, but
>> > what the heck kind of 32-bit NUMA hardware is still in the wild? Did
>> > someon buy a NUMA-Q on eBay? :)
> In fact, this is a patch that has been floating around in SUSE
> Enterprise kernels for some time. It was originally added to pass
> certification on IBM SurePOS 700 x4900-785.
>
> When cleaning up our kernel patches, I noticed that the bug is still
> present in the upstream kernel, so I posted this patch. While I don't
> have any evidence that someone actually needs the fix today, it seems
> wrong to leave buggy code in the kernel.
>
> If you all agree that we rip off DISCONTIGMEM instead, I can post
> patches to do that and be equally happy. ;-)
I have a soft spot in my heart for all that old 32-bit NUMA hardware.
I've been thinking about ripping the support out, but it usually sits
quietly not bothering anybody.
Your patch looks correct to me, and it's easier to tell that it is
correct if you just change the alignment. The only bummer here is that
it's going to be hard to test for correctness since it sounds like you
don't have the hardware sitting in front of you. In any case, feel free
to add my:
Acked-by: Dave Hansen <[email protected]>