LinuxLists.cc - Panic on 8-node system in memblock_virt_alloc_try

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

I've got a second failure mode, too, also memblock related with the same
system but a different config. In this one, the memblock code looks to
have returned an address for which there is no virtual mapping. The PMD
is clear.

> [ 0.000000] memblock_find_in_range_node():239
> [ 0.000000] __memblock_find_range_top_down():150
> [ 0.000000] __memblock_find_range_top_down():152 i: 600000001
> [ 0.000000] memblock_find_in_range_node():241 ret: 2147479552
> [ 0.000000] memblock_reserve: [0x0000007ffff000-0x0000007ffff03f] flags 0x0 numa_set_distance+0xd2/0x252
> [ 0.000000] numa_distance phys: 2147479552
> [ 0.000000] numa_distance virt: ffff88007ffff000
> [ 0.000000] numa_distance size: 64
> [ 0.000000] numa_alloc_distance() accessing numa_distance[] at byte: 0
> [ 0.000000] BUG: unable to handle kernel paging request at ffff88007ffff000
> [ 0.000000] IP: [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] PGD 211e067 PUD 2121067 PMD 0
> [ 0.000000] Oops: 0002 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0-slub-04156-g90804ed-dirty #825
> [ 0.000000] Hardware name: FUJITSU-SV PRIMEQUEST 1800E2/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.24 09/14/2011
> [ 0.000000] task: ffffffff81c104a0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
> [ 0.000000] RIP: 0010:[<ffffffff81d2c1f1>] [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] RSP: 0000:ffffffff81c01cd8 EFLAGS: 00010002
> [ 0.000000] RAX: 000000000000000a RBX: 0000000000000000 RCX: 0000000000000000
> [ 0.000000] RDX: 0000000000000014 RSI: 0000000000000046 RDI: ffffffff81ea4f84
> [ 0.000000] RBP: ffffffff81c01d68 R08: 000000000000100d R09: ffff88007ffff000
> [ 0.000000] R10: 0000000000000127 R11: 000000000000000d R12: 0000000000000000
> [ 0.000000] R13: 000000000000000a R14: 0000000000000008 R15: 0000000000000001
> [ 0.000000] FS: 0000000000000000(0000) GS:ffffffff81d00000(0000) knlGS:0000000000000000
> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.000000] CR2: ffff88007ffff000 CR3: 0000000001c0b000 CR4: 00000000000000b0
> [ 0.000000] Stack:
> [ 0.000000] 0000000000000000 ffffffff00000000 0000000000000000 0000004081c01dd0
> [ 0.000000] 00000000000000ff 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff81d2c480>] acpi_numa_slit_init+0x47/0x70
> [ 0.000000] [<ffffffff81d52c34>] ? acpi_table_print_srat_entry+0x26/0x26
> [ 0.000000] [<ffffffff81d52c9c>] acpi_parse_slit+0x68/0x6c
> [ 0.000000] [<ffffffff81d5156c>] acpi_table_parse+0x6c/0x82
> [ 0.000000] [<ffffffff81d52dcc>] acpi_numa_init+0x94/0xb0
> [ 0.000000] [<ffffffff81d2c6d9>] ? acpi_numa_arch_fixup+0x6/0x6
> [ 0.000000] [<ffffffff81d2c6d9>] ? acpi_numa_arch_fixup+0x6/0x6
> [ 0.000000] [<ffffffff81d2c6e2>] x86_acpi_numa_init+0x9/0x1b
> [ 0.000000] [<ffffffff81d2bbc2>] numa_init+0xe0/0x589
> [ 0.000000] [<ffffffff8108adba>] ? set_pte_vaddr_pud+0x3a/0x60
> [ 0.000000] [<ffffffff8108ae45>] ? set_pte_vaddr+0x65/0xa0
> [ 0.000000] [<ffffffff810902d5>] ? __native_set_fixmap+0x25/0x30
> [ 0.000000] [<ffffffff81d2c2d6>] x86_numa_init+0x19/0x2b
> [ 0.000000] [<ffffffff81d2c419>] initmem_init+0x9/0xb
> [ 0.000000] [<ffffffff81d1b2f3>] setup_arch+0x923/0xc6e
> [ 0.000000] [<ffffffff817032e0>] ? printk+0x4d/0x4f
> [ 0.000000] [<ffffffff81d14b1a>] start_kernel+0x85/0x3db
> [ 0.000000] [<ffffffff81d145a8>] x86_64_start_reservations+0x2a/0x2c
> [ 0.000000] [<ffffffff81d1469a>] x86_64_start_kernel+0xf0/0xf7
> [ 0.000000] Code: ff ff e8 c6 70 9d ff 8b 4d 80 4c 8b 8d 70 ff ff ff b0 0a 4c 03 0d a8 0a 17 00 ba 14 00 00 00 44 39 f9 0f 45 c2 49 ff c7 45 39 fe <41> 88 01 44 8b 85 78 ff ff ff 7f a0 ff c1 45 01 f0 44 39 f1 7c
> [ 0.000000] RIP [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] RSP <ffffffff81c01cd8>
> [ 0.000000] CR2: ffff88007ffff000
> [ 0.000000] ---[ end trace 1ac9854e9d9aedf2 ]---
> [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

2014-01-24 03:43:58

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

Dave,

On Thursday 23 January 2014 05:49 PM, Dave Hansen wrote:
> Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
> have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
> down to a just a few commits, most of which are changes to the memblock
> code. Since the panic is in the memblock code, it looks like a
> no-brainer. It's almost certainly the code from Santosh or Grygorii
> that's triggering this.
>
> Config and good/bad dmesg with memblock=debug are here:
>
> http://sr71.net/~dave/intel/3.13/
>
> Please let me know if you need it bisected further than this.
>
Thanks a lot for debug information. Its pretty useful. The oops
seems to be actually side effect of not setting up the numa nodes
correctly first place. At least the setup_node_data() results
indicate that. Actually setup_node_data() operates on the physical
memblock interfaces which are untouched except the alignment change
and thats potentially reason for the change in behavior.

Will you be able revert below commit and give a quick try to see
if the behavior changes ? It might impact other APIs since they
assume the default alignment as SMP_CACHE_BYTES but at least
I want to see if with below revert at least setup_node_data()
reserves correct memory space.

79f40fa mm/memblock: drop WARN and use SMP_CACHE_BYTES as a default alignment

Regards,
Santosh

2014-01-24 05:55:16

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
> Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
> have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
> down to a just a few commits, most of which are changes to the memblock
> code. Since the panic is in the memblock code, it looks like a
> no-brainer. It's almost certainly the code from Santosh or Grygorii
> that's triggering this.
>
> Config and good/bad dmesg with memblock=debug are here:
>
> http://sr71.net/~dave/intel/3.13/
>
> Please let me know if you need it bisected further than this.

Please check attached patch, and it should fix the problem.

Yinghai

Attachments:

fix_numa_x.patch (1.82 kB)

2014-01-24 06:38:38

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

Yinghai,

On Friday 24 January 2014 12:55 AM, Yinghai Lu wrote:
> On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
>> > Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
>> > have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
>> > down to a just a few commits, most of which are changes to the memblock
>> > code. Since the panic is in the memblock code, it looks like a
>> > no-brainer. It's almost certainly the code from Santosh or Grygorii
>> > that's triggering this.
>> >
>> > Config and good/bad dmesg with memblock=debug are here:
>> >
>> > http://sr71.net/~dave/intel/3.13/
>> >
>> > Please let me know if you need it bisected further than this.
> Please check attached patch, and it should fix the problem.
>

[...]

>
> Subject: [PATCH] x86: Fix numa with reverting wrong memblock setting.
>
> Dave reported Numa on x86 is broken on system with 1T memory.
>
> It turns out
> | commit 5b6e529521d35e1bcaa0fe43456d1bbb335cae5d
> | Author: Santosh Shilimkar <[email protected]>
> | Date: Tue Jan 21 15:50:03 2014 -0800
> |
> | x86: memblock: set current limit to max low memory address
>
> set limit to low wrongly.
>
> max_low_pfn_mapped is different from max_pfn_mapped.
> max_low_pfn_mapped is always under 4G.
>
> That will memblock_alloc_nid all go under 4G.
>
> Revert that offending patch.
>
> Reported-by: Dave Hansen <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
>
>
This mostly will fix the $subject issue but the regression
reported by Andrew [1] will surface with the revert. Its clear
now that even though commit fixed the issue, it wasn't the fix.

Would be great if you can have a look at the thread.

Regards,
Santosh

[1] http://lkml.indiana.edu/hypermail/linux/kernel/1312.1/03770.html

2014-01-24 06:56:42

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Friday 24 January 2014 01:38 AM, Santosh Shilimkar wrote:
> Yinghai,
>
> On Friday 24 January 2014 12:55 AM, Yinghai Lu wrote:
>> On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
>>>> Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
>>>> have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
>>>> down to a just a few commits, most of which are changes to the memblock
>>>> code. Since the panic is in the memblock code, it looks like a
>>>> no-brainer. It's almost certainly the code from Santosh or Grygorii
>>>> that's triggering this.
>>>>
>>>> Config and good/bad dmesg with memblock=debug are here:
>>>>
>>>> http://sr71.net/~dave/intel/3.13/
>>>>
>>>> Please let me know if you need it bisected further than this.
>> Please check attached patch, and it should fix the problem.
>>
>
> [...]
>
>>
>> Subject: [PATCH] x86: Fix numa with reverting wrong memblock setting.
>>
>> Dave reported Numa on x86 is broken on system with 1T memory.
>>
>> It turns out
>> | commit 5b6e529521d35e1bcaa0fe43456d1bbb335cae5d
>> | Author: Santosh Shilimkar <[email protected]>
>> | Date: Tue Jan 21 15:50:03 2014 -0800
>> |
>> | x86: memblock: set current limit to max low memory address
>>
>> set limit to low wrongly.
>>
>> max_low_pfn_mapped is different from max_pfn_mapped.
>> max_low_pfn_mapped is always under 4G.
>>
>> That will memblock_alloc_nid all go under 4G.
>>
>> Revert that offending patch.
>>
>> Reported-by: Dave Hansen <[email protected]>
>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>>
> This mostly will fix the $subject issue but the regression
> reported by Andrew [1] will surface with the revert. Its clear
> now that even though commit fixed the issue, it wasn't the fix.
>
> Would be great if you can have a look at the thread.
>
The patch which is now commit 457ff1d {lib/swiotlb.c: use
memblock apis for early memory allocations} was the breaking the
boot on Andrew's machine. Now if I look back the patch, based on your
above description, I believe below hunk waS/is the culprit.

@@ -172,8 +172,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
/*
* Get the overflow emergency buffer
*/
- v_overflow_buffer = alloc_bootmem_low_pages_nopanic(
- PAGE_ALIGN(io_tlb_overflow));
+ v_overflow_buffer = memblock_virt_alloc_nopanic(
+ PAGE_ALIGN(io_tlb_overflow),
+ PAGE_SIZE);
if (!v_overflow_buffer)
return -ENOMEM;

Looks like 'v_overflow_buffer' must be allocated from low memory in this
case. Is that correct ?

Regards,
Santosh

2014-01-24 06:57:10

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Thu, Jan 23, 2014 at 10:38 PM, Santosh Shilimkar
<[email protected]> wrote:
> Yinghai,
>
> On Friday 24 January 2014 12:55 AM, Yinghai Lu wrote:
>> On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
>>> > Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
>>> > have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
>>> > down to a just a few commits, most of which are changes to the memblock
>>> > code. Since the panic is in the memblock code, it looks like a
>>> > no-brainer. It's almost certainly the code from Santosh or Grygorii
>>> > that's triggering this.
>>> >
>>> > Config and good/bad dmesg with memblock=debug are here:
>>> >
>>> > http://sr71.net/~dave/intel/3.13/
>>> >
>>> > Please let me know if you need it bisected further than this.
>> Please check attached patch, and it should fix the problem.
>>
>
> [...]
>
>>
>> Subject: [PATCH] x86: Fix numa with reverting wrong memblock setting.
>>
>> Dave reported Numa on x86 is broken on system with 1T memory.
>>
>> It turns out
>> | commit 5b6e529521d35e1bcaa0fe43456d1bbb335cae5d
>> | Author: Santosh Shilimkar <[email protected]>
>> | Date: Tue Jan 21 15:50:03 2014 -0800
>> |
>> | x86: memblock: set current limit to max low memory address
>>
>> set limit to low wrongly.
>>
>> max_low_pfn_mapped is different from max_pfn_mapped.
>> max_low_pfn_mapped is always under 4G.
>>
>> That will memblock_alloc_nid all go under 4G.
>>
>> Revert that offending patch.
>>
>> Reported-by: Dave Hansen <[email protected]>
>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>>
> This mostly will fix the $subject issue but the regression
> reported by Andrew [1] will surface with the revert. Its clear
> now that even though commit fixed the issue, it wasn't the fix.
>
> Would be great if you can have a look at the thread.

>> [1] http://lkml.indiana.edu/hypermail/linux/kernel/1312.1/03770.html

Andrew,

Did you bisect which patch in that 23 patchset cause your system have problem?

Thanks

Yinghai

2014-01-24 07:01:21

by Andrew Morton

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Thu, 23 Jan 2014 22:57:08 -0800 Yinghai Lu <[email protected]> wrote:

> On Thu, Jan 23, 2014 at 10:38 PM, Santosh Shilimkar
> <[email protected]> wrote:
> > Yinghai,
> >
> > On Friday 24 January 2014 12:55 AM, Yinghai Lu wrote:
> >> On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
> >>> > Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
> >>> > have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
> >>> > down to a just a few commits, most of which are changes to the memblock
> >>> > code. Since the panic is in the memblock code, it looks like a
> >>> > no-brainer. It's almost certainly the code from Santosh or Grygorii
> >>> > that's triggering this.
> >>> >
> >>> > Config and good/bad dmesg with memblock=debug are here:
> >>> >
> >>> > http://sr71.net/~dave/intel/3.13/
> >>> >
> >>> > Please let me know if you need it bisected further than this.
> >> Please check attached patch, and it should fix the problem.
> >>
> >
> > [...]
> >
> >>
> >> Subject: [PATCH] x86: Fix numa with reverting wrong memblock setting.
> >>
> >> Dave reported Numa on x86 is broken on system with 1T memory.
> >>
> >> It turns out
> >> | commit 5b6e529521d35e1bcaa0fe43456d1bbb335cae5d
> >> | Author: Santosh Shilimkar <[email protected]>
> >> | Date: Tue Jan 21 15:50:03 2014 -0800
> >> |
> >> | x86: memblock: set current limit to max low memory address
> >>
> >> set limit to low wrongly.
> >>
> >> max_low_pfn_mapped is different from max_pfn_mapped.
> >> max_low_pfn_mapped is always under 4G.
> >>
> >> That will memblock_alloc_nid all go under 4G.
> >>
> >> Revert that offending patch.
> >>
> >> Reported-by: Dave Hansen <[email protected]>
> >> Signed-off-by: Yinghai Lu <[email protected]>
> >>
> >>
> > This mostly will fix the $subject issue but the regression
> > reported by Andrew [1] will surface with the revert. Its clear
> > now that even though commit fixed the issue, it wasn't the fix.
> >
> > Would be great if you can have a look at the thread.
>
> >> [1] http://lkml.indiana.edu/hypermail/linux/kernel/1312.1/03770.html
>
> Andrew,
>
> Did you bisect which patch in that 23 patchset cause your system have problem?
>

Yes - it was caused by the patch which that email was replying to.
"[PATCH v3 13/23] mm/lib/swiotlb: Use memblock apis for earlymemory
allocations".

2014-01-24 07:04:14

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Thu, Jan 23, 2014 at 10:56 PM, Santosh Shilimkar
<[email protected]> wrote:
> On Friday 24 January 2014 01:38 AM, Santosh Shilimkar wrote:

> The patch which is now commit 457ff1d {lib/swiotlb.c: use
> memblock apis for early memory allocations} was the breaking the
> boot on Andrew's machine. Now if I look back the patch, based on your
> above description, I believe below hunk waS/is the culprit.
>
> @@ -172,8 +172,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> /*
> * Get the overflow emergency buffer
> */
> - v_overflow_buffer = alloc_bootmem_low_pages_nopanic(
> - PAGE_ALIGN(io_tlb_overflow));
> + v_overflow_buffer = memblock_virt_alloc_nopanic(
> + PAGE_ALIGN(io_tlb_overflow),
> + PAGE_SIZE);
> if (!v_overflow_buffer)
> return -ENOMEM;
>
>
> Looks like 'v_overflow_buffer' must be allocated from low memory in this
> case. Is that correct ?

yes.

but should the change like following

commit 457ff1de2d247d9b8917c4664c2325321a35e313
Author: Santosh Shilimkar <[email protected]>
Date: Tue Jan 21 15:50:30 2014 -0800

lib/swiotlb.c: use memblock apis for early memory allocations

@@ -215,13 +220,13 @@ swiotlb_init(int verbose)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;

/* Get IO TLB memory from the low pages */
- vstart = alloc_bootmem_low_pages_nopanic(PAGE_ALIGN(bytes));
+ vstart = memblock_virt_alloc_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE);
if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
return;

2014-01-24 07:23:02

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Friday 24 January 2014 02:04 AM, Yinghai Lu wrote:
> On Thu, Jan 23, 2014 at 10:56 PM, Santosh Shilimkar
> <[email protected]> wrote:
>> On Friday 24 January 2014 01:38 AM, Santosh Shilimkar wrote:
>
>> The patch which is now commit 457ff1d {lib/swiotlb.c: use
>> memblock apis for early memory allocations} was the breaking the
>> boot on Andrew's machine. Now if I look back the patch, based on your
>> above description, I believe below hunk waS/is the culprit.
>>
>> @@ -172,8 +172,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>> /*
>> * Get the overflow emergency buffer
>> */
>> - v_overflow_buffer = alloc_bootmem_low_pages_nopanic(
>> - PAGE_ALIGN(io_tlb_overflow));
>> + v_overflow_buffer = memblock_virt_alloc_nopanic(
>> + PAGE_ALIGN(io_tlb_overflow),
>> + PAGE_SIZE);
>> if (!v_overflow_buffer)
>> return -ENOMEM;
>>
>>
>> Looks like 'v_overflow_buffer' must be allocated from low memory in this
>> case. Is that correct ?
>
> yes.
>
> but should the change like following
>
> commit 457ff1de2d247d9b8917c4664c2325321a35e313
> Author: Santosh Shilimkar <[email protected]>
> Date: Tue Jan 21 15:50:30 2014 -0800
>
> lib/swiotlb.c: use memblock apis for early memory allocations
>
>
> @@ -215,13 +220,13 @@ swiotlb_init(int verbose)
> bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>
> /* Get IO TLB memory from the low pages */
> - vstart = alloc_bootmem_low_pages_nopanic(PAGE_ALIGN(bytes));
> + vstart = memblock_virt_alloc_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE);
> if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
> return;
>
OK. So we need '__alloc_bootmem_low()' equivalent memblock API. We will try
to come up with a patch for the same. Thanks for inputs.

Regards,
Santosh

2014-01-24 07:46:32

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Thu, Jan 23, 2014 at 11:22 PM, Santosh Shilimkar
<[email protected]> wrote:
> On Friday 24 January 2014 02:04 AM, Yinghai Lu wrote:
>> On Thu, Jan 23, 2014 at 10:56 PM, Santosh Shilimkar
>> <[email protected]> wrote:
>>> On Friday 24 January 2014 01:38 AM, Santosh Shilimkar wrote:
>>
>>> The patch which is now commit 457ff1d {lib/swiotlb.c: use
>>> memblock apis for early memory allocations} was the breaking the
>>> boot on Andrew's machine. Now if I look back the patch, based on your
>>> above description, I believe below hunk waS/is the culprit.
>>>
>>> @@ -172,8 +172,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>>> /*
>>> * Get the overflow emergency buffer
>>> */
>>> - v_overflow_buffer = alloc_bootmem_low_pages_nopanic(
>>> - PAGE_ALIGN(io_tlb_overflow));
>>> + v_overflow_buffer = memblock_virt_alloc_nopanic(
>>> + PAGE_ALIGN(io_tlb_overflow),
>>> + PAGE_SIZE);
>>> if (!v_overflow_buffer)
>>> return -ENOMEM;
>>>
>>>
>>> Looks like 'v_overflow_buffer' must be allocated from low memory in this
>>> case. Is that correct ?
>>
>> yes.
>>
>> but should the change like following
>>
>> commit 457ff1de2d247d9b8917c4664c2325321a35e313
>> Author: Santosh Shilimkar <[email protected]>
>> Date: Tue Jan 21 15:50:30 2014 -0800
>>
>> lib/swiotlb.c: use memblock apis for early memory allocations
>>
>>
>> @@ -215,13 +220,13 @@ swiotlb_init(int verbose)
>> bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>>
>> /* Get IO TLB memory from the low pages */
>> - vstart = alloc_bootmem_low_pages_nopanic(PAGE_ALIGN(bytes));
>> + vstart = memblock_virt_alloc_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE);
>> if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
>> return;
>>
> OK. So we need '__alloc_bootmem_low()' equivalent memblock API. We will try
> to come up with a patch for the same. Thanks for inputs.

Yes,

Andrew, can you try attached two patches in your setup?

Assume your system does not have intel iommu support?

Thanks

Yinghai

Attachments:

fix_numa_x.patch (1.82 kB)
revert_memblock_swiotlb_change.patch (3.40 kB)
Download all attachments

2014-01-24 07:54:55

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Friday 24 January 2014 02:46 AM, Yinghai Lu wrote:
>> OK. So we need '__alloc_bootmem_low()' equivalent memblock API. We will try
>> > to come up with a patch for the same. Thanks for inputs.
> Yes,
>
> Andrew, can you try attached two patches in your setup?
>
> Assume your system does not have intel iommu support?
>
You are fast.. I was cooking up very similar patch as yours.
Thanks for help. Its should mostly fix the issue on Andrew's
box after the revert of commit 5b6e529521

>
> ---
> arch/arm/kernel/setup.c | 2 +-
> include/linux/bootmem.h | 37 +++++++++++++++++++++++++++++++++++++
> lib/swiotlb.c | 4 ++--
> 3 files changed, 40 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/include/linux/bootmem.h
> ===================================================================
> --- linux-2.6.orig/include/linux/bootmem.h
> +++ linux-2.6/include/linux/bootmem.h
> @@ -175,6 +175,27 @@ static inline void * __init memblock_vir
> NUMA_NO_NODE);
> }
>
> +#ifndef ARCH_LOW_ADDRESS_LIMIT
> +#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
> +#endif
> +
> +static inline void * __init memblock_virt_alloc_low(
> + phys_addr_t size, phys_addr_t align)
> +{
> + return memblock_virt_alloc_try_nid(size, align,
> + BOOTMEM_LOW_LIMIT,
> + ARCH_LOW_ADDRESS_LIMIT,
> + NUMA_NO_NODE);
> +}
> +static inline void * __init memblock_virt_alloc_low_nopanic(
> + phys_addr_t size, phys_addr_t align)
> +{
> + return memblock_virt_alloc_try_nid_nopanic(size, align,
> + BOOTMEM_LOW_LIMIT,
> + ARCH_LOW_ADDRESS_LIMIT,
> + NUMA_NO_NODE);
> +}
> +
> static inline void * __init memblock_virt_alloc_from_nopanic(
> phys_addr_t size, phys_addr_t align, phys_addr_t min_addr)
> {
> @@ -238,6 +259,22 @@ static inline void * __init memblock_vir
> return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
> }
>
> +static inline void * __init memblock_virt_alloc_low(
> + phys_addr_t size, phys_addr_t align)
> +{
> + if (!align)
> + align = SMP_CACHE_BYTES;
> + return __alloc_bootmem_low(size, align, BOOTMEM_LOW_LIMIT);
> +}
> +
> +static inline void * __init memblock_virt_alloc_low_nopanic(
> + phys_addr_t size, phys_addr_t align)
> +{
> + if (!align)
> + align = SMP_CACHE_BYTES;
> + return __alloc_bootmem_low_nopanic(size, align, BOOTMEM_LOW_LIMIT);
> +}
> +
> static inline void * __init memblock_virt_alloc_from_nopanic(
> phys_addr_t size, phys_addr_t align, phys_addr_t min_addr)
> {
> Index: linux-2.6/lib/swiotlb.c
> ===================================================================
> --- linux-2.6.orig/lib/swiotlb.c
> +++ linux-2.6/lib/swiotlb.c
> @@ -172,7 +172,7 @@ int __init swiotlb_init_with_tbl(char *t
> /*
> * Get the overflow emergency buffer
> */
> - v_overflow_buffer = memblock_virt_alloc_nopanic(
> + v_overflow_buffer = memblock_virt_alloc_low_nopanic(
> PAGE_ALIGN(io_tlb_overflow),
> PAGE_SIZE);
> if (!v_overflow_buffer)
> @@ -220,7 +220,7 @@ swiotlb_init(int verbose)
> bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>
> /* Get IO TLB memory from the low pages */
> - vstart = memblock_virt_alloc_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE);
> + vstart = memblock_virt_alloc_low_nopanic(PAGE_ALIGN(bytes), PAGE_SIZE);
> if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
> return;
>
> Index: linux-2.6/arch/arm/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/arm/kernel/setup.c
> +++ linux-2.6/arch/arm/kernel/setup.c
> @@ -717,7 +717,7 @@ static void __init request_standard_reso
> kernel_data.end = virt_to_phys(_end - 1);
>
> for_each_memblock(memory, region) {
> - res = memblock_virt_alloc(sizeof(*res), 0);
> + res = memblock_virt_alloc_low(sizeof(*res), 0);
> res->name = "System RAM";
> res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
> res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;

2014-01-24 15:02:36

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On 01/23/2014 09:55 PM, Yinghai Lu wrote:
> On Thu, Jan 23, 2014 at 2:49 PM, Dave Hansen <[email protected]> wrote:
>> Linus's current tree doesn't boot on an 8-node/1TB NUMA system that I
>> have. Its reboots are *LONG*, so I haven't fully bisected it, but it's
>> down to a just a few commits, most of which are changes to the memblock
>> code. Since the panic is in the memblock code, it looks like a
>> no-brainer. It's almost certainly the code from Santosh or Grygorii
>> that's triggering this.
>>
>> Config and good/bad dmesg with memblock=debug are here:
>>
>> http://sr71.net/~dave/intel/3.13/
>>
>> Please let me know if you need it bisected further than this.
>
> Please check attached patch, and it should fix the problem.

There are two failure modes I'm seeing: one when (failing to) allocate
the first node's mem_map[], and a second where it oopses accessing the
numa_distance[] table. This is the numa_distance[] one, and it happens
even with the patch you suggested applied.

> [ 0.000000] memblock_find_in_range_node():239
> [ 0.000000] __memblock_find_range_top_down():150
> [ 0.000000] __memblock_find_range_top_down():152 i: 600000001
> [ 0.000000] memblock_find_in_range_node():241 ret: 2147479552
> [ 0.000000] memblock_reserve: [0x0000007ffff000-0x0000007ffff03f] flags 0x0 numa_set_distance+0xd2/0x252
> [ 0.000000] numa_distance phys: 7ffff000
> [ 0.000000] numa_distance virt: ffff88007ffff000
> [ 0.000000] numa_distance size: 64
> [ 0.000000] numa_alloc_distance() accessing numa_distance[] at byte: 0
> [ 0.000000] BUG: unable to handle kernel paging request at ffff88007ffff000
> [ 0.000000] IP: [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] PGD 211e067 PUD 2121067 PMD 0
> [ 0.000000] Oops: 0002 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0-slub-04156-g90804ed-dirty #826
> [ 0.000000] Hardware name: FUJITSU-SV PRIMEQUEST 1800E2/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.24 09/14/2011
> [ 0.000000] task: ffffffff81c104a0 ti: ffffffff81c00000 task.ti: ffffffff81c00000
> [ 0.000000] RIP: 0010:[<ffffffff81d2c1f1>] [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] RSP: 0000:ffffffff81c01cd8 EFLAGS: 00010002
> [ 0.000000] RAX: 000000000000000a RBX: 0000000000000000 RCX: 0000000000000000
> [ 0.000000] RDX: 0000000000000014 RSI: 0000000000000046 RDI: ffffffff81ea4f84
> [ 0.000000] RBP: ffffffff81c01d68 R08: 000000000000100d R09: ffff88007ffff000
> [ 0.000000] R10: 0000000000000127 R11: 000000000000000d R12: 0000000000000000
> [ 0.000000] R13: 000000000000000a R14: 0000000000000008 R15: 0000000000000001
> [ 0.000000] FS: 0000000000000000(0000) GS:ffffffff81d00000(0000) knlGS:0000000000000000
> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.000000] CR2: ffff88007ffff000 CR3: 0000000001c0b000 CR4: 00000000000000b0
> [ 0.000000] Stack:
> [ 0.000000] 0000000000000000 ffffffff00000000 0000000000000000 0000004081c01dd0
> [ 0.000000] 00000000000000ff 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff81d2c480>] acpi_numa_slit_init+0x47/0x70
> [ 0.000000] [<ffffffff81d52c34>] ? acpi_table_print_srat_entry+0x26/0x26
> [ 0.000000] [<ffffffff81d52c9c>] acpi_parse_slit+0x68/0x6c
> [ 0.000000] [<ffffffff81d5156c>] acpi_table_parse+0x6c/0x82
> [ 0.000000] [<ffffffff81d52dcc>] acpi_numa_init+0x94/0xb0
> [ 0.000000] [<ffffffff81d2c6d9>] ? acpi_numa_arch_fixup+0x6/0x6
> [ 0.000000] [<ffffffff81d2c6d9>] ? acpi_numa_arch_fixup+0x6/0x6
> [ 0.000000] [<ffffffff81d2c6e2>] x86_acpi_numa_init+0x9/0x1b
> [ 0.000000] [<ffffffff81d2bbc2>] numa_init+0xe0/0x589
> [ 0.000000] [<ffffffff8108adba>] ? set_pte_vaddr_pud+0x3a/0x60
> [ 0.000000] [<ffffffff8108ae45>] ? set_pte_vaddr+0x65/0xa0
> [ 0.000000] [<ffffffff810902d5>] ? __native_set_fixmap+0x25/0x30
> [ 0.000000] [<ffffffff81d2c2d6>] x86_numa_init+0x19/0x2b
> [ 0.000000] [<ffffffff81d2c419>] initmem_init+0x9/0xb
> [ 0.000000] [<ffffffff81d1b2f3>] setup_arch+0x923/0xc6e
> [ 0.000000] [<ffffffff817032e0>] ? printk+0x4d/0x4f
> [ 0.000000] [<ffffffff81d14b1a>] start_kernel+0x85/0x3db
> [ 0.000000] [<ffffffff81d145a8>] x86_64_start_reservations+0x2a/0x2c
> [ 0.000000] [<ffffffff81d1469a>] x86_64_start_kernel+0xf0/0xf7
> [ 0.000000] Code: ff ff e8 c6 70 9d ff 8b 4d 80 4c 8b 8d 70 ff ff ff b0 0a 4c 03 0d a8 0a 17 00 ba 14 00 00 00 44 39 f9 0f 45 c2 49 ff c7 45 39 fe <41> 88 01 44 8b 85 78 ff ff ff 7f a0 ff c1 45 01 f0 44 39 f1 7c
> [ 0.000000] RIP [<ffffffff81d2c1f1>] numa_set_distance+0x186/0x252
> [ 0.000000] RSP <ffffffff81c01cd8>
> [ 0.000000] CR2: ffff88007ffff000
> [ 0.000000] ---[ end trace 8a50456ee7e911cb ]---
> [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

2014-01-24 15:25:41

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On 01/24/2014 07:01 AM, Dave Hansen wrote:
> There are two failure modes I'm seeing: one when (failing to) allocate
> the first node's mem_map[], and a second where it oopses accessing the
> numa_distance[] table. This is the numa_distance[] one, and it happens
> even with the patch you suggested applied.

And with my second (lots of debugging enabled) config, I get the
mem_map[] oops. In other words, none of the reverts or patches are
helping either of the conditions that I'm able to trigger.

2014-01-24 17:45:23

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Fri, Jan 24, 2014 at 7:01 AM, Dave Hansen <[email protected]> wrote:
> There are two failure modes I'm seeing: one when (failing to) allocate
> the first node's mem_map[], and a second where it oopses accessing the
> numa_distance[] table. This is the numa_distance[] one, and it happens
> even with the patch you suggested applied.
>
>> [ 0.000000] memblock_find_in_range_node():239
>> [ 0.000000] __memblock_find_range_top_down():150
>> [ 0.000000] __memblock_find_range_top_down():152 i: 600000001
>> [ 0.000000] memblock_find_in_range_node():241 ret: 2147479552
>> [ 0.000000] memblock_reserve: [0x0000007ffff000-0x0000007ffff03f] flags 0x0 numa_set_distance+0xd2/0x252

that address is wrong.

Can you post whole log with current linus' tree + two patches that I
sent out yesterday?

>> [ 0.000000] numa_distance phys: 7ffff000
>> [ 0.000000] numa_distance virt: ffff88007ffff000
>> [ 0.000000] numa_distance size: 64
>> [ 0.000000] numa_alloc_distance() accessing numa_distance[] at byte: 0
>> [ 0.000000] BUG: unable to handle kernel paging request at ffff88007ffff000

2014-01-24 18:10:01

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On 01/24/2014 09:45 AM, Yinghai Lu wrote:
> On Fri, Jan 24, 2014 at 7:01 AM, Dave Hansen <[email protected]> wrote:
>> There are two failure modes I'm seeing: one when (failing to) allocate
>> the first node's mem_map[], and a second where it oopses accessing the
>> numa_distance[] table. This is the numa_distance[] one, and it happens
>> even with the patch you suggested applied.
>>
>>> [ 0.000000] memblock_find_in_range_node():239
>>> [ 0.000000] __memblock_find_range_top_down():150
>>> [ 0.000000] __memblock_find_range_top_down():152 i: 600000001
>>> [ 0.000000] memblock_find_in_range_node():241 ret: 2147479552
>>> [ 0.000000] memblock_reserve: [0x0000007ffff000-0x0000007ffff03f] flags 0x0 numa_set_distance+0xd2/0x252
>
> that address is wrong.
>
> Can you post whole log with current linus' tree + two patches that I
> sent out yesterday?

Here you go. It's still spitting out memblock_reserve messages to the
console. I'm not sure if it's making _some_ progress or not.

https://www.sr71.net/~dave/intel/3.13/dmesg.with-2-patches

But, it's certainly not booting. Do you want to see it without
memblock=debug?

2014-01-24 18:13:56

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Fri, Jan 24, 2014 at 10:09 AM, Dave Hansen <[email protected]> wrote:
> On 01/24/2014 09:45 AM, Yinghai Lu wrote:
>> On Fri, Jan 24, 2014 at 7:01 AM, Dave Hansen <[email protected]> wrote:
>>> There are two failure modes I'm seeing: one when (failing to) allocate
>>> the first node's mem_map[], and a second where it oopses accessing the
>>> numa_distance[] table. This is the numa_distance[] one, and it happens
>>> even with the patch you suggested applied.
>>>
>>>> [ 0.000000] memblock_find_in_range_node():239
>>>> [ 0.000000] __memblock_find_range_top_down():150
>>>> [ 0.000000] __memblock_find_range_top_down():152 i: 600000001
>>>> [ 0.000000] memblock_find_in_range_node():241 ret: 2147479552
>>>> [ 0.000000] memblock_reserve: [0x0000007ffff000-0x0000007ffff03f] flags 0x0 numa_set_distance+0xd2/0x252
>>
>> that address is wrong.
>>
>> Can you post whole log with current linus' tree + two patches that I
>> sent out yesterday?
>
> Here you go. It's still spitting out memblock_reserve messages to the
> console. I'm not sure if it's making _some_ progress or not.
>
> https://www.sr71.net/~dave/intel/3.13/dmesg.with-2-patches
>
> But, it's certainly not booting. Do you want to see it without
> memblock=debug?

that looks like different problem. and it can not set memory mapping properly.

can you send me .config ?

Thanks

Yinghai

2014-01-24 18:19:20

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On 01/24/2014 10:13 AM, Yinghai Lu wrote:
> On Fri, Jan 24, 2014 at 10:09 AM, Dave Hansen <[email protected]> wrote:
>> On 01/24/2014 09:45 AM, Yinghai Lu wrote:
>> Here you go. It's still spitting out memblock_reserve messages to the
>> console. I'm not sure if it's making _some_ progress or not.
>>
>> https://www.sr71.net/~dave/intel/3.13/dmesg.with-2-patches
>>
>> But, it's certainly not booting. Do you want to see it without
>> memblock=debug?
>
> that looks like different problem. and it can not set memory mapping properly.
>
> can you send me .config ?

Here you go.

FWIW, I did turn of memblock=debug. It eventually booted, but
slooooooooooowly.

How many problems in this code are we tracking, btw? This is at least
3, right?

Attachments:

config-3.13.0-05617-g3aacd62-dirty.txt (75.16 kB)

2014-01-24 18:24:29

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On Fri, Jan 24, 2014 at 10:19 AM, Dave Hansen <[email protected]> wrote:
> On 01/24/2014 10:13 AM, Yinghai Lu wrote:
>> On Fri, Jan 24, 2014 at 10:09 AM, Dave Hansen <[email protected]> wrote:
>>> On 01/24/2014 09:45 AM, Yinghai Lu wrote:
>>> Here you go. It's still spitting out memblock_reserve messages to the
>>> console. I'm not sure if it's making _some_ progress or not.
>>>
>>> https://www.sr71.net/~dave/intel/3.13/dmesg.with-2-patches
>>>
>>> But, it's certainly not booting. Do you want to see it without
>>> memblock=debug?
>>
>> that looks like different problem. and it can not set memory mapping properly.
>>
>> can you send me .config ?
>
> Here you go.
>
> FWIW, I did turn of memblock=debug. It eventually booted, but
> slooooooooooowly.

then that is not a problem, as you are using 4k page mapping only.
and that printout is too spew...

>
> How many problems in this code are we tracking, btw? This is at least
> 3, right?

two problems:
1. big numa system.
2. Andrew's system with swiotlb.

The two patches should address them.

Thanks

Yinghai

2014-01-24 18:43:00

[permalink] [raw]

Subject: Re: Panic on 8-node system in memblock_virt_alloc_try_nid()

On 01/24/2014 10:24 AM, Yinghai Lu wrote:
> On Fri, Jan 24, 2014 at 10:19 AM, Dave Hansen <[email protected]> wrote:
>> FWIW, I did turn of memblock=debug. It eventually booted, but
>> slooooooooooowly.
>
> then that is not a problem, as you are using 4k page mapping only.
> and that printout is too spew...

This means that, essentially, memblock=debug and
KMEMCHECK/DEBUG_PAGEALLOC can't be used together. That's a shame
because my DEBUG_PAGEALLOC config *broke* this code a few months ago,
right? Oh well.

>> How many problems in this code are we tracking, btw? This is at least
>> 3, right?
>
> two problems:
> 1. big numa system.
> 2. Andrew's system with swiotlb.

Can I ask politely for some more caution on your part in this area?
This is two consecutive kernels where this code has broken my system.

2014-01-24 18:51:42