LinuxLists.cc - Kernel panic due to page migration accessing memory holes

2010-02-18 00:46:06

Subject: Kernel panic due to page migration accessing memory holes

Hi,

I have encountered a kernel panic on the ARM/msm platform in the mm
migration code on 2.6.29. My memory configuration has two discontiguous
banks per our ATAG definition. These banks end up on addresses that
are 1 MB aligned. I am using FLATMEM (not SPARSEMEM), but my
understanding is that SPARSEMEM should not be necessary to support this
configuration. Please correct me if I'm wrong.

The crash occurs in mm/page_alloc.c:move_freepages() when being passed a
start_page that corresponds to the last several megabytes of our first
memory bank. The code in move_freepages_block() aligns the passed in
page number to pageblock_nr_pages, which corresponds to 4 MB. It then
passes that aligned pfn as the beginning of a 4 MB range to
move_freepages(). The problem is that since our bank's end address is
not 4 MB aligned, the range passed to move_freepages() exceeds the end
of our memory bank. The code later blows up when trying to access
uninitialized page structures.

As a temporary fix, I added some code to move_freepages_block() that
inspects whether the range exceeds our first memory bank -- returning 0
if it does. This is not a clean solution, since it requires exporting
the ARM specific meminfo structure to extract the bank information.

I see an option exists called CONFIG_HOLES_IN_ZONE, which has control
over the definition of pfn_valid_within() used in move_freepages().
This option seems relevant to the problem. The ia64 architecture has a
special version of pfn_valid() called ia64_pfn_valid() that is used in
conjunction with this option. It appears to inspect the page
structure's state in a safe way that does not cause a crash, and can
presumably be used to determine whether the page structure is
initialized properly. The ARM version of pfn_valid() used in the
FLATMEM scenario does not appear to be memory hole aware, and will
blindly return true in this case.

I have looked on linux-next, and at least the functions mentioned above
have not changed.

I was curious if there is a stated requirement where memory banks must
end on 4 MB aligned addresses. Although I found this problem on ARM, it
appears upon inspection that the problem could occur on other
architectures as well, given the memory map assumptions stated above.
I'm hoping that some mm experts might understand the problem in greater
detail.

Thanks,
Michael

2010-02-18 01:07:12

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Wed, 17 Feb 2010 16:45:54 -0800
Michael Bohan <[email protected]> wrote:

> Hi,
>
> I have encountered a kernel panic on the ARM/msm platform in the mm
> migration code on 2.6.29. My memory configuration has two discontiguous
> banks per our ATAG definition. These banks end up on addresses that
> are 1 MB aligned. I am using FLATMEM (not SPARSEMEM), but my
> understanding is that SPARSEMEM should not be necessary to support this
> configuration. Please correct me if I'm wrong.
>
> The crash occurs in mm/page_alloc.c:move_freepages() when being passed a
> start_page that corresponds to the last several megabytes of our first
> memory bank. The code in move_freepages_block() aligns the passed in
> page number to pageblock_nr_pages, which corresponds to 4 MB. It then
> passes that aligned pfn as the beginning of a 4 MB range to
> move_freepages(). The problem is that since our bank's end address is
> not 4 MB aligned, the range passed to move_freepages() exceeds the end
> of our memory bank. The code later blows up when trying to access
> uninitialized page structures.
>
That should be aligned, I think.

> As a temporary fix, I added some code to move_freepages_block() that
> inspects whether the range exceeds our first memory bank -- returning 0
> if it does. This is not a clean solution, since it requires exporting
> the ARM specific meminfo structure to extract the bank information.
>
Hmm, my first impression is...

- Using FLATMEM, memmap is created for the number of pages and memmap should
not have aligned size.
- Using SPARSEMEM, memmap is created for aligned number of pages.

Then, the range [zone->start_pfn ... zone->start_pfn + zone->spanned_pages]
should be checked always.

803 static int move_freepages_block(struct zone *zone, struct page *page,
804 int migratetype)
805 {
816 if (start_pfn < zone->zone_start_pfn)
817 start_page = page;
818 if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
819 return 0;
820
821 return move_freepages(zone, start_page, end_page, migratetype);
822 }

"(end_pfn >= zone->zone_start_pfn + zone->spanned_pages)" is checked.
What zone->spanned_pages is set ? The zone's range is
[zone->start_pfn ... zone->start_pfn+zone->spanned_pages], so this
area should have initialized memmap. I wonder zone->spanned_pages is too big.

Could you check ? (maybe /proc/zoneinfo can show it.)
Dump of /proc/zoneinfo or dmesg will be helpful.

Thanks,
-Kame

2010-02-18 08:22:27

by Michael Bohan

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On 2/17/2010 5:03 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 17 Feb 2010 16:45:54 -0800
> Michael Bohan<[email protected]> wrote:
>> As a temporary fix, I added some code to move_freepages_block() that
>> inspects whether the range exceeds our first memory bank -- returning 0
>> if it does. This is not a clean solution, since it requires exporting
>> the ARM specific meminfo structure to extract the bank information.
>>
>>
> Hmm, my first impression is...
>
> - Using FLATMEM, memmap is created for the number of pages and memmap should
> not have aligned size.
> - Using SPARSEMEM, memmap is created for aligned number of pages.
>
> Then, the range [zone->start_pfn ... zone->start_pfn + zone->spanned_pages]
> should be checked always.
>
>
> 803 static int move_freepages_block(struct zone *zone, struct page *page,
> 804 int migratetype)
> 805 {
> 816 if (start_pfn< zone->zone_start_pfn)
> 817 start_page = page;
> 818 if (end_pfn>= zone->zone_start_pfn + zone->spanned_pages)
> 819 return 0;
> 820
> 821 return move_freepages(zone, start_page, end_page, migratetype);
> 822 }
>
> "(end_pfn>= zone->zone_start_pfn + zone->spanned_pages)" is checked.
> What zone->spanned_pages is set ? The zone's range is
> [zone->start_pfn ... zone->start_pfn+zone->spanned_pages], so this
> area should have initialized memmap. I wonder zone->spanned_pages is too big.
>

In the block of code above running on my target, the zone_start_pfn is
is 0x200 and the spanned_pages is 0x44100. This is consistent with the
values shown from the zoneinfo file below. It is also consistent with
my memory map:

bank0:
start: 0x00200000
size: 0x07B00000

bank1:
start: 0x40000000
size: 0x04300000

Thus, spanned_pages here is the highest address reached minus the start
address of the lowest bank (eg. 0x40000000 + 0x04300000 - 0x00200000).

Both of these banks exist in the same zone. This means that the check
in move_freepages_block() will never be satisfied for cases that overlap
with the prohibited pfns, since the zone spans invalid pfns. Should
each bank be associated with its own zone?

> Could you check ? (maybe /proc/zoneinfo can show it.)
> Dump of /proc/zoneinfo or dmesg will be helpful.
>

Here is what I believe to be the relevant pieces from the kernel log:

<7>[ 0.000000] On node 0 totalpages: 48640
<7>[ 0.000000] free_area_init_node: node 0, pgdat 804875bc,
node_mem_map 805af000
<7>[ 0.000000] Normal zone: 2178 pages used for memmap
<7>[ 0.000000] Normal zone: 0 pages reserved
<7>[ 0.000000] Normal zone: 46462 pages, LIFO batch:15
<4>[ 0.000000] Built 1 zonelists in Zone order, mobility grouping
on. Total pages: 46462

# cat /proc/zoneinfo
Node 0, zone Normal
pages free 678
min 431
low 538
high 646
scanned 0 (aa: 0 ia: 0 af: 0 if: 0)
spanned 278784
present 46462
mem_notify_status 0
nr_free_pages 678
nr_inactive_anon 8494
nr_active_anon 8474
nr_inactive_file 3234
nr_active_file 2653
nr_unevictable 71
nr_mlock 0
nr_anon_pages 12488
nr_mapped 7237
nr_file_pages 10446
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 293
nr_slab_unreclaimable 942
nr_page_table_pages 1365
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
protection: (0, 0)
pagesets
cpu: 0
count: 42
high: 90
batch: 15
all_unreclaimable: 0
prev_priority: 12
start_pfn: 512
inactive_ratio: 1

Thanks,
Michael

2010-02-18 08:53:54

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Wed, Feb 17, 2010 at 04:45:54PM -0800, Michael Bohan wrote:
> I have encountered a kernel panic on the ARM/msm platform in the mm
> migration code on 2.6.29. My memory configuration has two discontiguous
> banks per our ATAG definition. These banks end up on addresses that
> are 1 MB aligned. I am using FLATMEM (not SPARSEMEM), but my
> understanding is that SPARSEMEM should not be necessary to support this
> configuration. Please correct me if I'm wrong.

Make sure you have ARCH_HAS_HOLES_MEMORYMODEL enabled.

2010-02-18 09:39:38

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, 18 Feb 2010 00:22:24 -0800
Michael Bohan <[email protected]> wrote:

> On 2/17/2010 5:03 PM, KAMEZAWA Hiroyuki wrote:
> > On Wed, 17 Feb 2010 16:45:54 -0800
> > Michael Bohan<[email protected]> wrote:
> >> As a temporary fix, I added some code to move_freepages_block() that
> >> inspects whether the range exceeds our first memory bank -- returning 0
> >> if it does. This is not a clean solution, since it requires exporting
> >> the ARM specific meminfo structure to extract the bank information.
> >>
> >>
> > Hmm, my first impression is...
> >
> > - Using FLATMEM, memmap is created for the number of pages and memmap should
> > not have aligned size.
> > - Using SPARSEMEM, memmap is created for aligned number of pages.
> >
> > Then, the range [zone->start_pfn ... zone->start_pfn + zone->spanned_pages]
> > should be checked always.
> >
> >
> > 803 static int move_freepages_block(struct zone *zone, struct page *page,
> > 804 int migratetype)
> > 805 {
> > 816 if (start_pfn< zone->zone_start_pfn)
> > 817 start_page = page;
> > 818 if (end_pfn>= zone->zone_start_pfn + zone->spanned_pages)
> > 819 return 0;
> > 820
> > 821 return move_freepages(zone, start_page, end_page, migratetype);
> > 822 }
> >
> > "(end_pfn>= zone->zone_start_pfn + zone->spanned_pages)" is checked.
> > What zone->spanned_pages is set ? The zone's range is
> > [zone->start_pfn ... zone->start_pfn+zone->spanned_pages], so this
> > area should have initialized memmap. I wonder zone->spanned_pages is too big.
> >
>
> In the block of code above running on my target, the zone_start_pfn is
> is 0x200 and the spanned_pages is 0x44100. This is consistent with the
> values shown from the zoneinfo file below. It is also consistent with
> my memory map:
>
> bank0:
> start: 0x00200000
> size: 0x07B00000
>
> bank1:
> start: 0x40000000
> size: 0x04300000
>
> Thus, spanned_pages here is the highest address reached minus the start
> address of the lowest bank (eg. 0x40000000 + 0x04300000 - 0x00200000).
>
> Both of these banks exist in the same zone. This means that the check
> in move_freepages_block() will never be satisfied for cases that overlap
> with the prohibited pfns, since the zone spans invalid pfns. Should
> each bank be associated with its own zone?
>

Hmm. okay then..(CCing Mel.)

[Fact]
- There are 2 banks of memory and a memory hole on your machine.
As
0x00200000 - 0x07D00000
0x40000000 - 0x43000000

- Each bancks are in the same zone.
- You use FLATMEM.
- You see panic in move_freepages().
- Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
Then, it seems 1st bank is not algined.
- You see panic in move_freepages().
- When you added special range check for bank0 in move_freepages(), no panic.
So, it seems the kernel see somehing bad at accessing memmap for a memory
hole between bank0 and bank1.

When you use FLATMEM, memmap/migrate-type-bitmap should be allocated for
the whole range of [start_pfn....max_pfn) regardless of memory holes.
Then, I think you have memmap even for a memory hole [0x07D00000...0x40000000)

Then, the question is why move_freepages() panic at accessing *unused* memmaps
for memory hole. All memmap(struct page) are initialized in
memmap_init()
-> memmap_init_zone()
-> ....
Here, all page structs are initialized (page->flags, page->lru are initialized.)

Then, looking back into move_freepages().
==
778 for (page = start_page; page <= end_page;) {
779 /* Make sure we are not inadvertently changing nodes */
780 VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
781
782 if (!pfn_valid_within(page_to_pfn(page))) {
783 page++;
784 continue;
785 }
786
787 if (!PageBuddy(page)) {
788 page++;
789 continue;
790 }
791
792 order = page_order(page);
793 list_del(&page->lru);
794 list_add(&page->lru,
795 &zone->free_area[order].free_list[migratetype]);
796 page += 1 << order;
797 pages_moved += 1 << order;
798 }
==
Assume an access to page struct itself doesn't cause panic.
Touching page struct's member of page->lru at el to cause panic,
So, PageBuddy should be set.

Then, there are 2 chances.
1. page_to_nid(page) != zone_to_nid(zone).
2. PageBuddy() is set by mistake.
(PG_reserved page never be set PG_buddy.)

For both, something corrupted in unused memmap area.
There are 2 possibility.
(1) memmap for memory hole was not initialized correctly.
(2) something wrong currupt memmap. (by overwrite.)

I doubt (2) rather than (1).

One of difficulty here is that your kernel is 2.6.29. Can't you try 2.6.32 and
reproduce trouble ? Or could you check page flags for memory holes ?
For holes, nid should be zero and PG_buddy shouldn't be set and PG_reserved
should be set...

And checking memmap initialization of memory holes in memmap_init_zone()
may be good start point for debug, I guess.

Off topic:
BTW, memory hole seems huge for your size of memory....using SPARSEMEM
is a choice.

Regards,
-Kame

2010-02-18 10:04:49

by Mel Gorman

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, Feb 18, 2010 at 06:36:04PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 00:22:24 -0800
> Michael Bohan <[email protected]> wrote:
>
> > On 2/17/2010 5:03 PM, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 17 Feb 2010 16:45:54 -0800
> > > Michael Bohan<[email protected]> wrote:
> > >> As a temporary fix, I added some code to move_freepages_block() that
> > >> inspects whether the range exceeds our first memory bank -- returning 0
> > >> if it does. This is not a clean solution, since it requires exporting
> > >> the ARM specific meminfo structure to extract the bank information.
> > >>
> > >>
> > > Hmm, my first impression is...
> > >
> > > - Using FLATMEM, memmap is created for the number of pages and memmap should
> > > not have aligned size.
> > > - Using SPARSEMEM, memmap is created for aligned number of pages.
> > >
> > > Then, the range [zone->start_pfn ... zone->start_pfn + zone->spanned_pages]
> > > should be checked always.
> > >
> > >
> > > 803 static int move_freepages_block(struct zone *zone, struct page *page,
> > > 804 int migratetype)
> > > 805 {
> > > 816 if (start_pfn< zone->zone_start_pfn)
> > > 817 start_page = page;
> > > 818 if (end_pfn>= zone->zone_start_pfn + zone->spanned_pages)
> > > 819 return 0;
> > > 820
> > > 821 return move_freepages(zone, start_page, end_page, migratetype);
> > > 822 }
> > >
> > > "(end_pfn>= zone->zone_start_pfn + zone->spanned_pages)" is checked.
> > > What zone->spanned_pages is set ? The zone's range is
> > > [zone->start_pfn ... zone->start_pfn+zone->spanned_pages], so this
> > > area should have initialized memmap. I wonder zone->spanned_pages is too big.
> > >
> >
> > In the block of code above running on my target, the zone_start_pfn is
> > is 0x200 and the spanned_pages is 0x44100. This is consistent with the
> > values shown from the zoneinfo file below. It is also consistent with
> > my memory map:
> >
> > bank0:
> > start: 0x00200000
> > size: 0x07B00000
> >
> > bank1:
> > start: 0x40000000
> > size: 0x04300000
> >
> > Thus, spanned_pages here is the highest address reached minus the start
> > address of the lowest bank (eg. 0x40000000 + 0x04300000 - 0x00200000).
> >
> > Both of these banks exist in the same zone. This means that the check
> > in move_freepages_block() will never be satisfied for cases that overlap
> > with the prohibited pfns, since the zone spans invalid pfns. Should
> > each bank be associated with its own zone?
> >
>
> Hmm. okay then..(CCing Mel.)
>
> [Fact]
> - There are 2 banks of memory and a memory hole on your machine.
> As
> 0x00200000 - 0x07D00000
> 0x40000000 - 0x43000000
>
> - Each bancks are in the same zone.
> - You use FLATMEM.
> - You see panic in move_freepages().
> - Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
> Then, it seems 1st bank is not algined.

It's not and assumptions are made about it being aligned.

> - You see panic in move_freepages().
> - When you added special range check for bank0 in move_freepages(), no panic.
> So, it seems the kernel see somehing bad at accessing memmap for a memory
> hole between bank0 and bank1.
>
>
> When you use FLATMEM, memmap/migrate-type-bitmap should be allocated for
> the whole range of [start_pfn....max_pfn) regardless of memory holes.
> Then, I think you have memmap even for a memory hole [0x07D00000...0x40000000)
>

It would have at the start but then ....

> Then, the question is why move_freepages() panic at accessing *unused* memmaps
> for memory hole. All memmap(struct page) are initialized in
> memmap_init()
> -> memmap_init_zone()
> -> ....
> Here, all page structs are initialized (page->flags, page->lru are initialized.)
>

ARM frees unused portions of memmap to save memory. It's why memmap_valid_within()
exists when CONFIG_ARCH_HAS_HOLES_MEMORYMODEL although previously only
reading /proc/pagetypeinfo cared.

In that case, the FLATMEM memory map had unexpected holes which "never"
happens and that was the workaround. The problem here is that there are
unaligned zones but no pfn_valid() implementation that can identify
them as you'd have with SPARSEMEM. My expectation is that you are using
the pfn_valid() implementation from asm-generic

#define pfn_valid(pfn) ((pfn) < max_mapnr)

which is insufficient in your case.

> Then, looking back into move_freepages().
> ==
> 778 for (page = start_page; page <= end_page;) {
> 779 /* Make sure we are not inadvertently changing nodes */
> 780 VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
> 781
> 782 if (!pfn_valid_within(page_to_pfn(page))) {
> 783 page++;
> 784 continue;
> 785 }
> 786
> 787 if (!PageBuddy(page)) {
> 788 page++;
> 789 continue;
> 790 }
> 791
> 792 order = page_order(page);
> 793 list_del(&page->lru);
> 794 list_add(&page->lru,
> 795 &zone->free_area[order].free_list[migratetype]);
> 796 page += 1 << order;
> 797 pages_moved += 1 << order;
> 798 }
> ==
> Assume an access to page struct itself doesn't cause panic.
> Touching page struct's member of page->lru at el to cause panic,
> So, PageBuddy should be set.
>
> Then, there are 2 chances.
> 1. page_to_nid(page) != zone_to_nid(zone).
> 2. PageBuddy() is set by mistake.
> (PG_reserved page never be set PG_buddy.)
>
> For both, something corrupted in unused memmap area.
> There are 2 possibility.
> (1) memmap for memory hole was not initialized correctly.
> (2) something wrong currupt memmap. (by overwrite.)
>
> I doubt (2) rather than (1).
>

I think it's more likely the at the memmap he is accessing has been
freed and is effectively random data.

> One of difficulty here is that your kernel is 2.6.29. Can't you try 2.6.32 and
> reproduce trouble ? Or could you check page flags for memory holes ?
> For holes, nid should be zero and PG_buddy shouldn't be set and PG_reserved
> should be set...
>
> And checking memmap initialization of memory holes in memmap_init_zone()
> may be good start point for debug, I guess.
>
> Off topic:
> BTW, memory hole seems huge for your size of memory....using SPARSEMEM
> is a choice.
>

SPARSEMEM would give you an implementation of pfn_valid() that you could
use here. The choices that spring to mind are;

1. reduce MAX_ORDER so they are aligned (easiest)
2. use SPARSEMEM (easy, but not necessary what you want to do, might
waste memory unless you drop MAX_ORDER as well)
3. implement a pfn_valid() that can handle the holes and set
CONFIG_HOLES_IN_ZONE so it's called in move_freepages() to
deal with the holes (should pass this by someone more familiar
with ARM than I)
4. Call memmap_valid_within in move_freepages (very very ugly, not
suitable for upstream merging)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-02-19 01:47:31

by Michael Bohan

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On 2/18/2010 2:04 AM, Mel Gorman wrote:
> On Thu, Feb 18, 2010 at 06:36:04PM +0900, KAMEZAWA Hiroyuki wrote:
>
>> [Fact]
>> - There are 2 banks of memory and a memory hole on your machine.
>> As
>> 0x00200000 - 0x07D00000
>> 0x40000000 - 0x43000000
>>
>> - Each bancks are in the same zone.
>> - You use FLATMEM.
>> - You see panic in move_freepages().
>> - Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
>> Then, it seems 1st bank is not algined.
>>
> It's not and assumptions are made about it being aligned.
>

Would it be prudent to have the ARM mm init code detect unaligned,
discontiguous banks and print a warning message if
CONFIG_ARCH_HAS_HOLES_MEMORYMODEL is not configured? Should we take it
a step further and even BUG()?

> ARM frees unused portions of memmap to save memory. It's why memmap_valid_within()
> exists when CONFIG_ARCH_HAS_HOLES_MEMORYMODEL although previously only
> reading /proc/pagetypeinfo cared.
>
> In that case, the FLATMEM memory map had unexpected holes which "never"
> happens and that was the workaround. The problem here is that there are
> unaligned zones but no pfn_valid() implementation that can identify
> them as you'd have with SPARSEMEM. My expectation is that you are using
> the pfn_valid() implementation from asm-generic
>
> #define pfn_valid(pfn) ((pfn)< max_mapnr)
>
> which is insufficient in your case.
>

I am actually using the pfn_valid implementation FLATMEM in
arch/arm/include/asm/memory.h. This one is very similar to the
asm-generic, and has no knowledge of the holes.

> I think it's more likely the at the memmap he is accessing has been
> freed and is effectively random data.
>
>

I also think this is the case.

> SPARSEMEM would give you an implementation of pfn_valid() that you could
> use here. The choices that spring to mind are;
>
> 1. reduce MAX_ORDER so they are aligned (easiest)
>

Is it safe to assume that reducing MAX_ORDER will hurt performance?

> 2. use SPARSEMEM (easy, but not necessary what you want to do, might
> waste memory unless you drop MAX_ORDER as well)
>

We intend to use SPARSEMEM, but we'd also like to maintain FLATMEM
compatibility for some configurations. My guess is that there are other
ARM users that may want this support as well.

> 3. implement a pfn_valid() that can handle the holes and set
> CONFIG_HOLES_IN_ZONE so it's called in move_freepages() to
> deal with the holes (should pass this by someone more familiar
> with ARM than I)
>

This option seems the best to me. We should be able to implement an ARM
specific pfn_valid() that walks the ARM meminfo struct to ensure the pfn
is not within a hole.

My only concern with this is a comment in __rmqueue_fallback() after
calling move_freepages_block() that states "Claim the whole block if
over half of it is free". Suppose only 1 MB is beyond the bank limit.
That means that over half of the pages of the 4 MB block will be
reported by move_freepages() as free -- but 1 MB of those pages are
invalid. Won't this cause problems if these pages are assumed to be
part of an active block?

It seems like we should have an additional check in
move_freepages_block() with pfn_valid_within() to check the last page in
the block (eg. end_pfn) before calling move_freepages_block(). If the
last page is not valid, then we shouldn't we return 0 as in the zone
span check? This will also skip the extra burden of checking each
individual page, when we already know the proposed range is invalid.

Assuming we did return 0 in this case, would that sub-block of pages
ever be usable for anything else, or would it be effectively wasted? If
this memory were wasted, then adjusting MAX_ORDER would have an
advantage in this sense -- ignoring any performance implications.

Thanks,
Michael

2010-02-19 02:03:36

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, 18 Feb 2010 17:47:28 -0800
Michael Bohan <[email protected]> wrote:

> On 2/18/2010 2:04 AM, Mel Gorman wrote:
> > On Thu, Feb 18, 2010 at 06:36:04PM +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> [Fact]
> >> - There are 2 banks of memory and a memory hole on your machine.
> >> As
> >> 0x00200000 - 0x07D00000
> >> 0x40000000 - 0x43000000
> >>
> >> - Each bancks are in the same zone.
> >> - You use FLATMEM.
> >> - You see panic in move_freepages().
> >> - Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
> >> Then, it seems 1st bank is not algined.
> >>
> > It's not and assumptions are made about it being aligned.
> >
>
> Would it be prudent to have the ARM mm init code detect unaligned,
> discontiguous banks and print a warning message if
> CONFIG_ARCH_HAS_HOLES_MEMORYMODEL is not configured? Should we take it
> a step further and even BUG()?
>
> > ARM frees unused portions of memmap to save memory. It's why memmap_valid_within()
> > exists when CONFIG_ARCH_HAS_HOLES_MEMORYMODEL although previously only
> > reading /proc/pagetypeinfo cared.
> >
> > In that case, the FLATMEM memory map had unexpected holes which "never"
> > happens and that was the workaround. The problem here is that there are
> > unaligned zones but no pfn_valid() implementation that can identify
> > them as you'd have with SPARSEMEM. My expectation is that you are using
> > the pfn_valid() implementation from asm-generic
> >
> > #define pfn_valid(pfn) ((pfn)< max_mapnr)
> >
> > which is insufficient in your case.
> >
>
> I am actually using the pfn_valid implementation FLATMEM in
> arch/arm/include/asm/memory.h. This one is very similar to the
> asm-generic, and has no knowledge of the holes.
>
That means, in FLATMEM, memmaps are allocated for [start....max_pfn].
pfn_valid() isn't for "there is memor" but for "there is memmap".

> > I think it's more likely the at the memmap he is accessing has been
> > freed and is effectively random data.
> >
> >
>
> I also think this is the case.
>
Then, plz check free_bootmem() at el doen't free pages in a memory hole.

> > SPARSEMEM would give you an implementation of pfn_valid() that you could
> > use here. The choices that spring to mind are;
> >
> > 1. reduce MAX_ORDER so they are aligned (easiest)
> >
>
> Is it safe to assume that reducing MAX_ORDER will hurt performance?
>
> > 2. use SPARSEMEM (easy, but not necessary what you want to do, might
> > waste memory unless you drop MAX_ORDER as well)
> >
>
> We intend to use SPARSEMEM, but we'd also like to maintain FLATMEM
> compatibility for some configurations. My guess is that there are other
> ARM users that may want this support as well.
>
> > 3. implement a pfn_valid() that can handle the holes and set
> > CONFIG_HOLES_IN_ZONE so it's called in move_freepages() to
> > deal with the holes (should pass this by someone more familiar
> > with ARM than I)
> >
>
> This option seems the best to me. We should be able to implement an ARM
> specific pfn_valid() that walks the ARM meminfo struct to ensure the pfn
> is not within a hole.
>
> My only concern with this is a comment in __rmqueue_fallback() after
> calling move_freepages_block() that states "Claim the whole block if
> over half of it is free". Suppose only 1 MB is beyond the bank limit.
> That means that over half of the pages of the 4 MB block will be
> reported by move_freepages() as free -- but 1 MB of those pages are
> invalid. Won't this cause problems if these pages are assumed to be
> part of an active block?
>
memmap for memory holes should be marked as PG_reserved and never be freed
by free_bootmem(). Then, memmap for memory holes will not be in buddy allocator.

Again, pfn_valid() just show "there is memmap", not for "there is a valid page"

> It seems like we should have an additional check in
> move_freepages_block() with pfn_valid_within() to check the last page in
> the block (eg. end_pfn) before calling move_freepages_block(). If the
> last page is not valid, then we shouldn't we return 0 as in the zone
> span check? This will also skip the extra burden of checking each
> individual page, when we already know the proposed range is invalid.
>
You don't need that. please check why PG_reserved for your memory holes
are not set.

> Assuming we did return 0 in this case, would that sub-block of pages
> ever be usable for anything else, or would it be effectively wasted? If
> this memory were wasted, then adjusting MAX_ORDER would have an
> advantage in this sense -- ignoring any performance implications.
>

Even you do that, you have to fix "someone corrupts memory" or "someone
free PG_reserved memory" issue, anyway.

Thanks,
-Kame

2010-02-19 05:48:40

by Michael Bohan

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On 2/18/2010 6:00 PM, KAMEZAWA Hiroyuki wrote:
> memmap for memory holes should be marked as PG_reserved and never be freed
> by free_bootmem(). Then, memmap for memory holes will not be in buddy allocator.
>
> Again, pfn_valid() just show "there is memmap", not for "there is a valid page"
>

ARM seems to have been freeing the memmap holes for a long time. I'm
pretty sure there would be a lot of pushback if we tried to change
that. For example, in my memory map running FLATMEM, I would be
consuming an extra ~7 MB of memory if these structures were not freed.

As a compromise, perhaps we could free everything except the first
'pageblock_nr_pages' in a hole? This would guarantee that
move_freepages() doesn't deference any memory that doesn't belong to the
memmap -- but still only waste a relatively small amount of memory. For
a 4 MB page block, it should only consume an extra 32 KB per hole in the
memory map.

Thanks,
Michael

2010-02-19 06:13:48

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, 18 Feb 2010 21:48:37 -0800
Michael Bohan <[email protected]> wrote:

> On 2/18/2010 6:00 PM, KAMEZAWA Hiroyuki wrote:
> > memmap for memory holes should be marked as PG_reserved and never be freed
> > by free_bootmem(). Then, memmap for memory holes will not be in buddy allocator.
> >
> > Again, pfn_valid() just show "there is memmap", not for "there is a valid page"
> >
>
> ARM seems to have been freeing the memmap holes for a long time.
Ouch.

> I'm pretty sure there would be a lot of pushback if we tried to change
> that. For example, in my memory map running FLATMEM, I would be
> consuming an extra ~7 MB of memory if these structures were not freed.
>
> As a compromise, perhaps we could free everything except the first
> 'pageblock_nr_pages' in a hole? This would guarantee that
> move_freepages() doesn't deference any memory that doesn't belong to the
> memmap -- but still only waste a relatively small amount of memory. For
> a 4 MB page block, it should only consume an extra 32 KB per hole in the
> memory map.
>
No. You have to implement pfn_valid() to return correct value as
"pfn_valid() returnes true if there is memmap." even if you do that.
Otherwise, many things will go bad.

You have 2 or 3 ways.

1. re-implement pfn_valid() which returns correct value.
maybe not difficult. but please take care of defining CONFIG_HOLES_IN_....
etc.

2. use DISCONTIGMEM and treat each bank and NUMA node.
There will be no waste for memmap. But other complication of CONFIG_NUMA.

3. use SPARSEMEM.
You have even 2 choisce here.
a - Set your MAX_ORDER and SECTION_SIZE to be proper value.
b - waste some amount of memory for memmap on the edge of section.
(and don't free memmap for the edge.)

Thanks,
-Kame

2010-02-19 08:24:58

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Fri, 19 Feb 2010 15:10:12 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> 1. re-implement pfn_valid() which returns correct value.
> maybe not difficult. but please take care of defining CONFIG_HOLES_IN_....
> etc.
>
> 2. use DISCONTIGMEM and treat each bank and NUMA node.
> There will be no waste for memmap. But other complication of CONFIG_NUMA.
>
> 3. use SPARSEMEM.
> You have even 2 choisce here.
> a - Set your MAX_ORDER and SECTION_SIZE to be proper value.
> b - waste some amount of memory for memmap on the edge of section.
> (and don't free memmap for the edge.)
>

I read ARM's code briefly. In 2.6.32, ...I think (1) is implemented. As
==

#ifndef CONFIG_SPARSEMEM
int pfn_valid(unsigned long pfn)
{
struct meminfo *mi = &meminfo;
unsigned int left = 0, right = mi->nr_banks;

do {
unsigned int mid = (right + left) / 2;
struct membank *bank = &mi->bank[mid];

if (pfn < bank_pfn_start(bank))
right = mid;
else if (pfn >= bank_pfn_end(bank))
left = mid + 1;
else
return 1;
} while (left < right);
return 0;
}
EXPORT_SYMBOL(pfn_valid);
==
So, what you should do is upgrade to 2.6.32 or backport this one.

See this.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b7cfda9fc3d7aa60cffab5367f2a72a4a70060cd

Thanks,
-Kame

2010-02-19 08:30:53

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, Feb 18, 2010 at 05:47:28PM -0800, Michael Bohan wrote:
> I am actually using the pfn_valid implementation FLATMEM in
> arch/arm/include/asm/memory.h. This one is very similar to the
> asm-generic, and has no knowledge of the holes.

Later kernels have a pfn_valid() which does have hole functionality.

2010-02-19 13:48:50

by Mel Gorman

[permalink] [raw]

Subject: Re: Kernel panic due to page migration accessing memory holes

On Thu, Feb 18, 2010 at 05:47:28PM -0800, Michael Bohan wrote:
> On 2/18/2010 2:04 AM, Mel Gorman wrote:
>> On Thu, Feb 18, 2010 at 06:36:04PM +0900, KAMEZAWA Hiroyuki wrote:
>>
>>> [Fact]
>>> - There are 2 banks of memory and a memory hole on your machine.
>>> As
>>> 0x00200000 - 0x07D00000
>>> 0x40000000 - 0x43000000
>>>
>>> - Each bancks are in the same zone.
>>> - You use FLATMEM.
>>> - You see panic in move_freepages().
>>> - Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
>>> Then, it seems 1st bank is not algined.
>>>
>> It's not and assumptions are made about it being aligned.
>>
>
> Would it be prudent to have the ARM mm init code detect unaligned,
> discontiguous banks and print a warning message if
> CONFIG_ARCH_HAS_HOLES_MEMORYMODEL is not configured? Should we take it
> a step further and even BUG()?
>

I guess it wouldn't hurt. I wouldn't get too side-tracked though as it's
not the most important issue here.

>> ARM frees unused portions of memmap to save memory. It's why memmap_valid_within()
>> exists when CONFIG_ARCH_HAS_HOLES_MEMORYMODEL although previously only
>> reading /proc/pagetypeinfo cared.
>>
>> In that case, the FLATMEM memory map had unexpected holes which "never"
>> happens and that was the workaround. The problem here is that there are
>> unaligned zones but no pfn_valid() implementation that can identify
>> them as you'd have with SPARSEMEM. My expectation is that you are using
>> the pfn_valid() implementation from asm-generic
>>
>> #define pfn_valid(pfn) ((pfn)< max_mapnr)
>>
>> which is insufficient in your case.
>>
>
> I am actually using the pfn_valid implementation FLATMEM in
> arch/arm/include/asm/memory.h. This one is very similar to the
> asm-generic, and has no knowledge of the holes.
>

Same problem applies so.

>> I think it's more likely the at the memmap he is accessing has been
>> freed and is effectively random data.
>>
>>
>
> I also think this is the case.
>
>> SPARSEMEM would give you an implementation of pfn_valid() that you could
>> use here. The choices that spring to mind are;
>>
>> 1. reduce MAX_ORDER so they are aligned (easiest)
>>
>
> Is it safe to assume that reducing MAX_ORDER will hurt performance?
>

No, it does not necessarily reduce performance. In some circumstances it
might even help although I wouldn't chase after it.

Downside one is that some hash tables might be getting hurt if you have a
very large amount of memory (look for "hash table entries:" in dmesg after
booting to see what order is being used).

Downside two is that if some drivers require large contiguous memory
early in boot, they might be hurt by MAX_ORDER being lower. If you
require CONFIG_HUGETLB_PAGE, it might not be possible to reduce
MAX_ORDER depending on the size of the huge page.

>> 2. use SPARSEMEM (easy, but not necessary what you want to do, might
>> waste memory unless you drop MAX_ORDER as well)
>>
>
> We intend to use SPARSEMEM, but we'd also like to maintain FLATMEM
> compatibility for some configurations. My guess is that there are other
> ARM users that may want this support as well.
>
>> 3. implement a pfn_valid() that can handle the holes and set
>> CONFIG_HOLES_IN_ZONE so it's called in move_freepages() to
>> deal with the holes (should pass this by someone more familiar
>> with ARM than I)
>>
>
> This option seems the best to me. We should be able to implement an ARM
> specific pfn_valid() that walks the ARM meminfo struct to ensure the pfn
> is not within a hole.
>

Be sure to check your performance before and after. pfn_valid_within()
is used in a fair few places and you are likely enabling it.

> My only concern with this is a comment in __rmqueue_fallback() after
> calling move_freepages_block() that states "Claim the whole block if
> over half of it is free". Suppose only 1 MB is beyond the bank limit.
> That means that over half of the pages of the 4 MB block will be
> reported by move_freepages() as free -- but 1 MB of those pages are
> invalid. Won't this cause problems if these pages are assumed to be
> part of an active block?
>

The only operation taking place there is updating a bitmap so I doubt
you'll hit snags there.

> It seems like we should have an additional check in
> move_freepages_block() with pfn_valid_within() to check the last page in
> the block (eg. end_pfn) before calling move_freepages_block(). If the
> last page is not valid, then we shouldn't we return 0 as in the zone
> span check? This will also skip the extra burden of checking each
> individual page, when we already know the proposed range is invalid.
>

You don't know where the holes are going to be so it is paranod rather
than making assumptions about where architectures put holes.

> Assuming we did return 0 in this case, would that sub-block of pages
> ever be usable for anything else, or would it be effectively wasted?

They're still usable.

> If
> this memory were wasted, then adjusting MAX_ORDER would have an
> advantage in this sense -- ignoring any performance implications.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab