2008-03-14 14:38:53

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH 0/3 (RFC)](memory hotplug) freeing pages allocated by bootmem for hotremove


Hello.

I would like to post patch set to free pages which is allocated by bootmem
for memory-hotremove.

Basic my idea is using remain members of struct page to remember
information of users of bootmem (section number or node id).
When the section is removing, kernel can confirm it.
By this information, some issues can be solved.

1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
offlined already and all pages must be isolated against
page allocater. Kernel keeps it as is.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page migrator will be able to check and skip
memmap againt page isolation when page offline.
Current page migration fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)

Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.

This patch set needs Badari-san's generic __remove_pages() support patch.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-03/msg02881.html

I think this patch set is not perfect. Because, some of section/node
informations are smaller than one page, and bootmem allocator may
mix other data. This patch is still trial.
But I suppose this is good start for everyone to understand what is necessary.

Please comments.

Other Todo:
- for SPARSEMEM_VMEMMAP.
Freeing vmemmap's page is more diffcult than normal sparsemem.
Because not only memmap's page, but also the pages like page table must
be removed too. If removing section has pages for , then it must
be migrated too. Relocatable page table is necessary.

- compile with other config.
This version is just for requesting comments.
If this way is accepted, I'll check it.
- Follow fix bootmem by Yinghai Lu-san.
(This patch set is for 2.6.25-rc3-mm1 with Badari-san's patch yet.)

Thanks.



--
Yasunori Goto


2008-03-14 14:46:48

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH 2/3 (RFC)](memory hotplug) free pages allocated by bootmem for hotremove


This patch is to free memmap and usemap by using registered information.

Signed-off-by: Yasunori Goto <[email protected]>

---
mm/internal.h | 3 +--
mm/page_alloc.c | 2 +-
mm/sparse.c | 47 +++++++++++++++++++++++++++++++++++++++++------
3 files changed, 43 insertions(+), 9 deletions(-)

Index: current/mm/sparse.c
===================================================================
--- current.orig/mm/sparse.c 2008-03-10 22:24:46.000000000 +0900
+++ current/mm/sparse.c 2008-03-10 22:31:03.000000000 +0900
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
+#include "internal.h"
#include <asm/dma.h>
#include <asm/pgalloc.h>
#include <asm/pgtable.h>
@@ -361,28 +362,62 @@
free_pages((unsigned long)memmap,
get_order(sizeof(struct page) * nr_pages));
}
+
+static void free_maps_by_bootmem(struct page *map, unsigned long nr_pages)
+{
+ unsigned long maps_section_nr, removing_section_nr, i;
+ struct page *page = map;
+
+ for (i = 0; i < nr_pages; i++, page++) {
+ maps_section_nr = pfn_to_section_nr(page_to_pfn(page));
+ removing_section_nr = page->private;
+
+ /*
+ * If removing section's memmap is placed on other section,
+ * it must be free.
+ * Else, nothing is necessary. the memmap is already isolated
+ * against page allocator, and it is not used any more.
+ */
+ if (maps_section_nr != removing_section_nr) {
+ clear_page_bootmem_info(page);
+ __free_pages_bootmem(page, 0);
+ }
+ }
+}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */

static void free_section_usemap(struct page *memmap, unsigned long *usemap)
{
+ struct page *usemap_page;
+ unsigned long nr_pages;
+
if (!usemap)
return;

+ usemap_page = virt_to_page(usemap);
/*
* Check to see if allocation came from hot-plug-add
*/
- if (PageSlab(virt_to_page(usemap))) {
+ if (PageSlab(usemap_page)) {
kfree(usemap);
if (memmap)
__kfree_section_memmap(memmap, PAGES_PER_SECTION);
return;
}

- /*
- * TODO: Allocations came from bootmem - how do I free up ?
- */
- printk(KERN_WARNING "Not freeing up allocations from bootmem "
- "- leaking memory\n");
+ /* free maps came from bootmem */
+ nr_pages = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+ free_maps_by_bootmem(usemap_page, nr_pages);
+
+ if (memmap) {
+ struct page *memmap_page;
+ memmap_page = virt_to_page(memmap);
+
+ nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
+ >> PAGE_SHIFT;
+
+ free_maps_by_bootmem(memmap_page, nr_pages);
+ }
}

/*
Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-03-10 22:24:46.000000000 +0900
+++ current/mm/page_alloc.c 2008-03-10 22:29:20.000000000 +0900
@@ -564,7 +564,7 @@
/*
* permit the bootmem allocator to evade page validation on high-order frees
*/
-void __init __free_pages_bootmem(struct page *page, unsigned int order)
+void __free_pages_bootmem(struct page *page, unsigned int order)
{
if (order == 0) {
__ClearPageReserved(page);
Index: current/mm/internal.h
===================================================================
--- current.orig/mm/internal.h 2008-03-10 22:24:46.000000000 +0900
+++ current/mm/internal.h 2008-03-10 22:29:20.000000000 +0900
@@ -34,8 +34,7 @@
atomic_dec(&page->_count);
}

-extern void __init __free_pages_bootmem(struct page *page,
- unsigned int order);
+extern void __free_pages_bootmem(struct page *page, unsigned int order);

/*
* function for dealing with page's order in buddy system.

--
Yasunori Goto

2008-03-14 14:49:10

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH 1/3 (RFC)](memory hotplug) remember section_nr and node id for removing


This is to register information to be able to remove section's or node's
structures.

Signed-off-by: Yasunori Goto <[email protected]>

include/linux/memory_hotplug.h | 10 ++++
include/linux/mmzone.h | 1
mm/bootmem.c | 1
mm/memory_hotplug.c | 97 ++++++++++++++++++++++++++++++++++++++++-
mm/sparse.c | 3 -
5 files changed, 109 insertions(+), 3 deletions(-)

Index: current/mm/bootmem.c
===================================================================
--- current.orig/mm/bootmem.c 2008-03-10 16:42:54.000000000 +0900
+++ current/mm/bootmem.c 2008-03-10 22:24:46.000000000 +0900
@@ -401,6 +401,7 @@

unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
{
+ register_page_bootmem_info_node(pgdat);
return free_all_bootmem_core(pgdat);
}

Index: current/include/linux/memory_hotplug.h
===================================================================
--- current.orig/include/linux/memory_hotplug.h 2008-03-10 16:42:54.000000000 +0900
+++ current/include/linux/memory_hotplug.h 2008-03-10 16:42:57.000000000 +0900
@@ -11,6 +11,11 @@
struct mem_section;

#ifdef CONFIG_MEMORY_HOTPLUG
+
+#define SECTION_MAGIC 0xfffffffe
+#define NODE_INFO_MAGIC 0xfffffffd
+#define SECTION_INFO 0
+#define NODE_INFO 1
/*
* pgdat resizing functions
*/
@@ -145,6 +150,9 @@
#endif /* CONFIG_NUMA */
#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */

+extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
+extern void clear_page_bootmem_info(struct page *page);
+
#else /* ! CONFIG_MEMORY_HOTPLUG */
/*
* Stub functions for when hotplug is off
@@ -192,5 +200,7 @@
extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
int nr_pages);
extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
+ unsigned long pnum);

#endif /* __LINUX_MEMORY_HOTPLUG_H */
Index: current/include/linux/mmzone.h
===================================================================
--- current.orig/include/linux/mmzone.h 2008-03-10 16:42:54.000000000 +0900
+++ current/include/linux/mmzone.h 2008-03-10 16:42:57.000000000 +0900
@@ -938,6 +938,7 @@
return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}
extern int __section_nr(struct mem_section* ms);
+extern unsigned long usemap_size(void);

/*
* We use the lower bits of the mem_map pointer to store
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2008-03-10 16:42:54.000000000 +0900
+++ current/mm/memory_hotplug.c 2008-03-10 22:22:25.000000000 +0900
@@ -59,8 +59,103 @@
return;
}

-
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+static void set_page_bootmem_info(unsigned long info, struct page *page,
+ unsigned long flag)
+{
+
+ if (flag == SECTION_INFO)
+ atomic_set(&page->_mapcount, SECTION_MAGIC);
+ else
+ atomic_set(&page->_mapcount, NODE_INFO_MAGIC);
+
+ SetPagePrivate(page);
+ set_page_private(page, info);
+
+}
+
+void clear_page_bootmem_info(struct page *page)
+{
+ int magic;
+
+ magic = atomic_read(&page->_mapcount);
+ if (magic != SECTION_MAGIC && magic != NODE_INFO_MAGIC)
+ BUG();
+
+ ClearPagePrivate(page);
+ set_page_private(page, 0);
+ reset_page_mapcount(page);
+}
+
+void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+ unsigned long *usemap, mapsize, section_nr, i;
+ struct page *page, *memmap;
+
+ if (!pfn_valid(start_pfn))
+ return;
+
+ section_nr = pfn_to_section_nr(start_pfn);
+
+ memmap = pfn_to_page(start_pfn); /* memmap for the section */
+
+ /*
+ * Get page for the memmap's phys address
+ * XXX: need more consideration for sparse_vmemmap...
+ */
+ page = virt_to_page(memmap);
+ mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+ mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
+
+ /* remember memmap's page */
+ for (i = 0; i < mapsize; i++, page++)
+ set_page_bootmem_info(section_nr, page, SECTION_INFO);
+
+ usemap = __nr_to_section(section_nr)->pageblock_flags;
+ page = virt_to_page(usemap);
+
+ mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+
+ for (i = 0; i < mapsize; i++, page++)
+ set_page_bootmem_info(section_nr, page, SECTION_INFO);
+
+}
+
+void register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+ unsigned long i, pfn, end_pfn, nr_pages;
+ int node = pgdat->node_id;
+ struct page *page;
+ struct zone *zone;
+
+ nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+ page = virt_to_page(pgdat);
+
+ for (i = 0; i < nr_pages; i++, page++)
+ set_page_bootmem_info(node, page, NODE_INFO);
+
+ zone = &pgdat->node_zones[0];
+ for (; zone < pgdat->node_zones + MAX_NR_ZONES - 1; zone++) {
+ if (zone->wait_table) {
+ nr_pages = zone->wait_table_hash_nr_entries
+ * sizeof(wait_queue_head_t);
+ nr_pages = PAGE_ALIGN(nr_pages) >> PAGE_SHIFT;
+ page = virt_to_page(zone->wait_table);
+
+ for (i = 0; i < nr_pages; i++, page++)
+ set_page_bootmem_info(node, page, NODE_INFO);
+ }
+ }
+
+ pfn = pgdat->node_start_pfn;
+ end_pfn = pfn + pgdat->node_spanned_pages;
+
+ /* register_section info */
+ for (; pfn < end_pfn; pfn += PAGES_PER_SECTION)
+ register_page_bootmem_info_section(pfn);
+
+}
+
static int __add_zone(struct zone *zone, unsigned long phys_start_pfn)
{
struct pglist_data *pgdat = zone->zone_pgdat;
Index: current/mm/sparse.c
===================================================================
--- current.orig/mm/sparse.c 2008-03-10 16:42:54.000000000 +0900
+++ current/mm/sparse.c 2008-03-10 22:24:46.000000000 +0900
@@ -200,7 +200,6 @@
/*
* Decode mem_map from the coded memmap
*/
-static
struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum)
{
/* mask off the extra low bits of information */
@@ -223,7 +222,7 @@
return 1;
}

-static unsigned long usemap_size(void)
+unsigned long usemap_size(void)
{
unsigned long size_bytes;
size_bytes = roundup(SECTION_BLOCKFLAGS_BITS, 8) / 8;

--
Yasunori Goto

2008-03-14 14:53:31

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH 3/3 (RFC)](memory hotplug) align maps for easy removing


To free memmap and usemap easier, this patch aligns these maps to page size.

I know usemap size is too small to align page size.
It will be waste of area. So, there may be better way than this.

Followings are pros. and cons with other my ideas.
But I'm not sure which is better way.

a) Packing many section's usemap on one page. Count how many sections use
it in page_count.
Pros.
- Avoid waisting area.
Cons.
- This usemap's page will be hard(or impossible) to remove due to
dependency.
It should be allocated on un-movable zone/node.
(I'm not sure it's impact of performance.)
- Nodes' structures may have to be packed like usemap???

b) Pack memmap and usemap in one allocation.
Pros.
- May avoid wasting area if its size is suitable.
Cons.
- If size is not suitable, it will be same as this patch.
- This way is not good for VMEMMAP_SPARSEMEM.
At least, it is reverse way against Yinghai-san's fix.

c) This way.
Pros.
- Very easy to remove.
Cons.
- Waist of area.

Any other idea is welcome.


Signed-off-by: Yasunori Goto <[email protected]>

---
mm/sparse.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

Index: current/mm/sparse.c
===================================================================
--- current.orig/mm/sparse.c 2008-03-11 20:15:41.000000000 +0900
+++ current/mm/sparse.c 2008-03-11 20:58:18.000000000 +0900
@@ -244,7 +244,8 @@
struct mem_section *ms = __nr_to_section(pnum);
int nid = sparse_early_nid(ms);

- usemap = alloc_bootmem_node(NODE_DATA(nid), usemap_size());
+ usemap = alloc_bootmem_pages_node(NODE_DATA(nid),
+ PAGE_ALIGN(usemap_size()));
if (usemap)
return usemap;

@@ -264,8 +265,8 @@
if (map)
return map;

- map = alloc_bootmem_node(NODE_DATA(nid),
- sizeof(struct page) * PAGES_PER_SECTION);
+ map = alloc_bootmem_pages_node(NODE_DATA(nid),
+ PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION));
return map;
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */

--
Yasunori Goto

2008-03-14 16:26:18

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 3/3 (RFC)](memory hotplug) align maps for easy removing

On Fri, Mar 14, 2008 at 7:44 AM, Yasunori Goto <[email protected]> wrote:
>
> To free memmap and usemap easier, this patch aligns these maps to page size.
>
> I know usemap size is too small to align page size.
> It will be waste of area. So, there may be better way than this.
>
> Followings are pros. and cons with other my ideas.
> But I'm not sure which is better way.
>
> a) Packing many section's usemap on one page. Count how many sections use
> it in page_count.
> Pros.
> - Avoid waisting area.
> Cons.
> - This usemap's page will be hard(or impossible) to remove due to
> dependency.
> It should be allocated on un-movable zone/node.
> (I'm not sure it's impact of performance.)
> - Nodes' structures may have to be packed like usemap???
>
> b) Pack memmap and usemap in one allocation.
> Pros.
> - May avoid wasting area if its size is suitable.
> Cons.
> - If size is not suitable, it will be same as this patch.
> - This way is not good for VMEMMAP_SPARSEMEM.
> At least, it is reverse way against Yinghai-san's fix.
>
> c) This way.
> Pros.
> - Very easy to remove.
> Cons.
> - Waist of area.
>
> Any other idea is welcome.
>
>
> Signed-off-by: Yasunori Goto <[email protected]>
>
> ---
> mm/sparse.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> Index: current/mm/sparse.c
> ===================================================================
> --- current.orig/mm/sparse.c 2008-03-11 20:15:41.000000000 +0900
> +++ current/mm/sparse.c 2008-03-11 20:58:18.000000000 +0900
> @@ -244,7 +244,8 @@
> struct mem_section *ms = __nr_to_section(pnum);
> int nid = sparse_early_nid(ms);
>
> - usemap = alloc_bootmem_node(NODE_DATA(nid), usemap_size());
> + usemap = alloc_bootmem_pages_node(NODE_DATA(nid),
> + PAGE_ALIGN(usemap_size()));

if we allocate usemap continuously,
old way could make different usermap share one page. usermap size is
only about 24bytes. align to 128bytes ( the SMP cache lines)

sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24


YH

2008-03-15 04:13:25

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH 3/3 (RFC)](memory hotplug) align maps for easy removing

> > Index: current/mm/sparse.c
> > ===================================================================
> > --- current.orig/mm/sparse.c 2008-03-11 20:15:41.000000000 +0900
> > +++ current/mm/sparse.c 2008-03-11 20:58:18.000000000 +0900
> > @@ -244,7 +244,8 @@
> > struct mem_section *ms = __nr_to_section(pnum);
> > int nid = sparse_early_nid(ms);
> >
> > - usemap = alloc_bootmem_node(NODE_DATA(nid), usemap_size());
> > + usemap = alloc_bootmem_pages_node(NODE_DATA(nid),
> > + PAGE_ALIGN(usemap_size()));
>
> if we allocate usemap continuously,
> old way could make different usermap share one page. usermap size is
> only about 24bytes. align to 128bytes ( the SMP cache lines)
>
> sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
> sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
> sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
> sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24


Yes, they can share one page.

I was afraid its page will be hard to remove yesterday.
If all sections' usemaps are allocated on section A,
the other sections (from B to Z) must be removed before section A.
If only one of them are busy, section A can't be removed.
So, I disliked its dependency.

But, I reconsidered it after reading your mail.
The node structures like pgdat has same feature.
If a section has pgdat for the node, it must wait for other section's
removing on the node. So, I'll try to keep same section about pgdat
and shared usemap page.

Anyway, thanks for your comments.

Bye.

--
Yasunori Goto

2008-03-21 16:05:47

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3 (RFC)](memory hotplug) freeing pages allocated by bootmem for hotremove

On Fri, 2008-03-14 at 23:36 +0900, Yasunori Goto wrote:
> Hello.
>
> I would like to post patch set to free pages which is allocated by bootmem
> for memory-hotremove.
>
> Basic my idea is using remain members of struct page to remember
> information of users of bootmem (section number or node id).
> When the section is removing, kernel can confirm it.
> By this information, some issues can be solved.
>
> 1) When the memmap of removing section is allocated on other
> section by bootmem, it should/can be free.
> 2) When the memmap of removing section is allocated on the
> same section, it shouldn't be freed. Because the section has to be
> offlined already and all pages must be isolated against
> page allocater. Kernel keeps it as is.
> 3) When removing section has other section's memmap,
> kernel will be able to show easily which section should be removed
> before it for user. (Not implemented yet)
> 4) When the above case 2), the page migrator will be able to check and skip
> memmap againt page isolation when page offline.
> Current page migration fails in this case because this page is
> just reserved page and it can't distinguish this pages can be
> removed or not. But, it will be able to do by this patch.
> (Not implemented yet.)
> 5) The node information like pgdat has similar issues. But, this
> will be able to be solved too by this.
> (Not implemented yet, but, remembering node id in the pages.)
>
> Fortunately, current bootmem allocator just keeps PageReserved flags,
> and doesn't use any other members of page struct. The users of
> bootmem doesn't use them too.
>
> This patch set needs Badari-san's generic __remove_pages() support patch.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-03/msg02881.html
>
> I think this patch set is not perfect. Because, some of section/node
> informations are smaller than one page, and bootmem allocator may
> mix other data. This patch is still trial.
> But I suppose this is good start for everyone to understand what is necessary.
>
> Please comments.
>
> Other Todo:
> - for SPARSEMEM_VMEMMAP.
> Freeing vmemmap's page is more diffcult than normal sparsemem.
> Because not only memmap's page, but also the pages like page table must
> be removed too. If removing section has pages for , then it must
> be migrated too. Relocatable page table is necessary.
>
> - compile with other config.
> This version is just for requesting comments.
> If this way is accepted, I'll check it.
> - Follow fix bootmem by Yinghai Lu-san.
> (This patch set is for 2.6.25-rc3-mm1 with Badari-san's patch yet.)
>
> Thanks.
>

Do you have any updates to this. I am getting following boot panic while
testing this. Before I debug it, I want to make sure its not already
fixed. Please let me know.

Thanks,
Badari

Linux version 2.6.25-rc5-mm1 (root@elm3b155) (gcc version 3.3.3 (SuSE Linux)) #2 SMP Fri Mar 21 07:48:29 PST 2008
[boot]0012 Setup Arch
NUMA associativity depth for CPU/Memory: 3
adding cpu 0 to node 0
node 0
NODE_DATA() = c000000071fea100
start_paddr = 0
end_paddr = 72000000
bootmap_paddr = 71fdb000
reserve_bootmem 0 7cc000
reserve_bootmem 23d0000 10000
reserve_bootmem 77b6000 84a000
reserve_bootmem 71fdb000 f000
reserve_bootmem 71fea100 1e00
reserve_bootmem 71febf68 14098
PCI host bridge /pci@800000020000002 ranges:
IO 0x000003fe00200000..0x000003fe002fffff -> 0x0000000000000000
MEM 0x0000040080000000..0x00000400bfffffff -> 0x00000000c0000000
PCI host bridge /pci@800000020000003 ranges:
IO 0x000003fe00700000..0x000003fe007fffff -> 0x0000000000000000
MEM 0x00000401c0000000..0x00000401ffffffff -> 0x00000000c0000000
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 7168 bytes
Zone PFN ranges:
DMA 0 -> 466944
Normal 466944 -> 466944
Movable zone start PFN for each node
Node 0: 262144
early_node_map[1] active PFN ranges
0: 0 -> 466944
[boot]0015 Setup Done
Built 1 zonelists in Node order, mobility grouping on. Total pages: 451440
Policy zone: DMA
Kernel command line: root=/dev/sda3 selinux=0 elevator=cfq numa=debug kernelcore=1024M
[boot]0020 XICS Init
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
clocksource: timebase mult[1352e86] shift[22] registered
Console: colour dummy device 80x25
console handover: boot [udbg-1] -> real [hvc0]
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
freeing bootmem node 0
Unable to handle kernel paging request for data at address 0xcf7f80000000000c
Faulting instruction address: 0xc0000000000ce3e8
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c0000000000ce3e8 LR: c0000000000cf714 CTR: 800000000013f270
REGS: c0000000007639f0 TRAP: 0300 Not tainted (2.6.25-rc5-mm1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44002022 XER: 00000008
DAR: cf7f80000000000c, DSISR: 0000000042010000
TASK = c000000000689910[0] 'swapper' THREAD: c000000000760000 CPU: 0
GPR00: fffffffffffffffd c000000000763c70 c000000000761be0 0000000000000000
GPR04: cf7f800000000000 0000000000000000 0000000000000000 0000000000000001
GPR08: 0000000000000000 fffffffffffffffe 0000000000000088 cf00000000000000
GPR12: 0000000000004000 c00000000068a380 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 4000000001c00000 0000000000000000 0000000002241ed8 0000000000000000
GPR24: 0000000002242148 0000000000000000 c000000071feb000 0000000000000000
GPR28: c000000071feb000 0000000000000001 c0000000006e2bd8 cf7f800000000000
NIP [c0000000000ce3e8] .set_page_bootmem_info+0x10/0x38
LR [c0000000000cf714] .register_page_bootmem_info_section+0xc4/0x17c
Call Trace:
[c000000000763c70] [000000000000001a] 0x1a (unreliable)
[c000000000763d10] [c0000000000cf8f0] .register_page_bootmem_info_node+0x124/0x158
[c000000000763dc0] [c0000000006290e4] .free_all_bootmem_node+0x1c/0x3c
[c000000000763e50] [c00000000061d618] .mem_init+0xbc/0x260
[c000000000763ee0] [c00000000060bbcc] .start_kernel+0x2f4/0x3f4
[c000000000763f90] [c000000000008594] .start_here_common+0x54/0xc0
Instruction dump:
eb61ffd8 eb81ffe0 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 7d808120 4e800020
2fa50000 3920fffe 3800fffd 409e000c <9124000c> 48000008 9004000c 38000800
---[ end trace 31fd0ba7d8756001 ]---
Kernel panic - not syncing: Attempted to kill the idle task!

2008-03-22 00:12:49

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH 0/3 (RFC)](memory hotplug) freeing pages allocated by bootmem for hotremove

>
> Do you have any updates to this. I am getting following boot panic while
> testing this. Before I debug it, I want to make sure its not already
> fixed. Please let me know.

Hmmmm. No, I don't. Could you debug it?
This may come from powerpc environment.

Thanks.


>
> Thanks,
> Badari
>
> Linux version 2.6.25-rc5-mm1 (root@elm3b155) (gcc version 3.3.3 (SuSE Linux)) #2 SMP Fri Mar 21 07:48:29 PST 2008
> [boot]0012 Setup Arch
> NUMA associativity depth for CPU/Memory: 3
> adding cpu 0 to node 0
> node 0
> NODE_DATA() = c000000071fea100
> start_paddr = 0
> end_paddr = 72000000
> bootmap_paddr = 71fdb000
> reserve_bootmem 0 7cc000
> reserve_bootmem 23d0000 10000
> reserve_bootmem 77b6000 84a000
> reserve_bootmem 71fdb000 f000
> reserve_bootmem 71fea100 1e00
> reserve_bootmem 71febf68 14098
> PCI host bridge /pci@800000020000002 ranges:
> IO 0x000003fe00200000..0x000003fe002fffff -> 0x0000000000000000
> MEM 0x0000040080000000..0x00000400bfffffff -> 0x00000000c0000000
> PCI host bridge /pci@800000020000003 ranges:
> IO 0x000003fe00700000..0x000003fe007fffff -> 0x0000000000000000
> MEM 0x00000401c0000000..0x00000401ffffffff -> 0x00000000c0000000
> EEH: PCI Enhanced I/O Error Handling Enabled
> PPC64 nvram contains 7168 bytes
> Zone PFN ranges:
> DMA 0 -> 466944
> Normal 466944 -> 466944
> Movable zone start PFN for each node
> Node 0: 262144
> early_node_map[1] active PFN ranges
> 0: 0 -> 466944
> [boot]0015 Setup Done
> Built 1 zonelists in Node order, mobility grouping on. Total pages: 451440
> Policy zone: DMA
> Kernel command line: root=/dev/sda3 selinux=0 elevator=cfq numa=debug kernelcore=1024M
> [boot]0020 XICS Init
> [boot]0021 XICS Done
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> clocksource: timebase mult[1352e86] shift[22] registered
> Console: colour dummy device 80x25
> console handover: boot [udbg-1] -> real [hvc0]
> Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> freeing bootmem node 0
> Unable to handle kernel paging request for data at address 0xcf7f80000000000c
> Faulting instruction address: 0xc0000000000ce3e8
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=32 NUMA pSeries
> Modules linked in:
> NIP: c0000000000ce3e8 LR: c0000000000cf714 CTR: 800000000013f270
> REGS: c0000000007639f0 TRAP: 0300 Not tainted (2.6.25-rc5-mm1)
> MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44002022 XER: 00000008
> DAR: cf7f80000000000c, DSISR: 0000000042010000
> TASK = c000000000689910[0] 'swapper' THREAD: c000000000760000 CPU: 0
> GPR00: fffffffffffffffd c000000000763c70 c000000000761be0 0000000000000000
> GPR04: cf7f800000000000 0000000000000000 0000000000000000 0000000000000001
> GPR08: 0000000000000000 fffffffffffffffe 0000000000000088 cf00000000000000
> GPR12: 0000000000004000 c00000000068a380 0000000000000000 0000000000000000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 4000000001c00000 0000000000000000 0000000002241ed8 0000000000000000
> GPR24: 0000000002242148 0000000000000000 c000000071feb000 0000000000000000
> GPR28: c000000071feb000 0000000000000001 c0000000006e2bd8 cf7f800000000000
> NIP [c0000000000ce3e8] .set_page_bootmem_info+0x10/0x38
> LR [c0000000000cf714] .register_page_bootmem_info_section+0xc4/0x17c
> Call Trace:
> [c000000000763c70] [000000000000001a] 0x1a (unreliable)
> [c000000000763d10] [c0000000000cf8f0] .register_page_bootmem_info_node+0x124/0x158
> [c000000000763dc0] [c0000000006290e4] .free_all_bootmem_node+0x1c/0x3c
> [c000000000763e50] [c00000000061d618] .mem_init+0xbc/0x260
> [c000000000763ee0] [c00000000060bbcc] .start_kernel+0x2f4/0x3f4
> [c000000000763f90] [c000000000008594] .start_here_common+0x54/0xc0
> Instruction dump:
> eb61ffd8 eb81ffe0 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 7d808120 4e800020
> 2fa50000 3920fffe 3800fffd 409e000c <9124000c> 48000008 9004000c 38000800
> ---[ end trace 31fd0ba7d8756001 ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
>
>

--
Yasunori Goto

2008-03-26 21:08:48

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/3 (RFC)](memory hotplug) freeing pages allocated by bootmem for hotremove

On Sat, 2008-03-22 at 09:09 +0900, Yasunori Goto wrote:
> >
> > Do you have any updates to this. I am getting following boot panic while
> > testing this. Before I debug it, I want to make sure its not already
> > fixed. Please let me know.
>
> Hmmmm. No, I don't. Could you debug it?
> This may come from powerpc environment.
>
> Thanks.
>
>
> >
> > Thanks,
> > Badari
> >
> > Linux version 2.6.25-rc5-mm1 (root@elm3b155) (gcc version 3.3.3 (SuSE Linux)) #2 SMP Fri Mar 21 07:48:29 PST 2008
> > [boot]0012 Setup Arch
> > NUMA associativity depth for CPU/Memory: 3
> > adding cpu 0 to node 0
> > node 0
> > NODE_DATA() = c000000071fea100
> > start_paddr = 0
> > end_paddr = 72000000
> > bootmap_paddr = 71fdb000
> > reserve_bootmem 0 7cc000
> > reserve_bootmem 23d0000 10000
> > reserve_bootmem 77b6000 84a000
> > reserve_bootmem 71fdb000 f000
> > reserve_bootmem 71fea100 1e00
> > reserve_bootmem 71febf68 14098
> > PCI host bridge /pci@800000020000002 ranges:
> > IO 0x000003fe00200000..0x000003fe002fffff -> 0x0000000000000000
> > MEM 0x0000040080000000..0x00000400bfffffff -> 0x00000000c0000000
> > PCI host bridge /pci@800000020000003 ranges:
> > IO 0x000003fe00700000..0x000003fe007fffff -> 0x0000000000000000
> > MEM 0x00000401c0000000..0x00000401ffffffff -> 0x00000000c0000000
> > EEH: PCI Enhanced I/O Error Handling Enabled
> > PPC64 nvram contains 7168 bytes
> > Zone PFN ranges:
> > DMA 0 -> 466944
> > Normal 466944 -> 466944
> > Movable zone start PFN for each node
> > Node 0: 262144
> > early_node_map[1] active PFN ranges
> > 0: 0 -> 466944
> > [boot]0015 Setup Done
> > Built 1 zonelists in Node order, mobility grouping on. Total pages: 451440
> > Policy zone: DMA
> > Kernel command line: root=/dev/sda3 selinux=0 elevator=cfq numa=debug kernelcore=1024M
> > [boot]0020 XICS Init
> > [boot]0021 XICS Done
> > PID hash table entries: 4096 (order: 12, 32768 bytes)
> > clocksource: timebase mult[1352e86] shift[22] registered
> > Console: colour dummy device 80x25
> > console handover: boot [udbg-1] -> real [hvc0]
> > Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> > Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> > freeing bootmem node 0
> > Unable to handle kernel paging request for data at address 0xcf7f80000000000c
> > Faulting instruction address: 0xc0000000000ce3e8
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > SMP NR_CPUS=32 NUMA pSeries
> > Modules linked in:
> > NIP: c0000000000ce3e8 LR: c0000000000cf714 CTR: 800000000013f270
> > REGS: c0000000007639f0 TRAP: 0300 Not tainted (2.6.25-rc5-mm1)
> > MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44002022 XER: 00000008
> > DAR: cf7f80000000000c, DSISR: 0000000042010000
> > TASK = c000000000689910[0] 'swapper' THREAD: c000000000760000 CPU: 0
> > GPR00: fffffffffffffffd c000000000763c70 c000000000761be0 0000000000000000
> > GPR04: cf7f800000000000 0000000000000000 0000000000000000 0000000000000001
> > GPR08: 0000000000000000 fffffffffffffffe 0000000000000088 cf00000000000000
> > GPR12: 0000000000004000 c00000000068a380 0000000000000000 0000000000000000
> > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > GPR20: 4000000001c00000 0000000000000000 0000000002241ed8 0000000000000000
> > GPR24: 0000000002242148 0000000000000000 c000000071feb000 0000000000000000
> > GPR28: c000000071feb000 0000000000000001 c0000000006e2bd8 cf7f800000000000
> > NIP [c0000000000ce3e8] .set_page_bootmem_info+0x10/0x38
> > LR [c0000000000cf714] .register_page_bootmem_info_section+0xc4/0x17c
> > Call Trace:
> > [c000000000763c70] [000000000000001a] 0x1a (unreliable)
> > [c000000000763d10] [c0000000000cf8f0] .register_page_bootmem_info_node+0x124/0x158
> > [c000000000763dc0] [c0000000006290e4] .free_all_bootmem_node+0x1c/0x3c
> > [c000000000763e50] [c00000000061d618] .mem_init+0xbc/0x260
> > [c000000000763ee0] [c00000000060bbcc] .start_kernel+0x2f4/0x3f4
> > [c000000000763f90] [c000000000008594] .start_here_common+0x54/0xc0
> > Instruction dump:
> > eb61ffd8 eb81ffe0 eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 7d808120 4e800020
> > 2fa50000 3920fffe 3800fffd 409e000c <9124000c> 48000008 9004000c 38000800
> > ---[ end trace 31fd0ba7d8756001 ]---
> > Kernel panic - not syncing: Attempted to kill the idle task!

Okay, its an issue with CONFIG_SPARSEMEM_VMEMMAP=y

I disabled it for now.

Thanks,
Badari