When memory add was merged into mainline in 2.6.14, there were
various bits and pieces missing that prevent it from working on
ppc64. The following patches are against 2.6.14-git7 and address
all but one of the know issues.
1) Create hptes for new sections
2) Clear page count before freeing new pages
3) Kludge to add new memory to node 0
4) Ensure probe file is created for memory add via sysfs
--
Mike
Add the create_section_mapping() routine to create hptes for memory
sections dynamically added after system boot.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c
--- linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c 2005-11-04 21:21:05.000000000 +0000
+++ linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c 2005-11-04 22:05:06.000000000 +0000
@@ -176,6 +176,15 @@ static unsigned long get_hashtable_size(
return pteg_count << 7;
}
+#ifdef CONFIG_MEMORY_HOTPLUG
+void create_section_mapping(unsigned long start, unsigned long end)
+{
+ create_pte_mapping(start, end,
+ _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX,
+ cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE ? 1 : 0);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
void __init htab_initialize(void)
{
unsigned long table, htab_size_bytes;
diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c
--- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000
+++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:05:06.000000000 +0000
@@ -124,6 +124,9 @@ int __devinit add_memory(u64 start, u64
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ start += KERNELBASE;
+ create_section_mapping(start, start + size);
+
/* this should work for most non-highmem platforms */
zone = pgdata->node_zones;
diff -Naupr linux-2.6.14-git7/include/asm-ppc64/sparsemem.h linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h
--- linux-2.6.14-git7/include/asm-ppc64/sparsemem.h 2005-10-28 00:02:08.000000000 +0000
+++ linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h 2005-11-04 22:05:06.000000000 +0000
@@ -11,6 +11,10 @@
#define MAX_PHYSADDR_BITS 38
#define MAX_PHYSMEM_BITS 36
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void create_section_mapping(unsigned long start, unsigned long end);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
#endif /* CONFIG_SPARSEMEM */
#endif /* _ASM_PPC64_SPARSEMEM_H */
memmap_init_zone() sets page count to 1. Before 'freeing' the
page, we need to clear the count. This is the same that is done
on free_all_bootmem_core() for memory discovered at boot time.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c
--- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000
+++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:09:59.000000000 +0000
@@ -107,6 +107,7 @@ EXPORT_SYMBOL(phys_mem_access_prot);
void online_page(struct page *page)
{
ClearPageReserved(page);
+ set_page_count(page, 0);
free_cold_page(page);
totalram_pages++;
num_physpages++;
This is a temporary kludge that supports adding all new memory to
node 0. I will provide a more complete solution similar to that
used for dynamically added CPUs in a few days.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7/include/asm-ppc64/mmzone.h linux-2.6.14-git7.work/include/asm-ppc64/mmzone.h
--- linux-2.6.14-git7/include/asm-ppc64/mmzone.h 2005-11-04 21:21:09.000000000 +0000
+++ linux-2.6.14-git7.work/include/asm-ppc64/mmzone.h 2005-11-04 22:10:44.000000000 +0000
@@ -33,6 +33,9 @@ extern int numa_cpu_lookup_table[];
extern char *numa_memory_lookup_table;
extern cpumask_t numa_cpumask_lookup_table[];
extern int nr_cpus_in_node[];
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern unsigned long max_pfn;
+#endif
/* 16MB regions */
#define MEMORY_INCREMENT_SHIFT 24
@@ -45,6 +48,11 @@ static inline int pa_to_nid(unsigned lon
{
int nid;
+#ifdef CONFIG_MEMORY_HOTPLUG
+ /* kludge hot added sections default to node 0 */
+ if (pa >= (max_pfn << PAGE_SHIFT))
+ return 0;
+#endif
nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT];
#ifdef DEBUG_NUMA
ppc64 needs a special sysfs probe file for adding new memory.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7/arch/ppc64/Kconfig linux-2.6.14-git7.work/arch/ppc64/Kconfig
--- linux-2.6.14-git7/arch/ppc64/Kconfig 2005-11-04 21:21:06.000000000 +0000
+++ linux-2.6.14-git7.work/arch/ppc64/Kconfig 2005-11-04 22:11:16.000000000 +0000
@@ -277,6 +277,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool y
depends on NEED_MULTIPLE_NODES
+config ARCH_MEMORY_PROBE
+ def_bool y
+ depends on MEMORY_HOTPLUG
+
# Some NUMA nodes have memory ranges that span
# other nodes. Even though a pfn is valid and
# between a node's start and end pfns, it may not
On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote:
> Add the create_section_mapping() routine to create hptes for memory
> sections dynamically added after system boot.
>
> Signed-off-by: Mike Kravetz <[email protected]>
This patch will have to be slightly reworked on top of the 64k pages
one. It should be trivial though.
Ben.
> diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c
> --- linux-2.6.14-git7/arch/powerpc/mm/hash_utils_64.c 2005-11-04 21:21:05.000000000 +0000
> +++ linux-2.6.14-git7.work/arch/powerpc/mm/hash_utils_64.c 2005-11-04 22:05:06.000000000 +0000
> @@ -176,6 +176,15 @@ static unsigned long get_hashtable_size(
> return pteg_count << 7;
> }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +void create_section_mapping(unsigned long start, unsigned long end)
> +{
> + create_pte_mapping(start, end,
> + _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX,
> + cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE ? 1 : 0);
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> void __init htab_initialize(void)
> {
> unsigned long table, htab_size_bytes;
> diff -Naupr linux-2.6.14-git7/arch/powerpc/mm/mem.c linux-2.6.14-git7.work/arch/powerpc/mm/mem.c
> --- linux-2.6.14-git7/arch/powerpc/mm/mem.c 2005-11-04 21:21:05.000000000 +0000
> +++ linux-2.6.14-git7.work/arch/powerpc/mm/mem.c 2005-11-04 22:05:06.000000000 +0000
> @@ -124,6 +124,9 @@ int __devinit add_memory(u64 start, u64
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> + start += KERNELBASE;
> + create_section_mapping(start, start + size);
> +
> /* this should work for most non-highmem platforms */
> zone = pgdata->node_zones;
>
> diff -Naupr linux-2.6.14-git7/include/asm-ppc64/sparsemem.h linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h
> --- linux-2.6.14-git7/include/asm-ppc64/sparsemem.h 2005-10-28 00:02:08.000000000 +0000
> +++ linux-2.6.14-git7.work/include/asm-ppc64/sparsemem.h 2005-11-04 22:05:06.000000000 +0000
> @@ -11,6 +11,10 @@
> #define MAX_PHYSADDR_BITS 38
> #define MAX_PHYSMEM_BITS 36
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void create_section_mapping(unsigned long start, unsigned long end);
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> #endif /* CONFIG_SPARSEMEM */
>
> #endif /* _ASM_PPC64_SPARSEMEM_H */
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote:
> > Add the create_section_mapping() routine to create hptes for memory
> > sections dynamically added after system boot.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
>
> This patch will have to be slightly reworked on top of the 64k pages
> one. It should be trivial though.
OK. I'll respin on top of your patch at:
http://gate.crashing.org/~benh/ppc64-64k-pages.diff
Let me know if there is a different version going upstream.
--
Mike
On Fri, 2005-11-04 at 16:35 -0800, Mike Kravetz wrote:
> On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2005-11-04 at 15:18 -0800, Mike Kravetz wrote:
> > > Add the create_section_mapping() routine to create hptes for memory
> > > sections dynamically added after system boot.
> > >
> > > Signed-off-by: Mike Kravetz <[email protected]>
> >
> > This patch will have to be slightly reworked on top of the 64k pages
> > one. It should be trivial though.
>
> OK. I'll respin on top of your patch at:
>
> http://gate.crashing.org/~benh/ppc64-64k-pages.diff
>
> Let me know if there is a different version going upstream
I'll check if it still applied after linus pulls the next round of ppc
updates
Ben.
Mike Kravetz writes:
> ppc64 needs a special sysfs probe file for adding new memory.
>
> Signed-off-by: Mike Kravetz <[email protected]>
>
> diff -Naupr linux-2.6.14-git7/arch/ppc64/Kconfig linux-2.6.14-git7.work/arch/ppc64/Kconfig
> --- linux-2.6.14-git7/arch/ppc64/Kconfig 2005-11-04 21:21:06.000000000 +0000
> +++ linux-2.6.14-git7.work/arch/ppc64/Kconfig 2005-11-04 22:11:16.000000000 +0000
> @@ -277,6 +277,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
> def_bool y
> depends on NEED_MULTIPLE_NODES
>
> +config ARCH_MEMORY_PROBE
> + def_bool y
> + depends on MEMORY_HOTPLUG
> +
Does arch/powerpc/Kconfig need a similar fix then?
Paul.
On Mon, Nov 07, 2005 at 11:59:42AM +1100, Paul Mackerras wrote:
> Mike Kravetz writes:
> > ppc64 needs a special sysfs probe file for adding new memory.
>
> Does arch/powerpc/Kconfig need a similar fix then?
Yes it does. Sorry, I haven't been paying as much attention to the
merge as I should. Here is a new version.
ppc64 needs a special sysfs probe file for adding new memory.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7/arch/powerpc/Kconfig linux-2.6.14-git7.work/arch/powerpc/Kconfig
--- linux-2.6.14-git7/arch/powerpc/Kconfig 2005-11-04 21:21:05.000000000 +0000
+++ linux-2.6.14-git7.work/arch/powerpc/Kconfig 2005-11-07 17:32:45.000000000 +0000
@@ -569,6 +569,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool y
depends on NEED_MULTIPLE_NODES
+config ARCH_MEMORY_PROBE
+ def_bool y
+ depends on MEMORY_HOTPLUG
+
# Some NUMA nodes have memory ranges that span
# other nodes. Even though a pfn is valid and
# between a node's start and end pfns, it may not
diff -Naupr linux-2.6.14-git7/arch/ppc64/Kconfig linux-2.6.14-git7.work/arch/ppc64/Kconfig
--- linux-2.6.14-git7/arch/ppc64/Kconfig 2005-11-04 21:21:06.000000000 +0000
+++ linux-2.6.14-git7.work/arch/ppc64/Kconfig 2005-11-07 17:31:51.000000000 +0000
@@ -277,6 +277,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool y
depends on NEED_MULTIPLE_NODES
+config ARCH_MEMORY_PROBE
+ def_bool y
+ depends on MEMORY_HOTPLUG
+
# Some NUMA nodes have memory ranges that span
# other nodes. Even though a pfn is valid and
# between a node's start and end pfns, it may not
On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote:
> This patch will have to be slightly reworked on top of the 64k pages
> one. It should be trivial though.
Ran into an issue with the interaction of SPARSEMEM and 64k pages.
SPARSEMEM defines the pp64 section size to be 16MB which corresponds
to the smallest LMB size. There is a check in the SPARSEMEM code
to ensure that MAX_ORDER (actually MAX_ORDER-1) block size is not
greater than section size. Within the Kconfig file, there is this:
# We optimistically allocate largepages from the VM, so make the limit
# large enough (16MB). This badly named config option is actually
# max order + 1
config FORCE_MAX_ZONEORDER
int
depends on PPC64
default "13"
Just curious if we still want to boost MAX_ORDER like this with 64k
pages? Doesn't that make the MAX_ORDER block size 256MB in this case?
Also, not quite sure what happens if memory size (a 16 MB multiple)
does not align with a MAX_ORDER block size (a 256MB multiple in this
case). My 'guess' is that the page allocator would not use it as it
would not fit within the buddy system.
cc'ing SPARSEMEM author Andy Whitcroft.
--
Mike
> Just curious if we still want to boost MAX_ORDER like this with 64k
> pages? Doesn't that make the MAX_ORDER block size 256MB in this case?
> Also, not quite sure what happens if memory size (a 16 MB multiple)
> does not align with a MAX_ORDER block size (a 256MB multiple in this
> case). My 'guess' is that the page allocator would not use it as it
> would not fit within the buddy system.
>
> cc'ing SPARSEMEM author Andy Whitcroft.
Yes, the MAX_ORDER should be different indeed. But can Kconfig do that ?
That is have the default value be different based on a Kconfig option ?
I don't see that ... We may have to do things differently here...
Ben.
On Tue, Nov 08, 2005 at 08:12:56AM +1100, Benjamin Herrenschmidt wrote:
> Yes, the MAX_ORDER should be different indeed. But can Kconfig do that ?
> That is have the default value be different based on a Kconfig option ?
> I don't see that ... We may have to do things differently here...
This seems to be done in other parts of the Kconfig file. Using those
as an example, this should keep the MAX_ORDER block size at 16MB.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git7.64k/arch/powerpc/Kconfig linux-2.6.14-git7.64k.work/arch/powerpc/Kconfig
--- linux-2.6.14-git7.64k/arch/powerpc/Kconfig 2005-11-07 18:38:50.000000000 +0000
+++ linux-2.6.14-git7.64k.work/arch/powerpc/Kconfig 2005-11-07 21:37:21.000000000 +0000
@@ -463,6 +463,7 @@ source "fs/Kconfig.binfmt"
config FORCE_MAX_ZONEORDER
int
depends on PPC64
+ default "9" if PPC_64K_PAGES
default "13"
config MATH_EMULATION
diff -Naupr linux-2.6.14-git7.64k/arch/ppc64/Kconfig linux-2.6.14-git7.64k.work/arch/ppc64/Kconfig
--- linux-2.6.14-git7.64k/arch/ppc64/Kconfig 2005-11-07 18:38:50.000000000 +0000
+++ linux-2.6.14-git7.64k.work/arch/ppc64/Kconfig 2005-11-07 21:36:42.000000000 +0000
@@ -56,6 +56,7 @@ config PPC_STD_MMU
# max order + 1
config FORCE_MAX_ZONEORDER
int
+ default "9" if PPC_64K_PAGES
default "13"
source "init/Kconfig"
On Sat, Nov 05, 2005 at 11:04:30AM +1100, Benjamin Herrenschmidt wrote:
> This patch will have to be slightly reworked on top of the 64k pages
> one. It should be trivial though.
Here is a new version of the patch on top of 64k page support (actually
2.6.14-git10). One filename also changed due to more merge changes.
Add the create_section_mapping() routine to create hptes for memory
sections dynamically added after system boot.
Signed-off-by: Mike Kravetz <[email protected]>
diff -Naupr linux-2.6.14-git10/arch/powerpc/mm/hash_utils_64.c linux-2.6.14-git10.work/arch/powerpc/mm/hash_utils_64.c
--- linux-2.6.14-git10/arch/powerpc/mm/hash_utils_64.c 2005-11-08 00:04:15.784924264 +0000
+++ linux-2.6.14-git10.work/arch/powerpc/mm/hash_utils_64.c 2005-11-08 00:06:46.992964608 +0000
@@ -385,6 +385,15 @@ static unsigned long __init htab_get_tab
return pteg_count << 7;
}
+#ifdef CONFIG_MEMORY_HOTPLUG
+void create_section_mapping(unsigned long start, unsigned long end)
+{
+ BUG_ON(htab_bolt_mapping(start, end, start,
+ _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_COHERENT | PP_RWXX,
+ mmu_linear_psize));
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
void __init htab_initialize(void)
{
unsigned long table, htab_size_bytes;
diff -Naupr linux-2.6.14-git10/arch/powerpc/mm/mem.c linux-2.6.14-git10.work/arch/powerpc/mm/mem.c
--- linux-2.6.14-git10/arch/powerpc/mm/mem.c 2005-11-08 00:04:15.798922136 +0000
+++ linux-2.6.14-git10.work/arch/powerpc/mm/mem.c 2005-11-08 00:06:46.993964456 +0000
@@ -127,6 +127,9 @@ int __devinit add_memory(u64 start, u64
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ start += KERNELBASE;
+ create_section_mapping(start, start + size);
+
/* this should work for most non-highmem platforms */
zone = pgdata->node_zones;
diff -Naupr linux-2.6.14-git10/include/asm-powerpc/sparsemem.h linux-2.6.14-git10.work/include/asm-powerpc/sparsemem.h
--- linux-2.6.14-git10/include/asm-powerpc/sparsemem.h 2005-11-08 00:04:28.486988472 +0000
+++ linux-2.6.14-git10.work/include/asm-powerpc/sparsemem.h 2005-11-08 00:07:39.138891344 +0000
@@ -11,6 +11,10 @@
#define MAX_PHYSADDR_BITS 38
#define MAX_PHYSMEM_BITS 36
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void create_section_mapping(unsigned long start, unsigned long end);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
#endif /* CONFIG_SPARSEMEM */
#endif /* _ASM_POWERPC_SPARSEMEM_H */
On Mon, 2005-11-07 at 13:48 -0800, Mike Kravetz wrote:
> On Tue, Nov 08, 2005 at 08:12:56AM +1100, Benjamin Herrenschmidt wrote:
> > Yes, the MAX_ORDER should be different indeed. But can Kconfig do that ?
> > That is have the default value be different based on a Kconfig option ?
> > I don't see that ... We may have to do things differently here...
>
> This seems to be done in other parts of the Kconfig file. Using those
> as an example, this should keep the MAX_ORDER block size at 16MB.
Ok, I verified it does the right thing with Kconfig, thanks.
Paul, can you add to the merge tree too ?
Ben.
Hi Mike,
> When memory add was merged into mainline in 2.6.14, there were
> various bits and pieces missing that prevent it from working on
> ppc64. The following patches are against 2.6.14-git7 and address
> all but one of the know issues.
>
> 1) Create hptes for new sections
> 2) Clear page count before freeing new pages
> 3) Kludge to add new memory to node 0
> 4) Ensure probe file is created for memory add via sysfs
Ive got a patch that reworks our numa code and it might reject with
your stuff. I'll send them out for review this afternoon.
Anton
On Tue, Nov 08, 2005 at 11:39:01AM +1100, Anton Blanchard wrote:
> Ive got a patch that reworks our numa code and it might reject with
> your stuff. I'll send them out for review this afternoon.
Interesting in that I was going to start reworking some of the
numa code to make it play nice with hot add. Doubt this patch
set will impact your changes. This set is not very intelligent
WRT numa and doesn't really modify any of the real code.
--
Mike
Mike Kravetz writes:
> Here is a new version of the patch on top of 64k page support (actually
> 2.6.14-git10). One filename also changed due to more merge changes.
So, should I send this on to Linus along with the original 2/4 and 3/4
you posted and the revised 4/4?
Paul.
On Tue, Nov 08, 2005 at 01:07:13PM +1100, Paul Mackerras wrote:
> So, should I send this on to Linus along with the original 2/4 and 3/4
> you posted and the revised 4/4?
Yes, those should provide basic memory add support for ppc64.
Thanks,
--
Mike
Mike Kravetz wrote:
> Just curious if we still want to boost MAX_ORDER like this with 64k
> pages? Doesn't that make the MAX_ORDER block size 256MB in this case?
> Also, not quite sure what happens if memory size (a 16 MB multiple)
> does not align with a MAX_ORDER block size (a 256MB multiple in this
> case). My 'guess' is that the page allocator would not use it as it
> would not fit within the buddy system.
The buddy system and the SPARSEMEM mem_map are separate really. The key
limitation is the a MAX_ORDER chunk must fit within the SPARSEMEM block
size it cannot span two blocks. This is because the algorithm by which
the buddy system finds buddies for a returning allocation assumes that
mem_map is contigious upto the maximum buddy size (MAX_ORDER); it
assumes it can use relative addressing to locate them.
The buddy system doesn't really care about the alignment of any of its
blocks. The allocator is built empty and all existant pages are freed
back to it. If there is a chunk of memory which can never coalesce back
to MAX_ORDER it will simply sit lower in the tree 'waiting' for these
non-existant buddies and will never merge. It will still be usable.
-apw