2008-10-09 19:21:42

by Gary Hade

[permalink] [raw]
Subject: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs


Show node to memory section relationship with symlinks in sysfs

Add /sys/devices/system/node/nodeX/memoryY symlinks for all
the memory sections located on nodeX. For example:
/sys/devices/system/node/node1/memory135 -> ../../memory/memory135
indicates that memory section 135 resides on node1.

Tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems.

Also revises documentation to cover this change as well as updating
Documentation/ABI/testing/sysfs-devices-memory to include descriptions
of memory hotremove files 'phys_device', 'phys_index', and 'state'
that were previously not described there.

Supersedes the "mm: show memory section to node relationship in sysfs"
patch posted on 05 Sept 2008 which created node ID containing 'node'
files in /sys/devices/system/memory/memoryX instead of symlinks.
Changed from files to symlinks due to feedback that symlinks were
more consistent with the sysfs way.

Also supercedes the "mm: show node to memory section relationship
with symlinks in sysfs" patch posted on 29 Sept 2008 to address a
Yasunori Goto reported problem where an incorrect symlink was created
due to a range of uninitialized pages at the beginning of a section.
This problem which produced a symlink in /sys/devices/system/node/node0
that incorrectly referenced a mem section located on node1 is corrected
in this version. This version also covers the case were a mem section
could span multiple nodes.

Signed-off-by: Gary Hade <[email protected]>
Signed-off-by: Badari Pulavarty <[email protected]>

---
Documentation/ABI/testing/sysfs-devices-memory | 51 +++++++
Documentation/memory-hotplug.txt | 16 +-
arch/ia64/mm/init.c | 2
arch/powerpc/mm/mem.c | 2
arch/s390/mm/init.c | 2
arch/sh/mm/init.c | 3
arch/x86/mm/init_32.c | 2
arch/x86/mm/init_64.c | 2
drivers/base/memory.c | 19 +-
drivers/base/node.c | 100 +++++++++++++++
include/linux/memory.h | 6
include/linux/memory_hotplug.h | 2
include/linux/node.h | 13 +
mm/memory_hotplug.c | 9 -
14 files changed, 205 insertions(+), 24 deletions(-)

Index: linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory
===================================================================
--- linux-2.6.27-rc8.orig/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:18:46.000000000 -0700
+++ linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:20:19.000000000 -0700
@@ -6,7 +6,6 @@
internal state of the kernel memory blocks. Files could be
added or removed dynamically to represent hot-add/remove
operations.
-
Users: hotplug memory add/remove tools
https://w3.opensource.ibm.com/projects/powerpc-utils/

@@ -19,6 +18,56 @@
This is useful for a user-level agent to determine
identify removable sections of the memory before attempting
potentially expensive hot-remove memory operation
+Users: hotplug memory remove tools
+ https://w3.opensource.ibm.com/projects/powerpc-utils/
+
+What: /sys/devices/system/memory/memoryX/phys_device
+Date: September 2008
+Contact: Badari Pulavarty <[email protected]>
+Description:
+ The file /sys/devices/system/memory/memoryX/phys_device
+ is read-only and is designed to show the name of physical
+ memory device. Implementation is currently incomplete.

+What: /sys/devices/system/memory/memoryX/phys_index
+Date: September 2008
+Contact: Badari Pulavarty <[email protected]>
+Description:
+ The file /sys/devices/system/memory/memoryX/phys_index
+ is read-only and contains the section ID in hexadecimal
+ which is equivalent to decimal X contained in the
+ memory section directory name.
+
+What: /sys/devices/system/memory/memoryX/state
+Date: September 2008
+Contact: Badari Pulavarty <[email protected]>
+Description:
+ The file /sys/devices/system/memory/memoryX/state
+ is read-write. When read, it's contents show the
+ online/offline state of the memory section. When written,
+ root can toggle the the online/offline state of a removable
+ memory section (see removable file description above)
+ using the following commands.
+ # echo online > /sys/devices/system/memory/memoryX/state
+ # echo offline > /sys/devices/system/memory/memoryX/state
+
+ For example, if /sys/devices/system/memory/memory22/removable
+ contains a value of 1 and
+ /sys/devices/system/memory/memory22/state contains the
+ string "online" the following command can be executed by
+ by root to offline that section.
+ # echo offline > /sys/devices/system/memory/memory22/state
Users: hotplug memory remove tools
https://w3.opensource.ibm.com/projects/powerpc-utils/
+
+What: /sys/devices/system/node/nodeX/memoryY
+Date: September 2008
+Contact: Gary Hade <[email protected]>
+Description:
+ When CONFIG_NUMA is enabled
+ /sys/devices/system/node/nodeX/memoryY is a symbolic link that
+ points to the corresponding /sys/devices/system/memory/memoryY
+ memory section directory. For example, the following symbolic
+ link is created for memory section 9 on node0.
+ /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+
Index: linux-2.6.27-rc8/Documentation/memory-hotplug.txt
===================================================================
--- linux-2.6.27-rc8.orig/Documentation/memory-hotplug.txt 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/Documentation/memory-hotplug.txt 2008-10-06 11:20:42.000000000 -0700
@@ -124,7 +124,7 @@
This option can be kernel module too.

--------------------------------
-3 sysfs files for memory hotplug
+4 sysfs files for memory hotplug
--------------------------------
All sections have their device information under /sys/devices/system/memory as

@@ -138,11 +138,12 @@
(0x100000000 / 1Gib = 4)
This device covers address range [0x100000000 ... 0x140000000)

-Under each section, you can see 3 files.
+Under each section, you can see 4 files.

/sys/devices/system/memory/memoryXXX/phys_index
/sys/devices/system/memory/memoryXXX/phys_device
/sys/devices/system/memory/memoryXXX/state
+/sys/devices/system/memory/memoryXXX/removable

'phys_index' : read-only and contains section id, same as XXX.
'state' : read-write
@@ -150,10 +151,20 @@
at write: user can specify "online", "offline" command
'phys_device': read-only: designed to show the name of physical memory device.
This is not well implemented now.
+'removable' : read-only: contains an integer value indicating
+ whether the memory section is removable or not
+ removable. A value of 1 indicates that the memory
+ section is removable and a value of 0 indicates that
+ it is not removable.

NOTE:
These directories/files appear after physical memory hotplug phase.

+If CONFIG_NUMA is enabled the
+/sys/devices/system/memory/memoryXXX memory section
+directories can also be accessed via symbolic links located in
+the /sys/devices/system/node/node* directories. For example:
+/sys/devices/system/node/node0/memory9 -> ../../memory/memory9

--------------------------------
4. Physical memory hot-add phase
@@ -365,7 +376,6 @@
- allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
sysctl or new control file.
- showing memory section and physical device relationship.
- - showing memory section and node relationship (maybe good for NUMA)
- showing memory section is under ZONE_MOVABLE or not
- test and make it better memory offlining.
- support HugeTLB page migration and offlining.
Index: linux-2.6.27-rc8/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/ia64/mm/init.c 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/arch/ia64/mm/init.c 2008-10-06 11:19:41.000000000 -0700
@@ -693,7 +693,7 @@
pgdat = NODE_DATA(nid);

zone = pgdat->node_zones + ZONE_NORMAL;
- ret = __add_pages(zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, zone, start_pfn, nr_pages);

if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
Index: linux-2.6.27-rc8/arch/powerpc/mm/mem.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/powerpc/mm/mem.c 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/arch/powerpc/mm/mem.c 2008-10-06 11:19:41.000000000 -0700
@@ -133,7 +133,7 @@
/* this should work for most non-highmem platforms */
zone = pgdata->node_zones;

- return __add_pages(zone, start_pfn, nr_pages);
+ return __add_pages(nid, zone, start_pfn, nr_pages);
}

#ifdef CONFIG_MEMORY_HOTREMOVE
Index: linux-2.6.27-rc8/arch/s390/mm/init.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/s390/mm/init.c 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/arch/s390/mm/init.c 2008-10-06 11:19:41.000000000 -0700
@@ -183,7 +183,7 @@
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
- rc = __add_pages(zone, PFN_DOWN(start), PFN_DOWN(size));
+ rc = __add_pages(nid, zone, PFN_DOWN(start), PFN_DOWN(size));
if (rc)
vmem_remove_mapping(start, size);
return rc;
Index: linux-2.6.27-rc8/arch/sh/mm/init.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/sh/mm/init.c 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/arch/sh/mm/init.c 2008-10-06 11:19:41.000000000 -0700
@@ -276,7 +276,8 @@
pgdat = NODE_DATA(nid);

/* We only have ZONE_NORMAL, so this is easy.. */
- ret = __add_pages(pgdat->node_zones + ZONE_NORMAL, start_pfn, nr_pages);
+ ret = __add_pages(nid, pgdat->node_zones + ZONE_NORMAL,
+ start_pfn, nr_pages);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);

Index: linux-2.6.27-rc8/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/x86/mm/init_32.c 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/arch/x86/mm/init_32.c 2008-10-06 11:19:41.000000000 -0700
@@ -995,7 +995,7 @@
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

- return __add_pages(zone, start_pfn, nr_pages);
+ return __add_pages(nid, zone, start_pfn, nr_pages);
}
#endif

Index: linux-2.6.27-rc8/drivers/base/memory.c
===================================================================
--- linux-2.6.27-rc8.orig/drivers/base/memory.c 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/drivers/base/memory.c 2008-10-06 11:19:41.000000000 -0700
@@ -345,8 +345,9 @@
* section belongs to...
*/

-static int add_memory_block(unsigned long node_id, struct mem_section *section,
- unsigned long state, int phys_device)
+static int add_memory_block(int nid, struct mem_section *section,
+ unsigned long state, int phys_device,
+ enum mem_add_context context)
{
struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
int ret = 0;
@@ -368,6 +369,10 @@
ret = mem_create_simple_file(mem, phys_device);
if (!ret)
ret = mem_create_simple_file(mem, removable);
+ if (!ret) {
+ if (context == HOTPLUG)
+ ret = register_mem_sect_under_node(mem, nid);
+ }

return ret;
}
@@ -380,7 +385,7 @@
*
* This could be made generic for all sysdev classes.
*/
-static struct memory_block *find_memory_block(struct mem_section *section)
+struct memory_block *find_memory_block(struct mem_section *section)
{
struct kobject *kobj;
struct sys_device *sysdev;
@@ -409,6 +414,7 @@
struct memory_block *mem;

mem = find_memory_block(section);
+ unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
mem_remove_simple_file(mem, state);
mem_remove_simple_file(mem, phys_device);
@@ -422,9 +428,9 @@
* need an interface for the VM to add new memory regions,
* but without onlining it.
*/
-int register_new_memory(struct mem_section *section)
+int register_new_memory(int nid, struct mem_section *section)
{
- return add_memory_block(0, section, MEM_OFFLINE, 0);
+ return add_memory_block(nid, section, MEM_OFFLINE, 0, HOTPLUG);
}

int unregister_memory_section(struct mem_section *section)
@@ -456,7 +462,8 @@
for (i = 0; i < NR_MEM_SECTIONS; i++) {
if (!present_section_nr(i))
continue;
- err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE, 0);
+ err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
+ 0, BOOT);
if (!ret)
ret = err;
}
Index: linux-2.6.27-rc8/drivers/base/node.c
===================================================================
--- linux-2.6.27-rc8.orig/drivers/base/node.c 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/drivers/base/node.c 2008-10-06 11:19:41.000000000 -0700
@@ -6,6 +6,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/mm.h>
+#include <linux/memory.h>
#include <linux/node.h>
#include <linux/hugetlb.h>
#include <linux/cpumask.h>
@@ -225,6 +226,102 @@
return 0;
}

+#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#define page_initialized(page) (page->lru.next)
+
+static int get_nid_for_pfn(unsigned long pfn)
+{
+ struct page *page;
+
+ if (!pfn_valid_within(pfn))
+ return -1;
+ page = pfn_to_page(pfn);
+ if (!page_initialized(page))
+ return -1;
+ return pfn_to_nid(pfn);
+}
+
+/* register memory section under specified node if it spans that node */
+int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
+{
+ unsigned long pfn, sect_start_pfn, sect_end_pfn;
+
+ if (!mem_blk)
+ return -EFAULT;
+ if (!node_online(nid))
+ return 0;
+ sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+ sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+ for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
+ int page_nid;
+
+ page_nid = get_nid_for_pfn(pfn);
+ if (page_nid < 0)
+ continue;
+ if (page_nid != nid)
+ continue;
+ return sysfs_create_link_nowarn(&node_devices[nid].sysdev.kobj,
+ &mem_blk->sysdev.kobj,
+ kobject_name(&mem_blk->sysdev.kobj));
+ }
+ /* mem section does not span the specified node */
+ return 0;
+}
+
+/* unregister memory section under all nodes that it spans */
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+{
+ nodemask_t unlinked_nodes;
+ unsigned long pfn, sect_start_pfn, sect_end_pfn;
+
+ if (!mem_blk)
+ return -EFAULT;
+ nodes_clear(unlinked_nodes);
+ sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+ sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+ for (pfn = sect_start_pfn; pfn < sect_end_pfn; pfn++) {
+ unsigned int nid;
+
+ nid = get_nid_for_pfn(pfn);
+ if (nid < 0)
+ continue;
+ if (!node_online(nid))
+ continue;
+ if (node_test_and_set(nid, unlinked_nodes))
+ continue;
+ sysfs_remove_link(&node_devices[nid].sysdev.kobj,
+ kobject_name(&mem_blk->sysdev.kobj));
+ }
+ return 0;
+}
+
+static int link_mem_sections(int nid)
+{
+ unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
+ unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+ unsigned long pfn;
+ int err = 0;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ unsigned long section_nr = pfn_to_section_nr(pfn);
+ struct mem_section *mem_sect;
+ struct memory_block *mem_blk;
+ int ret;
+
+ if (!present_section_nr(section_nr))
+ continue;
+ mem_sect = __nr_to_section(section_nr);
+ mem_blk = find_memory_block(mem_sect);
+ ret = register_mem_sect_under_node(mem_blk, nid);
+ if (!err)
+ err = ret;
+ }
+ return err;
+}
+#else
+static int link_mem_sections(int nid) { return 0; }
+#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+
int register_one_node(int nid)
{
int error = 0;
@@ -244,6 +341,9 @@
if (cpu_to_node(cpu) == nid)
register_cpu_under_node(cpu, nid);
}
+
+ /* link memory sections under this node */
+ error = link_mem_sections(nid);
}

return error;
Index: linux-2.6.27-rc8/include/linux/memory.h
===================================================================
--- linux-2.6.27-rc8.orig/include/linux/memory.h 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/include/linux/memory.h 2008-10-06 11:19:41.000000000 -0700
@@ -79,14 +79,14 @@
#else
extern int register_memory_notifier(struct notifier_block *nb);
extern void unregister_memory_notifier(struct notifier_block *nb);
-extern int register_new_memory(struct mem_section *);
+extern int register_new_memory(int, struct mem_section *);
extern int unregister_memory_section(struct mem_section *);
extern int memory_dev_init(void);
extern int remove_memory_block(unsigned long, struct mem_section *, int);
extern int memory_notify(unsigned long val, void *v);
+extern struct memory_block *find_memory_block(struct mem_section *);
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
-
-
+enum mem_add_context { BOOT, HOTPLUG };
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */

#ifdef CONFIG_MEMORY_HOTPLUG
Index: linux-2.6.27-rc8/include/linux/memory_hotplug.h
===================================================================
--- linux-2.6.27-rc8.orig/include/linux/memory_hotplug.h 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/include/linux/memory_hotplug.h 2008-10-06 11:19:41.000000000 -0700
@@ -72,7 +72,7 @@
extern int offline_pages(unsigned long, unsigned long, unsigned long);

/* reasonably generic interface to expand the physical pages in a zone */
-extern int __add_pages(struct zone *zone, unsigned long start_pfn,
+extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
Index: linux-2.6.27-rc8/include/linux/node.h
===================================================================
--- linux-2.6.27-rc8.orig/include/linux/node.h 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/include/linux/node.h 2008-10-06 11:19:41.000000000 -0700
@@ -26,6 +26,7 @@
struct sys_device sysdev;
};

+struct memory_block;
extern struct node node_devices[];

extern int register_node(struct node *, int, struct node *);
@@ -35,6 +36,9 @@
extern void unregister_one_node(int nid);
extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
+extern int register_mem_sect_under_node(struct memory_block *mem_blk,
+ int nid);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
#else
static inline int register_one_node(int nid)
{
@@ -52,6 +56,15 @@
{
return 0;
}
+static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
+ int nid)
+{
+ return 0;
+}
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+{
+ return 0;
+}
#endif

#define to_node(sys_device) container_of(sys_device, struct node, sysdev)
Index: linux-2.6.27-rc8/mm/memory_hotplug.c
===================================================================
--- linux-2.6.27-rc8.orig/mm/memory_hotplug.c 2008-10-06 11:18:44.000000000 -0700
+++ linux-2.6.27-rc8/mm/memory_hotplug.c 2008-10-06 11:19:41.000000000 -0700
@@ -216,7 +216,8 @@
return 0;
}

-static int __add_section(struct zone *zone, unsigned long phys_start_pfn)
+static int __add_section(int nid, struct zone *zone,
+ unsigned long phys_start_pfn)
{
int nr_pages = PAGES_PER_SECTION;
int ret;
@@ -234,7 +235,7 @@
if (ret < 0)
return ret;

- return register_new_memory(__pfn_to_section(phys_start_pfn));
+ return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}

#ifdef CONFIG_SPARSEMEM_VMEMMAP
@@ -273,7 +274,7 @@
* call this function after deciding the zone to which to
* add the new pages.
*/
-int __add_pages(struct zone *zone, unsigned long phys_start_pfn,
+int __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
unsigned long nr_pages)
{
unsigned long i;
@@ -284,7 +285,7 @@
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);

for (i = start_sec; i <= end_sec; i++) {
- err = __add_section(zone, i << PFN_SECTION_SHIFT);
+ err = __add_section(nid, zone, i << PFN_SECTION_SHIFT);

/*
* EEXIST is finally dealt with by ioresource collision
Index: linux-2.6.27-rc8/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.27-rc8.orig/arch/x86/mm/init_64.c 2008-10-06 11:18:45.000000000 -0700
+++ linux-2.6.27-rc8/arch/x86/mm/init_64.c 2008-10-06 11:24:09.000000000 -0700
@@ -725,7 +725,7 @@
if (last_mapped_pfn > max_pfn_mapped)
max_pfn_mapped = last_mapped_pfn;

- ret = __add_pages(zone, start_pfn, nr_pages);
+ ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON(1);

return ret;


2008-10-10 10:56:44

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs


Looks good to me.

Acked-by: Yasunori Goto <[email protected]>


>
> Show node to memory section relationship with symlinks in sysfs
>
> Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> the memory sections located on nodeX. For example:
> /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> indicates that memory section 135 resides on node1.
>
> Tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems.
>
> Also revises documentation to cover this change as well as updating
> Documentation/ABI/testing/sysfs-devices-memory to include descriptions
> of memory hotremove files 'phys_device', 'phys_index', and 'state'
> that were previously not described there.
>
> Supersedes the "mm: show memory section to node relationship in sysfs"
> patch posted on 05 Sept 2008 which created node ID containing 'node'
> files in /sys/devices/system/memory/memoryX instead of symlinks.
> Changed from files to symlinks due to feedback that symlinks were
> more consistent with the sysfs way.
>
> Also supercedes the "mm: show node to memory section relationship
> with symlinks in sysfs" patch posted on 29 Sept 2008 to address a
> Yasunori Goto reported problem where an incorrect symlink was created
> due to a range of uninitialized pages at the beginning of a section.
> This problem which produced a symlink in /sys/devices/system/node/node0
> that incorrectly referenced a mem section located on node1 is corrected
> in this version. This version also covers the case were a mem section
> could span multiple nodes.
>
> Signed-off-by: Gary Hade <[email protected]>
> Signed-off-by: Badari Pulavarty <[email protected]>
>
> ---
> Documentation/ABI/testing/sysfs-devices-memory | 51 +++++++
> Documentation/memory-hotplug.txt | 16 +-
> arch/ia64/mm/init.c | 2
> arch/powerpc/mm/mem.c | 2
> arch/s390/mm/init.c | 2
> arch/sh/mm/init.c | 3
> arch/x86/mm/init_32.c | 2
> arch/x86/mm/init_64.c | 2
> drivers/base/memory.c | 19 +-
> drivers/base/node.c | 100 +++++++++++++++
> include/linux/memory.h | 6
> include/linux/memory_hotplug.h | 2
> include/linux/node.h | 13 +
> mm/memory_hotplug.c | 9 -
> 14 files changed, 205 insertions(+), 24 deletions(-)
>
> Index: linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory
> ===================================================================
> --- linux-2.6.27-rc8.orig/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:18:46.000000000 -0700
> +++ linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:20:19.000000000 -0700
> @@ -6,7 +6,6 @@
> internal state of the kernel memory blocks. Files could be
> added or removed dynamically to represent hot-add/remove
> operations.
> -
> Users: hotplug memory add/remove tools
> https://w3.opensource.ibm.com/projects/powerpc-utils/
>
> @@ -19,6 +18,56 @@
> This is useful for a user-level agent to determine
> identify removable sections of the memory before attempting
> potentially expensive hot-remove memory operation
> +Users: hotplug memory remove tools
> + https://w3.opensource.ibm.com/projects/powerpc-utils/
> +
> +What: /sys/devices/system/memory/memoryX/phys_device
> +Date: September 2008
> +Contact: Badari Pulavarty <[email protected]>
> +Description:
> + The file /sys/devices/system/memory/memoryX/phys_device
> + is read-only and is designed to show the name of physical
> + memory device. Implementation is currently incomplete.
>
> +What: /sys/devices/system/memory/memoryX/phys_index
> +Date: September 2008
> +Contact: Badari Pulavarty <[email protected]>
> +Description:
> + The file /sys/devices/system/memory/memoryX/phys_index
> + is read-only and contains the section ID in hexadecimal
> + which is equivalent to decimal X contained in the
> + memory section directory name.
> +
> +What: /sys/devices/system/memory/memoryX/state
> +Date: September 2008
> +Contact: Badari Pulavarty <[email protected]>
> +Description:
> + The file /sys/devices/system/memory/memoryX/state
> + is read-write. When read, it's contents show the
> + online/offline state of the memory section. When written,
> + root can toggle the the online/offline state of a removable
> + memory section (see removable file description above)
> + using the following commands.
> + # echo online > /sys/devices/system/memory/memoryX/state
> + # echo offline > /sys/devices/system/memory/memoryX/state
> +
> + For example, if /sys/devices/system/memory/memory22/removable
> + contains a value of 1 and
> + /sys/devices/system/memory/memory22/state contains the
> + string "online" the following command can be executed by
> + by root to offline that section.
> + # echo offline > /sys/devices/system/memory/memory22/state
> Users: hotplug memory remove tools
> https://w3.opensource.ibm.com/projects/powerpc-utils/
> +
> +What: /sys/devices/system/node/nodeX/memoryY
> +Date: September 2008
> +Contact: Gary Hade <[email protected]>
> +Description:
> + When CONFIG_NUMA is enabled
> + /sys/devices/system/node/nodeX/memoryY is a symbolic link that
> + points to the corresponding /sys/devices/system/memory/memoryY
> + memory section directory. For example, the following symbolic
> + link is created for memory section 9 on node0.
> + /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
> +
> Index: linux-2.6.27-rc8/Documentation/memory-hotplug.txt
> ===================================================================
> --- linux-2.6.27-rc8.orig/Documentation/memory-hotplug.txt 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/Documentation/memory-hotplug.txt 2008-10-06 11:20:42.000000000 -0700
> @@ -124,7 +124,7 @@
> This option can be kernel module too.
>
> --------------------------------
> -3 sysfs files for memory hotplug
> +4 sysfs files for memory hotplug
> --------------------------------
> All sections have their device information under /sys/devices/system/memory as
>
> @@ -138,11 +138,12 @@
> (0x100000000 / 1Gib = 4)
> This device covers address range [0x100000000 ... 0x140000000)
>
> -Under each section, you can see 3 files.
> +Under each section, you can see 4 files.
>
> /sys/devices/system/memory/memoryXXX/phys_index
> /sys/devices/system/memory/memoryXXX/phys_device
> /sys/devices/system/memory/memoryXXX/state
> +/sys/devices/system/memory/memoryXXX/removable
>
> 'phys_index' : read-only and contains section id, same as XXX.
> 'state' : read-write
> @@ -150,10 +151,20 @@
> at write: user can specify "online", "offline" command
> 'phys_device': read-only: designed to show the name of physical memory device.
> This is not well implemented now.
> +'removable' : read-only: contains an integer value indicating
> + whether the memory section is removable or not
> + removable. A value of 1 indicates that the memory
> + section is removable and a value of 0 indicates that
> + it is not removable.
>
> NOTE:
> These directories/files appear after physical memory hotplug phase.
>
> +If CONFIG_NUMA is enabled the
> +/sys/devices/system/memory/memoryXXX memory section
> +directories can also be accessed via symbolic links located in
> +the /sys/devices/system/node/node* directories. For example:
> +/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
>
> --------------------------------
> 4. Physical memory hot-add phase
> @@ -365,7 +376,6 @@
> - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
> sysctl or new control file.
> - showing memory section and physical device relationship.
> - - showing memory section and node relationship (maybe good for NUMA)
> - showing memory section is under ZONE_MOVABLE or not
> - test and make it better memory offlining.
> - support HugeTLB page migration and offlining.
> Index: linux-2.6.27-rc8/arch/ia64/mm/init.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/ia64/mm/init.c 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/arch/ia64/mm/init.c 2008-10-06 11:19:41.000000000 -0700
> @@ -693,7 +693,7 @@
> pgdat = NODE_DATA(nid);
>
> zone = pgdat->node_zones + ZONE_NORMAL;
> - ret = __add_pages(zone, start_pfn, nr_pages);
> + ret = __add_pages(nid, zone, start_pfn, nr_pages);
>
> if (ret)
> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
> Index: linux-2.6.27-rc8/arch/powerpc/mm/mem.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/powerpc/mm/mem.c 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/arch/powerpc/mm/mem.c 2008-10-06 11:19:41.000000000 -0700
> @@ -133,7 +133,7 @@
> /* this should work for most non-highmem platforms */
> zone = pgdata->node_zones;
>
> - return __add_pages(zone, start_pfn, nr_pages);
> + return __add_pages(nid, zone, start_pfn, nr_pages);
> }
>
> #ifdef CONFIG_MEMORY_HOTREMOVE
> Index: linux-2.6.27-rc8/arch/s390/mm/init.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/s390/mm/init.c 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/arch/s390/mm/init.c 2008-10-06 11:19:41.000000000 -0700
> @@ -183,7 +183,7 @@
> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
> - rc = __add_pages(zone, PFN_DOWN(start), PFN_DOWN(size));
> + rc = __add_pages(nid, zone, PFN_DOWN(start), PFN_DOWN(size));
> if (rc)
> vmem_remove_mapping(start, size);
> return rc;
> Index: linux-2.6.27-rc8/arch/sh/mm/init.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/sh/mm/init.c 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/arch/sh/mm/init.c 2008-10-06 11:19:41.000000000 -0700
> @@ -276,7 +276,8 @@
> pgdat = NODE_DATA(nid);
>
> /* We only have ZONE_NORMAL, so this is easy.. */
> - ret = __add_pages(pgdat->node_zones + ZONE_NORMAL, start_pfn, nr_pages);
> + ret = __add_pages(nid, pgdat->node_zones + ZONE_NORMAL,
> + start_pfn, nr_pages);
> if (unlikely(ret))
> printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
>
> Index: linux-2.6.27-rc8/arch/x86/mm/init_32.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/x86/mm/init_32.c 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/arch/x86/mm/init_32.c 2008-10-06 11:19:41.000000000 -0700
> @@ -995,7 +995,7 @@
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> - return __add_pages(zone, start_pfn, nr_pages);
> + return __add_pages(nid, zone, start_pfn, nr_pages);
> }
> #endif
>
> Index: linux-2.6.27-rc8/drivers/base/memory.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/drivers/base/memory.c 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/drivers/base/memory.c 2008-10-06 11:19:41.000000000 -0700
> @@ -345,8 +345,9 @@
> * section belongs to...
> */
>
> -static int add_memory_block(unsigned long node_id, struct mem_section *section,
> - unsigned long state, int phys_device)
> +static int add_memory_block(int nid, struct mem_section *section,
> + unsigned long state, int phys_device,
> + enum mem_add_context context)
> {
> struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> int ret = 0;
> @@ -368,6 +369,10 @@
> ret = mem_create_simple_file(mem, phys_device);
> if (!ret)
> ret = mem_create_simple_file(mem, removable);
> + if (!ret) {
> + if (context == HOTPLUG)
> + ret = register_mem_sect_under_node(mem, nid);
> + }
>
> return ret;
> }
> @@ -380,7 +385,7 @@
> *
> * This could be made generic for all sysdev classes.
> */
> -static struct memory_block *find_memory_block(struct mem_section *section)
> +struct memory_block *find_memory_block(struct mem_section *section)
> {
> struct kobject *kobj;
> struct sys_device *sysdev;
> @@ -409,6 +414,7 @@
> struct memory_block *mem;
>
> mem = find_memory_block(section);
> + unregister_mem_sect_under_nodes(mem);
> mem_remove_simple_file(mem, phys_index);
> mem_remove_simple_file(mem, state);
> mem_remove_simple_file(mem, phys_device);
> @@ -422,9 +428,9 @@
> * need an interface for the VM to add new memory regions,
> * but without onlining it.
> */
> -int register_new_memory(struct mem_section *section)
> +int register_new_memory(int nid, struct mem_section *section)
> {
> - return add_memory_block(0, section, MEM_OFFLINE, 0);
> + return add_memory_block(nid, section, MEM_OFFLINE, 0, HOTPLUG);
> }
>
> int unregister_memory_section(struct mem_section *section)
> @@ -456,7 +462,8 @@
> for (i = 0; i < NR_MEM_SECTIONS; i++) {
> if (!present_section_nr(i))
> continue;
> - err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE, 0);
> + err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
> + 0, BOOT);
> if (!ret)
> ret = err;
> }
> Index: linux-2.6.27-rc8/drivers/base/node.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/drivers/base/node.c 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/drivers/base/node.c 2008-10-06 11:19:41.000000000 -0700
> @@ -6,6 +6,7 @@
> #include <linux/module.h>
> #include <linux/init.h>
> #include <linux/mm.h>
> +#include <linux/memory.h>
> #include <linux/node.h>
> #include <linux/hugetlb.h>
> #include <linux/cpumask.h>
> @@ -225,6 +226,102 @@
> return 0;
> }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> +#define page_initialized(page) (page->lru.next)
> +
> +static int get_nid_for_pfn(unsigned long pfn)
> +{
> + struct page *page;
> +
> + if (!pfn_valid_within(pfn))
> + return -1;
> + page = pfn_to_page(pfn);
> + if (!page_initialized(page))
> + return -1;
> + return pfn_to_nid(pfn);
> +}
> +
> +/* register memory section under specified node if it spans that node */
> +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
> +{
> + unsigned long pfn, sect_start_pfn, sect_end_pfn;
> +
> + if (!mem_blk)
> + return -EFAULT;
> + if (!node_online(nid))
> + return 0;
> + sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
> + sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> + for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> + int page_nid;
> +
> + page_nid = get_nid_for_pfn(pfn);
> + if (page_nid < 0)
> + continue;
> + if (page_nid != nid)
> + continue;
> + return sysfs_create_link_nowarn(&node_devices[nid].sysdev.kobj,
> + &mem_blk->sysdev.kobj,
> + kobject_name(&mem_blk->sysdev.kobj));
> + }
> + /* mem section does not span the specified node */
> + return 0;
> +}
> +
> +/* unregister memory section under all nodes that it spans */
> +int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
> +{
> + nodemask_t unlinked_nodes;
> + unsigned long pfn, sect_start_pfn, sect_end_pfn;
> +
> + if (!mem_blk)
> + return -EFAULT;
> + nodes_clear(unlinked_nodes);
> + sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
> + sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> + for (pfn = sect_start_pfn; pfn < sect_end_pfn; pfn++) {
> + unsigned int nid;
> +
> + nid = get_nid_for_pfn(pfn);
> + if (nid < 0)
> + continue;
> + if (!node_online(nid))
> + continue;
> + if (node_test_and_set(nid, unlinked_nodes))
> + continue;
> + sysfs_remove_link(&node_devices[nid].sysdev.kobj,
> + kobject_name(&mem_blk->sysdev.kobj));
> + }
> + return 0;
> +}
> +
> +static int link_mem_sections(int nid)
> +{
> + unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
> + unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> + unsigned long pfn;
> + int err = 0;
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + unsigned long section_nr = pfn_to_section_nr(pfn);
> + struct mem_section *mem_sect;
> + struct memory_block *mem_blk;
> + int ret;
> +
> + if (!present_section_nr(section_nr))
> + continue;
> + mem_sect = __nr_to_section(section_nr);
> + mem_blk = find_memory_block(mem_sect);
> + ret = register_mem_sect_under_node(mem_blk, nid);
> + if (!err)
> + err = ret;
> + }
> + return err;
> +}
> +#else
> +static int link_mem_sections(int nid) { return 0; }
> +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
> +
> int register_one_node(int nid)
> {
> int error = 0;
> @@ -244,6 +341,9 @@
> if (cpu_to_node(cpu) == nid)
> register_cpu_under_node(cpu, nid);
> }
> +
> + /* link memory sections under this node */
> + error = link_mem_sections(nid);
> }
>
> return error;
> Index: linux-2.6.27-rc8/include/linux/memory.h
> ===================================================================
> --- linux-2.6.27-rc8.orig/include/linux/memory.h 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/include/linux/memory.h 2008-10-06 11:19:41.000000000 -0700
> @@ -79,14 +79,14 @@
> #else
> extern int register_memory_notifier(struct notifier_block *nb);
> extern void unregister_memory_notifier(struct notifier_block *nb);
> -extern int register_new_memory(struct mem_section *);
> +extern int register_new_memory(int, struct mem_section *);
> extern int unregister_memory_section(struct mem_section *);
> extern int memory_dev_init(void);
> extern int remove_memory_block(unsigned long, struct mem_section *, int);
> extern int memory_notify(unsigned long val, void *v);
> +extern struct memory_block *find_memory_block(struct mem_section *);
> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
> -
> -
> +enum mem_add_context { BOOT, HOTPLUG };
> #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> Index: linux-2.6.27-rc8/include/linux/memory_hotplug.h
> ===================================================================
> --- linux-2.6.27-rc8.orig/include/linux/memory_hotplug.h 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/include/linux/memory_hotplug.h 2008-10-06 11:19:41.000000000 -0700
> @@ -72,7 +72,7 @@
> extern int offline_pages(unsigned long, unsigned long, unsigned long);
>
> /* reasonably generic interface to expand the physical pages in a zone */
> -extern int __add_pages(struct zone *zone, unsigned long start_pfn,
> +extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
> unsigned long nr_pages);
> extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
> unsigned long nr_pages);
> Index: linux-2.6.27-rc8/include/linux/node.h
> ===================================================================
> --- linux-2.6.27-rc8.orig/include/linux/node.h 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/include/linux/node.h 2008-10-06 11:19:41.000000000 -0700
> @@ -26,6 +26,7 @@
> struct sys_device sysdev;
> };
>
> +struct memory_block;
> extern struct node node_devices[];
>
> extern int register_node(struct node *, int, struct node *);
> @@ -35,6 +36,9 @@
> extern void unregister_one_node(int nid);
> extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
> extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
> +extern int register_mem_sect_under_node(struct memory_block *mem_blk,
> + int nid);
> +extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
> #else
> static inline int register_one_node(int nid)
> {
> @@ -52,6 +56,15 @@
> {
> return 0;
> }
> +static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
> + int nid)
> +{
> + return 0;
> +}
> +static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
> +{
> + return 0;
> +}
> #endif
>
> #define to_node(sys_device) container_of(sys_device, struct node, sysdev)
> Index: linux-2.6.27-rc8/mm/memory_hotplug.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/mm/memory_hotplug.c 2008-10-06 11:18:44.000000000 -0700
> +++ linux-2.6.27-rc8/mm/memory_hotplug.c 2008-10-06 11:19:41.000000000 -0700
> @@ -216,7 +216,8 @@
> return 0;
> }
>
> -static int __add_section(struct zone *zone, unsigned long phys_start_pfn)
> +static int __add_section(int nid, struct zone *zone,
> + unsigned long phys_start_pfn)
> {
> int nr_pages = PAGES_PER_SECTION;
> int ret;
> @@ -234,7 +235,7 @@
> if (ret < 0)
> return ret;
>
> - return register_new_memory(__pfn_to_section(phys_start_pfn));
> + return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
> }
>
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> @@ -273,7 +274,7 @@
> * call this function after deciding the zone to which to
> * add the new pages.
> */
> -int __add_pages(struct zone *zone, unsigned long phys_start_pfn,
> +int __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
> unsigned long nr_pages)
> {
> unsigned long i;
> @@ -284,7 +285,7 @@
> end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
>
> for (i = start_sec; i <= end_sec; i++) {
> - err = __add_section(zone, i << PFN_SECTION_SHIFT);
> + err = __add_section(nid, zone, i << PFN_SECTION_SHIFT);
>
> /*
> * EEXIST is finally dealt with by ioresource collision
> Index: linux-2.6.27-rc8/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.27-rc8.orig/arch/x86/mm/init_64.c 2008-10-06 11:18:45.000000000 -0700
> +++ linux-2.6.27-rc8/arch/x86/mm/init_64.c 2008-10-06 11:24:09.000000000 -0700
> @@ -725,7 +725,7 @@
> if (last_mapped_pfn > max_pfn_mapped)
> max_pfn_mapped = last_mapped_pfn;
>
> - ret = __add_pages(zone, start_pfn, nr_pages);
> + ret = __add_pages(nid, zone, start_pfn, nr_pages);
> WARN_ON(1);
>
> return ret;

--
Yasunori Goto

2008-10-10 19:48:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Thu, 9 Oct 2008 12:21:15 -0700
Gary Hade <[email protected]> wrote:

> Show node to memory section relationship with symlinks in sysfs
>
> Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> the memory sections located on nodeX. For example:
> /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> indicates that memory section 135 resides on node1.

I'm not seeing here a description of why the kernel needs this feature.
Why is it useful? How will it be used? What value does it have to
our users?

2008-10-10 21:34:28

by Gary Hade

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> On Thu, 9 Oct 2008 12:21:15 -0700
> Gary Hade <[email protected]> wrote:
>
> > Show node to memory section relationship with symlinks in sysfs
> >
> > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > the memory sections located on nodeX. For example:
> > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > indicates that memory section 135 resides on node1.
>
> I'm not seeing here a description of why the kernel needs this feature.
> Why is it useful? How will it be used? What value does it have to
> our users?

Sorry, I should have included that. In our case, it is another
small step towards eventual total node removal. We will need to
know which memory sections to offline for whatever node is targeted
for removal. However, I suspect that exposing the node to section
information to user-level could be useful for other purposes.
For example, I have been thinking that using memory hotremove
functionality to modify the amount of available memory on specific
nodes without having to physically add/remove DIMMs might be useful
to those that test application or benchmark performance on a
multi-node system in various memory configurations.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2008-10-10 22:02:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Fri, 10 Oct 2008 14:33:57 -0700
Gary Hade <[email protected]> wrote:

> On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > On Thu, 9 Oct 2008 12:21:15 -0700
> > Gary Hade <[email protected]> wrote:
> >
> > > Show node to memory section relationship with symlinks in sysfs
> > >
> > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > the memory sections located on nodeX. For example:
> > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > indicates that memory section 135 resides on node1.
> >
> > I'm not seeing here a description of why the kernel needs this feature.
> > Why is it useful? How will it be used? What value does it have to
> > our users?
>
> Sorry, I should have included that. In our case, it is another
> small step towards eventual total node removal. We will need to
> know which memory sections to offline for whatever node is targeted
> for removal. However, I suspect that exposing the node to section
> information to user-level could be useful for other purposes.
> For example, I have been thinking that using memory hotremove
> functionality to modify the amount of available memory on specific
> nodes without having to physically add/remove DIMMs might be useful
> to those that test application or benchmark performance on a
> multi-node system in various memory configurations.
>

hm, OK, thanks. It does sound a bit thin, and if we merge this then
not only do we get a porkier kernel, we also get a new userspace
interface which we're then locked into.

So I'm inclined to skip this change until we have a stronger need?

2008-10-10 23:19:00

by Gary Hade

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> On Fri, 10 Oct 2008 14:33:57 -0700
> Gary Hade <[email protected]> wrote:
>
> > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > Gary Hade <[email protected]> wrote:
> > >
> > > > Show node to memory section relationship with symlinks in sysfs
> > > >
> > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > the memory sections located on nodeX. For example:
> > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > indicates that memory section 135 resides on node1.
> > >
> > > I'm not seeing here a description of why the kernel needs this feature.
> > > Why is it useful? How will it be used? What value does it have to
> > > our users?
> >
> > Sorry, I should have included that. In our case, it is another
> > small step towards eventual total node removal. We will need to
> > know which memory sections to offline for whatever node is targeted
> > for removal. However, I suspect that exposing the node to section
> > information to user-level could be useful for other purposes.
> > For example, I have been thinking that using memory hotremove
> > functionality to modify the amount of available memory on specific
> > nodes without having to physically add/remove DIMMs might be useful
> > to those that test application or benchmark performance on a
> > multi-node system in various memory configurations.
> >
>
> hm, OK, thanks. It does sound a bit thin, and if we merge this then
> not only do we get a porkier kernel,

Would you feel the same about the size increase if patch 2/2 (include
memory section subtree in sysfs with only sparsemem enabled) was
withdrawn?

Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
wo/memory hotplug kernels is extremely small. Even for memory hotplug
enabled kernels there is only a little extra code in ./drivers/base/node.o
which only gets linked into NUMA enabled kernels. I can gather some numbers
if necessary.

> we also get a new userspace interface which we're then locked into.

True.

>
> So I'm inclined to skip this change until we have a stronger need?

Of course, I'm not. :)

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2008-10-10 23:38:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Fri, 10 Oct 2008 16:18:44 -0700
Gary Hade <[email protected]> wrote:

> On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> > On Fri, 10 Oct 2008 14:33:57 -0700
> > Gary Hade <[email protected]> wrote:
> >
> > > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > > Gary Hade <[email protected]> wrote:
> > > >
> > > > > Show node to memory section relationship with symlinks in sysfs
> > > > >
> > > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > > the memory sections located on nodeX. For example:
> > > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > > indicates that memory section 135 resides on node1.
> > > >
> > > > I'm not seeing here a description of why the kernel needs this feature.
> > > > Why is it useful? How will it be used? What value does it have to
> > > > our users?
> > >
> > > Sorry, I should have included that. In our case, it is another
> > > small step towards eventual total node removal. We will need to
> > > know which memory sections to offline for whatever node is targeted
> > > for removal. However, I suspect that exposing the node to section
> > > information to user-level could be useful for other purposes.
> > > For example, I have been thinking that using memory hotremove
> > > functionality to modify the amount of available memory on specific
> > > nodes without having to physically add/remove DIMMs might be useful
> > > to those that test application or benchmark performance on a
> > > multi-node system in various memory configurations.
> > >
> >
> > hm, OK, thanks. It does sound a bit thin, and if we merge this then
> > not only do we get a porkier kernel,
>
> Would you feel the same about the size increase if patch 2/2 (include
> memory section subtree in sysfs with only sparsemem enabled) was
> withdrawn?
>
> Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
> wo/memory hotplug kernels is extremely small. Even for memory hotplug
> enabled kernels there is only a little extra code in ./drivers/base/node.o
> which only gets linked into NUMA enabled kernels. I can gather some numbers
> if necessary.

Size is probably a minor issue on memory-hotpluggable machines.

> > we also get a new userspace interface which we're then locked into.
>
> True.

That's a bigger issue. The later we leave this sort of thing, the more
information we have.

2008-10-13 16:34:34

by Gary Hade

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Fri, Oct 10, 2008 at 04:32:30PM -0700, Andrew Morton wrote:
> On Fri, 10 Oct 2008 16:18:44 -0700
> Gary Hade <[email protected]> wrote:
>
> > On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> > > On Fri, 10 Oct 2008 14:33:57 -0700
> > > Gary Hade <[email protected]> wrote:
> > >
> > > > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > > > Gary Hade <[email protected]> wrote:
> > > > >
> > > > > > Show node to memory section relationship with symlinks in sysfs
> > > > > >
> > > > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > > > the memory sections located on nodeX. For example:
> > > > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > > > indicates that memory section 135 resides on node1.
> > > > >
> > > > > I'm not seeing here a description of why the kernel needs this feature.
> > > > > Why is it useful? How will it be used? What value does it have to
> > > > > our users?
> > > >
> > > > Sorry, I should have included that. In our case, it is another
> > > > small step towards eventual total node removal. We will need to
> > > > know which memory sections to offline for whatever node is targeted
> > > > for removal. However, I suspect that exposing the node to section
> > > > information to user-level could be useful for other purposes.
> > > > For example, I have been thinking that using memory hotremove
> > > > functionality to modify the amount of available memory on specific
> > > > nodes without having to physically add/remove DIMMs might be useful
> > > > to those that test application or benchmark performance on a
> > > > multi-node system in various memory configurations.
> > > >
> > >
> > > hm, OK, thanks. It does sound a bit thin, and if we merge this then
> > > not only do we get a porkier kernel,
> >
> > Would you feel the same about the size increase if patch 2/2 (include
> > memory section subtree in sysfs with only sparsemem enabled) was
> > withdrawn?
> >
> > Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
> > wo/memory hotplug kernels is extremely small. Even for memory hotplug
> > enabled kernels there is only a little extra code in ./drivers/base/node.o
> > which only gets linked into NUMA enabled kernels. I can gather some numbers
> > if necessary.
>
> Size is probably a minor issue on memory-hotpluggable machines.
>
> > > we also get a new userspace interface which we're then locked into.
> >
> > True.
>
> That's a bigger issue. The later we leave this sort of thing, the more
> information we have.

I understand your concerns about adding possibly frivolous interfaces
but in this case we are simply eliminating a very obvious hole in the
existing set of memory hot-add/remove interfaces. In general, it
makes absolutely no sense to provide a resource add/remove mechanism
without telling the user where the resource is physically located.
i.e. providing the _maximum_ possible amount of location information
available for the add/remove controllable resource. This is especially
critical for large multi-node systems and for resources that can impact
application or overall system performance.

The kernel already exports node location information for CPUs
(e.g. /sys/devices/system/node/node0/cpu0 -> ../../cpu/cpu0) and
PCI devices (e.g. ./devices/pci0000:00/0000:00:00.0/numa_node).
Why should memory be treated any differently?

The memory hot-add/remove interfaces include physical device files
(e.g. /sys/devices/system/memory/memory0/phys_device) which are not
yet fully implemented. When systems that support removable memory
modules force this interface to mature, node location information
will become even more critical. This feature will not be very useful
on multi-node systems if the user does not know what node a specific
memory module is installed in. It may be possible to encode the
node ID into the string provided by the phys_device file but a
more general node to memory section association as provided by this
patch is better since it can be used for other purposes.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2008-10-13 16:40:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Mon, 2008-10-13 at 09:34 -0700, Gary Hade wrote:
> I understand your concerns about adding possibly frivolous interfaces
> but in this case we are simply eliminating a very obvious hole in the
> existing set of memory hot-add/remove interfaces. In general, it
> makes absolutely no sense to provide a resource add/remove mechanism
> without telling the user where the resource is physically located.

Does it help we export the phys_index (basically the section number) as
part of the section directory?

I don't think we export the physical memory ranges of NUMA nodes. But,
if we did that as well, it would allow userspace to do this association
without troubling the kernel with maintaining it.

-- Dave

2008-10-14 11:54:35

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

> On Fri, Oct 10, 2008 at 04:32:30PM -0700, Andrew Morton wrote:
> > On Fri, 10 Oct 2008 16:18:44 -0700
> > Gary Hade <[email protected]> wrote:
> >
> > > On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> > > > On Fri, 10 Oct 2008 14:33:57 -0700
> > > > Gary Hade <[email protected]> wrote:
> > > >
> > > > > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > > > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > > > > Gary Hade <[email protected]> wrote:
> > > > > >
> > > > > > > Show node to memory section relationship with symlinks in sysfs
> > > > > > >
> > > > > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > > > > the memory sections located on nodeX. For example:
> > > > > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > > > > indicates that memory section 135 resides on node1.
> > > > > >
> > > > > > I'm not seeing here a description of why the kernel needs this feature.
> > > > > > Why is it useful? How will it be used? What value does it have to
> > > > > > our users?
> > > > >
> > > > > Sorry, I should have included that. In our case, it is another
> > > > > small step towards eventual total node removal. We will need to
> > > > > know which memory sections to offline for whatever node is targeted
> > > > > for removal. However, I suspect that exposing the node to section
> > > > > information to user-level could be useful for other purposes.
> > > > > For example, I have been thinking that using memory hotremove
> > > > > functionality to modify the amount of available memory on specific
> > > > > nodes without having to physically add/remove DIMMs might be useful
> > > > > to those that test application or benchmark performance on a
> > > > > multi-node system in various memory configurations.
> > > > >
> > > >
> > > > hm, OK, thanks. It does sound a bit thin, and if we merge this then
> > > > not only do we get a porkier kernel,
> > >
> > > Would you feel the same about the size increase if patch 2/2 (include
> > > memory section subtree in sysfs with only sparsemem enabled) was
> > > withdrawn?
> > >
> > > Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
> > > wo/memory hotplug kernels is extremely small. Even for memory hotplug
> > > enabled kernels there is only a little extra code in ./drivers/base/node.o
> > > which only gets linked into NUMA enabled kernels. I can gather some numbers
> > > if necessary.
> >
> > Size is probably a minor issue on memory-hotpluggable machines.
> >
> > > > we also get a new userspace interface which we're then locked into.
> > >
> > > True.
> >
> > That's a bigger issue. The later we leave this sort of thing, the more
> > information we have.
>
> I understand your concerns about adding possibly frivolous interfaces
> but in this case we are simply eliminating a very obvious hole in the
> existing set of memory hot-add/remove interfaces. In general, it
> makes absolutely no sense to provide a resource add/remove mechanism
> without telling the user where the resource is physically located.
> i.e. providing the _maximum_ possible amount of location information
> available for the add/remove controllable resource. This is especially
> critical for large multi-node systems and for resources that can impact
> application or overall system performance.
>
> The kernel already exports node location information for CPUs
> (e.g. /sys/devices/system/node/node0/cpu0 -> ../../cpu/cpu0) and
> PCI devices (e.g. ./devices/pci0000:00/0000:00:00.0/numa_node).
> Why should memory be treated any differently?
>
> The memory hot-add/remove interfaces include physical device files
> (e.g. /sys/devices/system/memory/memory0/phys_device) which are not
> yet fully implemented. When systems that support removable memory
> modules force this interface to mature, node location information
> will become even more critical. This feature will not be very useful
> on multi-node systems if the user does not know what node a specific
> memory module is installed in. It may be possible to encode the
> node ID into the string provided by the phys_device file but a
> more general node to memory section association as provided by this
> patch is better since it can be used for other purposes.

Sorry for late responce.

Our fujitsu box can hot-add a node. This means a user/script has to
find which memory sections and cpus belong to added node when node hot-add
is executed.

Current my hotplug script is very poor. It onlines all offlined cpus and memories.
However if user offlined one memory section intentionally due to
memory error message, the script can't understand it is intended, and hot-add
the error section. I think this is one of reason why this link is necessary.



I think not only node id, but I would like to show _PXM value of ACPI to specify
physical position of the node. Because node id is decided by OS at boot time
(and hot-add time) to make it consecutive.
(This is historical inheritance when there is no macro like
for_each_online_cpus().)

If a system has 2 nodes whose _PXM values are 0 and 3, and boot up,
then kernel make node id 0 and 1 for them, and when hot-add a node
whose _PXM value is 1, then new node id will be 2.

_PXM 0 1 3
node id 0 2 1

When user reboot the system, then node id will be followings.
User will be confused by this.
_PXM 0 1 3
node id 0 1 2

Current kernel may allow "node id = _PXM", because for_each_xxx_node() works
well now. But I'm not sure....

Bye.

--
Yasunori Goto

2008-10-14 21:06:46

by Gary Hade

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

On Tue, Oct 14, 2008 at 08:54:21PM +0900, Yasunori Goto wrote:
> > On Fri, Oct 10, 2008 at 04:32:30PM -0700, Andrew Morton wrote:
> > > On Fri, 10 Oct 2008 16:18:44 -0700
> > > Gary Hade <[email protected]> wrote:
> > >
> > > > On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> > > > > On Fri, 10 Oct 2008 14:33:57 -0700
> > > > > Gary Hade <[email protected]> wrote:
> > > > >
> > > > > > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > > > > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > > > > > Gary Hade <[email protected]> wrote:
> > > > > > >
> > > > > > > > Show node to memory section relationship with symlinks in sysfs
> > > > > > > >
> > > > > > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > > > > > the memory sections located on nodeX. For example:
> > > > > > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > > > > > indicates that memory section 135 resides on node1.
> > > > > > >
> > > > > > > I'm not seeing here a description of why the kernel needs this feature.
> > > > > > > Why is it useful? How will it be used? What value does it have to
> > > > > > > our users?
> > > > > >
> > > > > > Sorry, I should have included that. In our case, it is another
> > > > > > small step towards eventual total node removal. We will need to
> > > > > > know which memory sections to offline for whatever node is targeted
> > > > > > for removal. However, I suspect that exposing the node to section
> > > > > > information to user-level could be useful for other purposes.
> > > > > > For example, I have been thinking that using memory hotremove
> > > > > > functionality to modify the amount of available memory on specific
> > > > > > nodes without having to physically add/remove DIMMs might be useful
> > > > > > to those that test application or benchmark performance on a
> > > > > > multi-node system in various memory configurations.
> > > > > >
> > > > >
> > > > > hm, OK, thanks. It does sound a bit thin, and if we merge this then
> > > > > not only do we get a porkier kernel,
> > > >
> > > > Would you feel the same about the size increase if patch 2/2 (include
> > > > memory section subtree in sysfs with only sparsemem enabled) was
> > > > withdrawn?
> > > >
> > > > Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
> > > > wo/memory hotplug kernels is extremely small. Even for memory hotplug
> > > > enabled kernels there is only a little extra code in ./drivers/base/node.o
> > > > which only gets linked into NUMA enabled kernels. I can gather some numbers
> > > > if necessary.
> > >
> > > Size is probably a minor issue on memory-hotpluggable machines.
> > >
> > > > > we also get a new userspace interface which we're then locked into.
> > > >
> > > > True.
> > >
> > > That's a bigger issue. The later we leave this sort of thing, the more
> > > information we have.
> >
> > I understand your concerns about adding possibly frivolous interfaces
> > but in this case we are simply eliminating a very obvious hole in the
> > existing set of memory hot-add/remove interfaces. In general, it
> > makes absolutely no sense to provide a resource add/remove mechanism
> > without telling the user where the resource is physically located.
> > i.e. providing the _maximum_ possible amount of location information
> > available for the add/remove controllable resource. This is especially
> > critical for large multi-node systems and for resources that can impact
> > application or overall system performance.
> >
> > The kernel already exports node location information for CPUs
> > (e.g. /sys/devices/system/node/node0/cpu0 -> ../../cpu/cpu0) and
> > PCI devices (e.g. ./devices/pci0000:00/0000:00:00.0/numa_node).
> > Why should memory be treated any differently?
> >
> > The memory hot-add/remove interfaces include physical device files
> > (e.g. /sys/devices/system/memory/memory0/phys_device) which are not
> > yet fully implemented. When systems that support removable memory
> > modules force this interface to mature, node location information
> > will become even more critical. This feature will not be very useful
> > on multi-node systems if the user does not know what node a specific
> > memory module is installed in. It may be possible to encode the
> > node ID into the string provided by the phys_device file but a
> > more general node to memory section association as provided by this
> > patch is better since it can be used for other purposes.
>
> Sorry for late responce.
>
> Our fujitsu box can hot-add a node. This means a user/script has to
> find which memory sections and cpus belong to added node when node hot-add
> is executed.
>
> Current my hotplug script is very poor. It onlines all offlined cpus and memories.
> However if user offlined one memory section intentionally due to
> memory error message, the script can't understand it is intended, and hot-add
> the error section. I think this is one of reason why this link is necessary.

When it comes time to replace that misbehaving memory I bet the
user will be delighted to know which node it is in. This sort
of benefit will of course _not_ limited to the small community
of node hot-add/remove capable systems.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2008-10-15 10:38:17

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs

> On Tue, Oct 14, 2008 at 08:54:21PM +0900, Yasunori Goto wrote:
> > > On Fri, Oct 10, 2008 at 04:32:30PM -0700, Andrew Morton wrote:
> > > > On Fri, 10 Oct 2008 16:18:44 -0700
> > > > Gary Hade <[email protected]> wrote:
> > > >
> > > > > On Fri, Oct 10, 2008 at 02:59:50PM -0700, Andrew Morton wrote:
> > > > > > On Fri, 10 Oct 2008 14:33:57 -0700
> > > > > > Gary Hade <[email protected]> wrote:
> > > > > >
> > > > > > > On Fri, Oct 10, 2008 at 12:42:39PM -0700, Andrew Morton wrote:
> > > > > > > > On Thu, 9 Oct 2008 12:21:15 -0700
> > > > > > > > Gary Hade <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > Show node to memory section relationship with symlinks in sysfs
> > > > > > > > >
> > > > > > > > > Add /sys/devices/system/node/nodeX/memoryY symlinks for all
> > > > > > > > > the memory sections located on nodeX. For example:
> > > > > > > > > /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
> > > > > > > > > indicates that memory section 135 resides on node1.
> > > > > > > >
> > > > > > > > I'm not seeing here a description of why the kernel needs this feature.
> > > > > > > > Why is it useful? How will it be used? What value does it have to
> > > > > > > > our users?
> > > > > > >
> > > > > > > Sorry, I should have included that. In our case, it is another
> > > > > > > small step towards eventual total node removal. We will need to
> > > > > > > know which memory sections to offline for whatever node is targeted
> > > > > > > for removal. However, I suspect that exposing the node to section
> > > > > > > information to user-level could be useful for other purposes.
> > > > > > > For example, I have been thinking that using memory hotremove
> > > > > > > functionality to modify the amount of available memory on specific
> > > > > > > nodes without having to physically add/remove DIMMs might be useful
> > > > > > > to those that test application or benchmark performance on a
> > > > > > > multi-node system in various memory configurations.
> > > > > > >
> > > > > >
> > > > > > hm, OK, thanks. It does sound a bit thin, and if we merge this then
> > > > > > not only do we get a porkier kernel,
> > > > >
> > > > > Would you feel the same about the size increase if patch 2/2 (include
> > > > > memory section subtree in sysfs with only sparsemem enabled) was
> > > > > withdrawn?
> > > > >
> > > > > Without patch 2/2 the size increase for non-Sparsemem or Sparsemem
> > > > > wo/memory hotplug kernels is extremely small. Even for memory hotplug
> > > > > enabled kernels there is only a little extra code in ./drivers/base/node.o
> > > > > which only gets linked into NUMA enabled kernels. I can gather some numbers
> > > > > if necessary.
> > > >
> > > > Size is probably a minor issue on memory-hotpluggable machines.
> > > >
> > > > > > we also get a new userspace interface which we're then locked into.
> > > > >
> > > > > True.
> > > >
> > > > That's a bigger issue. The later we leave this sort of thing, the more
> > > > information we have.
> > >
> > > I understand your concerns about adding possibly frivolous interfaces
> > > but in this case we are simply eliminating a very obvious hole in the
> > > existing set of memory hot-add/remove interfaces. In general, it
> > > makes absolutely no sense to provide a resource add/remove mechanism
> > > without telling the user where the resource is physically located.
> > > i.e. providing the _maximum_ possible amount of location information
> > > available for the add/remove controllable resource. This is especially
> > > critical for large multi-node systems and for resources that can impact
> > > application or overall system performance.
> > >
> > > The kernel already exports node location information for CPUs
> > > (e.g. /sys/devices/system/node/node0/cpu0 -> ../../cpu/cpu0) and
> > > PCI devices (e.g. ./devices/pci0000:00/0000:00:00.0/numa_node).
> > > Why should memory be treated any differently?
> > >
> > > The memory hot-add/remove interfaces include physical device files
> > > (e.g. /sys/devices/system/memory/memory0/phys_device) which are not
> > > yet fully implemented. When systems that support removable memory
> > > modules force this interface to mature, node location information
> > > will become even more critical. This feature will not be very useful
> > > on multi-node systems if the user does not know what node a specific
> > > memory module is installed in. It may be possible to encode the
> > > node ID into the string provided by the phys_device file but a
> > > more general node to memory section association as provided by this
> > > patch is better since it can be used for other purposes.
> >
> > Sorry for late responce.
> >
> > Our fujitsu box can hot-add a node. This means a user/script has to
> > find which memory sections and cpus belong to added node when node hot-add
> > is executed.
> >
> > Current my hotplug script is very poor. It onlines all offlined cpus and memories.
> > However if user offlined one memory section intentionally due to
> > memory error message, the script can't understand it is intended, and hot-add
> > the error section. I think this is one of reason why this link is necessary.
>
> When it comes time to replace that misbehaving memory I bet the
> user will be delighted to know which node it is in.

Even if user understand broken node, this case will happen.
(I may be paranoia...)

1) When a part of memory on a node is broken, user may execute just offline
the broken section, because there may be no spare node at this time.

2) After new spare node comes to the user, he may intend to hot-add the
new node at first, because he is afraid of temporary resouce reducing.
(Our box's 1 node has 8 CPU cores and much memories now).
And he will remove broken old node after hot-add new node.)

3) User kicks hot-add command to add new node.
My script is executed automatically by "acpi container driver"
vir udev when hot-add event occur. Then the script will online
the broken section.


> This sort of benefit will of course _not_ limited to the small community
> of node hot-add/remove capable systems.

I agree.

Bye.

--
Yasunori Goto