Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760687AbYJITVm (ORCPT ); Thu, 9 Oct 2008 15:21:42 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758575AbYJITV1 (ORCPT ); Thu, 9 Oct 2008 15:21:27 -0400 Received: from e36.co.us.ibm.com ([32.97.110.154]:54472 "EHLO e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760790AbYJITVZ (ORCPT ); Thu, 9 Oct 2008 15:21:25 -0400 Date: Thu, 9 Oct 2008 12:21:15 -0700 From: Gary Hade To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton , Yasunori Goto , Badari Pulavarty , Mel Gorman , Chris McDermott , Gary Hade , Ingo Molnar , Greg KH , Dave Hansen , Nish Aravamudan Subject: [PATCH 1/2] [REPOST] mm: show node to memory section relationship with symlinks in sysfs Message-ID: <20081009192115.GB8793@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.17+20080114 (2008-01-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 21228 Lines: 564 Show node to memory section relationship with symlinks in sysfs Add /sys/devices/system/node/nodeX/memoryY symlinks for all the memory sections located on nodeX. For example: /sys/devices/system/node/node1/memory135 -> ../../memory/memory135 indicates that memory section 135 resides on node1. Tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems. Also revises documentation to cover this change as well as updating Documentation/ABI/testing/sysfs-devices-memory to include descriptions of memory hotremove files 'phys_device', 'phys_index', and 'state' that were previously not described there. Supersedes the "mm: show memory section to node relationship in sysfs" patch posted on 05 Sept 2008 which created node ID containing 'node' files in /sys/devices/system/memory/memoryX instead of symlinks. Changed from files to symlinks due to feedback that symlinks were more consistent with the sysfs way. Also supercedes the "mm: show node to memory section relationship with symlinks in sysfs" patch posted on 29 Sept 2008 to address a Yasunori Goto reported problem where an incorrect symlink was created due to a range of uninitialized pages at the beginning of a section. This problem which produced a symlink in /sys/devices/system/node/node0 that incorrectly referenced a mem section located on node1 is corrected in this version. This version also covers the case were a mem section could span multiple nodes. Signed-off-by: Gary Hade Signed-off-by: Badari Pulavarty --- Documentation/ABI/testing/sysfs-devices-memory | 51 +++++++ Documentation/memory-hotplug.txt | 16 +- arch/ia64/mm/init.c | 2 arch/powerpc/mm/mem.c | 2 arch/s390/mm/init.c | 2 arch/sh/mm/init.c | 3 arch/x86/mm/init_32.c | 2 arch/x86/mm/init_64.c | 2 drivers/base/memory.c | 19 +- drivers/base/node.c | 100 +++++++++++++++ include/linux/memory.h | 6 include/linux/memory_hotplug.h | 2 include/linux/node.h | 13 + mm/memory_hotplug.c | 9 - 14 files changed, 205 insertions(+), 24 deletions(-) Index: linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory =================================================================== --- linux-2.6.27-rc8.orig/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:18:46.000000000 -0700 +++ linux-2.6.27-rc8/Documentation/ABI/testing/sysfs-devices-memory 2008-10-06 11:20:19.000000000 -0700 @@ -6,7 +6,6 @@ internal state of the kernel memory blocks. Files could be added or removed dynamically to represent hot-add/remove operations. - Users: hotplug memory add/remove tools https://w3.opensource.ibm.com/projects/powerpc-utils/ @@ -19,6 +18,56 @@ This is useful for a user-level agent to determine identify removable sections of the memory before attempting potentially expensive hot-remove memory operation +Users: hotplug memory remove tools + https://w3.opensource.ibm.com/projects/powerpc-utils/ + +What: /sys/devices/system/memory/memoryX/phys_device +Date: September 2008 +Contact: Badari Pulavarty +Description: + The file /sys/devices/system/memory/memoryX/phys_device + is read-only and is designed to show the name of physical + memory device. Implementation is currently incomplete. +What: /sys/devices/system/memory/memoryX/phys_index +Date: September 2008 +Contact: Badari Pulavarty +Description: + The file /sys/devices/system/memory/memoryX/phys_index + is read-only and contains the section ID in hexadecimal + which is equivalent to decimal X contained in the + memory section directory name. + +What: /sys/devices/system/memory/memoryX/state +Date: September 2008 +Contact: Badari Pulavarty +Description: + The file /sys/devices/system/memory/memoryX/state + is read-write. When read, it's contents show the + online/offline state of the memory section. When written, + root can toggle the the online/offline state of a removable + memory section (see removable file description above) + using the following commands. + # echo online > /sys/devices/system/memory/memoryX/state + # echo offline > /sys/devices/system/memory/memoryX/state + + For example, if /sys/devices/system/memory/memory22/removable + contains a value of 1 and + /sys/devices/system/memory/memory22/state contains the + string "online" the following command can be executed by + by root to offline that section. + # echo offline > /sys/devices/system/memory/memory22/state Users: hotplug memory remove tools https://w3.opensource.ibm.com/projects/powerpc-utils/ + +What: /sys/devices/system/node/nodeX/memoryY +Date: September 2008 +Contact: Gary Hade +Description: + When CONFIG_NUMA is enabled + /sys/devices/system/node/nodeX/memoryY is a symbolic link that + points to the corresponding /sys/devices/system/memory/memoryY + memory section directory. For example, the following symbolic + link is created for memory section 9 on node0. + /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + Index: linux-2.6.27-rc8/Documentation/memory-hotplug.txt =================================================================== --- linux-2.6.27-rc8.orig/Documentation/memory-hotplug.txt 2008-10-06 11:18:45.000000000 -0700 +++ linux-2.6.27-rc8/Documentation/memory-hotplug.txt 2008-10-06 11:20:42.000000000 -0700 @@ -124,7 +124,7 @@ This option can be kernel module too. -------------------------------- -3 sysfs files for memory hotplug +4 sysfs files for memory hotplug -------------------------------- All sections have their device information under /sys/devices/system/memory as @@ -138,11 +138,12 @@ (0x100000000 / 1Gib = 4) This device covers address range [0x100000000 ... 0x140000000) -Under each section, you can see 3 files. +Under each section, you can see 4 files. /sys/devices/system/memory/memoryXXX/phys_index /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state +/sys/devices/system/memory/memoryXXX/removable 'phys_index' : read-only and contains section id, same as XXX. 'state' : read-write @@ -150,10 +151,20 @@ at write: user can specify "online", "offline" command 'phys_device': read-only: designed to show the name of physical memory device. This is not well implemented now. +'removable' : read-only: contains an integer value indicating + whether the memory section is removable or not + removable. A value of 1 indicates that the memory + section is removable and a value of 0 indicates that + it is not removable. NOTE: These directories/files appear after physical memory hotplug phase. +If CONFIG_NUMA is enabled the +/sys/devices/system/memory/memoryXXX memory section +directories can also be accessed via symbolic links located in +the /sys/devices/system/node/node* directories. For example: +/sys/devices/system/node/node0/memory9 -> ../../memory/memory9 -------------------------------- 4. Physical memory hot-add phase @@ -365,7 +376,6 @@ - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like sysctl or new control file. - showing memory section and physical device relationship. - - showing memory section and node relationship (maybe good for NUMA) - showing memory section is under ZONE_MOVABLE or not - test and make it better memory offlining. - support HugeTLB page migration and offlining. Index: linux-2.6.27-rc8/arch/ia64/mm/init.c =================================================================== --- linux-2.6.27-rc8.orig/arch/ia64/mm/init.c 2008-10-06 11:18:45.000000000 -0700 +++ linux-2.6.27-rc8/arch/ia64/mm/init.c 2008-10-06 11:19:41.000000000 -0700 @@ -693,7 +693,7 @@ pgdat = NODE_DATA(nid); zone = pgdat->node_zones + ZONE_NORMAL; - ret = __add_pages(zone, start_pfn, nr_pages); + ret = __add_pages(nid, zone, start_pfn, nr_pages); if (ret) printk("%s: Problem encountered in __add_pages() as ret=%d\n", Index: linux-2.6.27-rc8/arch/powerpc/mm/mem.c =================================================================== --- linux-2.6.27-rc8.orig/arch/powerpc/mm/mem.c 2008-10-06 11:18:44.000000000 -0700 +++ linux-2.6.27-rc8/arch/powerpc/mm/mem.c 2008-10-06 11:19:41.000000000 -0700 @@ -133,7 +133,7 @@ /* this should work for most non-highmem platforms */ zone = pgdata->node_zones; - return __add_pages(zone, start_pfn, nr_pages); + return __add_pages(nid, zone, start_pfn, nr_pages); } #ifdef CONFIG_MEMORY_HOTREMOVE Index: linux-2.6.27-rc8/arch/s390/mm/init.c =================================================================== --- linux-2.6.27-rc8.orig/arch/s390/mm/init.c 2008-10-06 11:18:45.000000000 -0700 +++ linux-2.6.27-rc8/arch/s390/mm/init.c 2008-10-06 11:19:41.000000000 -0700 @@ -183,7 +183,7 @@ rc = vmem_add_mapping(start, size); if (rc) return rc; - rc = __add_pages(zone, PFN_DOWN(start), PFN_DOWN(size)); + rc = __add_pages(nid, zone, PFN_DOWN(start), PFN_DOWN(size)); if (rc) vmem_remove_mapping(start, size); return rc; Index: linux-2.6.27-rc8/arch/sh/mm/init.c =================================================================== --- linux-2.6.27-rc8.orig/arch/sh/mm/init.c 2008-10-06 11:18:45.000000000 -0700 +++ linux-2.6.27-rc8/arch/sh/mm/init.c 2008-10-06 11:19:41.000000000 -0700 @@ -276,7 +276,8 @@ pgdat = NODE_DATA(nid); /* We only have ZONE_NORMAL, so this is easy.. */ - ret = __add_pages(pgdat->node_zones + ZONE_NORMAL, start_pfn, nr_pages); + ret = __add_pages(nid, pgdat->node_zones + ZONE_NORMAL, + start_pfn, nr_pages); if (unlikely(ret)) printk("%s: Failed, __add_pages() == %d\n", __func__, ret); Index: linux-2.6.27-rc8/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.27-rc8.orig/arch/x86/mm/init_32.c 2008-10-06 11:18:45.000000000 -0700 +++ linux-2.6.27-rc8/arch/x86/mm/init_32.c 2008-10-06 11:19:41.000000000 -0700 @@ -995,7 +995,7 @@ unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - return __add_pages(zone, start_pfn, nr_pages); + return __add_pages(nid, zone, start_pfn, nr_pages); } #endif Index: linux-2.6.27-rc8/drivers/base/memory.c =================================================================== --- linux-2.6.27-rc8.orig/drivers/base/memory.c 2008-10-06 11:18:44.000000000 -0700 +++ linux-2.6.27-rc8/drivers/base/memory.c 2008-10-06 11:19:41.000000000 -0700 @@ -345,8 +345,9 @@ * section belongs to... */ -static int add_memory_block(unsigned long node_id, struct mem_section *section, - unsigned long state, int phys_device) +static int add_memory_block(int nid, struct mem_section *section, + unsigned long state, int phys_device, + enum mem_add_context context) { struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL); int ret = 0; @@ -368,6 +369,10 @@ ret = mem_create_simple_file(mem, phys_device); if (!ret) ret = mem_create_simple_file(mem, removable); + if (!ret) { + if (context == HOTPLUG) + ret = register_mem_sect_under_node(mem, nid); + } return ret; } @@ -380,7 +385,7 @@ * * This could be made generic for all sysdev classes. */ -static struct memory_block *find_memory_block(struct mem_section *section) +struct memory_block *find_memory_block(struct mem_section *section) { struct kobject *kobj; struct sys_device *sysdev; @@ -409,6 +414,7 @@ struct memory_block *mem; mem = find_memory_block(section); + unregister_mem_sect_under_nodes(mem); mem_remove_simple_file(mem, phys_index); mem_remove_simple_file(mem, state); mem_remove_simple_file(mem, phys_device); @@ -422,9 +428,9 @@ * need an interface for the VM to add new memory regions, * but without onlining it. */ -int register_new_memory(struct mem_section *section) +int register_new_memory(int nid, struct mem_section *section) { - return add_memory_block(0, section, MEM_OFFLINE, 0); + return add_memory_block(nid, section, MEM_OFFLINE, 0, HOTPLUG); } int unregister_memory_section(struct mem_section *section) @@ -456,7 +462,8 @@ for (i = 0; i < NR_MEM_SECTIONS; i++) { if (!present_section_nr(i)) continue; - err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE, 0); + err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE, + 0, BOOT); if (!ret) ret = err; } Index: linux-2.6.27-rc8/drivers/base/node.c =================================================================== --- linux-2.6.27-rc8.orig/drivers/base/node.c 2008-10-06 11:18:44.000000000 -0700 +++ linux-2.6.27-rc8/drivers/base/node.c 2008-10-06 11:19:41.000000000 -0700 @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -225,6 +226,102 @@ return 0; } +#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE +#define page_initialized(page) (page->lru.next) + +static int get_nid_for_pfn(unsigned long pfn) +{ + struct page *page; + + if (!pfn_valid_within(pfn)) + return -1; + page = pfn_to_page(pfn); + if (!page_initialized(page)) + return -1; + return pfn_to_nid(pfn); +} + +/* register memory section under specified node if it spans that node */ +int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) +{ + unsigned long pfn, sect_start_pfn, sect_end_pfn; + + if (!mem_blk) + return -EFAULT; + if (!node_online(nid)) + return 0; + sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index); + sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { + int page_nid; + + page_nid = get_nid_for_pfn(pfn); + if (page_nid < 0) + continue; + if (page_nid != nid) + continue; + return sysfs_create_link_nowarn(&node_devices[nid].sysdev.kobj, + &mem_blk->sysdev.kobj, + kobject_name(&mem_blk->sysdev.kobj)); + } + /* mem section does not span the specified node */ + return 0; +} + +/* unregister memory section under all nodes that it spans */ +int unregister_mem_sect_under_nodes(struct memory_block *mem_blk) +{ + nodemask_t unlinked_nodes; + unsigned long pfn, sect_start_pfn, sect_end_pfn; + + if (!mem_blk) + return -EFAULT; + nodes_clear(unlinked_nodes); + sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index); + sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + for (pfn = sect_start_pfn; pfn < sect_end_pfn; pfn++) { + unsigned int nid; + + nid = get_nid_for_pfn(pfn); + if (nid < 0) + continue; + if (!node_online(nid)) + continue; + if (node_test_and_set(nid, unlinked_nodes)) + continue; + sysfs_remove_link(&node_devices[nid].sysdev.kobj, + kobject_name(&mem_blk->sysdev.kobj)); + } + return 0; +} + +static int link_mem_sections(int nid) +{ + unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn; + unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages; + unsigned long pfn; + int err = 0; + + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + unsigned long section_nr = pfn_to_section_nr(pfn); + struct mem_section *mem_sect; + struct memory_block *mem_blk; + int ret; + + if (!present_section_nr(section_nr)) + continue; + mem_sect = __nr_to_section(section_nr); + mem_blk = find_memory_block(mem_sect); + ret = register_mem_sect_under_node(mem_blk, nid); + if (!err) + err = ret; + } + return err; +} +#else +static int link_mem_sections(int nid) { return 0; } +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ + int register_one_node(int nid) { int error = 0; @@ -244,6 +341,9 @@ if (cpu_to_node(cpu) == nid) register_cpu_under_node(cpu, nid); } + + /* link memory sections under this node */ + error = link_mem_sections(nid); } return error; Index: linux-2.6.27-rc8/include/linux/memory.h =================================================================== --- linux-2.6.27-rc8.orig/include/linux/memory.h 2008-10-06 11:18:44.000000000 -0700 +++ linux-2.6.27-rc8/include/linux/memory.h 2008-10-06 11:19:41.000000000 -0700 @@ -79,14 +79,14 @@ #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); -extern int register_new_memory(struct mem_section *); +extern int register_new_memory(int, struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); +extern struct memory_block *find_memory_block(struct mem_section *); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION< max_pfn_mapped) max_pfn_mapped = last_mapped_pfn; - ret = __add_pages(zone, start_pfn, nr_pages); + ret = __add_pages(nid, zone, start_pfn, nr_pages); WARN_ON(1); return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/