2010-07-12 15:27:24

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 0/7] De-couple sysfs memory directories from memory sections

This set of patches de-couples the idea that there is a single
directory in sysfs for each memory section. The intent of the
patches is to reduce the number of sysfs directories created to
resolve a boot-time performance issue. On very large systems
boot time are getting very long (as seen on powerpc hardware)
due to the enormous number of sysfs directories being created.
On a system with 1 TB of memory we create ~63,000 directories.
For even larger systems boot times are being measured in hours.

This set of patches allows for each directory created in sysfs
to cover more than one memory section. The default behavior for
sysfs directory creation is the same, in that each directory
represents a single memory section. This update also adds to
new files to each sysfs memory directory. The file 'end_phys_index'
contains the physical id of the last memory section covered by
the directory. The file 'split' allows for splitting the
directory in two, with each new directory covering half as many
memory sections as the previous directory.

Thanks,

Nathan Fontenot


2010-07-12 15:42:11

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 1/7] Split the memory_block structure

This patch splits the memory_block struct into a memory_block
struct to cover each sysfs directory and a new memory_block_section
struct for each memory section covered by the sysfs directory.

This also updates the routine handling memory_block creation
and manipulation to use these updated structures.

Signed -off-by: Nathan Fontenot <[email protected]>
---
drivers/base/memory.c | 228 +++++++++++++++++++++++++++++++++++--------------
include/linux/memory.h | 11 +-
2 files changed, 172 insertions(+), 67 deletions(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c 2010-07-08 11:27:21.000000000 -0500
+++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:09.000000000 -0500
@@ -28,6 +28,14 @@
#include <asm/uaccess.h>

#define MEMORY_CLASS_NAME "memory"
+#define MIN_MEMORY_BLOCK_SIZE (1 << SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+ return (section_nr / sections_per_block) * sections_per_block;
+}

static struct sysdev_class memory_sysdev_class = {
.name = MEMORY_CLASS_NAME,
@@ -94,10 +102,9 @@
}

static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
{
BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
- BUG_ON(memory->sysdev.id != __section_nr(section));

/* drop the ref. we got in remove_memory_block() */
kobject_put(&memory->sysdev.kobj);
@@ -123,13 +130,20 @@
static ssize_t show_mem_removable(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
{
- unsigned long start_pfn;
- int ret;
- struct memory_block *mem =
- container_of(dev, struct memory_block, sysdev);
+ struct list_head *pos, *tmp;
+ struct memory_block *mem;
+ int ret = 1;
+
+ mem = container_of(dev, struct memory_block, sysdev);
+ list_for_each_safe(pos, tmp, &mem->sections) {
+ struct memory_block_section *mbs;
+ unsigned long start_pfn;
+
+ mbs = list_entry(pos, struct memory_block_section, next);
+ start_pfn = section_nr_to_pfn(mbs->phys_index);
+ ret &= is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+ }

- start_pfn = section_nr_to_pfn(mem->phys_index);
- ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
return sprintf(buf, "%d\n", ret);
}

@@ -182,16 +196,16 @@
* OK to have direct references to sparsemem variables in here.
*/
static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_block_action(struct memory_block_section *mbs, unsigned long action)
{
int i;
unsigned long psection;
unsigned long start_pfn, start_paddr;
struct page *first_page;
int ret;
- int old_state = mem->state;
ot-option-to-disable-memory-hotplug.patch
+ int old_state = mbs->state;

- psection = mem->phys_index;
+ psection = mbs->phys_index;
first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);

/*
@@ -217,18 +231,18 @@
ret = online_pages(start_pfn, PAGES_PER_SECTION);
break;
case MEM_OFFLINE:
- mem->state = MEM_GOING_OFFLINE;
+ mbs->state = MEM_GOING_OFFLINE;
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION << PAGE_SHIFT);
if (ret) {
- mem->state = old_state;
+ mbs->state = old_state;
break;
}
break;
default:
WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
- __func__, mem, action, action);
+ __func__, mbs, action, action);
ret = -EINVAL;
}

@@ -238,19 +252,40 @@
static int memory_block_change_state(struct memory_block *mem,
unsigned long to_state, unsigned long from_state_req)
{
+ struct memory_block_section *mbs;
+ struct list_head *pos;
int ret = 0;
+
mutex_lock(&mem->state_mutex);

- if (mem->state != from_state_req) {
- ret = -EINVAL;
- goto out;
+ list_for_each(pos, &mem->sections) {
+ mbs = list_entry(pos, struct memory_block_section, next);
+
+ if (mbs->state != from_state_req)
+ continue;
+
+ ret = memory_block_action(mbs, to_state);
+ if (ret)
+ break;
+ }
+
+ if (ret) {
+ list_for_each(pos, &mem->sections) {
+ mbs = list_entry(pos, struct memory_block_section,
+ next);
+
+ if (mbs->state == from_state_req)
+ continue;
+
+ if (memory_block_action(mbs, to_state))
+ printk(KERN_ERR "Could not re-enable memory "
+ "section %lx\n", mbs->phys_index);
+ }
}

- ret = memory_block_action(mem, to_state);
if (!ret)
mem->state = to_state;

-out:
mutex_unlock(&mem->state_mutex);
return ret;
}
@@ -260,20 +295,15 @@
struct sysdev_attribute *attr, const char *buf, size_t count)
{
struct memory_block *mem;
- unsigned int phys_section_nr;
int ret = -EINVAL;

mem = container_of(dev, struct memory_block, sysdev);
- phys_section_nr = mem->phys_index;
-
- if (!present_section_nr(phys_section_nr))
- goto out;

if (!strncmp(buf, "online", min((int)count, 6)))
ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
else if(!strncmp(buf, "offline", min((int)count, 7)))
ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
-out:
+
if (ret)
return ret;
return count;
@@ -435,39 +465,6 @@
return 0;
}

-static int add_memory_block(int nid, struct mem_section *section,
- unsigned long state, enum mem_add_context context)
-{
- struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
- unsigned long start_pfn;
- int ret = 0;
-
- if (!mem)
- return -ENOMEM;
-
- mem->phys_index = __section_nr(section);
- mem->state = state;
- mutex_init(&mem->state_mutex);
- start_pfn = section_nr_to_pfn(mem->phys_index);
- mem->phys_device = arch_get_memory_phys_device(start_pfn);
-
- ret = register_memory(mem, section);
- if (!ret)
- ret = mem_create_simple_file(mem, phys_index);
- if (!ret)
- ret = mem_create_simple_file(mem, state);
- if (!ret)
- ret = mem_create_simple_file(mem, phys_device);
- if (!ret)
- ret = mem_create_simple_file(mem, removable);
- if (!ret) {
- if (context == HOTPLUG)
- ret = register_mem_sect_under_node(mem, nid);
- }
-
- return ret;
-}
-
/*
* For now, we have a linear search to go find the appropriate
* memory_block corresponding to a particular phys_index. If
@@ -482,12 +479,13 @@
struct sys_device *sysdev;
struct memory_block *mem;
char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+ int block_id = base_memory_block_id(__section_nr(section));

/*
* This only works because we know that section == sysdev->id
* slightly redundant with sysdev_register()
*/
- sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+ sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);

kobj = kset_find_obj(&memory_sysdev_class.kset, name);
if (!kobj)
@@ -498,19 +496,97 @@

return mem;
}
+static int add_mem_block_section(struct memory_block *mem,
+ int section_nr, unsigned long state)
+{
+ struct memory_block_section *mbs;
+
+ mbs = kzalloc(sizeof(*mbs), GFP_KERNEL);
+ if (!mbs)
+ return -ENOMEM;
+
+ mbs->phys_index = section_nr;
+ mbs->state = state;
+
+ list_add(&mbs->next, &mem->sections);
+ return 0;
+}
+
+static int add_memory_block(int nid, struct mem_section *section,
+ unsigned long state, enum mem_add_context context)
+{
+ struct memory_block *mem;
+ int ret = 0;
+
+ mem = find_memory_block(section);
+ if (!mem) {
+ unsigned long start_pfn;
+
+ mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ mem->state = state;
+ mutex_init(&mem->state_mutex);
+ start_pfn = section_nr_to_pfn(__section_nr(section));
+ mem->phys_device = arch_get_memory_phys_device(start_pfn);
+ INIT_LIST_HEAD(&mem->sections);
+
+ ret = register_memory(mem, section);
+ if (!ret)
+ ret = mem_create_simple_file(mem, phys_index);
+ if (!ret)
+ ret = mem_create_simple_file(mem, state);
+ if (!ret)
+ ret = mem_create_simple_file(mem, phys_device);
+ if (!ret)
+ ret = mem_create_simple_file(mem, removable);
+ if (!ret) {
+ if (context == HOTPLUG)
+ ret = register_mem_sect_under_node(mem, nid);
+ }
+ } else {
+ kobject_put(&mem->sysdev.kobj);
+ }
+
+ if (!ret)
+ ret = add_mem_block_section(mem, __section_nr(section), state);
+
+ return ret;
+}

int remove_memory_block(unsigned long node_id, struct mem_section *section,
int phys_device)
{
struct memory_block *mem;
+ struct memory_block_section *mbs;
+ struct list_head *pos, *tmp;
+ int section_nr = __section_nr(section);

mem = find_memory_block(section);
- unregister_mem_sect_under_nodes(mem);
- mem_remove_simple_file(mem, phys_index);
- mem_remove_simple_file(mem, state);
- mem_remove_simple_file(mem, phys_device);
- mem_remove_simple_file(mem, removable);
- unregister_memory(mem, section);
+ mutex_lock(&mem->state_mutex);
+
+ /* remove the specified section */
+ list_for_each_safe(pos, tmp, &mem->sections) {
+ mbs = list_entry(pos, struct memory_block_section, next);
+
+ if (mbs->phys_index == section_nr) {
+ list_del(&mbs->next);
+ kfree(mbs);
+ }
+ }
+
+ mutex_unlock(&mem->state_mutex);
+
+ if (list_empty(&mem->sections)) {
+ unregister_mem_sect_under_nodes(mem);
+ mem_remove_simple_file(mem, phys_index);
+ mem_remove_simple_file(mem, state);
+ mem_remove_simple_file(mem, phys_device);
+ mem_remove_simple_file(mem, removable);
+ unregister_memory(mem);
+ kfree(mem);
+ }

return 0;
}
@@ -532,6 +608,24 @@
return remove_memory_block(0, section, 0);
}

+u32 __weak memory_block_size(void)
+{
+ return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static u32 get_memory_block_size(void)
+{
+ u32 blk_sz;
+
+ blk_sz = memory_block_size();
+
+ /* Validate blk_sz is a power of 2 and not less than section size */
+ if ((blk_sz & (blk_sz - 1)) || (blk_sz < MIN_MEMORY_BLOCK_SIZE))
+ blk_sz = MIN_MEMORY_BLOCK_SIZE;
+
+ return blk_sz;
+}
+
/*
* Initialize the sysfs support for memory devices...
*/
@@ -540,12 +634,16 @@
unsigned int i;
int ret;
int err;
+ int block_sz;

memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
ret = sysdev_class_register(&memory_sysdev_class);
if (ret)
goto out;

+ block_sz = get_memory_block_size();
+ sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+
/*
* Create entries for memory sections that were found
* during boot and have been initialized
Index: linux-2.6/include/linux/memory.h
===================================================================
--- linux-2.6.orig/include/linux/memory.h 2010-07-08 11:27:21.000000000 -0500
+++ linux-2.6/include/linux/memory.h 2010-07-09 14:22:44.000000000 -0500
@@ -19,9 +19,15 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/mutex.h>
+#include <linux/list.h>

-struct memory_block {
+struct memory_block_section {
+ unsigned long state;
unsigned long phys_index;
+ struct list_head next;
+};
+
+struct memory_block {
unsigned long state;
/*
* This serializes all state change requests. It isn't
@@ -34,6 +40,7 @@
void *hw; /* optional pointer to fw/hw data */
int (*phys_callback)(struct memory_block *);
struct sys_device sysdev;
+ struct list_head sections;
};

int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -113,7 +120,7 @@
extern int remove_memory_block(unsigned long, struct mem_section *, int);
extern int memory_notify(unsigned long val, void *v);
extern int memory_isolate_notify(unsigned long val, void *v);
-extern struct memory_block *find_memory_block(unsigned long);
+extern struct memory_block *find_memory_block(struct mem_section *);
extern int memory_is_hidden(struct mem_section *);
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
enum mem_add_context { BOOT, HOTPLUG };

2010-07-12 15:43:11

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 2/7] Create the new 'end_phys_index' file

This patch adds a new 'end_phys_index' file to each memory sysfs
directory to report the physical index of the last memory section
coverd by the sysfs directory.

Signed -off-by: Nathan Fontenot <[email protected]>

---
drivers/base/memory.c | 14 +++++++++++++-
include/linux/memory.h | 3 +++
2 files changed, 16 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:09.000000000 -0500
+++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:17.000000000 -0500
@@ -121,7 +121,15 @@
{
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- return sprintf(buf, "%08lx\n", mem->phys_index);
+ return sprintf(buf, "%08lx\n", mem->start_phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct memory_block *mem =
+ container_of(dev, struct memory_block, sysdev);
+ return sprintf(buf, "%08lx\n", mem->end_phys_index);
}

/*
@@ -327,6 +335,7 @@
}

static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -536,6 +545,8 @@
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
if (!ret)
+ ret = mem_create_simple_file(mem, end_phys_index);
+ if (!ret)
ret = mem_create_simple_file(mem, state);
if (!ret)
ret = mem_create_simple_file(mem, phys_device);
@@ -581,6 +592,7 @@
if (list_empty(&mem->sections)) {
unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
+ mem_remove_simple_file(mem, end_phys_index);
mem_remove_simple_file(mem, state);
mem_remove_simple_file(mem, phys_device);
mem_remove_simple_file(mem, removable);
Index: linux-2.6/include/linux/memory.h
===================================================================
--- linux-2.6.orig/include/linux/memory.h 2010-07-09 14:22:44.000000000 -0500
+++ linux-2.6/include/linux/memory.h 2010-07-09 14:23:17.000000000 -0500
@@ -29,6 +29,9 @@

struct memory_block {
unsigned long state;
+ unsigned long start_phys_index;
+ unsigned long end_phys_index;
+
/*
* This serializes all state change requests. It isn't
* held during creation because the control files are

2010-07-12 15:44:33

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 3/7] Update the [register,unregister]_memory routines

This patch moves the register/unregister_memory routines to
avoid a forward declaration. It also moves the sysfs file
creation and deletion for each directory into the register/
unregister routines to avoid duplicating it with these updates.

Signed-off-by: Nathan Fontenot <[email protected]>
---
drivers/base/memory.c | 93 +++++++++++++++++++++++++-------------------------
1 file changed, 48 insertions(+), 45 deletions(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:17.000000000 -0500
+++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
@@ -87,31 +87,6 @@
EXPORT_SYMBOL(unregister_memory_isolate_notifier);

/*
- * register_memory - Setup a sysfs device for a memory block
- */
-static
-int register_memory(struct memory_block *memory, struct mem_section *section)
-{
- int error;
-
- memory->sysdev.cls = &memory_sysdev_class;
- memory->sysdev.id = __section_nr(section);
-
- error = sysdev_register(&memory->sysdev);
- return error;
-}
-
-static void
-unregister_memory(struct memory_block *memory)
-{
- BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
-
- /* drop the ref. we got in remove_memory_block() */
- kobject_put(&memory->sysdev.kobj);
- sysdev_unregister(&memory->sysdev);
-}
-
-/*
* use this as the physical section index that this memsection
* uses.
*/
@@ -346,6 +321,53 @@
sysdev_remove_file(&mem->sysdev, &attr_##attr_name)

/*
+ * register_memory - Setup a sysfs device for a memory block
+ */
+static
+int register_memory(struct memory_block *memory, struct mem_section *section,
+ int nid, enum mem_add_context context)
+{
+ int ret;
+
+ memory->sysdev.cls = &memory_sysdev_class;
+ memory->sysdev.id = __section_nr(section);
+
+ ret = sysdev_register(&memory->sysdev);
+ if (!ret)
+ ret = mem_create_simple_file(memory, phys_index);
+ if (!ret)
+ ret = mem_create_simple_file(memory, end_phys_index);
+ if (!ret)
+ ret = mem_create_simple_file(memory, state);
+ if (!ret)
+ ret = mem_create_simple_file(memory, phys_device);
+ if (!ret)
+ ret = mem_create_simple_file(memory, removable);
+ if (!ret) {
+ if (context == HOTPLUG)
+ ret = register_mem_sect_under_node(memory, nid);
+ }
+
+ return ret;
+}
+
+static void
+unregister_memory(struct memory_block *memory)
+{
+ BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
+
+ mem_remove_simple_file(memory, phys_index);
+ mem_remove_simple_file(memory, end_phys_index);
+ mem_remove_simple_file(memory, state);
+ mem_remove_simple_file(memory, phys_device);
+ mem_remove_simple_file(memory, removable);
+
+ /* drop the ref. we got in remove_memory_block() */
+ kobject_put(&memory->sysdev.kobj);
+ sysdev_unregister(&memory->sysdev);
+}
+
+/*
* Block size attribute stuff
*/
static ssize_t
@@ -541,21 +563,7 @@
mem->phys_device = arch_get_memory_phys_device(start_pfn);
INIT_LIST_HEAD(&mem->sections);

- ret = register_memory(mem, section);
- if (!ret)
- ret = mem_create_simple_file(mem, phys_index);
- if (!ret)
- ret = mem_create_simple_file(mem, end_phys_index);
- if (!ret)
- ret = mem_create_simple_file(mem, state);
- if (!ret)
- ret = mem_create_simple_file(mem, phys_device);
- if (!ret)
- ret = mem_create_simple_file(mem, removable);
- if (!ret) {
- if (context == HOTPLUG)
- ret = register_mem_sect_under_node(mem, nid);
- }
+ ret = register_memory(mem, section, nid, context);
} else {
kobject_put(&mem->sysdev.kobj);
}
@@ -591,11 +599,6 @@

if (list_empty(&mem->sections)) {
unregister_mem_sect_under_nodes(mem);
- mem_remove_simple_file(mem, phys_index);
- mem_remove_simple_file(mem, end_phys_index);
- mem_remove_simple_file(mem, state);
- mem_remove_simple_file(mem, phys_device);
- mem_remove_simple_file(mem, removable);
unregister_memory(mem);
kfree(mem);
}

2010-07-12 15:45:29

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 4/7] Allow sysfs memory directories to be split

This patch introduces the new 'split' file in each memory sysfs
directory and the associated routines needed to handle splitting
a directory.

Signed-off-by; Nathan Fontenot <[email protected]>
---
drivers/base/memory.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 98 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
+++ linux-2.6/drivers/base/memory.c 2010-07-09 14:38:09.000000000 -0500
@@ -32,6 +32,9 @@

static int sections_per_block;

+static int register_memory(struct memory_block *, struct mem_section *,
+ int, enum mem_add_context);
+
static inline int base_memory_block_id(int section_nr)
{
return (section_nr / sections_per_block) * sections_per_block;
@@ -309,11 +312,100 @@
return sprintf(buf, "%d\n", mem->phys_device);
}

+static void update_memory_block_phys_indexes(struct memory_block *mem)
+{
+ struct list_head *pos;
+ struct memory_block_section *mbs;
+ unsigned long min_index = 0xffffffff;
+ unsigned long max_index = 0;
+
+ list_for_each(pos, &mem->sections) {
+ mbs = list_entry(pos, struct memory_block_section, next);
+
+ if (mbs->phys_index < min_index)
+ min_index = mbs->phys_index;
+
+ if (mbs->phys_index > max_index)
+ max_index = mbs->phys_index;
+ }
+
+ mem->start_phys_index = min_index;
+ mem->end_phys_index = max_index;
+}
+
+static ssize_t
+store_mem_split_block(struct sys_device *dev, struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct memory_block *mem, *new_mem_blk;
+ struct memory_block_section *mbs;
+ struct list_head *pos, *tmp;
+ struct mem_section *section;
+ int min_scn_nr = 0;
+ int max_scn_nr = 0;
+ int total_scns = 0;
+ int new_blk_min, new_blk_total;
+ int ret = -EINVAL;
+
+ mem = container_of(dev, struct memory_block, sysdev);
+
+ if (list_is_singular(&mem->sections))
+ return -EINVAL;
+
+ mutex_lock(&mem->state_mutex);
+
+ list_for_each(pos, &mem->sections) {
+ mbs = list_entry(pos, struct memory_block_section, next);
+
+ total_scns++;
+
+ if (min_scn_nr > mbs->phys_index)
+ min_scn_nr = mbs->phys_index;
+
+ if (max_scn_nr < mbs->phys_index)
+ max_scn_nr = mbs->phys_index;
+ }
+
+ new_mem_blk = kzalloc(sizeof(*new_mem_blk), GFP_KERNEL);
+ if (!new_mem_blk)
+ return -ENOMEM;
+
+ mutex_init(&new_mem_blk->state_mutex);
+ INIT_LIST_HEAD(&new_mem_blk->sections);
+ new_mem_blk->state = mem->state;
+
+ mutex_lock(&new_mem_blk->state_mutex);
+
+ new_blk_total = total_scns / 2;
+ new_blk_min = max_scn_nr - new_blk_total + 1;
+
+ section = __nr_to_section(new_blk_min);
+ ret = register_memory(new_mem_blk, section, 0, HOTPLUG);
+
+ list_for_each_safe(pos, tmp, &mem->sections) {
+ unsigned long scn_nr;
+
+ mbs = list_entry(pos, struct memory_block_section, next);
+ scn_nr = mbs->phys_index;
+
+ if ((scn_nr >= new_blk_min) && (scn_nr <= max_scn_nr))
+ list_move(&mbs->next, &new_mem_blk->sections);
+ }
+
+ update_memory_block_phys_indexes(mem);
+ update_memory_block_phys_indexes(new_mem_blk);
+
+ mutex_unlock(&new_mem_blk->state_mutex);
+ mutex_unlock(&mem->state_mutex);
+ return count;
+}
+
static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
+static SYSDEV_ATTR(split, 0200, NULL, store_mem_split_block);

#define mem_create_simple_file(mem, attr_name) \
sysdev_create_file(&mem->sysdev, &attr_##attr_name)
@@ -343,6 +435,8 @@
ret = mem_create_simple_file(memory, phys_device);
if (!ret)
ret = mem_create_simple_file(memory, removable);
+ if (!ret)
+ ret = mem_create_simple_file(memory, split);
if (!ret) {
if (context == HOTPLUG)
ret = register_mem_sect_under_node(memory, nid);
@@ -361,6 +455,7 @@
mem_remove_simple_file(memory, state);
mem_remove_simple_file(memory, phys_device);
mem_remove_simple_file(memory, removable);
+ mem_remove_simple_file(memory, split);

/* drop the ref. we got in remove_memory_block() */
kobject_put(&memory->sysdev.kobj);
@@ -568,8 +663,10 @@
kobject_put(&mem->sysdev.kobj);
}

- if (!ret)
+ if (!ret) {
ret = add_mem_block_section(mem, __section_nr(section), state);
+ update_memory_block_phys_indexes(mem);
+ }

return ret;
}

2010-07-12 15:46:27

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 5/7] update the mutex name in the memory_block struct

This patch updates the name of the memory_block mutex
since it is now used for more than just gating changes
to the status of the memory section coveerd by the memory
sysfs directory.

Signed-off-by: Nathan Fontenot <[email protected]>
---
drivers/base/memory.c | 20 ++++++++++----------
include/linux/memory.h | 9 +--------
2 files changed, 11 insertions(+), 18 deletions(-)

Index: linux-2.6/drivers/base/memory.c
===================================================================
--- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:38:09.000000000 -0500
+++ linux-2.6/drivers/base/memory.c 2010-07-09 14:38:18.000000000 -0500
@@ -242,7 +242,7 @@
struct list_head *pos;
int ret = 0;

- mutex_lock(&mem->state_mutex);
+ mutex_lock(&mem->mutex);

list_for_each(pos, &mem->sections) {
mbs = list_entry(pos, struct memory_block_section, next);
@@ -272,7 +272,7 @@
if (!ret)
mem->state = to_state;

- mutex_unlock(&mem->state_mutex);
+ mutex_unlock(&mem->mutex);
return ret;
}

@@ -352,7 +352,7 @@
if (list_is_singular(&mem->sections))
return -EINVAL;

- mutex_lock(&mem->state_mutex);
+ mutex_lock(&mem->mutex);

list_for_each(pos, &mem->sections) {
mbs = list_entry(pos, struct memory_block_section, next);
@@ -370,11 +370,11 @@
if (!new_mem_blk)
return -ENOMEM;

- mutex_init(&new_mem_blk->state_mutex);
+ mutex_init(&new_mem_blk->mutex);
INIT_LIST_HEAD(&new_mem_blk->sections);
new_mem_blk->state = mem->state;

- mutex_lock(&new_mem_blk->state_mutex);
+ mutex_lock(&new_mem_blk->mutex);

new_blk_total = total_scns / 2;
new_blk_min = max_scn_nr - new_blk_total + 1;
@@ -395,8 +395,8 @@
update_memory_block_phys_indexes(mem);
update_memory_block_phys_indexes(new_mem_blk);

- mutex_unlock(&new_mem_blk->state_mutex);
- mutex_unlock(&mem->state_mutex);
+ mutex_unlock(&new_mem_blk->mutex);
+ mutex_unlock(&mem->mutex);
return count;
}

@@ -653,7 +653,7 @@
return -ENOMEM;

mem->state = state;
- mutex_init(&mem->state_mutex);
+ mutex_init(&mem->mutex);
start_pfn = section_nr_to_pfn(__section_nr(section));
mem->phys_device = arch_get_memory_phys_device(start_pfn);
INIT_LIST_HEAD(&mem->sections);
@@ -680,7 +680,7 @@
int section_nr = __section_nr(section);

mem = find_memory_block(section);
- mutex_lock(&mem->state_mutex);
+ mutex_lock(&mem->mutex);

/* remove the specified section */
list_for_each_safe(pos, tmp, &mem->sections) {
@@ -692,7 +692,7 @@
}
}

- mutex_unlock(&mem->state_mutex);
+ mutex_unlock(&mem->mutex);

if (list_empty(&mem->sections)) {
unregister_mem_sect_under_nodes(mem);
Index: linux-2.6/include/linux/memory.h
===================================================================
--- linux-2.6.orig/include/linux/memory.h 2010-07-09 14:36:54.000000000 -0500
+++ linux-2.6/include/linux/memory.h 2010-07-09 14:38:18.000000000 -0500
@@ -31,14 +31,7 @@
unsigned long state;
unsigned long start_phys_index;
unsigned long end_phys_index;
-
- /*
- * This serializes all state change requests. It isn't
- * held during creation because the control files are
- * created long after the critical areas during
- * initialization.
- */
- struct mutex state_mutex;
+ struct mutex mutex;
int phys_device; /* to which fru does this belong? */
void *hw; /* optional pointer to fw/hw data */
int (*phys_callback)(struct memory_block *);

2010-07-12 15:47:15

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 6/7] Update sysfs node routines for new sysfs memory directories

This patch updates the node sysfs directory routines that create
links to the memory sections under each node. This update makes
the node code aware that a memory sysfs directory can cover
multiple memory sections.

Signed-off-by: Nathan Fontenot <[email protected]>

---
drivers/base/node.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/drivers/base/node.c
===================================================================
--- linux-2.6.orig/drivers/base/node.c 2010-07-09 14:36:53.000000000 -0500
+++ linux-2.6/drivers/base/node.c 2010-07-09 14:38:22.000000000 -0500
@@ -346,8 +346,10 @@
return -EFAULT;
if (!node_online(nid))
return 0;
- sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
- sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+ sect_start_pfn = section_nr_to_pfn(mem_blk->start_phys_index);
+ sect_end_pfn = section_nr_to_pfn(mem_blk->end_phys_index);
+ sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int page_nid;

@@ -383,8 +385,10 @@
if (!unlinked_nodes)
return -ENOMEM;
nodes_clear(*unlinked_nodes);
- sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
- sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+ sect_start_pfn = section_nr_to_pfn(mem_blk->start_phys_index);
+ sect_end_pfn = section_nr_to_pfn(mem_blk->end_phys_index);
+ sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int nid;

2010-07-12 15:48:43

by Nathan Fontenot

[permalink] [raw]
Subject: [PATCH 7/7] Enable multiple memory sections per sysfs memory directory for powerpc/pseries

This patch updates the powerpc/pseries code to initialize
the memory sysfs directory creation to create sysfs directories
that each cover an LMB's worth of memory.

Signed-off-by; Nathan Fontenot <[email protected]>
---
arch/powerpc/platforms/pseries/hotplug-memory.c | 66 +++++++++++++++++++-----
1 file changed, 53 insertions(+), 13 deletions(-)

Index: linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-2.6.orig/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-07-09 14:36:52.000000000 -0500
+++ linux-2.6/arch/powerpc/platforms/pseries/hotplug-memory.c 2010-07-09 14:38:25.000000000 -0500
@@ -17,6 +17,54 @@
#include <asm/pSeries_reconfig.h>
#include <asm/sparsemem.h>

+static u32 get_lmb_size(void)
+{
+ struct device_node *np;
+ unsigned int lmb_size = 0;
+
+ np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+ if (np) {
+ const unsigned int *size;
+
+ size = of_get_property(np, "ibm,lmb-size", NULL);
+ lmb_size = size ? *size : 0;
+
+ of_node_put(np);
+ } else {
+ const unsigned int *regs;
+
+ np = of_find_node_by_path("/memory@0");
+ if (np) {
+ regs = of_get_property(np, "reg", NULL);
+ lmb_size = regs ? regs[3] : 0;
+ of_node_put(np);
+ }
+
+ if (lmb_size) {
+ /* We now know the size of memory@0, use this to find
+ * the first lmb and get its size.
+ */
+ char buf[64];
+
+ sprintf(buf, "/memory@%x", lmb_size);
+ np = of_find_node_by_path(buf);
+ if (np) {
+ regs = of_get_property(np, "reg", NULL);
+ lmb_size = regs ? regs[3] : 0;
+ of_node_put(np);
+ } else
+ lmb_size = 0;
+ }
+ }
+
+ return lmb_size;
+}
+
+u32 memory_block_size(void)
+{
+ return get_lmb_size();
+}
+
static int pseries_remove_lmb(unsigned long base, unsigned int lmb_size)
{
unsigned long start, start_pfn;
@@ -127,30 +175,22 @@

static int pseries_drconf_memory(unsigned long *base, unsigned int action)
{
- struct device_node *np;
- const unsigned long *lmb_size;
+ unsigned long lmb_size;
int rc;

- np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
- if (!np)
+ lmb_size = get_lmb_size();
+ if (!lmb_size)
return -EINVAL;

- lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
- if (!lmb_size) {
- of_node_put(np);
- return -EINVAL;
- }
-
if (action == PSERIES_DRCONF_MEM_ADD) {
- rc = lmb_add(*base, *lmb_size);
+ rc = lmb_add(*base, lmb_size);
rc = (rc < 0) ? -EINVAL : 0;
} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
- rc = pseries_remove_lmb(*base, *lmb_size);
+ rc = pseries_remove_lmb(*base, lmb_size);
} else {
rc = -EINVAL;
}

- of_node_put(np);
return rc;
}

2010-07-12 19:50:03

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH 0/7] De-couple sysfs memory directories from memory sections

Nathan Fontenot <[email protected]> wrote:

> The file 'split' allows for splitting the
> directory in two, with each new directory covering half as many
> memory sections as the previous directory.

Just some random thoughts:

1) Why is it needed/helpful?
2) If it is needed, why not take an int to split after n entries?

2010-07-13 02:28:21

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 0/7] De-couple sysfs memory directories from memory sections

On 07/12/2010 02:30 PM, Bodo Eggert wrote:
> Nathan Fontenot <[email protected]> wrote:
>
>> The file 'split' allows for splitting the
>> directory in two, with each new directory covering half as many
>> memory sections as the previous directory.
>
> Just some random thoughts:
> 1) Why is it needed/helpful?

This is needed if someone needed to perform an action (add/remove) a
single memory section. The 'split' option allows users to isolate
a memory section so these operations could be performed.

> 2) If it is needed, why not take an int to split after n entries?

The idea of being able to split a directory came out of a previous
discussion on how to resolve the issue this patch set is trying to solve.
I included the split functionality in this patch set since it
was suggested. I will leave the decision of whether or not this
functionality is needed up to the community.

-Nathan

2010-07-13 06:23:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Split the memory_block structure


plz cc linux-mm in the next time...
And please incudes updates for Documentation/memory-hotplug.txt.


On Mon, 12 Jul 2010 10:42:06 -0500
Nathan Fontenot <[email protected]> wrote:

> This patch splits the memory_block struct into a memory_block
> struct to cover each sysfs directory and a new memory_block_section
> struct for each memory section covered by the sysfs directory.
>
> This also updates the routine handling memory_block creation
> and manipulation to use these updated structures.
>

Could you clarify the number of memory_block_section per memory_block ?


> Signed -off-by: Nathan Fontenot <[email protected]>
> ---
> drivers/base/memory.c | 228 +++++++++++++++++++++++++++++++++++--------------
> include/linux/memory.h | 11 +-
> 2 files changed, 172 insertions(+), 67 deletions(-)
>
> Index: linux-2.6/drivers/base/memory.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/memory.c 2010-07-08 11:27:21.000000000 -0500
> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:09.000000000 -0500
> @@ -28,6 +28,14 @@
> #include <asm/uaccess.h>
>
> #define MEMORY_CLASS_NAME "memory"
> +#define MIN_MEMORY_BLOCK_SIZE (1 << SECTION_SIZE_BITS)
> +
> +static int sections_per_block;
> +
some default value, plz. Does this can be determined only by .config ?


> +static inline int base_memory_block_id(int section_nr)
> +{
> + return (section_nr / sections_per_block) * sections_per_block;
> +}
>
> static struct sysdev_class memory_sysdev_class = {
> .name = MEMORY_CLASS_NAME,
> @@ -94,10 +102,9 @@
> }
>
> static void
> -unregister_memory(struct memory_block *memory, struct mem_section *section)
> +unregister_memory(struct memory_block *memory)
> {
> BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
> - BUG_ON(memory->sysdev.id != __section_nr(section));
>
> /* drop the ref. we got in remove_memory_block() */
> kobject_put(&memory->sysdev.kobj);
> @@ -123,13 +130,20 @@
> static ssize_t show_mem_removable(struct sys_device *dev,
> struct sysdev_attribute *attr, char *buf)
> {
> - unsigned long start_pfn;
> - int ret;
> - struct memory_block *mem =
> - container_of(dev, struct memory_block, sysdev);
> + struct list_head *pos, *tmp;
> + struct memory_block *mem;
> + int ret = 1;
> +
> + mem = container_of(dev, struct memory_block, sysdev);
> + list_for_each_safe(pos, tmp, &mem->sections) {
> + struct memory_block_section *mbs;
> + unsigned long start_pfn;
> +
> + mbs = list_entry(pos, struct memory_block_section, next);

list_for_each_entry ?



> + start_pfn = section_nr_to_pfn(mbs->phys_index);
> + ret &= is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
> + }

Hmm, them, only when the whole memory block is removable, it's shown as
removable. Right ?
Does it meets ppc guy's requirements ?

>
> - start_pfn = section_nr_to_pfn(mem->phys_index);
> - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
> return sprintf(buf, "%d\n", ret);
> }

Hmm...can't you print removable information as bitmap, here ?
overkill ?


>
> @@ -182,16 +196,16 @@
> * OK to have direct references to sparsemem variables in here.
> */
> static int
> -memory_block_action(struct memory_block *mem, unsigned long action)
> +memory_block_action(struct memory_block_section *mbs, unsigned long action)
> {
> int i;
> unsigned long psection;
> unsigned long start_pfn, start_paddr;
> struct page *first_page;
> int ret;
> - int old_state = mem->state;
> ot-option-to-disable-memory-hotplug.patch
> + int old_state = mbs->state;

Where is this noise from ?

>
> - psection = mem->phys_index;
> + psection = mbs->phys_index;
> first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
>
> /*
> @@ -217,18 +231,18 @@
> ret = online_pages(start_pfn, PAGES_PER_SECTION);
> break;
> case MEM_OFFLINE:
> - mem->state = MEM_GOING_OFFLINE;
> + mbs->state = MEM_GOING_OFFLINE;
> start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
> ret = remove_memory(start_paddr,
> PAGES_PER_SECTION << PAGE_SHIFT);
> if (ret) {
> - mem->state = old_state;
> + mbs->state = old_state;
> break;
> }
> break;
> default:
> WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
> - __func__, mem, action, action);
> + __func__, mbs, action, action);
> ret = -EINVAL;
> }
>
> @@ -238,19 +252,40 @@
> static int memory_block_change_state(struct memory_block *mem,
> unsigned long to_state, unsigned long from_state_req)
> {
> + struct memory_block_section *mbs;
> + struct list_head *pos;
> int ret = 0;
> +
> mutex_lock(&mem->state_mutex);
>
> - if (mem->state != from_state_req) {
> - ret = -EINVAL;
> - goto out;
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section, next);
> +

list_for_each_entry() ?

> + if (mbs->state != from_state_req)
> + continue;
> +
> + ret = memory_block_action(mbs, to_state);
> + if (ret)
> + break;
> + }

Then, all actions will be affect all memory sections under memory block ?
(Hmm..maybe have to see following patches ?)


> +
> + if (ret) {
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section,
> + next);
> +
list_for_each_entry() ?

> + if (mbs->state == from_state_req)
> + continue;
> +
> + if (memory_block_action(mbs, to_state))
> + printk(KERN_ERR "Could not re-enable memory "
> + "section %lx\n", mbs->phys_index);
> + }
> }
>
> - ret = memory_block_action(mem, to_state);
> if (!ret)
> mem->state = to_state;
>
> -out:
> mutex_unlock(&mem->state_mutex);
> return ret;
> }
> @@ -260,20 +295,15 @@
> struct sysdev_attribute *attr, const char *buf, size_t count)
> {

Hmm, store_mem_state() ? What diff option are you using ?


> struct memory_block *mem;
> - unsigned int phys_section_nr;
> int ret = -EINVAL;
>
> mem = container_of(dev, struct memory_block, sysdev);
> - phys_section_nr = mem->phys_index;
> -
> - if (!present_section_nr(phys_section_nr))
> - goto out;
>
> if (!strncmp(buf, "online", min((int)count, 6)))
> ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
> else if(!strncmp(buf, "offline", min((int)count, 7)))
> ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
> -out:
> +
> if (ret)
> return ret;
> return count;
> @@ -435,39 +465,6 @@
> return 0;
> }
>
> -static int add_memory_block(int nid, struct mem_section *section,
> - unsigned long state, enum mem_add_context context)
> -{
> - struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> - unsigned long start_pfn;
> - int ret = 0;
> -
> - if (!mem)
> - return -ENOMEM;
> -
> - mem->phys_index = __section_nr(section);
> - mem->state = state;
> - mutex_init(&mem->state_mutex);
> - start_pfn = section_nr_to_pfn(mem->phys_index);
> - mem->phys_device = arch_get_memory_phys_device(start_pfn);
> -
> - ret = register_memory(mem, section);
> - if (!ret)
> - ret = mem_create_simple_file(mem, phys_index);
> - if (!ret)
> - ret = mem_create_simple_file(mem, state);
> - if (!ret)
> - ret = mem_create_simple_file(mem, phys_device);
> - if (!ret)
> - ret = mem_create_simple_file(mem, removable);
> - if (!ret) {
> - if (context == HOTPLUG)
> - ret = register_mem_sect_under_node(mem, nid);
> - }
> -
> - return ret;
> -}
> -


please divide clean-up and logic-change patches into their own..


> /*
> * For now, we have a linear search to go find the appropriate
> * memory_block corresponding to a particular phys_index. If
> @@ -482,12 +479,13 @@
> struct sys_device *sysdev;
> struct memory_block *mem;
> char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
> + int block_id = base_memory_block_id(__section_nr(section));
>
> /*
> * This only works because we know that section == sysdev->id
> * slightly redundant with sysdev_register()
> */
> - sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
> + sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);

Hmm. Then, the user has to calculate block-id in addtion to section-id.
Can't we use memory block name as memory%d-%d(start-end) ?



>
> kobj = kset_find_obj(&memory_sysdev_class.kset, name);
> if (!kobj)
> @@ -498,19 +496,97 @@
>
> return mem;
> }
> +static int add_mem_block_section(struct memory_block *mem,
> + int section_nr, unsigned long state)
> +{
> + struct memory_block_section *mbs;
> +
> + mbs = kzalloc(sizeof(*mbs), GFP_KERNEL);
> + if (!mbs)
> + return -ENOMEM;
> +
> + mbs->phys_index = section_nr;
> + mbs->state = state;
> +
> + list_add(&mbs->next, &mem->sections);
> + return 0;
> +}
> +
> +static int add_memory_block(int nid, struct mem_section *section,
> + unsigned long state, enum mem_add_context context)
> +{
> + struct memory_block *mem;
> + int ret = 0;
> +
> + mem = find_memory_block(section);

I guess you need to add changes to find_memory_block. section-ID != block-ID.


> + if (!mem) {
> + unsigned long start_pfn;
> +
> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> + if (!mem)
> + return -ENOMEM;
> +
> + mem->state = state;
> + mutex_init(&mem->state_mutex);
> + start_pfn = section_nr_to_pfn(__section_nr(section));
> + mem->phys_device = arch_get_memory_phys_device(start_pfn);
> + INIT_LIST_HEAD(&mem->sections);

I'm not sure this phys_device is properly set in any arch...but this changes in
granule will not affect ?

> +
> + ret = register_memory(mem, section);
> + if (!ret)
> + ret = mem_create_simple_file(mem, phys_index);
> + if (!ret)
> + ret = mem_create_simple_file(mem, state);
> + if (!ret)
> + ret = mem_create_simple_file(mem, phys_device);
> + if (!ret)
> + ret = mem_create_simple_file(mem, removable);
> + if (!ret) {
> + if (context == HOTPLUG)
> + ret = register_mem_sect_under_node(mem, nid);
> + }
> + } else {
> + kobject_put(&mem->sysdev.kobj);
> + }
> +
> + if (!ret)
> + ret = add_mem_block_section(mem, __section_nr(section), state);
> +
> + return ret;
> +}





>
> int remove_memory_block(unsigned long node_id, struct mem_section *section,
> int phys_device)
> {
> struct memory_block *mem;
> + struct memory_block_section *mbs;
> + struct list_head *pos, *tmp;
> + int section_nr = __section_nr(section);
>
> mem = find_memory_block(section);

ditto.

> - unregister_mem_sect_under_nodes(mem);
> - mem_remove_simple_file(mem, phys_index);
> - mem_remove_simple_file(mem, state);
> - mem_remove_simple_file(mem, phys_device);
> - mem_remove_simple_file(mem, removable);
> - unregister_memory(mem, section);
> + mutex_lock(&mem->state_mutex);
> +
> + /* remove the specified section */
> + list_for_each_safe(pos, tmp, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section, next);
> +
list_for_each_entry_safe ?

> + if (mbs->phys_index == section_nr) {
> + list_del(&mbs->next);
> + kfree(mbs);
> + }
> + }
> +
> + mutex_unlock(&mem->state_mutex);
> +
> + if (list_empty(&mem->sections)) {
> + unregister_mem_sect_under_nodes(mem);
> + mem_remove_simple_file(mem, phys_index);
> + mem_remove_simple_file(mem, state);
> + mem_remove_simple_file(mem, phys_device);
> + mem_remove_simple_file(mem, removable);
> + unregister_memory(mem);
> + kfree(mem);
> + }
>
> return 0;
> }
> @@ -532,6 +608,24 @@
> return remove_memory_block(0, section, 0);
> }
>
> +u32 __weak memory_block_size(void)
> +{
> + return MIN_MEMORY_BLOCK_SIZE;
> +}
> +
> +static u32 get_memory_block_size(void)
> +{
> + u32 blk_sz;
> +
> + blk_sz = memory_block_size();
> +
> + /* Validate blk_sz is a power of 2 and not less than section size */
> + if ((blk_sz & (blk_sz - 1)) || (blk_sz < MIN_MEMORY_BLOCK_SIZE))
> + blk_sz = MIN_MEMORY_BLOCK_SIZE;
> +
> + return blk_sz;
> +}
> +
> /*
> * Initialize the sysfs support for memory devices...
> */
> @@ -540,12 +634,16 @@
> unsigned int i;
> int ret;
> int err;
> + int block_sz;
>
> memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
> ret = sysdev_class_register(&memory_sysdev_class);
> if (ret)
> goto out;
>
> + block_sz = get_memory_block_size();
> + sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +
> /*
> * Create entries for memory sections that were found
> * during boot and have been initialized
> Index: linux-2.6/include/linux/memory.h
> ===================================================================
> --- linux-2.6.orig/include/linux/memory.h 2010-07-08 11:27:21.000000000 -0500
> +++ linux-2.6/include/linux/memory.h 2010-07-09 14:22:44.000000000 -0500
> @@ -19,9 +19,15 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/mutex.h>
> +#include <linux/list.h>
>
> -struct memory_block {
> +struct memory_block_section {
> + unsigned long state;
> unsigned long phys_index;
> + struct list_head next;
> +};
> +
> +struct memory_block {
> unsigned long state;
> /*
> * This serializes all state change requests. It isn't
> @@ -34,6 +40,7 @@
> void *hw; /* optional pointer to fw/hw data */
> int (*phys_callback)(struct memory_block *);
> struct sys_device sysdev;
> + struct list_head sections;
> };
>
> int arch_get_memory_phys_device(unsigned long start_pfn);
> @@ -113,7 +120,7 @@
> extern int remove_memory_block(unsigned long, struct mem_section *, int);
> extern int memory_notify(unsigned long val, void *v);
> extern int memory_isolate_notify(unsigned long val, void *v);
> -extern struct memory_block *find_memory_block(unsigned long);
> +extern struct memory_block *find_memory_block(struct mem_section *);
> extern int memory_is_hidden(struct mem_section *);
> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
> enum mem_add_context { BOOT, HOTPLUG };
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2010-07-13 06:25:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/7] Update the [register,unregister]_memory routines

On Mon, 12 Jul 2010 10:44:10 -0500
Nathan Fontenot <[email protected]> wrote:

> This patch moves the register/unregister_memory routines to
> avoid a forward declaration. It also moves the sysfs file
> creation and deletion for each directory into the register/
> unregister routines to avoid duplicating it with these updates.
>
> Signed-off-by: Nathan Fontenot <[email protected]>
> ---
> drivers/base/memory.c | 93 +++++++++++++++++++++++++-------------------------
> 1 file changed, 48 insertions(+), 45 deletions(-)
>
> Index: linux-2.6/drivers/base/memory.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:17.000000000 -0500
> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
> @@ -87,31 +87,6 @@
> EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>
> /*
> - * register_memory - Setup a sysfs device for a memory block
> - */
> -static
> -int register_memory(struct memory_block *memory, struct mem_section *section)
> -{
> - int error;
> -
> - memory->sysdev.cls = &memory_sysdev_class;
> - memory->sysdev.id = __section_nr(section);
> -
> - error = sysdev_register(&memory->sysdev);
> - return error;
> -}
> -
> -static void
> -unregister_memory(struct memory_block *memory)
> -{
> - BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
> -
> - /* drop the ref. we got in remove_memory_block() */
> - kobject_put(&memory->sysdev.kobj);
> - sysdev_unregister(&memory->sysdev);
> -}
> -
> -/*
> * use this as the physical section index that this memsection
> * uses.
> */
> @@ -346,6 +321,53 @@
> sysdev_remove_file(&mem->sysdev, &attr_##attr_name)
>
> /*
> + * register_memory - Setup a sysfs device for a memory block
> + */
> +static
> +int register_memory(struct memory_block *memory, struct mem_section *section,
> + int nid, enum mem_add_context context)
> +{
> + int ret;
> +
> + memory->sysdev.cls = &memory_sysdev_class;
> + memory->sysdev.id = __section_nr(section);
> +
Why not block-ID but section-ID ?

-Kame

2010-07-13 06:33:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On Mon, 12 Jul 2010 10:45:25 -0500
Nathan Fontenot <[email protected]> wrote:

> This patch introduces the new 'split' file in each memory sysfs
> directory and the associated routines needed to handle splitting
> a directory.
>
> Signed-off-by; Nathan Fontenot <[email protected]>
> ---

pleae check diff option...


> drivers/base/memory.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 98 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/base/memory.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:38:09.000000000 -0500
> @@ -32,6 +32,9 @@
>
> static int sections_per_block;
>
> +static int register_memory(struct memory_block *, struct mem_section *,
> + int, enum mem_add_context);
> +
> static inline int base_memory_block_id(int section_nr)
> {
> return (section_nr / sections_per_block) * sections_per_block;
> @@ -309,11 +312,100 @@
> return sprintf(buf, "%d\n", mem->phys_device);
> }
>
> +static void update_memory_block_phys_indexes(struct memory_block *mem)
> +{
> + struct list_head *pos;
> + struct memory_block_section *mbs;
> + unsigned long min_index = 0xffffffff;
> + unsigned long max_index = 0;
> +
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section, next);
> +
> + if (mbs->phys_index < min_index)
> + min_index = mbs->phys_index;
> +
> + if (mbs->phys_index > max_index)
> + max_index = mbs->phys_index;
> + }
> +
> + mem->start_phys_index = min_index;
> + mem->end_phys_index = max_index;
> +}
> +
> +static ssize_t
> +store_mem_split_block(struct sys_device *dev, struct sysdev_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct memory_block *mem, *new_mem_blk;
> + struct memory_block_section *mbs;
> + struct list_head *pos, *tmp;
> + struct mem_section *section;
> + int min_scn_nr = 0;
> + int max_scn_nr = 0;
> + int total_scns = 0;
> + int new_blk_min, new_blk_total;
> + int ret = -EINVAL;
> +
> + mem = container_of(dev, struct memory_block, sysdev);
> +
> + if (list_is_singular(&mem->sections))
> + return -EINVAL;

What this means ?


> +
> + mutex_lock(&mem->state_mutex);
> +
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section, next);
> +
> + total_scns++;
> +
> + if (min_scn_nr > mbs->phys_index)
> + min_scn_nr = mbs->phys_index;
> +
> + if (max_scn_nr < mbs->phys_index)
> + max_scn_nr = mbs->phys_index;
> + }
> +
> + new_mem_blk = kzalloc(sizeof(*new_mem_blk), GFP_KERNEL);
> + if (!new_mem_blk)
> + return -ENOMEM;
> +
> + mutex_init(&new_mem_blk->state_mutex);
> + INIT_LIST_HEAD(&new_mem_blk->sections);
> + new_mem_blk->state = mem->state;
> +
> + mutex_lock(&new_mem_blk->state_mutex);
> +
> + new_blk_total = total_scns / 2;
> + new_blk_min = max_scn_nr - new_blk_total + 1;
> +
> + section = __nr_to_section(new_blk_min);
> + ret = register_memory(new_mem_blk, section, 0, HOTPLUG);
> +
'nid' is always 0 ?

And for what purpose this interface is ? Does this split memory block into 2 pieces
of the same size ?? sounds __very__ strange interface to me.

If this is necessary, I hope move the whole things to configfs rather than
something tricky.

Bye.
-Kame

2010-07-13 14:00:51

by Brian King

[permalink] [raw]
Subject: Re: [PATCH 1/7] Split the memory_block structure

On 07/12/2010 10:42 AM, Nathan Fontenot wrote:
> @@ -123,13 +130,20 @@
> static ssize_t show_mem_removable(struct sys_device *dev,
> struct sysdev_attribute *attr, char *buf)
> {
> - unsigned long start_pfn;
> - int ret;
> - struct memory_block *mem =
> - container_of(dev, struct memory_block, sysdev);
> + struct list_head *pos, *tmp;
> + struct memory_block *mem;
> + int ret = 1;
> +
> + mem = container_of(dev, struct memory_block, sysdev);
> + list_for_each_safe(pos, tmp, &mem->sections) {
> + struct memory_block_section *mbs;
> + unsigned long start_pfn;
> +
> + mbs = list_entry(pos, struct memory_block_section, next);
> + start_pfn = section_nr_to_pfn(mbs->phys_index);
> + ret &= is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
> + }

I don't see you deleting anyting from the list in this loop. Why do you need
to use list_for_each_safe? That won't protect you if someone else is messing
with the list.

>
> - start_pfn = section_nr_to_pfn(mem->phys_index);
> - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
> return sprintf(buf, "%d\n", ret);
> }
>


> @@ -238,19 +252,40 @@
> static int memory_block_change_state(struct memory_block *mem,
> unsigned long to_state, unsigned long from_state_req)
> {
> + struct memory_block_section *mbs;
> + struct list_head *pos;
> int ret = 0;
> +
> mutex_lock(&mem->state_mutex);
>
> - if (mem->state != from_state_req) {
> - ret = -EINVAL;
> - goto out;
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section, next);
> +
> + if (mbs->state != from_state_req)
> + continue;
> +
> + ret = memory_block_action(mbs, to_state);
> + if (ret)
> + break;
> + }

Would it be better here to loop through all the sections and ensure they
are in the proper state first before starting to change the state of any
of them? Then you could easily return -EINVAL if one or more is in
the incorrect state and wouldn't need to the code below.

> + if (ret) {
> + list_for_each(pos, &mem->sections) {
> + mbs = list_entry(pos, struct memory_block_section,
> + next);
> +
> + if (mbs->state == from_state_req)
> + continue;
> +
> + if (memory_block_action(mbs, to_state))
> + printk(KERN_ERR "Could not re-enable memory "
> + "section %lx\n", mbs->phys_index);
> + }
> }
>
> - ret = memory_block_action(mem, to_state);
> if (!ret)
> mem->state = to_state;
>
> -out:
> mutex_unlock(&mem->state_mutex);
> return ret;
> }


> @@ -498,19 +496,97 @@
>
> return mem;
> }
> +static int add_mem_block_section(struct memory_block *mem,
> + int section_nr, unsigned long state)
> +{
> + struct memory_block_section *mbs;
> +
> + mbs = kzalloc(sizeof(*mbs), GFP_KERNEL);
> + if (!mbs)
> + return -ENOMEM;
> +
> + mbs->phys_index = section_nr;
> + mbs->state = state;
> +
> + list_add(&mbs->next, &mem->sections);

I don't think there is sufficient protection for this list. Don't we
need to be holding a lock of some sort when adding/deleting/iterating
through this list?

> + return 0;
> +}

--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

2010-07-13 15:44:24

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 1/7] Split the memory_block structure

Thanks for the review, answers below...

-Nathan

On 07/13/2010 01:18 AM, KAMEZAWA Hiroyuki wrote:
>
> plz cc linux-mm in the next time...
> And please incudes updates for Documentation/memory-hotplug.txt.
>

will do.
>
> On Mon, 12 Jul 2010 10:42:06 -0500
> Nathan Fontenot <[email protected]> wrote:
>
>> This patch splits the memory_block struct into a memory_block
>> struct to cover each sysfs directory and a new memory_block_section
>> struct for each memory section covered by the sysfs directory.
>>
>> This also updates the routine handling memory_block creation
>> and manipulation to use these updated structures.
>>
>
> Could you clarify the number of memory_block_section per memory_block ?

The default number of memory_block_sections per memory block is 1. The
memory_block_size() routine (defined as __weak) sets the size of the
memory block, and thus the number of memory_block_sections.

The current view of memory in sysfs where each directory covers a single
memory section should still hold for everyone, unless a arch defines
their own memory_section_size routine to alter the behavior.

>
>
>> Signed -off-by: Nathan Fontenot <[email protected]>
>> ---
>> drivers/base/memory.c | 228 +++++++++++++++++++++++++++++++++++--------------
>> include/linux/memory.h | 11 +-
>> 2 files changed, 172 insertions(+), 67 deletions(-)
>>
>> Index: linux-2.6/drivers/base/memory.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/base/memory.c 2010-07-08 11:27:21.000000000 -0500
>> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:09.000000000 -0500
>> @@ -28,6 +28,14 @@
>> #include <asm/uaccess.h>
>>
>> #define MEMORY_CLASS_NAME "memory"
>> +#define MIN_MEMORY_BLOCK_SIZE (1 << SECTION_SIZE_BITS)
>> +
>> +static int sections_per_block;
>> +
> some default value, plz. Does this can be determined only by .config ?

The default is 1. This is determined in the get_memory_block_size() which
is called from the memory sysfs init routine.

>
>
>> +static inline int base_memory_block_id(int section_nr)
>> +{
>> + return (section_nr / sections_per_block) * sections_per_block;
>> +}
>>
>> static struct sysdev_class memory_sysdev_class = {
>> .name = MEMORY_CLASS_NAME,
>> @@ -94,10 +102,9 @@
>> }
>>
>> static void
>> -unregister_memory(struct memory_block *memory, struct mem_section *section)
>> +unregister_memory(struct memory_block *memory)
>> {
>> BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
>> - BUG_ON(memory->sysdev.id != __section_nr(section));
>>
>> /* drop the ref. we got in remove_memory_block() */
>> kobject_put(&memory->sysdev.kobj);
>> @@ -123,13 +130,20 @@
>> static ssize_t show_mem_removable(struct sys_device *dev,
>> struct sysdev_attribute *attr, char *buf)
>> {
>> - unsigned long start_pfn;
>> - int ret;
>> - struct memory_block *mem =
>> - container_of(dev, struct memory_block, sysdev);
>> + struct list_head *pos, *tmp;
>> + struct memory_block *mem;
>> + int ret = 1;
>> +
>> + mem = container_of(dev, struct memory_block, sysdev);
>> + list_for_each_safe(pos, tmp, &mem->sections) {
>> + struct memory_block_section *mbs;
>> + unsigned long start_pfn;
>> +
>> + mbs = list_entry(pos, struct memory_block_section, next);
>
> list_for_each_entry ?

I went with list_for_each_safe() here since I am not holding the mutex
while walking the list. Perhaps this should be changed to take the
mutex and use list_for_each_entry().

>
>
>
>> + start_pfn = section_nr_to_pfn(mbs->phys_index);
>> + ret &= is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
>> + }
>
> Hmm, them, only when the whole memory block is removable, it's shown as
> removable. Right ?
> Does it meets ppc guy's requirements ?

Yes, and yes.

>
>>
>> - start_pfn = section_nr_to_pfn(mem->phys_index);
>> - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
>> return sprintf(buf, "%d\n", ret);
>> }
>
> Hmm...can't you print removable information as bitmap, here ?
> overkill ?

We could print it as a bitmap, but I think it would be overkill. The
memory add/remove routines work on a memory_block such that all
memory_block_sections in the memory_block are added/removed as a whole.
Given this, I figured we only needed to know if the entire memory_block
is removable.

>
>
>>
>> @@ -182,16 +196,16 @@
>> * OK to have direct references to sparsemem variables in here.
>> */
>> static int
>> -memory_block_action(struct memory_block *mem, unsigned long action)
>> +memory_block_action(struct memory_block_section *mbs, unsigned long action)
>> {
>> int i;
>> unsigned long psection;
>> unsigned long start_pfn, start_paddr;
>> struct page *first_page;
>> int ret;
>> - int old_state = mem->state;
>> ot-option-to-disable-memory-hotplug.patch
>> + int old_state = mbs->state;
>
> Where is this noise from ?

Yuck! I'll take a look. That shouldn't be there obviously.

>
>>
>> - psection = mem->phys_index;
>> + psection = mbs->phys_index;
>> first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
>>
>> /*
>> @@ -217,18 +231,18 @@
>> ret = online_pages(start_pfn, PAGES_PER_SECTION);
>> break;
>> case MEM_OFFLINE:
>> - mem->state = MEM_GOING_OFFLINE;
>> + mbs->state = MEM_GOING_OFFLINE;
>> start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
>> ret = remove_memory(start_paddr,
>> PAGES_PER_SECTION << PAGE_SHIFT);
>> if (ret) {
>> - mem->state = old_state;
>> + mbs->state = old_state;
>> break;
>> }
>> break;
>> default:
>> WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
>> - __func__, mem, action, action);
>> + __func__, mbs, action, action);
>> ret = -EINVAL;
>> }
>>
>> @@ -238,19 +252,40 @@
>> static int memory_block_change_state(struct memory_block *mem,
>> unsigned long to_state, unsigned long from_state_req)
>> {
>> + struct memory_block_section *mbs;
>> + struct list_head *pos;
>> int ret = 0;
>> +
>> mutex_lock(&mem->state_mutex);
>>
>> - if (mem->state != from_state_req) {
>> - ret = -EINVAL;
>> - goto out;
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> +
>
> list_for_each_entry() ?

That could be done here.

>
>> + if (mbs->state != from_state_req)
>> + continue;
>> +
>> + ret = memory_block_action(mbs, to_state);
>> + if (ret)
>> + break;
>> + }
>
> Then, all actions will be affect all memory sections under memory block ?
> (Hmm..maybe have to see following patches ?)

Correct. Add/remove actions will work on a memory_block as a whole.

>
>
>> +
>> + if (ret) {
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section,
>> + next);
>> +
> list_for_each_entry() ?

got it. :)

>
>> + if (mbs->state == from_state_req)
>> + continue;
>> +
>> + if (memory_block_action(mbs, to_state))
>> + printk(KERN_ERR "Could not re-enable memory "
>> + "section %lx\n", mbs->phys_index);
>> + }
>> }
>>
>> - ret = memory_block_action(mem, to_state);
>> if (!ret)
>> mem->state = to_state;
>>
>> -out:
>> mutex_unlock(&mem->state_mutex);
>> return ret;
>> }
>> @@ -260,20 +295,15 @@
>> struct sysdev_attribute *attr, const char *buf, size_t count)
>> {
>
> Hmm, store_mem_state() ? What diff option are you using ?

Yes, this is store_mem_state.

Patches were generated with quilt.

>
>
>> struct memory_block *mem;
>> - unsigned int phys_section_nr;
>> int ret = -EINVAL;
>>
>> mem = container_of(dev, struct memory_block, sysdev);
>> - phys_section_nr = mem->phys_index;
>> -
>> - if (!present_section_nr(phys_section_nr))
>> - goto out;
>>
>> if (!strncmp(buf, "online", min((int)count, 6)))
>> ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
>> else if(!strncmp(buf, "offline", min((int)count, 7)))
>> ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
>> -out:
>> +
>> if (ret)
>> return ret;
>> return count;
>> @@ -435,39 +465,6 @@
>> return 0;
>> }
>>
>> -static int add_memory_block(int nid, struct mem_section *section,
>> - unsigned long state, enum mem_add_context context)
>> -{
>> - struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>> - unsigned long start_pfn;
>> - int ret = 0;
>> -
>> - if (!mem)
>> - return -ENOMEM;
>> -
>> - mem->phys_index = __section_nr(section);
>> - mem->state = state;
>> - mutex_init(&mem->state_mutex);
>> - start_pfn = section_nr_to_pfn(mem->phys_index);
>> - mem->phys_device = arch_get_memory_phys_device(start_pfn);
>> -
>> - ret = register_memory(mem, section);
>> - if (!ret)
>> - ret = mem_create_simple_file(mem, phys_index);
>> - if (!ret)
>> - ret = mem_create_simple_file(mem, state);
>> - if (!ret)
>> - ret = mem_create_simple_file(mem, phys_device);
>> - if (!ret)
>> - ret = mem_create_simple_file(mem, removable);
>> - if (!ret) {
>> - if (context == HOTPLUG)
>> - ret = register_mem_sect_under_node(mem, nid);
>> - }
>> -
>> - return ret;
>> -}
>> -
>
>
> please divide clean-up and logic-change patches into their own..

ok.
>
>
>> /*
>> * For now, we have a linear search to go find the appropriate
>> * memory_block corresponding to a particular phys_index. If
>> @@ -482,12 +479,13 @@
>> struct sys_device *sysdev;
>> struct memory_block *mem;
>> char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
>> + int block_id = base_memory_block_id(__section_nr(section));
>>
>> /*
>> * This only works because we know that section == sysdev->id
>> * slightly redundant with sysdev_register()
>> */
>> - sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
>> + sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);
>
> Hmm. Then, the user has to calculate block-id in addtion to section-id.
> Can't we use memory block name as memory%d-%d(start-end) ?

I am not attached to a particular name for the directories. I think
keeping the memory%d, where %d is the starting id makes the code that splits
the directory cleaner.

In a later patch I add a new file for each directory that has the ending
id in it so users can easily determine the start and end id's of the
memory block.

>
>
>
>>
>> kobj = kset_find_obj(&memory_sysdev_class.kset, name);
>> if (!kobj)
>> @@ -498,19 +496,97 @@
>>
>> return mem;
>> }
>> +static int add_mem_block_section(struct memory_block *mem,
>> + int section_nr, unsigned long state)
>> +{
>> + struct memory_block_section *mbs;
>> +
>> + mbs = kzalloc(sizeof(*mbs), GFP_KERNEL);
>> + if (!mbs)
>> + return -ENOMEM;
>> +
>> + mbs->phys_index = section_nr;
>> + mbs->state = state;
>> +
>> + list_add(&mbs->next, &mem->sections);
>> + return 0;
>> +}
>> +
>> +static int add_memory_block(int nid, struct mem_section *section,
>> + unsigned long state, enum mem_add_context context)
>> +{
>> + struct memory_block *mem;
>> + int ret = 0;
>> +
>> + mem = find_memory_block(section);
>
> I guess you need to add changes to find_memory_block. section-ID != block-ID.

That is above, see the line

>> + int block_id = base_memory_block_id(__section_nr(section));

>
>
>> + if (!mem) {
>> + unsigned long start_pfn;
>> +
>> + mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>> + if (!mem)
>> + return -ENOMEM;
>> +
>> + mem->state = state;
>> + mutex_init(&mem->state_mutex);
>> + start_pfn = section_nr_to_pfn(__section_nr(section));
>> + mem->phys_device = arch_get_memory_phys_device(start_pfn);
>> + INIT_LIST_HEAD(&mem->sections);
>
> I'm not sure this phys_device is properly set in any arch...but this changes in
> granule will not affect ?

I don't think so, hopefully someone will speak up if this causes an issue.

>
>> +
>> + ret = register_memory(mem, section);
>> + if (!ret)
>> + ret = mem_create_simple_file(mem, phys_index);
>> + if (!ret)
>> + ret = mem_create_simple_file(mem, state);
>> + if (!ret)
>> + ret = mem_create_simple_file(mem, phys_device);
>> + if (!ret)
>> + ret = mem_create_simple_file(mem, removable);
>> + if (!ret) {
>> + if (context == HOTPLUG)
>> + ret = register_mem_sect_under_node(mem, nid);
>> + }
>> + } else {
>> + kobject_put(&mem->sysdev.kobj);
>> + }
>> +
>> + if (!ret)
>> + ret = add_mem_block_section(mem, __section_nr(section), state);
>> +
>> + return ret;
>> +}
>
>
>
>
>
>>
>> int remove_memory_block(unsigned long node_id, struct mem_section *section,
>> int phys_device)
>> {
>> struct memory_block *mem;
>> + struct memory_block_section *mbs;
>> + struct list_head *pos, *tmp;
>> + int section_nr = __section_nr(section);
>>
>> mem = find_memory_block(section);
>
> ditto.
>
>> - unregister_mem_sect_under_nodes(mem);
>> - mem_remove_simple_file(mem, phys_index);
>> - mem_remove_simple_file(mem, state);
>> - mem_remove_simple_file(mem, phys_device);
>> - mem_remove_simple_file(mem, removable);
>> - unregister_memory(mem, section);
>> + mutex_lock(&mem->state_mutex);
>> +
>> + /* remove the specified section */
>> + list_for_each_safe(pos, tmp, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> +
> list_for_each_entry_safe ?

yep. :)

>
>> + if (mbs->phys_index == section_nr) {
>> + list_del(&mbs->next);
>> + kfree(mbs);
>> + }
>> + }
>> +
>> + mutex_unlock(&mem->state_mutex);
>> +
>> + if (list_empty(&mem->sections)) {
>> + unregister_mem_sect_under_nodes(mem);
>> + mem_remove_simple_file(mem, phys_index);
>> + mem_remove_simple_file(mem, state);
>> + mem_remove_simple_file(mem, phys_device);
>> + mem_remove_simple_file(mem, removable);
>> + unregister_memory(mem);
>> + kfree(mem);
>> + }
>>
>> return 0;
>> }
>> @@ -532,6 +608,24 @@
>> return remove_memory_block(0, section, 0);
>> }
>>
>> +u32 __weak memory_block_size(void)
>> +{
>> + return MIN_MEMORY_BLOCK_SIZE;
>> +}
>> +
>> +static u32 get_memory_block_size(void)
>> +{
>> + u32 blk_sz;
>> +
>> + blk_sz = memory_block_size();
>> +
>> + /* Validate blk_sz is a power of 2 and not less than section size */
>> + if ((blk_sz & (blk_sz - 1)) || (blk_sz < MIN_MEMORY_BLOCK_SIZE))
>> + blk_sz = MIN_MEMORY_BLOCK_SIZE;
>> +
>> + return blk_sz;
>> +}
>> +
>> /*
>> * Initialize the sysfs support for memory devices...
>> */
>> @@ -540,12 +634,16 @@
>> unsigned int i;
>> int ret;
>> int err;
>> + int block_sz;
>>
>> memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
>> ret = sysdev_class_register(&memory_sysdev_class);
>> if (ret)
>> goto out;
>>
>> + block_sz = get_memory_block_size();
>> + sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
>> +
>> /*
>> * Create entries for memory sections that were found
>> * during boot and have been initialized
>> Index: linux-2.6/include/linux/memory.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/memory.h 2010-07-08 11:27:21.000000000 -0500
>> +++ linux-2.6/include/linux/memory.h 2010-07-09 14:22:44.000000000 -0500
>> @@ -19,9 +19,15 @@
>> #include <linux/node.h>
>> #include <linux/compiler.h>
>> #include <linux/mutex.h>
>> +#include <linux/list.h>
>>
>> -struct memory_block {
>> +struct memory_block_section {
>> + unsigned long state;
>> unsigned long phys_index;
>> + struct list_head next;
>> +};
>> +
>> +struct memory_block {
>> unsigned long state;
>> /*
>> * This serializes all state change requests. It isn't
>> @@ -34,6 +40,7 @@
>> void *hw; /* optional pointer to fw/hw data */
>> int (*phys_callback)(struct memory_block *);
>> struct sys_device sysdev;
>> + struct list_head sections;
>> };
>>
>> int arch_get_memory_phys_device(unsigned long start_pfn);
>> @@ -113,7 +120,7 @@
>> extern int remove_memory_block(unsigned long, struct mem_section *, int);
>> extern int memory_notify(unsigned long val, void *v);
>> extern int memory_isolate_notify(unsigned long val, void *v);
>> -extern struct memory_block *find_memory_block(unsigned long);
>> +extern struct memory_block *find_memory_block(struct mem_section *);
>> extern int memory_is_hidden(struct mem_section *);
>> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
>> enum mem_add_context { BOOT, HOTPLUG };
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2010-07-13 15:47:14

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 3/7] Update the [register,unregister]_memory routines

On 07/13/2010 01:20 AM, KAMEZAWA Hiroyuki wrote:
> On Mon, 12 Jul 2010 10:44:10 -0500
> Nathan Fontenot <[email protected]> wrote:
>
>> This patch moves the register/unregister_memory routines to
>> avoid a forward declaration. It also moves the sysfs file
>> creation and deletion for each directory into the register/
>> unregister routines to avoid duplicating it with these updates.
>>
>> Signed-off-by: Nathan Fontenot <[email protected]>
>> ---
>> drivers/base/memory.c | 93 +++++++++++++++++++++++++-------------------------
>> 1 file changed, 48 insertions(+), 45 deletions(-)
>>
>> Index: linux-2.6/drivers/base/memory.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:17.000000000 -0500
>> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
>> @@ -87,31 +87,6 @@
>> EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>
>> /*
>> - * register_memory - Setup a sysfs device for a memory block
>> - */
>> -static
>> -int register_memory(struct memory_block *memory, struct mem_section *section)
>> -{
>> - int error;
>> -
>> - memory->sysdev.cls = &memory_sysdev_class;
>> - memory->sysdev.id = __section_nr(section);
>> -
>> - error = sysdev_register(&memory->sysdev);
>> - return error;
>> -}
>> -
>> -static void
>> -unregister_memory(struct memory_block *memory)
>> -{
>> - BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
>> -
>> - /* drop the ref. we got in remove_memory_block() */
>> - kobject_put(&memory->sysdev.kobj);
>> - sysdev_unregister(&memory->sysdev);
>> -}
>> -
>> -/*
>> * use this as the physical section index that this memsection
>> * uses.
>> */
>> @@ -346,6 +321,53 @@
>> sysdev_remove_file(&mem->sysdev, &attr_##attr_name)
>>
>> /*
>> + * register_memory - Setup a sysfs device for a memory block
>> + */
>> +static
>> +int register_memory(struct memory_block *memory, struct mem_section *section,
>> + int nid, enum mem_add_context context)
>> +{
>> + int ret;
>> +
>> + memory->sysdev.cls = &memory_sysdev_class;
>> + memory->sysdev.id = __section_nr(section);
>> +

> Why not block-ID but section-ID ?

Using the beginning section id as the id here makes the splitting of
memory_block's easier since we can assume that the id is unique.

>
> -Kame
>

2010-07-13 15:52:10

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On 07/13/2010 01:28 AM, KAMEZAWA Hiroyuki wrote:
> On Mon, 12 Jul 2010 10:45:25 -0500
> Nathan Fontenot <[email protected]> wrote:
>
>> This patch introduces the new 'split' file in each memory sysfs
>> directory and the associated routines needed to handle splitting
>> a directory.
>>
>> Signed-off-by; Nathan Fontenot <[email protected]>
>> ---
>
> pleae check diff option...
>
>
>> drivers/base/memory.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 98 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/drivers/base/memory.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/base/memory.c 2010-07-09 14:23:20.000000000 -0500
>> +++ linux-2.6/drivers/base/memory.c 2010-07-09 14:38:09.000000000 -0500
>> @@ -32,6 +32,9 @@
>>
>> static int sections_per_block;
>>
>> +static int register_memory(struct memory_block *, struct mem_section *,
>> + int, enum mem_add_context);
>> +
>> static inline int base_memory_block_id(int section_nr)
>> {
>> return (section_nr / sections_per_block) * sections_per_block;
>> @@ -309,11 +312,100 @@
>> return sprintf(buf, "%d\n", mem->phys_device);
>> }
>>
>> +static void update_memory_block_phys_indexes(struct memory_block *mem)
>> +{
>> + struct list_head *pos;
>> + struct memory_block_section *mbs;
>> + unsigned long min_index = 0xffffffff;
>> + unsigned long max_index = 0;
>> +
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> +
>> + if (mbs->phys_index < min_index)
>> + min_index = mbs->phys_index;
>> +
>> + if (mbs->phys_index > max_index)
>> + max_index = mbs->phys_index;
>> + }
>> +
>> + mem->start_phys_index = min_index;
>> + mem->end_phys_index = max_index;
>> +}
>> +
>> +static ssize_t
>> +store_mem_split_block(struct sys_device *dev, struct sysdev_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + struct memory_block *mem, *new_mem_blk;
>> + struct memory_block_section *mbs;
>> + struct list_head *pos, *tmp;
>> + struct mem_section *section;
>> + int min_scn_nr = 0;
>> + int max_scn_nr = 0;
>> + int total_scns = 0;
>> + int new_blk_min, new_blk_total;
>> + int ret = -EINVAL;
>> +
>> + mem = container_of(dev, struct memory_block, sysdev);
>> +
>> + if (list_is_singular(&mem->sections))
>> + return -EINVAL;
>
> What this means ?

list_is_singular() will return true if there is only one item
on the list. In this case we cannot split a memory_block with
only one memory_block_section.

>
>
>> +
>> + mutex_lock(&mem->state_mutex);
>> +
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> +
>> + total_scns++;
>> +
>> + if (min_scn_nr > mbs->phys_index)
>> + min_scn_nr = mbs->phys_index;
>> +
>> + if (max_scn_nr < mbs->phys_index)
>> + max_scn_nr = mbs->phys_index;
>> + }
>> +
>> + new_mem_blk = kzalloc(sizeof(*new_mem_blk), GFP_KERNEL);
>> + if (!new_mem_blk)
>> + return -ENOMEM;
>> +
>> + mutex_init(&new_mem_blk->state_mutex);
>> + INIT_LIST_HEAD(&new_mem_blk->sections);
>> + new_mem_blk->state = mem->state;
>> +
>> + mutex_lock(&new_mem_blk->state_mutex);
>> +
>> + new_blk_total = total_scns / 2;
>> + new_blk_min = max_scn_nr - new_blk_total + 1;
>> +
>> + section = __nr_to_section(new_blk_min);
>> + ret = register_memory(new_mem_blk, section, 0, HOTPLUG);
>> +
> 'nid' is always 0 ?

Ahh.. good catch. it may not be. I'll look into finding the correct nid.

>
> And for what purpose this interface is ? Does this split memory block into 2 pieces
> of the same size ?? sounds __very__ strange interface to me.

Yes, this splits the memory_block into two blocks of the same size. This was
suggested as something we may want to do. From ppc perspective I am not sure we
would use this.

The split functionality is not required. The main goal of the patch set is to
reduce the number of memory sysfs directories created. From a ppc perspective
the split functionality is not really needed.

>
> If this is necessary, I hope move the whole things to configfs rather than
> something tricky.
>
> Bye.
> -Kame
>

2010-07-13 15:59:56

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 1/7] Split the memory_block structure

On 07/13/2010 09:00 AM, Brian King wrote:
> On 07/12/2010 10:42 AM, Nathan Fontenot wrote:
>> @@ -123,13 +130,20 @@
>> static ssize_t show_mem_removable(struct sys_device *dev,
>> struct sysdev_attribute *attr, char *buf)
>> {
>> - unsigned long start_pfn;
>> - int ret;
>> - struct memory_block *mem =
>> - container_of(dev, struct memory_block, sysdev);
>> + struct list_head *pos, *tmp;
>> + struct memory_block *mem;
>> + int ret = 1;
>> +
>> + mem = container_of(dev, struct memory_block, sysdev);
>> + list_for_each_safe(pos, tmp, &mem->sections) {
>> + struct memory_block_section *mbs;
>> + unsigned long start_pfn;
>> +
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> + start_pfn = section_nr_to_pfn(mbs->phys_index);
>> + ret &= is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
>> + }
>
> I don't see you deleting anyting from the list in this loop. Why do you need
> to use list_for_each_safe? That won't protect you if someone else is messing
> with the list.

Yes, Kame pointed this out too. I think I'll need to update the patches to
always take the mutex when walking the list and use list_for_each_entry

>
>>
>> - start_pfn = section_nr_to_pfn(mem->phys_index);
>> - ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
>> return sprintf(buf, "%d\n", ret);
>> }
>>
>
>
>> @@ -238,19 +252,40 @@
>> static int memory_block_change_state(struct memory_block *mem,
>> unsigned long to_state, unsigned long from_state_req)
>> {
>> + struct memory_block_section *mbs;
>> + struct list_head *pos;
>> int ret = 0;
>> +
>> mutex_lock(&mem->state_mutex);
>>
>> - if (mem->state != from_state_req) {
>> - ret = -EINVAL;
>> - goto out;
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section, next);
>> +
>> + if (mbs->state != from_state_req)
>> + continue;
>> +
>> + ret = memory_block_action(mbs, to_state);
>> + if (ret)
>> + break;
>> + }
>
> Would it be better here to loop through all the sections and ensure they
> are in the proper state first before starting to change the state of any
> of them? Then you could easily return -EINVAL if one or more is in
> the incorrect state and wouldn't need to the code below.

The code below is needed in cases where the add/remove of one of the
memory_block_sections fails. The code can then return all of the
memory_block_sections in the memory_block to the original state.

>
>> + if (ret) {
>> + list_for_each(pos, &mem->sections) {
>> + mbs = list_entry(pos, struct memory_block_section,
>> + next);
>> +
>> + if (mbs->state == from_state_req)
>> + continue;
>> +
>> + if (memory_block_action(mbs, to_state))
>> + printk(KERN_ERR "Could not re-enable memory "
>> + "section %lx\n", mbs->phys_index);
>> + }
>> }
>>
>> - ret = memory_block_action(mem, to_state);
>> if (!ret)
>> mem->state = to_state;
>>
>> -out:
>> mutex_unlock(&mem->state_mutex);
>> return ret;
>> }
>
>
>> @@ -498,19 +496,97 @@
>>
>> return mem;
>> }
>> +static int add_mem_block_section(struct memory_block *mem,
>> + int section_nr, unsigned long state)
>> +{
>> + struct memory_block_section *mbs;
>> +
>> + mbs = kzalloc(sizeof(*mbs), GFP_KERNEL);
>> + if (!mbs)
>> + return -ENOMEM;
>> +
>> + mbs->phys_index = section_nr;
>> + mbs->state = state;
>> +
>> + list_add(&mbs->next, &mem->sections);
>
> I don't think there is sufficient protection for this list. Don't we
> need to be holding a lock of some sort when adding/deleting/iterating
> through this list?

You're right. we should be holding the mutex.

I think there are a couple other places that I missed with this. I'll fix
it for a v2 of the patches.

>
>> + return 0;
>> +}
>

2010-07-14 00:40:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On Tue, 13 Jul 2010 10:51:58 -0500
Nathan Fontenot <[email protected]> wrote:

> >
> > And for what purpose this interface is ? Does this split memory block into 2 pieces
> > of the same size ?? sounds __very__ strange interface to me.
>
> Yes, this splits the memory_block into two blocks of the same size. This was
> suggested as something we may want to do. From ppc perspective I am not sure we
> would use this.
>
> The split functionality is not required. The main goal of the patch set is to
> reduce the number of memory sysfs directories created. From a ppc perspective
> the split functionality is not really needed.
>

Okay, this is an offer from me.

1. I think you can add an boot option as "don't create memory sysfs".
please do.

2. I'd like to write a configfs module for handling memory hotplug even when
sysfs directroy is not created.
Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do

When offlining section X.
# insmod configfs_memory.ko
# mount -t configfs none /configfs
# mkdir /configfs/memoryX
# echo offline > /configfs/memoryX/state
# rmdir /configfs/memoryX

And making this operation as the default bahavior for all arch's memory hotplug may
be better...

Dave, how do you think ? Because ppc guys uses "probe" interface already,
this can be handled... no ?

One problem is that I don't have enough knowledge about configfs..it seems complex.

Thanks,
-Kame

2010-07-14 03:18:14

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote:
> On Tue, 13 Jul 2010 10:51:58 -0500
> Nathan Fontenot <[email protected]> wrote:
>
>>>
>>> And for what purpose this interface is ? Does this split memory block into 2 pieces
>>> of the same size ?? sounds __very__ strange interface to me.
>>
>> Yes, this splits the memory_block into two blocks of the same size. This was
>> suggested as something we may want to do. From ppc perspective I am not sure we
>> would use this.
>>
>> The split functionality is not required. The main goal of the patch set is to
>> reduce the number of memory sysfs directories created. From a ppc perspective
>> the split functionality is not really needed.
>>
>
> Okay, this is an offer from me.
>
> 1. I think you can add an boot option as "don't create memory sysfs".
> please do.

I posted a patch to do that a week or so ago, it didn't go over very well.

>
> 2. I'd like to write a configfs module for handling memory hotplug even when
> sysfs directroy is not created.
> Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
>
> When offlining section X.
> # insmod configfs_memory.ko
> # mount -t configfs none /configfs
> # mkdir /configfs/memoryX
> # echo offline > /configfs/memoryX/state
> # rmdir /configfs/memoryX
>
> And making this operation as the default bahavior for all arch's memory hotplug may
> be better...
>
> Dave, how do you think ? Because ppc guys uses "probe" interface already,
> this can be handled... no ?

ppc would still require the existance of the 'probe' interface.

Are you objecting to the 'split' functionality? If so I do not see any reason from ppc
perspective that it is needed. This was something Dave suggested, unless I am missing
something.

Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple
memory_block_sections reside under a single memory_block makes memory hotplug
simpler. On ppc we do emory hotplug operations on an LMB size basis. With my
patches this now lets us set each memory_block to span an LMB's worth of
memory. Now we could do emory hotplug in a single operation instead of multiple
operations to offline/online all of the memory sections in an LMB.

Of course the easy solution would be to increase SECTION_SIZE_BITS, but we need
support hardware that can have LMB's from 16 MB to 256 MB in size so the
SECTION_SIZE_BITS value has to remain small.

>
> One problem is that I don't have enough knowledge about configfs..it seems complex.

Me neither, thoug I will take a look at it.

-Nathan

2010-07-14 03:27:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On Wed, 2010-07-14 at 09:35 +0900, KAMEZAWA Hiroyuki wrote:
> 2. I'd like to write a configfs module for handling memory hotplug even when
> sysfs directroy is not created.
> Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
>
> When offlining section X.
> # insmod configfs_memory.ko
> # mount -t configfs none /configfs
> # mkdir /configfs/memoryX
> # echo offline > /configfs/memoryX/state
> # rmdir /configfs/memoryX
>
> And making this operation as the default bahavior for all arch's memory hotplug may
> be better...
>
> Dave, how do you think ? Because ppc guys uses "probe" interface already,
> this can be handled... no ?

I think creating a interface to duplicate the existing sysfs one is a
bad idea. I also think removing the existing sysfs one isn't feasible
since there are users, and it's truly part of the ABI. So, I'm not
really a fan on the configfs interface. :(

I really do think the sysfs interface is fixable. We should at least
give it a good shot before largely duplicating its functionality.

-- Dave

2010-07-14 03:29:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On Tue, 13 Jul 2010 22:18:03 -0500
Nathan Fontenot <[email protected]> wrote:

> On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote:
> > On Tue, 13 Jul 2010 10:51:58 -0500
> > Nathan Fontenot <[email protected]> wrote:
> >
> >>>
> >>> And for what purpose this interface is ? Does this split memory block into 2 pieces
> >>> of the same size ?? sounds __very__ strange interface to me.
> >>
> >> Yes, this splits the memory_block into two blocks of the same size. This was
> >> suggested as something we may want to do. From ppc perspective I am not sure we
> >> would use this.
> >>
> >> The split functionality is not required. The main goal of the patch set is to
> >> reduce the number of memory sysfs directories created. From a ppc perspective
> >> the split functionality is not really needed.
> >>
> >
> > Okay, this is an offer from me.
> >
> > 1. I think you can add an boot option as "don't create memory sysfs".
> > please do.
>
> I posted a patch to do that a week or so ago, it didn't go over very well.
>
> >
> > 2. I'd like to write a configfs module for handling memory hotplug even when
> > sysfs directroy is not created.
> > Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
> >
> > When offlining section X.
> > # insmod configfs_memory.ko
> > # mount -t configfs none /configfs
> > # mkdir /configfs/memoryX
> > # echo offline > /configfs/memoryX/state
> > # rmdir /configfs/memoryX
> >
> > And making this operation as the default bahavior for all arch's memory hotplug may
> > be better...
> >
> > Dave, how do you think ? Because ppc guys uses "probe" interface already,
> > this can be handled... no ?
>
> ppc would still require the existance of the 'probe' interface.
>
> Are you objecting to the 'split' functionality?
yes.

> If so I do not see any reason from ppc
> perspective that it is needed. This was something Dave suggested, unless I am missing
> something.
>
> Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple
> memory_block_sections reside under a single memory_block makes memory hotplug
> simpler. On ppc we do emory hotplug operations on an LMB size basis. With my
> patches this now lets us set each memory_block to span an LMB's worth of
> memory. Now we could do emory hotplug in a single operation instead of multiple
> operations to offline/online all of the memory sections in an LMB.
>

Why per-section memory offlining is provided is for allowing good success-rate of
memory offlining. Because memory-hotplug has to "migrate or free" all used page
under a section, possibility of memory unplug depends on usage of memory.
If a section contains unmovable page(kernel page), we can't offline sectin.

For example, comparing
1. offlining 128MB of memory at once
2. offlining 8 chunks of 16MB memory
"2" can get very good possibility and system-busy time can be much reduced.

IIUC, ppc's 1st requirement is "resizing" not "hot-removing some memory device",
"2" is much welcomed. So, some fine-grained interface to section_size is
appreciated. So, "multiple operations" is much better than single operation.

As I posted show/hide patch, I'm writing it in configfs. I think it meets IBM's
requirements.
_But_, it's IBM's issue not Fujitsu's. So, final decistion will depend on you guys.

Anyway, I don't like a too fancy interface as "split".

Thanks,
-Kame

2010-07-14 08:34:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On Wed, 14 Jul 2010 12:25:03 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 13 Jul 2010 22:18:03 -0500
> Nathan Fontenot <[email protected]> wrote:
>
> > On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 13 Jul 2010 10:51:58 -0500
> > > Nathan Fontenot <[email protected]> wrote:
> > >
> > >>>
> > >>> And for what purpose this interface is ? Does this split memory block into 2 pieces
> > >>> of the same size ?? sounds __very__ strange interface to me.
> > >>
> > >> Yes, this splits the memory_block into two blocks of the same size. This was
> > >> suggested as something we may want to do. From ppc perspective I am not sure we
> > >> would use this.
> > >>
> > >> The split functionality is not required. The main goal of the patch set is to
> > >> reduce the number of memory sysfs directories created. From a ppc perspective
> > >> the split functionality is not really needed.
> > >>
> > >
> > > Okay, this is an offer from me.
> > >
> > > 1. I think you can add an boot option as "don't create memory sysfs".
> > > please do.
> >
> > I posted a patch to do that a week or so ago, it didn't go over very well.
> >
> > >
> > > 2. I'd like to write a configfs module for handling memory hotplug even when
> > > sysfs directroy is not created.
> > > Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
> > >
> > > When offlining section X.
> > > # insmod configfs_memory.ko
> > > # mount -t configfs none /configfs
> > > # mkdir /configfs/memoryX
> > > # echo offline > /configfs/memoryX/state
> > > # rmdir /configfs/memoryX
> > >
> > > And making this operation as the default bahavior for all arch's memory hotplug may
> > > be better...
> > >
> > > Dave, how do you think ? Because ppc guys uses "probe" interface already,
> > > this can be handled... no ?
> >
> > ppc would still require the existance of the 'probe' interface.
> >
> > Are you objecting to the 'split' functionality?
> yes.
>
> > If so I do not see any reason from ppc
> > perspective that it is needed. This was something Dave suggested, unless I am missing
> > something.
> >
> > Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple
> > memory_block_sections reside under a single memory_block makes memory hotplug
> > simpler. On ppc we do emory hotplug operations on an LMB size basis. With my
> > patches this now lets us set each memory_block to span an LMB's worth of
> > memory. Now we could do emory hotplug in a single operation instead of multiple
> > operations to offline/online all of the memory sections in an LMB.
> >
>
> Why per-section memory offlining is provided is for allowing good success-rate of
> memory offlining. Because memory-hotplug has to "migrate or free" all used page
> under a section, possibility of memory unplug depends on usage of memory.
> If a section contains unmovable page(kernel page), we can't offline sectin.
>
> For example, comparing
> 1. offlining 128MB of memory at once
> 2. offlining 8 chunks of 16MB memory
> "2" can get very good possibility and system-busy time can be much reduced.
>
> IIUC, ppc's 1st requirement is "resizing" not "hot-removing some memory device",
> "2" is much welcomed. So, some fine-grained interface to section_size is
> appreciated. So, "multiple operations" is much better than single operation.
>
> As I posted show/hide patch, I'm writing it in configfs. I think it meets IBM's
> requirements.
> _But_, it's IBM's issue not Fujitsu's. So, final decistion will depend on you guys.
>
> Anyway, I don't like a too fancy interface as "split".
>

This is a sample configfs for handling memory hotplug.
I wrote this just for my fun and study. code-duplication was not as
big as expected...most of codes are for configfs management.

you can ignore this. but please avoid changing existing interace in fancy way.

==
[root@bluextal kamezawa]# mount -t configfs none /configfs/
[root@bluextal kamezawa]# mkdir /configfs/memory/72
[root@bluextal kamezawa]# cat /configfs/memory/72/phys_index
00000048
[root@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/phys_index
00000048
[root@bluextal kamezawa]# echo offline > /configfs/memory/72/state
[root@bluextal kamezawa]# cat /configfs/memory/72/state
offline
[root@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/state
offline
[root@bluextal kamezawa]# echo online > /configfs/memory/72/state
[root@bluextal kamezawa]# cat /sys/devices/system/memory/memory72/state
online

No sign.

---
drivers/base/Makefile | 2
drivers/base/memory.c | 87 +++++++++++++++++--
drivers/base/memory_config.c | 192 +++++++++++++++++++++++++++++++++++++++++++
include/linux/memory.h | 10 ++
mm/Kconfig | 1
5 files changed, 280 insertions(+), 12 deletions(-)

Index: mmotm-2.6.35-0701/drivers/base/memory.c
===================================================================
--- mmotm-2.6.35-0701.orig/drivers/base/memory.c
+++ mmotm-2.6.35-0701/drivers/base/memory.c
@@ -23,12 +23,15 @@
#include <linux/mutex.h>
#include <linux/stat.h>
#include <linux/slab.h>
+#include <linux/radix-tree.h>

#include <asm/atomic.h>
#include <asm/uaccess.h>

#define MEMORY_CLASS_NAME "memory"

+
+
static struct sysdev_class memory_sysdev_class = {
.name = MEMORY_CLASS_NAME,
};
@@ -104,17 +107,57 @@ unregister_memory(struct memory_block *m
sysdev_unregister(&memory->sysdev);
}

+/* routine for remember memory's status when configs is not used. */
+
+RADIX_TREE(hidden_mems, GFP_KERNEL);
+DEFINE_MUTEX(hidden_mems_mutex);
+int record_memory_state(unsigned long section_nr, int status)
+{
+ int ret = -ENOMEM;
+ long lstat = status << 8; /* for avoid radix'trees special handling */
+ mutex_lock(&hidden_mems_mutex);
+ radix_tree_delete(&hidden_mems, section_nr);
+ if (radix_tree_preload(GFP_KERNEL))
+ goto out;
+ ret = radix_tree_insert(&hidden_mems, section_nr, (void*)lstat);
+ radix_tree_preload_end();
+out:
+ mutex_unlock(&hidden_mems_mutex);
+ return ret;
+}
+
+int lookup_memory_state(unsigned long section_nr)
+{
+ void *ptr;
+ /* we already have big mutex */
+ ptr= radix_tree_lookup(&hidden_mems, section_nr);
+ /* treate not-recorded mems'state as ONLINE...? */
+ return ((long)ptr) >> 8;
+}
+
+void forget_memory_state(unsigned long section_nr)
+{
+ radix_tree_delete(&hidden_mems, section_nr);
+}
+
+
/*
* use this as the physical section index that this memsection
* uses.
*/

+ssize_t show_memoryblock_phys_index(struct memory_block *mem,
+ char *buf)
+{
+ return sprintf(buf, "%08lx\n", mem->phys_index);
+}
+
static ssize_t show_mem_phys_index(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
{
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- return sprintf(buf, "%08lx\n", mem->phys_index);
+ return show_memoryblock_phys_index(mem, buf);
}

/*
@@ -136,11 +179,8 @@ static ssize_t show_mem_removable(struct
/*
* online, offline, going offline, etc.
*/
-static ssize_t show_mem_state(struct sys_device *dev,
- struct sysdev_attribute *attr, char *buf)
+ssize_t show_memoryblock_state(struct memory_block *mem, char *buf)
{
- struct memory_block *mem =
- container_of(dev, struct memory_block, sysdev);
ssize_t len = 0;

/*
@@ -167,6 +207,15 @@ static ssize_t show_mem_state(struct sys
return len;
}

+ssize_t show_mem_state(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct memory_block *mem =
+ container_of(dev, struct memory_block, sysdev);
+ mem->state = lookup_memory_state(mem->phys_index);
+ return show_memoryblock_state(mem, buf);
+}
+
int memory_notify(unsigned long val, void *v)
{
return blocking_notifier_call_chain(&memory_chain, val, v);
@@ -218,11 +267,14 @@ memory_block_action(struct memory_block
break;
case MEM_OFFLINE:
mem->state = MEM_GOING_OFFLINE;
+ record_memory_state(mem->phys_index, mem->state);
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION << PAGE_SHIFT);
if (ret) {
mem->state = old_state;
+ record_memory_state(mem->phys_index,
+ mem->state);
break;
}
break;
@@ -241,6 +293,8 @@ static int memory_block_change_state(str
int ret = 0;
mutex_lock(&mem->state_mutex);

+ mem->state = lookup_memory_state(mem->phys_index);
+
if (mem->state != from_state_req) {
ret = -EINVAL;
goto out;
@@ -249,21 +303,18 @@ static int memory_block_change_state(str
ret = memory_block_action(mem, to_state);
if (!ret)
mem->state = to_state;
-
+ record_memory_state(mem->phys_index, mem->state);
out:
mutex_unlock(&mem->state_mutex);
return ret;
}

-static ssize_t
-store_mem_state(struct sys_device *dev,
- struct sysdev_attribute *attr, const char *buf, size_t count)
+ssize_t store_memoryblock_state(struct memory_block *mem,
+ const char *buf, size_t count)
{
- struct memory_block *mem;
unsigned int phys_section_nr;
int ret = -EINVAL;

- mem = container_of(dev, struct memory_block, sysdev);
phys_section_nr = mem->phys_index;

if (!present_section_nr(phys_section_nr))
@@ -279,6 +330,16 @@ out:
return count;
}

+static ssize_t
+store_mem_state(struct sys_device *dev,
+ struct sysdev_attribute *attr, const char *buf, size_t count)
+{
+ struct memory_block *mem;
+
+ mem = container_of(dev, struct memory_block, sysdev);
+ return store_memoryblock_state(mem, buf, count);
+}
+
/*
* phys_device is a bad name for this. What I really want
* is a way to differentiate between memory ranges that
@@ -451,6 +512,8 @@ static int add_memory_block(int nid, str
start_pfn = section_nr_to_pfn(mem->phys_index);
mem->phys_device = arch_get_memory_phys_device(start_pfn);

+ record_memory_state(mem->phys_index, state);
+
ret = register_memory(mem, section);
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
@@ -505,6 +568,7 @@ int remove_memory_block(unsigned long no
struct memory_block *mem;

mem = find_memory_block(section);
+ forget_memory_state(mem->phys_index);
unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
mem_remove_simple_file(mem, state);
@@ -573,3 +637,4 @@ out:
printk(KERN_ERR "%s() failed: %d\n", __func__, ret);
return ret;
}
+
Index: mmotm-2.6.35-0701/drivers/base/memory_config.c
===================================================================
--- /dev/null
+++ mmotm-2.6.35-0701/drivers/base/memory_config.c
@@ -0,0 +1,192 @@
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/configfs.h>
+
+
+
+struct memory_hp_block {
+ struct config_group group;
+ struct memory_block mb;
+};
+struct memory_hp_attribute {
+ struct configfs_attribute attr;
+ ssize_t (*show)(struct memory_hp_block *, char *);
+ ssize_t (*store)(struct memory_hp_block *, const char *);
+};
+
+static struct memory_hp_block *to_mhp_block(struct config_item *item)
+{
+ if (!item)
+ return NULL;
+ return container_of(to_config_group(item),
+ struct memory_hp_block, group);
+}
+
+static ssize_t
+memory_hp_phys_index_read(struct memory_hp_block *mhb, char *page)
+{
+ return show_memoryblock_phys_index(&mhb->mb, page);
+}
+
+static struct memory_hp_attribute memory_hp_phys_index = {
+ .attr = { .ca_owner = THIS_MODULE,
+ .ca_name = "phys_index",
+ .ca_mode = S_IRUGO },
+ .show = memory_hp_phys_index_read,
+};
+
+static ssize_t
+memory_hp_state_read(struct memory_hp_block *mhb, char *page)
+{
+ /* synchronize */
+ printk("lookup section %ld\n", mhb->mb.phys_index);
+ mhb->mb.state = lookup_memory_state(mhb->mb.phys_index);
+ return show_memoryblock_state(&mhb->mb, page);
+}
+
+static ssize_t
+memory_hp_state_store(struct memory_hp_block *mhb, const char *page)
+{
+ int len = strnlen(page, 8);
+ printk("length %d str %s\n", len, page);
+ if (len > 8) /* online/offline */
+ return -EINVAL;
+ /* synchronize */
+ mhb->mb.state = lookup_memory_state(mhb->mb.phys_index);
+ return store_memoryblock_state(&mhb->mb, page, len);
+}
+
+static struct memory_hp_attribute memory_hp_state = {
+ .attr = { .ca_owner = THIS_MODULE,
+ .ca_name = "state",
+ .ca_mode = S_IRUGO|S_IWUSR },
+ .show = memory_hp_state_read,
+ .store = memory_hp_state_store,
+};
+
+static struct configfs_attribute *memory_hp_attrs[] = {
+ &memory_hp_phys_index.attr,
+ &memory_hp_state.attr,
+ NULL,
+};
+
+static ssize_t memory_hp_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ struct memory_hp_block *mhb = to_mhp_block(item);
+ struct memory_hp_attribute *memhp_attr =
+ container_of(attr, struct memory_hp_attribute, attr);
+ ssize_t ret = 0;
+
+ if (memhp_attr->show)
+ ret = memhp_attr->show(mhb, page);
+ return ret;
+}
+
+static ssize_t memory_hp_attr_store(struct config_item *item,
+ struct configfs_attribute *attr,
+ const char *page, size_t count)
+{
+ struct memory_hp_block *mhb = to_mhp_block(item);
+ struct memory_hp_attribute *memhp_attr =
+ container_of(attr, struct memory_hp_attribute, attr);
+ ssize_t ret = 0;
+
+ if (memhp_attr->store)
+ ret = memhp_attr->store(mhb, page);
+ else
+ ret = -EINVAL;
+ return ret;
+}
+
+static struct configfs_item_operations memory_hp_item_ops = {
+ .show_attribute = memory_hp_attr_show,
+ .store_attribute = memory_hp_attr_store,
+};
+
+static struct config_item_type memory_hp_type = {
+ .ct_item_ops = &memory_hp_item_ops,
+ .ct_attrs = memory_hp_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct config_group *
+memory_hp_make_group(struct config_group *group, const char *name)
+{
+ struct memory_hp_block *mhb;
+ unsigned long long section_id;
+
+
+ if (strict_strtoull(name, 10, &section_id))
+ return ERR_PTR(-EINVAL);
+
+ if (!valid_section_nr(section_id))
+ return ERR_PTR(-EINVAL);
+
+ mhb = kzalloc(sizeof(*mhb), GFP_KERNEL);
+ if (!mhb)
+ return NULL;
+
+ config_group_init_type_name(&mhb->group, name, &memory_hp_type);
+
+ mhb->mb.phys_index = section_id;
+ mutex_init(&mhb->mb.state_mutex);
+ mhb->mb.state = lookup_memory_state(section_id);
+
+ return &mhb->group;
+}
+
+static void memory_hp_drop_item(struct config_group *group,
+ struct config_item *item)
+{
+ struct memory_hp_block *mhb;
+
+ mhb = container_of(group, struct memory_hp_block, group);
+ config_item_put(item);
+}
+
+static struct configfs_group_operations memory_hp_group_ops = {
+ .make_group = memory_hp_make_group,
+ .drop_item = memory_hp_drop_item,
+};
+
+static struct config_item_type memory_hp_subsys_type = {
+ .ct_group_ops = &memory_hp_group_ops,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem memory_hp_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "memory",
+ .ci_type = &memory_hp_subsys_type,
+ },
+ },
+};
+
+static int __init memory_config_init(void)
+{
+ int ret;
+
+ config_group_init(&memory_hp_subsys.su_group);
+ mutex_init(&memory_hp_subsys.su_mutex);
+ ret = configfs_register_subsystem(&memory_hp_subsys);
+ if (ret) {
+ printk(KERN_ERR "Error %d while registering memory configfs\n", ret);
+ return ret;
+ }
+ return 0;
+}
+
+#if 0
+static void __exit memory_config_exit(void)
+{
+ configfs_unregister_subsystem(&memory_hp_subsys);
+}
+#endif
+late_initcall(memory_config_init);
Index: mmotm-2.6.35-0701/drivers/base/Makefile
===================================================================
--- mmotm-2.6.35-0701.orig/drivers/base/Makefile
+++ mmotm-2.6.35-0701/drivers/base/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT)
obj-$(CONFIG_ISA) += isa.o
obj-$(CONFIG_FW_LOADER) += firmware_class.o
obj-$(CONFIG_NUMA) += node.o
-obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
+obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o memory_config.o
obj-$(CONFIG_SMP) += topology.o
obj-$(CONFIG_IOMMU_API) += iommu.o
ifeq ($(CONFIG_SYSFS),y)
Index: mmotm-2.6.35-0701/mm/Kconfig
===================================================================
--- mmotm-2.6.35-0701.orig/mm/Kconfig
+++ mmotm-2.6.35-0701/mm/Kconfig
@@ -138,6 +138,7 @@ config MEMORY_HOTPLUG
config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
+ select CONFIGFS_FS

config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
Index: mmotm-2.6.35-0701/include/linux/memory.h
===================================================================
--- mmotm-2.6.35-0701.orig/include/linux/memory.h
+++ mmotm-2.6.35-0701/include/linux/memory.h
@@ -68,6 +68,16 @@ struct memory_isolate_notify {
struct notifier_block;
struct mem_section;

+
+ssize_t show_memoryblock_phys_index(struct memory_block *mb, char *buf);
+ssize_t show_memoryblock_state(struct memory_block *mb, char *buf);
+ssize_t store_memoryblock_state(struct memory_block *mb,
+ const char *buf, size_t count);
+
+int record_memory_state(unsigned long section, int state);
+int lookup_memory_state(unsigned long section);
+void forget_memory_state(unsigned long section);
+
/*
* Priorities for the hotplug memory callback routines (stored in decreasing
* order in the callback chain)




2010-07-14 17:17:14

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split

On 07/13/2010 10:26 PM, Dave Hansen wrote:
> On Wed, 2010-07-14 at 09:35 +0900, KAMEZAWA Hiroyuki wrote:
>> 2. I'd like to write a configfs module for handling memory hotplug even when
>> sysfs directroy is not created.
>> Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
>>
>> When offlining section X.
>> # insmod configfs_memory.ko
>> # mount -t configfs none /configfs
>> # mkdir /configfs/memoryX
>> # echo offline > /configfs/memoryX/state
>> # rmdir /configfs/memoryX
>>
>> And making this operation as the default bahavior for all arch's memory hotplug may
>> be better...
>>
>> Dave, how do you think ? Because ppc guys uses "probe" interface already,
>> this can be handled... no ?
>
> I think creating a interface to duplicate the existing sysfs one is a
> bad idea. I also think removing the existing sysfs one isn't feasible
> since there are users, and it's truly part of the ABI. So, I'm not
> really a fan on the configfs interface. :(
>
> I really do think the sysfs interface is fixable. We should at least
> give it a good shot before largely duplicating its functionality.

I agree with Dave, I don't think another memory hotplug interface is needed.

I am working to update the patchset to remove the split functionality and
fix other items commented on. this new patch will just split the memory_block
structure so that a memory_block can span multiple memory sections.

Kame, I understand that offlining 16 MB is easier than 256 MB. From the ppc
perspective though, we are still offlining 256 MB. We do memory add/remove
on LMB size chunks, so we have to add/remove all of the memory sections contained
in an LMB. If any one memory section covered by a LMB fails to add/remove, we
restore the memory sections to their orignal state an fail the add/remove operation.
NOTE: the code doing this is not in the kernel, but in the user-space drmgr
command (from powerpc-utils package).

-Nathan

2010-07-16 07:13:01

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/7] De-couple sysfs memory directories from memory sections

On Mon, Jul 12, 2010 at 10:27:02AM -0500, Nathan Fontenot wrote:
> This set of patches de-couples the idea that there is a single
> directory in sysfs for each memory section.

Any reason you didn't cc: the sysfs maintainer on these patches? If
not, I'll gladly ignore them...

(hint, scripts/get_maintainer.pl is your friend...)

greg k-h

2010-07-16 15:41:53

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH 0/7] De-couple sysfs memory directories from memory sections

On 07/16/2010 02:13 AM, Greg KH wrote:
> On Mon, Jul 12, 2010 at 10:27:02AM -0500, Nathan Fontenot wrote:
>> This set of patches de-couples the idea that there is a single
>> directory in sysfs for each memory section.
>
> Any reason you didn't cc: the sysfs maintainer on these patches? If
> not, I'll gladly ignore them...

Ummm.... I have no good excuse. :)

Will add you to cc for v3 of the patch.

>
> (hint, scripts/get_maintainer.pl is your friend...)
Thanks.

-Nathan
>
> greg k-h