LinuxLists.cc - [Patch 000/002] Rearrange notifier of memory hotplug

2007-10-12 02:19:19

Subject: [Patch 000/002] Rearrange notifier of memory hotplug

Hello.

This patch set is to rearrange event notifier for memory hotplug,
because the old notifier has some defects. For example, there is no
information like new memory's pfn and # of pages for callback functions.

Fortunately, nothing uses this notifier so far, there is no impact by
this change. (SLUB will use this after this patch set to make
kmem_cache_node structure).

In addition, descriptions of notifer is added to memory hotplug
document.

This patch was a part of patch set to make kmem_cache_node of SLUB
to avoid panic of memory online. But, I think this change becomes
not only for SLUB but also for others. So, I extracted this from it.

This patch set is for 2.6.23-rc8-mm2.
I tested this patch on my ia64 box.

Please apply.

Bye.

--
Yasunori Goto

2007-10-12 02:21:38

by Yasunori Goto

[permalink] [raw]

Subject: [Patch 001/002] Make description of memory hotplug notifier in document

Add description about event notification callback routine to the document.

Signed-off-by: Yasunori Goto <[email protected]>

---
Documentation/memory-hotplug.txt | 56 ++++++++++++++++++++++++++++++++++++---
1 file changed, 53 insertions(+), 3 deletions(-)

Index: current/Documentation/memory-hotplug.txt
===================================================================
--- current.orig/Documentation/memory-hotplug.txt
+++ current/Documentation/memory-hotplug.txt
@@ -2,7 +2,8 @@
Memory Hotplug
==============

-Last Updated: Jul 28 2007
+Created: Jul 28 2007
+Add description of notifier of memory hotplug Oct 11 2007

This document is about memory hotplug including how-to-use and current status.
Because Memory Hotplug is still under development, contents of this text will
@@ -24,7 +25,8 @@ be changed often.
6.1 Memory offline and ZONE_MOVABLE
6.2. How to offline memory
7. Physical memory remove
-8. Future Work List
+8. Memory hotplug event notifier
+9. Future Work List

Note(1): x86_64's has special implementation for memory hotplug.
This text does not describe it.
@@ -307,8 +309,68 @@ Need more implementation yet....
- Notification completion of remove works by OS to firmware.
- Guard from remove if not yet.

+--------------------------------
+8. Memory hotplug event notifier
+--------------------------------
+Memory hotplug has event notifer. There are 6 types of notification.
+
+MEMORY_GOING_ONLINE
+ This is notified before memory online. If some structures must be prepared
+ for new memory, it should be done at this event's callback.
+ The new onlining memory can't be used yet.
+
+MEMORY_CANCEL_ONLINE
+ If memory online fails, this event is notified for rollback of setting at
+ MEMORY_GOING_ONLINE.
+ (Currently, this event is notified only the case which a callback routine
+ of MEMORY_GOING_ONLINE fails).
+
+MEMORY_ONLINE
+ This event is called when memory online is completed. The page allocator uses
+ new memory area before this notification. In other words, callback routine
+ use new memory area via page allocator.
+ The failures of callbacks of this notification will be ignored.
+
+MEMORY_GOING_OFFLINE
+ This is notified on halfway of memory offline. The offlining pages are
+ isolated. In other words, the page allocater doesn't allocate new pages from
+ offlining memory area at this time. If callback routine freed some pages,
+ they are not used by the page allocator.
+ This is good place for shrinking cache. (If possible, it is desirable to
+ migrate to other area.)
+
+MEMORY_CANCEL_OFFLINE
+ If memory offline fails, this event is notified for rollback against
+ MEMORY_GOING_OFFLINE. The page allocator will use target memory area after
+ this callback again.
+
+MEMORY_OFFLINE
+ This is notified after memory offline completed. The failures of callbacks
+ of this notification will be ignored. Callback routine can return structures
+ for offlined memory.
+ If the node which has offlined memory,
+
+A callback routine can be registered by
+ hotplug_memory_notifier(callback_func, priority).
+
+The second argument of callback function (action) is event types of above.
+The third argument is passed by pointer of struct memory_notify.
+
+struct memory_notify {
+ unsigned long start_pfn;
+ unsigned long nr_pages;
+ int status_change_nid;
+};
+start_pfn is start pfn of online/offline memory.
+nr_pages is # of pages of online/offline memory.
+status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+set/clear. It means a new(memoryless) node gets new memory by online and a
+node lose all memory. If this is -1, then nodemask status is not changed.
+If status_changed_nid >= 0, callback should create/discard structures for the
+node if necessary.
+
--------------
-8. Future Work
+9. Future Work
--------------
- allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
sysctl or new control file.

--
Yasunori Goto

2007-10-12 02:23:20

by Yasunori Goto

[permalink] [raw]

Subject: [Patch 002/002] rearrange patch for notifier of memory hotplug

Current memory notifier has some defects yet. (Fortunately, nothing uses it.)
This patch is to fix and rearrange for them.

- Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
- Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
- Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
- Make CANCEL events for rollingback when error occurs.
- Delete MEM_MAPPING_INVALID notification. It will be not used.
- Fix compile error of (un)register_memory_notifier().

Signed-off-by: Yasunori Goto <[email protected]>

---
drivers/base/memory.c | 9 +--------
include/linux/memory.h | 27 +++++++++++++++------------
mm/memory_hotplug.c | 48 +++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 61 insertions(+), 23 deletions(-)

Index: current/drivers/base/memory.c
===================================================================
--- current.orig/drivers/base/memory.c 2007-10-11 14:33:02.000000000 +0900
+++ current/drivers/base/memory.c 2007-10-11 14:33:07.000000000 +0900
@@ -137,7 +137,7 @@ static ssize_t show_mem_state(struct sys
return len;
}

-static inline int memory_notify(unsigned long val, void *v)
+int memory_notify(unsigned long val, void *v)
{
return blocking_notifier_call_chain(&memory_chain, val, v);
}
@@ -183,7 +183,6 @@ memory_block_action(struct memory_block
break;
case MEM_OFFLINE:
mem->state = MEM_GOING_OFFLINE;
- memory_notify(MEM_GOING_OFFLINE, NULL);
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION << PAGE_SHIFT);
@@ -191,7 +190,6 @@ memory_block_action(struct memory_block
mem->state = old_state;
break;
}
- memory_notify(MEM_MAPPING_INVALID, NULL);
break;
default:
printk(KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
@@ -199,11 +197,6 @@ memory_block_action(struct memory_block
WARN_ON(1);
ret = -EINVAL;
}
- /*
- * For now, only notify on successful memory operations
- */
- if (!ret)
- memory_notify(action, NULL);

return ret;
}
Index: current/include/linux/memory.h
===================================================================
--- current.orig/include/linux/memory.h 2007-10-11 14:33:02.000000000 +0900
+++ current/include/linux/memory.h 2007-10-11 15:19:31.000000000 +0900
@@ -41,18 +41,15 @@ struct memory_block {
#define MEM_ONLINE (1<<0) /* exposed to userspace */
#define MEM_GOING_OFFLINE (1<<1) /* exposed to userspace */
#define MEM_OFFLINE (1<<2) /* exposed to userspace */
+#define MEM_GOING_ONLINE (1<<3)
+#define MEM_CANCEL_ONLINE (1<<4)
+#define MEM_CANCEL_OFFLINE (1<<5)

-/*
- * All of these states are currently kernel-internal for notifying
- * kernel components and architectures.
- *
- * For MEM_MAPPING_INVALID, all notifier chains with priority >0
- * are called before pfn_to_page() becomes invalid. The priority=0
- * entry is reserved for the function that actually makes
- * pfn_to_page() stop working. Any notifiers that want to be called
- * after that should have priority <0.
- */
-#define MEM_MAPPING_INVALID (1<<3)
+struct memory_notify {
+ unsigned long start_pfn;
+ unsigned long nr_pages;
+ int status_change_nid;
+};

struct notifier_block;
struct mem_section;
@@ -69,12 +66,18 @@ static inline int register_memory_notifi
static inline void unregister_memory_notifier(struct notifier_block *nb)
{
}
+static inline int memory_notify(unsigned long val, void *v)
+{
+ return 0;
+}
#else
+extern int register_memory_notifier(struct notifier_block *nb);
+extern void unregister_memory_notifier(struct notifier_block *nb);
extern int register_new_memory(struct mem_section *);
extern int unregister_memory_section(struct mem_section *);
extern int memory_dev_init(void);
extern int remove_memory_block(unsigned long, struct mem_section *, int);
-
+extern int memory_notify(unsigned long val, void *v);
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)

Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2007-10-11 14:33:02.000000000 +0900
+++ current/mm/memory_hotplug.c 2007-10-11 15:38:50.000000000 +0900
@@ -187,7 +187,24 @@ int online_pages(unsigned long pfn, unsi
unsigned long onlined_pages = 0;
struct zone *zone;
int need_zonelists_rebuild = 0;
+ int nid;
+ int ret;
+ struct memory_notify arg;
+
+ arg.start_pfn = pfn;
+ arg.nr_pages = nr_pages;
+ arg.status_change_nid = -1;
+
+ nid = page_to_nid(pfn_to_page(pfn));
+ if (node_present_pages(nid) == 0)
+ arg.status_change_nid = nid;

+ ret = memory_notify(MEM_GOING_ONLINE, &arg);
+ ret = notifier_to_errno(ret);
+ if (ret) {
+ memory_notify(MEM_CANCEL_ONLINE, &arg);
+ return ret;
+ }
/*
* This doesn't need a lock to do pfn_to_page().
* The section can't be removed here because of the
@@ -222,6 +239,10 @@ int online_pages(unsigned long pfn, unsi
build_all_zonelists();
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
+
+ if (onlined_pages)
+ memory_notify(MEM_ONLINE, &arg);
+
return 0;
}
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
@@ -467,8 +488,9 @@ int offline_pages(unsigned long start_pf
{
unsigned long pfn, nr_pages, expire;
long offlined_pages;
- int ret, drain, retry_max;
+ int ret, drain, retry_max, node;
struct zone *zone;
+ struct memory_notify arg;

BUG_ON(start_pfn >= end_pfn);
/* at least, alignment against pageblock is necessary */
@@ -480,11 +502,27 @@ int offline_pages(unsigned long start_pf
we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn))
return -EINVAL;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ node = zone_to_nid(zone);
+ nr_pages = end_pfn - start_pfn;
+
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn);
if (ret)
return ret;
- nr_pages = end_pfn - start_pfn;
+
+ arg.start_pfn = start_pfn;
+ arg.nr_pages = nr_pages;
+ arg.status_change_nid = -1;
+ if (nr_pages >= node_present_pages(node))
+ arg.status_change_nid = node;
+
+ ret = memory_notify(MEM_GOING_OFFLINE, &arg);
+ ret = notifier_to_errno(ret);
+ if (ret)
+ goto failed_removal;
+
pfn = start_pfn;
expire = jiffies + timeout;
drain = 0;
@@ -539,20 +577,24 @@ repeat:
/* reset pagetype flags */
start_isolate_page_range(start_pfn, end_pfn);
/* removal success */
- zone = page_zone(pfn_to_page(start_pfn));
zone->present_pages -= offlined_pages;
zone->zone_pgdat->node_present_pages -= offlined_pages;
totalram_pages -= offlined_pages;
num_physpages -= offlined_pages;
+
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();
+
+ memory_notify(MEM_OFFLINE, &arg);
return 0;

failed_removal:
printk(KERN_INFO "memory offlining %lx to %lx failed\n",
start_pfn, end_pfn);
+ memory_notify(MEM_CANCEL_OFFLINE, &arg);
/* pushback to free area */
undo_isolate_page_range(start_pfn, end_pfn);
+
return ret;
}
#else

--
Yasunori Goto

2007-10-12 02:25:28

by Yasunori Goto

[permalink] [raw]

Subject: [Patch 000/002] Make kmem_cache_node for SLUB on memory online to avoid panic(take 2)

This patch set is to fix panic due to access NULL pointer of SLUB.

When new memory is hot-added on the new node (or memory less node),
kmem_cache_node for the new node is not prepared,
and panic occurs by it. So, kmem_cache_node should be created for the node
before new memory is available on the node.
Incidentally, it is freed on memory offline if it becomes not necessary.

This is the first user of the callback of memory notifier, and
requires its rearrange patch set.

This patch set is for 2.6.23-rc8-mm2.
I tested this patch on my ia64 box.

Please apply.

Bye.

--
Yasunori Goto

2007-10-12 02:28:19

by Yasunori Goto

[permalink] [raw]

Subject: [Patch 001/002] extract kmem_cache_shrink

Make kmem_cache_shrink_node() for callback routine of memory hotplug
notifier. This is just extract a part of kmem_cache_shrink().

Signed-off-by: Yasunori Goto <[email protected]>

---
mm/slub.c | 111 ++++++++++++++++++++++++++++++++++----------------------------
1 file changed, 61 insertions(+), 50 deletions(-)

Index: current/mm/slub.c
===================================================================
--- current.orig/mm/slub.c 2007-10-11 20:30:45.000000000 +0900
+++ current/mm/slub.c 2007-10-11 21:58:47.000000000 +0900
@@ -2626,6 +2626,56 @@ void kfree(const void *x)
}
EXPORT_SYMBOL(kfree);

+static inline void __kmem_cache_shrink_node(struct kmem_cache *s, int node,
+ struct list_head *slabs_by_inuse)
+{
+ struct kmem_cache_node *n;
+ int i;
+ struct page *page;
+ struct page *t;
+ unsigned long flags;
+
+ n = get_node(s, node);
+
+ if (!n->nr_partial)
+ return;
+
+ for (i = 0; i < s->objects; i++)
+ INIT_LIST_HEAD(slabs_by_inuse + i);
+
+ spin_lock_irqsave(&n->list_lock, flags);
+
+ /*
+ * Build lists indexed by the items in use in each slab.
+ *
+ * Note that concurrent frees may occur while we hold the
+ * list_lock. page->inuse here is the upper limit.
+ */
+ list_for_each_entry_safe(page, t, &n->partial, lru) {
+ if (!page->inuse && slab_trylock(page)) {
+ /*
+ * Must hold slab lock here because slab_free
+ * may have freed the last object and be
+ * waiting to release the slab.
+ */
+ list_del(&page->lru);
+ n->nr_partial--;
+ slab_unlock(page);
+ discard_slab(s, page);
+ } else
+ list_move(&page->lru, slabs_by_inuse + page->inuse);
+ }
+
+ /*
+ * Rebuild the partial list with the slabs filled up most
+ * first and the least used slabs at the end.
+ */
+ for (i = s->objects - 1; i >= 0; i--)
+ list_splice(slabs_by_inuse + i, n->partial.prev);
+
+ spin_unlock_irqrestore(&n->list_lock, flags);
+}
+
/*
* kmem_cache_shrink removes empty slabs from the partial lists and sorts
* the remaining slabs by the number of items in use. The slabs with the
@@ -2636,68 +2686,29 @@ EXPORT_SYMBOL(kfree);
* being allocated from last increasing the chance that the last objects
* are freed in them.
*/
-int kmem_cache_shrink(struct kmem_cache *s)
+int kmem_cache_shrink_node(struct kmem_cache *s, int node)
{
- int node;
- int i;
- struct kmem_cache_node *n;
- struct page *page;
- struct page *t;
struct list_head *slabs_by_inuse =
kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
- unsigned long flags;

if (!slabs_by_inuse)
return -ENOMEM;

flush_all(s);
- for_each_node_state(node, N_NORMAL_MEMORY) {
- n = get_node(s, node);
-
- if (!n->nr_partial)
- continue;
-
- for (i = 0; i < s->objects; i++)
- INIT_LIST_HEAD(slabs_by_inuse + i);
-
- spin_lock_irqsave(&n->list_lock, flags);
-
- /*
- * Build lists indexed by the items in use in each slab.
- *
- * Note that concurrent frees may occur while we hold the
- * list_lock. page->inuse here is the upper limit.
- */
- list_for_each_entry_safe(page, t, &n->partial, lru) {
- if (!page->inuse && slab_trylock(page)) {
- /*
- * Must hold slab lock here because slab_free
- * may have freed the last object and be
- * waiting to release the slab.
- */
- list_del(&page->lru);
- n->nr_partial--;
- slab_unlock(page);
- discard_slab(s, page);
- } else {
- list_move(&page->lru,
- slabs_by_inuse + page->inuse);
- }
- }
-
- /*
- * Rebuild the partial list with the slabs filled up most
- * first and the least used slabs at the end.
- */
- for (i = s->objects - 1; i >= 0; i--)
- list_splice(slabs_by_inuse + i, n->partial.prev);
-
- spin_unlock_irqrestore(&n->list_lock, flags);
- }
+ if (node >= 0)
+ __kmem_cache_shrink_node(s, node, slabs_by_inuse);
+ else
+ for_each_node_state(node, N_NORMAL_MEMORY)
+ __kmem_cache_shrink_node(s, node, slabs_by_inuse);

kfree(slabs_by_inuse);
return 0;
}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+ return kmem_cache_shrink_node(s, -1);
+}
EXPORT_SYMBOL(kmem_cache_shrink);

/********************************************************************

--
Yasunori Goto

2007-10-12 02:30:18

by Yasunori Goto

[permalink] [raw]

Subject: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

This is to make kmem_cache_nodes of all SLUBs for new node when
memory-hotadd is called. This fixes panic due to access NULL pointer at
discard_slab() after memory hot-add.

If pages on the new node available, slub can use it before making
new kmem_cache_nodes. So, this callback should be called
BEFORE pages on the node are available.

When memory online is called, slab_mem_going_online_callback() is
called to make kmem_cache_node(). if it (or other callbacks) fails,
then slab_mem_offline_callback() is called for rollback.

In memory offline, slab_mem_going_offline_callback() is called to
shrink cache, then slab_mem_offline_callback() is called later.

Signed-off-by: Yasunori Goto <[email protected]>

---
mm/slub.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 117 insertions(+)

Index: current/mm/slub.c
===================================================================
--- current.orig/mm/slub.c 2007-10-11 20:31:37.000000000 +0900
+++ current/mm/slub.c 2007-10-11 21:58:10.000000000 +0900
@@ -20,6 +20,7 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include <linux/memory.h>

/*
* Lock order:
@@ -2711,6 +2712,120 @@ int kmem_cache_shrink(struct kmem_cache
}
EXPORT_SYMBOL(kmem_cache_shrink);

+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static int slab_mem_going_offline_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct memory_notify *marg = arg;
+ int local_node, offline_node = marg->status_change_nid;
+
+ if (offline_node < 0)
+ /* node has memory yet. nothing to do. */
+ return 0;
+
+ down_read(&slub_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ local_node = page_to_nid(virt_to_page(s));
+ if (local_node == offline_node)
+ /* This slub is on the offline node. */
+ return -EBUSY;
+ }
+ up_read(&slub_lock);
+
+ kmem_cache_shrink_node(s, offline_node);
+
+ return 0;
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+ struct kmem_cache_node *n;
+ struct kmem_cache *s;
+ struct memory_notify *marg = arg;
+ int offline_node;
+
+ offline_node = marg->status_change_nid;
+
+ if (offline_node < 0)
+ /* node has memory yet. nothing to do. */
+ return;
+
+ down_read(&slub_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ n = get_node(s, offline_node);
+ if (n) {
+ /*
+ * if n->nr_slabs > 0, offline_pages() must be fail,
+ * because the node is used by slub yet.
+ */
+ BUG_ON(atomic_read(&n->nr_slabs));
+
+ s->node[offline_node] = NULL;
+ kmem_cache_free(kmalloc_caches, n);
+ }
+ }
+ up_read(&slub_lock);
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+ struct kmem_cache_node *n;
+ struct kmem_cache *s;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+
+ /* If the node already has memory, then nothing is necessary. */
+ if (nid < 0)
+ return 0;
+
+ /*
+ * New memory will be onlined on the node which has no memory so far.
+ * New kmem_cache_node is necssary for it.
+ */
+ down_read(&slub_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ /*
+ * XXX: The new node's memory can't be allocated yet,
+ * kmem_cache_node will be allocated other node.
+ */
+ n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
+ if (!n)
+ return -ENOMEM;
+ init_kmem_cache_node(n);
+ s->node[nid] = n;
+ }
+ up_read(&slub_lock);
+
+ return 0;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = slab_mem_going_online_callback(arg);
+ break;
+ case MEM_GOING_OFFLINE:
+ ret = slab_mem_going_offline_callback(arg);
+ break;
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ slab_mem_offline_callback(arg);
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
/********************************************************************
* Basic setup of slabs
*******************************************************************/
@@ -2741,6 +2856,8 @@ void __init kmem_cache_init(void)
sizeof(struct kmem_cache_node), GFP_KERNEL);
kmalloc_caches[0].refcount = -1;
caches++;
+
+ hotplug_memory_notifier(slab_memory_callback, 1);
#endif

/* Able to allocate the per node structures */

--
Yasunori Goto

2007-10-12 04:09:20

by Christoph Lameter

[permalink] [raw]

Subject: Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

On Fri, 12 Oct 2007, Yasunori Goto wrote:

> If pages on the new node available, slub can use it before making
> new kmem_cache_nodes. So, this callback should be called
> BEFORE pages on the node are available.

If its called before pages on the node are available then it must
fallback and cannot use the pages.

> +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> +static int slab_mem_going_offline_callback(void *arg)
> +{
> + struct kmem_cache *s;
> + struct memory_notify *marg = arg;
> + int local_node, offline_node = marg->status_change_nid;
> +
> + if (offline_node < 0)
> + /* node has memory yet. nothing to do. */

Please clarify the comment. This seems to indicate that we should not
do anything because the node still has memory?

Doesnt the node always have memory before offlining?

> + return 0;
> +
> + down_read(&slub_lock);
> + list_for_each_entry(s, &slab_caches, list) {
> + local_node = page_to_nid(virt_to_page(s));
> + if (local_node == offline_node)
> + /* This slub is on the offline node. */
> + return -EBUSY;
> + }
> + up_read(&slub_lock);

So this checks if the any kmem_cache structure is on the offlined node? If
so then we cannot offline the node?

> + kmem_cache_shrink_node(s, offline_node);

kmem_cache_shrink(s) would be okay here I would think. The function is
reasonably fast. Offlining is a rare event.

> +static void slab_mem_offline_callback(void *arg)

We call this after we have established that no kmem_cache structures are
on this and after we have shrunk the slabs. Is there any guarantee that
no slab operations have occurrent since then?

> +{
> + struct kmem_cache_node *n;
> + struct kmem_cache *s;
> + struct memory_notify *marg = arg;
> + int offline_node;
> +
> + offline_node = marg->status_change_nid;
> +
> + if (offline_node < 0)
> + /* node has memory yet. nothing to do. */
> + return;

Does this mean that the node still has memory?

> + down_read(&slub_lock);
> + list_for_each_entry(s, &slab_caches, list) {
> + n = get_node(s, offline_node);
> + if (n) {
> + /*
> + * if n->nr_slabs > 0, offline_pages() must be fail,
> + * because the node is used by slub yet.
> + */

It may be clearer to say:

"If nr_slabs > 0 then slabs still exist on the node that is going down.
We were unable to free them so we must fail."

> +static int slab_mem_going_online_callback(void *arg)
> +{
> + struct kmem_cache_node *n;
> + struct kmem_cache *s;
> + struct memory_notify *marg = arg;
> + int nid = marg->status_change_nid;
> +
> + /* If the node already has memory, then nothing is necessary. */
> + if (nid < 0)
> + return 0;

The node must have memory???? Or we have already brought up the code?

> + /*
> + * New memory will be onlined on the node which has no memory so far.
> + * New kmem_cache_node is necssary for it.

"We are bringing a node online. No memory is available yet. We must
allocate a kmem_cache_node structure in order to bring the node online." ?

> + */
> + down_read(&slub_lock);
> + list_for_each_entry(s, &slab_caches, list) {
> + /*
> + * XXX: The new node's memory can't be allocated yet,
> + * kmem_cache_node will be allocated other node.
> + */

"kmem_cache_alloc node will fallback to other nodes since memory is
not yet available from the node that is brought up.? ?

> + n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
> + if (!n)
> + return -ENOMEM;
> + init_kmem_cache_node(n);
> + s->node[nid] = n;
> + }
> + up_read(&slub_lock);
> +
> + return 0;
> +}

2007-10-12 04:09:35

by Christoph Lameter

[permalink] [raw]

Subject: Re: [Patch 001/002] extract kmem_cache_shrink

On Fri, 12 Oct 2007, Yasunori Goto wrote:

> Make kmem_cache_shrink_node() for callback routine of memory hotplug
> notifier. This is just extract a part of kmem_cache_shrink().

Could we just call kmem_cache_shrink? It will do the shrink on every node
but memory hotplug is rare?

2007-10-12 04:19:09

by Christoph Lameter

[permalink] [raw]

Subject: Re: [Patch 001/002] Make description of memory hotplug notifier in document

Looks good. Some suggestions on improving the wording.

On Fri, 12 Oct 2007, Yasunori Goto wrote:

> +MEMORY_GOING_ONLINE
> + This is notified before memory online. If some structures must be prepared
> + for new memory, it should be done at this event's callback.
> + The new onlining memory can't be used yet.

Generated before new memory becomes available in order to be able to
prepare subsystems to handle memory. The page allocator is still unable
to allocate from the new memory.

> +MEMORY_CANCEL_ONLINE
> + If memory online fails, this event is notified for rollback of setting at
> + MEMORY_GOING_ONLINE.
> + (Currently, this event is notified only the case which a callback routine
> + of MEMORY_GOING_ONLINE fails).

Generated if MEMORY_GOING_ONLINE fails.

> +MEMORY_ONLINE
> + This event is called when memory online is completed. The page allocator uses
> + new memory area before this notification. In other words, callback routine
> + use new memory area via page allocator.
> + The failures of callbacks of this notification will be ignored.

Generated when memory has succesfully brought online. The callback may
allocate from the new memory.

> +MEMORY_GOING_OFFLINE
> + This is notified on halfway of memory offline. The offlining pages are
> + isolated. In other words, the page allocater doesn't allocate new pages from
> + offlining memory area at this time. If callback routine freed some pages,
> + they are not used by the page allocator.
> + This is good place for shrinking cache. (If possible, it is desirable to
> + migrate to other area.)

Generated to begin the process of offlining memory. Allocations are no
longer possible from the memory but some of the memory to be offlined
is still in use. The callback can be used to free memory known to a
subsystem from the indicated node.

> +MEMORY_CANCEL_OFFLINE
> + If memory offline fails, this event is notified for rollback against
> + MEMORY_GOING_OFFLINE. The page allocator will use target memory area after
> + this callback again.

Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
the node that we attempted to offline.

> + +MEMORY_OFFLINE

Generated after offlining memory is complete.

2007-10-12 04:34:41

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Patch 001/002] Make description of memory hotplug notifier in document

> Looks good. Some suggestions on improving the wording.

Thanks! I'll fix them.

Bye.

>
> On Fri, 12 Oct 2007, Yasunori Goto wrote:
>
> > +MEMORY_GOING_ONLINE
> > + This is notified before memory online. If some structures must be prepared
> > + for new memory, it should be done at this event's callback.
> > + The new onlining memory can't be used yet.
>
> Generated before new memory becomes available in order to be able to
> prepare subsystems to handle memory. The page allocator is still unable
> to allocate from the new memory.
>
> > +MEMORY_CANCEL_ONLINE
> > + If memory online fails, this event is notified for rollback of setting at
> > + MEMORY_GOING_ONLINE.
> > + (Currently, this event is notified only the case which a callback routine
> > + of MEMORY_GOING_ONLINE fails).
>
> Generated if MEMORY_GOING_ONLINE fails.
>
> > +MEMORY_ONLINE
> > + This event is called when memory online is completed. The page allocator uses
> > + new memory area before this notification. In other words, callback routine
> > + use new memory area via page allocator.
> > + The failures of callbacks of this notification will be ignored.
>
> Generated when memory has succesfully brought online. The callback may
> allocate from the new memory.
>
> > +MEMORY_GOING_OFFLINE
> > + This is notified on halfway of memory offline. The offlining pages are
> > + isolated. In other words, the page allocater doesn't allocate new pages from
> > + offlining memory area at this time. If callback routine freed some pages,
> > + they are not used by the page allocator.
> > + This is good place for shrinking cache. (If possible, it is desirable to
> > + migrate to other area.)
>
> Generated to begin the process of offlining memory. Allocations are no
> longer possible from the memory but some of the memory to be offlined
> is still in use. The callback can be used to free memory known to a
> subsystem from the indicated node.
>
> > +MEMORY_CANCEL_OFFLINE
> > + If memory offline fails, this event is notified for rollback against
> > + MEMORY_GOING_OFFLINE. The page allocator will use target memory area after
> > + this callback again.
>
> Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
> the node that we attempted to offline.
>
> > + +MEMORY_OFFLINE
>
> Generated after offlining memory is complete.

--
Yasunori Goto

2007-10-12 04:42:36

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Patch 001/002] extract kmem_cache_shrink

> On Fri, 12 Oct 2007, Yasunori Goto wrote:
>
> > Make kmem_cache_shrink_node() for callback routine of memory hotplug
> > notifier. This is just extract a part of kmem_cache_shrink().
>
> Could we just call kmem_cache_shrink? It will do the shrink on every node
> but memory hotplug is rare?

Yes it is. Memory hotplug is rare.
Ok. I'll do it.

Thanks.
--
Yasunori Goto

2007-10-12 06:16:38

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

> On Fri, 12 Oct 2007, Yasunori Goto wrote:
>
> > If pages on the new node available, slub can use it before making
> > new kmem_cache_nodes. So, this callback should be called
> > BEFORE pages on the node are available.
>
> If its called before pages on the node are available then it must
> fallback and cannot use the pages.

Hmm. My description may be wrong. I would like to just
mention that kmem_cache_node should be created before the node's page
can be allocated. If not, it will cause of panic.

> > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > +static int slab_mem_going_offline_callback(void *arg)
> > +{
> > + struct kmem_cache *s;
> > + struct memory_notify *marg = arg;
> > + int local_node, offline_node = marg->status_change_nid;
> > +
> > + if (offline_node < 0)
> > + /* node has memory yet. nothing to do. */
>
> Please clarify the comment. This seems to indicate that we should not
> do anything because the node still has memory?

Yes. kmem_cache_node is still necessary for remaining memory on the
node.

> Doesnt the node always have memory before offlining?

If node doesn't have memory and offline_pages() called for it,
it must be check and fail. This callback shouldn't be called.
If not, it is bug of memory hotplug, I think.

> > + return 0;
> > +
> > + down_read(&slub_lock);
> > + list_for_each_entry(s, &slab_caches, list) {
> > + local_node = page_to_nid(virt_to_page(s));
> > + if (local_node == offline_node)
> > + /* This slub is on the offline node. */
> > + return -EBUSY;
> > + }
> > + up_read(&slub_lock);
>
> So this checks if the any kmem_cache structure is on the offlined node? If
> so then we cannot offline the node?

Right. If slabs' migration is possible, here would be good place for
doing it. But, it is not possible (at least now).

> > + kmem_cache_shrink_node(s, offline_node);
>
> kmem_cache_shrink(s) would be okay here I would think. The function is
> reasonably fast. Offlining is a rare event.

Ok. I'll fix it.

> > +static void slab_mem_offline_callback(void *arg)
>
> We call this after we have established that no kmem_cache structures are
> on this and after we have shrunk the slabs. Is there any guarantee that
> no slab operations have occurrent since then?

If slabs still exist, it can't be migrated and offline_pages() has
to give up offline. This means MEM_OFFLINE event is not generated when
slabs are on the removing node.
In other word, when this event is generated, all of pages on
this section is isolated and there are no used pages(slabs).

>
> > +{
> > + struct kmem_cache_node *n;
> > + struct kmem_cache *s;
> > + struct memory_notify *marg = arg;
> > + int offline_node;
> > +
> > + offline_node = marg->status_change_nid;
> > +
> > + if (offline_node < 0)
> > + /* node has memory yet. nothing to do. */
> > + return;
>
> Does this mean that the node still has memory?

Yes.

> > + down_read(&slub_lock);
> > + list_for_each_entry(s, &slab_caches, list) {
> > + n = get_node(s, offline_node);
> > + if (n) {
> > + /*
> > + * if n->nr_slabs > 0, offline_pages() must be fail,
> > + * because the node is used by slub yet.
> > + */
>
> It may be clearer to say:
>
> "If nr_slabs > 0 then slabs still exist on the node that is going down.
> We were unable to free them so we must fail."

Again. If nr_slabs > 0, offline_pages must be fail due to slabs
remaining on the node before. So, this callback isn't called.

> > +static int slab_mem_going_online_callback(void *arg)
> > +{
> > + struct kmem_cache_node *n;
> > + struct kmem_cache *s;
> > + struct memory_notify *marg = arg;
> > + int nid = marg->status_change_nid;
> > +
> > + /* If the node already has memory, then nothing is necessary. */
> > + if (nid < 0)
> > + return 0;
>
> The node must have memory???? Or we have already brought up the code?

kmem_cache_node is created at boot time if the node has memory.
(Or, it is created by this callback on first added memory on the node).

When nid = - 1, kmem_cache_node is created before this node due to
node has memory.

>
> > + /*
> > + * New memory will be onlined on the node which has no memory so far.
> > + * New kmem_cache_node is necssary for it.
>
> "We are bringing a node online. No memory is available yet. We must
> allocate a kmem_cache_node structure in order to bring the node online." ?

Your mention might be ok.
But. I would like to prefer to define status of node hotplug for
exactitude like followings

A)Node online -- pgdat is created and can be accessed for this node.
but there are no gurantee that cpu or memory is onlined.
This status is very close from memory-less node.
But this might be halfway status for node hotplug.
Node online bit is set. But N_HIGH_MEMORY
(or N_NORMAL_MEMORY) might be not set.

B)Node has memory--
one or more sections memory is onlined on the node.
N_HIGH_MEMORY (or N_NORMAL_MEMORY) is set.

If first memory is onlined on the node, the node status changes
from A) to B).

I feel this is very useful to manage "halfway status" of node
hotplug. (So, memory-less node patch is very helpful for me.)

So, I would like to avoid using the word "node online" at here.
But, if above definition is messy for others, I'll change it.

> > + */
> > + down_read(&slub_lock);
> > + list_for_each_entry(s, &slab_caches, list) {
> > + /*
> > + * XXX: The new node's memory can't be allocated yet,
> > + * kmem_cache_node will be allocated other node.
> > + */
>
> "kmem_cache_alloc node will fallback to other nodes since memory is
> not yet available from the node that is brought up."?

Yes.

Thanks for your comment.

--
Yasunori Goto

2007-10-12 17:20:44

by Christoph Lameter

[permalink] [raw]

Subject: Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

On Fri, 12 Oct 2007, Yasunori Goto wrote:

> > > + down_read(&slub_lock);
> > > + list_for_each_entry(s, &slab_caches, list) {
> > > + local_node = page_to_nid(virt_to_page(s));
> > > + if (local_node == offline_node)
> > > + /* This slub is on the offline node. */
> > > + return -EBUSY;
> > > + }
> > > + up_read(&slub_lock);
> >
> > So this checks if the any kmem_cache structure is on the offlined node? If
> > so then we cannot offline the node?
>
> Right. If slabs' migration is possible, here would be good place for
> doing it. But, it is not possible (at least now).

I think you can avoid this check. The kmem_cache structures are allocated
from the kmalloc array. The check if the kmalloc slabs are empty will fail
if kmem_cache structures still exist on the node.

> > > + * because the node is used by slub yet.
> > > + */
> >
> > It may be clearer to say:
> >
> > "If nr_slabs > 0 then slabs still exist on the node that is going down.
> > We were unable to free them so we must fail."
>
> Again. If nr_slabs > 0, offline_pages must be fail due to slabs
> remaining on the node before. So, this callback isn't called.

Ok then we can remove these checks?

> > > +static int slab_mem_going_online_callback(void *arg)
> > > +{
> > > + struct kmem_cache_node *n;
> > > + struct kmem_cache *s;
> > > + struct memory_notify *marg = arg;
> > > + int nid = marg->status_change_nid;
> > > +
> > > + /* If the node already has memory, then nothing is necessary. */
> > > + if (nid < 0)
> > > + return 0;
> >
> > The node must have memory???? Or we have already brought up the code?
>
> kmem_cache_node is created at boot time if the node has memory.
> (Or, it is created by this callback on first added memory on the node).
>
> When nid = - 1, kmem_cache_node is created before this node due to
> node has memory.

So the function can be called for a node that is already online?

> > > + * New memory will be onlined on the node which has no memory so far.
> > > + * New kmem_cache_node is necssary for it.
> >
> > "We are bringing a node online. No memory is available yet. We must
> > allocate a kmem_cache_node structure in order to bring the node online." ?
>
> Your mention might be ok.
> But. I would like to prefer to define status of node hotplug for
> exactitude like followings
>
>
> A)Node online -- pgdat is created and can be accessed for this node.
> but there are no gurantee that cpu or memory is onlined.
> This status is very close from memory-less node.
> But this might be halfway status for node hotplug.
> Node online bit is set. But N_HIGH_MEMORY
> (or N_NORMAL_MEMORY) might be not set.

Ahh.. Okay.

> B)Node has memory--
> one or more sections memory is onlined on the node.
> N_HIGH_MEMORY (or N_NORMAL_MEMORY) is set.
>
> If first memory is onlined on the node, the node status changes
> from A) to B).
>
> I feel this is very useful to manage "halfway status" of node
> hotplug. (So, memory-less node patch is very helpful for me.)
>
> So, I would like to avoid using the word "node online" at here.
> But, if above definition is messy for others, I'll change it.

Ok can we talk about this as

node online

and

node memory available?

2007-10-13 05:01:21

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

> On Fri, 12 Oct 2007, Yasunori Goto wrote:
>
> > > > + down_read(&slub_lock);
> > > > + list_for_each_entry(s, &slab_caches, list) {
> > > > + local_node = page_to_nid(virt_to_page(s));
> > > > + if (local_node == offline_node)
> > > > + /* This slub is on the offline node. */
> > > > + return -EBUSY;
> > > > + }
> > > > + up_read(&slub_lock);
> > >
> > > So this checks if the any kmem_cache structure is on the offlined node? If
> > > so then we cannot offline the node?
> >
> > Right. If slabs' migration is possible, here would be good place for
> > doing it. But, it is not possible (at least now).
>
> I think you can avoid this check. The kmem_cache structures are allocated
> from the kmalloc array. The check if the kmalloc slabs are empty will fail
> if kmem_cache structures still exist on the node.

Ah, Ok.

>
> > > > + * because the node is used by slub yet.
> > > > + */
> > >
> > > It may be clearer to say:
> > >
> > > "If nr_slabs > 0 then slabs still exist on the node that is going down.
> > > We were unable to free them so we must fail."
> >
> > Again. If nr_slabs > 0, offline_pages must be fail due to slabs
> > remaining on the node before. So, this callback isn't called.
>
> Ok then we can remove these checks?

Hmm. Yes. I'll remove it.

>
> > > > +static int slab_mem_going_online_callback(void *arg)
> > > > +{
> > > > + struct kmem_cache_node *n;
> > > > + struct kmem_cache *s;
> > > > + struct memory_notify *marg = arg;
> > > > + int nid = marg->status_change_nid;
> > > > +
> > > > + /* If the node already has memory, then nothing is necessary. */
> > > > + if (nid < 0)
> > > > + return 0;
> > >
> > > The node must have memory???? Or we have already brought up the code?
> >
> > kmem_cache_node is created at boot time if the node has memory.
> > (Or, it is created by this callback on first added memory on the node).
> >
> > When nid = - 1, kmem_cache_node is created before this node due to
> > node has memory.
>
> So the function can be called for a node that is already online?

already "node memory available", accurately ;-)

>
> > > > + * New memory will be onlined on the node which has no memory so far.
> > > > + * New kmem_cache_node is necssary for it.
> > >
> > > "We are bringing a node online. No memory is available yet. We must
> > > allocate a kmem_cache_node structure in order to bring the node online." ?
> >
> > Your mention might be ok.
> > But. I would like to prefer to define status of node hotplug for
> > exactitude like followings
> >
> >
> > A)Node online -- pgdat is created and can be accessed for this node.
> > but there are no gurantee that cpu or memory is onlined.
> > This status is very close from memory-less node.
> > But this might be halfway status for node hotplug.
> > Node online bit is set. But N_HIGH_MEMORY
> > (or N_NORMAL_MEMORY) might be not set.
>
> Ahh.. Okay.
>
> > B)Node has memory--
> > one or more sections memory is onlined on the node.
> > N_HIGH_MEMORY (or N_NORMAL_MEMORY) is set.
> >
> > If first memory is onlined on the node, the node status changes
> > from A) to B).
> >
> > I feel this is very useful to manage "halfway status" of node
> > hotplug. (So, memory-less node patch is very helpful for me.)
> >
> > So, I would like to avoid using the word "node online" at here.
> > But, if above definition is messy for others, I'll change it.
>
> Ok can we talk about this as
>
> node online
>
> and
>
> node memory available?

Yes. Thanks.

--
Yasunori Goto