2008-07-31 11:54:38

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Hello.

This patch set is first trial and to describe my rough idea of
"how to remove pgdat".

I would like to confirm "current my idea is good way or not" by this post.
This patch is incomplete and not tested yet, If my idea is good way,
I'll continue to make them and test.

I think pgdat removing is diffcult issue,
because any code doesn't know pgdat will be removed, and access
them without any locking now. But the pgdat remover must wait their access,
because the node may be removed electrically after it soon.

Current my idea is using RCU feature for waiting them.
Because it is the least impact against reader's performance,
and pgdat remover can wait finish of reader's access to pgdat
which is removing by synchronize_sched().

So, I made followings read_lock for accessing pgdat.
- pgdat_remove_read_lock()/unlock()
- pgdat_remove_read_lock_sleepable()/unlock_sleepable()
These definishions use rcu_read_lock and srcu_read_lock().

Writer uses node_set_offline() which uses clear_bit(),
and build_all_zonelists() with stop_machine_run().


There are a few types of pgdat access.

1) via node_online_bitmap.
Many code use for_each_xxx_node(), for_each_zone(), and so on.
These code must be used with pgdat_remove_read_lock/unlock().

2) mempolicy
There are callback interface when memory offline works. mempolicy
must use callbacks for disable removing node.
This patch set includes quite simple (sample) patch to point
what will be required. However more detail specification will be necessary.
(ex, When preffered node of mempolicy is removing, how does kernel should do?)

3) zonelist
alloc_pages access zones via zonelist. However, zone may be removed
by pgdat remover too. It must be check zones might be removed
before accessing zonliest which is guarded between pgdat_remove_read_lock()
and unlock().

4) via NODE_DATA() with node_id.
This type access is called with numa_node_id() in many case.
Basically, CPUs on the removing node must be removed before removing node.
So, I used BUG_ON() when numa_node_id() is points offlined node.

If node id is specified by other way, offline_node must be checked and
escape when it is offline...


If my idea is bad way, other way I can tell is...
- read_write_lock(). (It should n't be used...)
- collect pgdats on one node (depends on performance)

If you have better idea, please let me know.


Note:
- I don't add pgdat_remove_read_lock() on boot code.
Because pgdat hot-removing will not work at boot time.
(But I may overlook some places which must use pgdat_remove_read_lock() yet.)


Thanks.


--
Yasunori Goto


2008-07-31 11:58:18

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 002/008](memory hotplug) pgdat_remove_read_lock/unlock


This is definition for pgdat_remove_read_lock() & read_lock_sleepable().


Signed-off-by: Yasunori Goto <[email protected]>


---
include/linux/memory_hotplug.h | 25 +++++++++++++++++++++++++
mm/memory_hotplug.c | 12 ++++++++++++
2 files changed, 37 insertions(+)

Index: current/include/linux/memory_hotplug.h
===================================================================
--- current.orig/include/linux/memory_hotplug.h 2008-07-29 21:19:13.000000000 +0900
+++ current/include/linux/memory_hotplug.h 2008-07-29 21:19:17.000000000 +0900
@@ -20,6 +20,31 @@ struct mem_section;
#define MIX_SECTION_INFO (-1 - 2)
#define NODE_INFO (-1 - 3)

+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+/*
+ * pgdat removing lock
+ */
+extern struct srcu_struct pgdat_remove_srcu;
+#define pgdat_remove_read_lock() rcu_read_lock()
+#define pgdat_remove_read_unlock() rcu_read_unlock()
+#define pgdat_remove_read_lock_sleepable() srcu_read_lock(&pgdat_remove_srcu)
+#define pgdat_remove_read_unlock_sleepable(idx) \
+ srcu_read_unlock(&pgdat_remove_srcu, idx)
+#else
+static inline void pgdat_remove_read_lock(void)
+{
+}
+static inline void pgdat_remove_read_unlock(void)
+{
+}
+static inline int pgdat_remove_read_lock_sleepable(void)
+{
+}
+static inline void pgdat_remove_read_unlock_sleepable(int idx)
+{
+}
+#endif
+
/*
* pgdat resizing functions
*/
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2008-07-29 21:19:13.000000000 +0900
+++ current/mm/memory_hotplug.c 2008-07-29 22:17:38.000000000 +0900
@@ -31,6 +31,10 @@

#include "internal.h"

+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+struct srcu_struct pgdat_remove_srcu;
+#endif
+
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
@@ -850,6 +854,14 @@ failed_removal:

return ret;
}
+
+static int __init init_pgdat_remove_lock_sleepable(void)
+{
+ init_srcu_struct(&pgdat_remove_srcu);
+ return 0;
+}
+
+subsys_initcall(init_pgdat_remove_lock_sleepable);
#else
int remove_memory(u64 start, u64 size)
{

--
Yasunori Goto

2008-07-31 11:59:50

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 003/008](memory hotplug) check node online in __alloc_pages


This is to add pgdat_remove_read_lock()/unlock() for parsing zonelist in
__alloc_pages_internal().
The node might be removed before pgdat_remove_read_lock(),
node_online() must be checked at first. If offlined, don't parse it.

Signed-off-by: Yasunori Goto <[email protected]>

---
mm/page_alloc.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-31 19:01:46.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-31 19:19:19.000000000 +0900
@@ -1394,10 +1394,22 @@ get_page_from_freelist(gfp_t gfp_mask, n
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */

+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(zonelist_nid))) {
+ /*
+ * Pgdat removing worked before here.
+ * Don't touch pgdat/zone/zonelist any more.
+ */
+ pgdat_remove_read_unlock();
+ return NULL;
+ }
+
(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
&preferred_zone);
- if (!preferred_zone)
+ if (!preferred_zone) {
+ pgdat_remove_read_unlock();
return NULL;
+ }

classzone_idx = zone_idx(preferred_zone);

@@ -1451,6 +1463,7 @@ try_next_zone:
zlc_active = 0;
goto zonelist_scan;
}
+ pgdat_remove_read_unlock();
return page;
}

@@ -1536,10 +1549,21 @@ __alloc_pages_internal(gfp_t gfp_mask, u
return NULL;

restart:
+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(zonelist_nid))) {
+ /*
+ * pgdat removing worked before here.
+ * zone & zonelist can't be touched.
+ */
+ pgdat_remove_read_unlock();
+ goto nopage;
+ }
zonelist = node_zonelist(zonelist_nid, gfp_mask);;
z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */
+ zone = z->zone;
+ pgdat_remove_read_unlock();

- if (unlikely(!z->zone)) {
+ if (unlikely(!zone)) {
/*
* Happens if we have an empty zonelist as a result of
* GFP_THISNODE being used on a memoryless node
@@ -1565,9 +1589,17 @@ restart:
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

+ pgdat_remove_read_lock();
+
+ if (unlikely(!node_online(zonelist_nid))) {
+ pgdat_remove_read_unlock();
+ goto nopage;
+ }
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
wakeup_kswapd(zone, order);

+ pgdat_remove_read_unlock();
+
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according

--
Yasunori Goto

2008-07-31 12:02:19

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 004/008](memory hotplug) Use lock for for_each_online_node


Add pgdat_remove_read_lock() and unlock() for parsing
for_each_online_node() (and for_each_node_state()).

(for_each_zone also needs same lock, but I don't implement
it yet.)


Signed-off-by: Yasunori Goto <[email protected]>

---
fs/buffer.c | 4 +++-
mm/mempolicy.c | 9 ++++++++-
mm/page-writeback.c | 2 ++
mm/page_alloc.c | 9 ++++++++-
mm/vmscan.c | 2 ++
mm/vmstat.c | 3 +++
6 files changed, 26 insertions(+), 3 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-29 21:21:33.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-29 22:17:44.000000000 +0900
@@ -2345,6 +2345,7 @@ static int default_zonelist_order(void)
/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
low_kmem_size = 0;
total_size = 0;
+ pgdat_remove_read_lock();
for_each_online_node(nid) {
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
z = &NODE_DATA(nid)->node_zones[zone_type];
@@ -2355,6 +2356,7 @@ static int default_zonelist_order(void)
}
}
}
+ pgdat_remove_read_unlock();
if (!low_kmem_size || /* there are no DMA area. */
low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
return ZONELIST_ORDER_NODE;
@@ -2365,6 +2367,8 @@ static int default_zonelist_order(void)
*/
average_size = total_size /
(nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+
+ pgdat_remove_read_lock();
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -2378,9 +2382,12 @@ static int default_zonelist_order(void)
}
if (low_kmem_size &&
total_size > average_size && /* ignore small node */
- low_kmem_size > total_size * 70/100)
+ low_kmem_size > total_size * 70/100){
+ pgdat_remove_read_unlock();
return ZONELIST_ORDER_NODE;
+ }
}
+ pgdat_remove_read_unlock();
return ZONELIST_ORDER_ZONE;
}

Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c 2008-07-29 21:20:42.000000000 +0900
+++ current/mm/vmscan.c 2008-07-29 22:17:44.000000000 +0900
@@ -2170,6 +2170,7 @@ static int __devinit cpu_callback(struct
int nid;

if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
+ pgdat_remove_read_lock();
for_each_node_state(nid, N_HIGH_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
node_to_cpumask_ptr(mask, pgdat->node_id);
@@ -2178,6 +2179,7 @@ static int __devinit cpu_callback(struct
/* One of our CPUs online: restore mask */
set_cpus_allowed_ptr(pgdat->kswapd, mask);
}
+ pgdat_remove_read_unlock();
}
return NOTIFY_OK;
}
Index: current/mm/page-writeback.c
===================================================================
--- current.orig/mm/page-writeback.c 2008-07-29 21:20:42.000000000 +0900
+++ current/mm/page-writeback.c 2008-07-29 21:23:11.000000000 +0900
@@ -325,12 +325,14 @@ static unsigned long highmem_dirtyable_m
int node;
unsigned long x = 0;

+ pgdat_remove_read_lock();
for_each_node_state(node, N_HIGH_MEMORY) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];

x += zone_page_state(z, NR_FREE_PAGES) + zone_lru_pages(z);
}
+ pgdat_remove_read_unlock();
/*
* Make sure that the number of highmem pages is never larger
* than the number of the total dirtyable memory. This can only
Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c 2008-07-29 21:20:42.000000000 +0900
+++ current/mm/mempolicy.c 2008-07-29 22:17:44.000000000 +0900
@@ -129,15 +129,19 @@ static int is_valid_nodemask(const nodem
/* Check that there is something useful in this mask */
k = policy_zone;

+ pgdat_remove_read_lock();
for_each_node_mask(nd, *nodemask) {
struct zone *z;

for (k = 0; k <= policy_zone; k++) {
z = &NODE_DATA(nd)->node_zones[k];
- if (z->present_pages > 0)
+ if (z->present_pages > 0) {
+ pgdat_remove_read_unlock();
return 1;
+ }
}
}
+ pgdat_remove_read_unlock();

return 0;
}
@@ -1930,6 +1934,8 @@ void __init numa_policy_init(void)
* fall back to the largest node if they're all smaller.
*/
nodes_clear(interleave_nodes);
+
+ pgdat_remove_read_lock(); /* node_present_pages accesses pgdat */
for_each_node_state(nid, N_HIGH_MEMORY) {
unsigned long total_pages = node_present_pages(nid);

@@ -1943,6 +1949,7 @@ void __init numa_policy_init(void)
if ((total_pages << PAGE_SHIFT) >= (16 << 20))
node_set(nid, interleave_nodes);
}
+ pgdat_remove_read_unlock();

/* All too small, use the largest */
if (unlikely(nodes_empty(interleave_nodes)))
Index: current/fs/buffer.c
===================================================================
--- current.orig/fs/buffer.c 2008-07-29 21:20:42.000000000 +0900
+++ current/fs/buffer.c 2008-07-29 21:23:11.000000000 +0900
@@ -369,11 +369,12 @@ void invalidate_bdev(struct block_device
static void free_more_memory(void)
{
struct zone *zone;
- int nid;
+ int nid, idx;

wakeup_pdflush(1024);
yield();

+ idx = pgdat_remove_read_lock_sleepable();
for_each_online_node(nid) {
(void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
gfp_zone(GFP_NOFS), NULL,
@@ -382,6 +383,7 @@ static void free_more_memory(void)
try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
GFP_NOFS);
}
+ pgdat_remove_read_unlock_sleepable(idx);
}

/*
Index: current/mm/vmstat.c
===================================================================
--- current.orig/mm/vmstat.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/vmstat.c 2008-07-29 22:07:13.000000000 +0900
@@ -400,6 +400,8 @@ static void *frag_start(struct seq_file
{
pg_data_t *pgdat;
loff_t node = *pos;
+
+ pgdat_remove_read_lock();
for (pgdat = first_online_pgdat();
pgdat && node;
pgdat = next_online_pgdat(pgdat))
@@ -418,6 +420,7 @@ static void *frag_next(struct seq_file *

static void frag_stop(struct seq_file *m, void *arg)
{
+ pgdat_remove_read_unlock();
}

/* Walk all the zones in a node and print using a callback */

--
Yasunori Goto

2008-07-31 12:02:53

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 001/008](memory hotplug) change parameter from pointer of zonelist to node id


This is preparation patch for the later patch.
Current code passes the pointer of zonelist to __alloc_pages() to
specify which zonelist should be used.
However, its parameter also means which node(pgdat)'s zonelist
should be used for parsing.

This patch change interface from zonelist pointer to node id (zonelist_nid)
which has target zonelist.
Because node id is easy to check node online/offline.


Signed-off-by: Yasunori Goto <[email protected]>

---
include/linux/gfp.h | 12 +++++------
include/linux/mempolicy.h | 2 -
mm/hugetlb.c | 4 ++-
mm/mempolicy.c | 50 +++++++++++++++++++++++-----------------------
mm/page_alloc.c | 21 ++++++++++++-------
5 files changed, 49 insertions(+), 40 deletions(-)

Index: current/include/linux/gfp.h
===================================================================
--- current.orig/include/linux/gfp.h 2008-07-31 18:54:09.000000000 +0900
+++ current/include/linux/gfp.h 2008-07-31 18:54:18.000000000 +0900
@@ -175,20 +175,20 @@ static inline void arch_alloc_page(struc

struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask);
+ int zonelist_nid, nodemask_t *nodemask);

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist)
+ int zonelist_nid)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_internal(gfp_mask, order, zonelist_nid, NULL);
}

static inline struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+ int zonelist_nid, nodemask_t *nodemask)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+ return __alloc_pages_internal(gfp_mask, order, zonelist_nid, nodemask);
}


@@ -202,7 +202,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+ return __alloc_pages(gfp_mask, order, nid);
}

#ifdef CONFIG_NUMA
Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c 2008-07-31 18:54:09.000000000 +0900
+++ current/mm/mempolicy.c 2008-07-31 18:54:59.000000000 +0900
@@ -1329,7 +1329,7 @@ static nodemask_t *policy_nodemask(gfp_t
}

/* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
+static int policy_node(gfp_t gfp, struct mempolicy *policy)
{
int nd = numa_node_id();

@@ -1354,7 +1354,7 @@ static struct zonelist *policy_zonelist(
default:
BUG();
}
- return node_zonelist(nd, gfp);
+ return nd;
}

/* Do dynamic interleaving for a process */
@@ -1459,36 +1459,35 @@ static inline unsigned interleave_nid(st

#ifdef CONFIG_HUGETLBFS
/*
- * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * huge_node(@vma, @addr, @gfp_flags, @mpol)
* @vma = virtual memory area whose policy is sought
* @addr = address in @vma for shared policy lookup and interleave policy
* @gfp_flags = for requested zone
* @mpol = pointer to mempolicy pointer for reference counted mempolicy
* @nodemask = pointer to nodemask pointer for MPOL_BIND nodemask
*
- * Returns a zonelist suitable for a huge page allocation and a pointer
+ * Returns node id suitable for a huge page allocation and a pointer
* to the struct mempolicy for conditional unref after allocation.
* If the effective policy is 'BIND, returns a pointer to the mempolicy's
* @nodemask for filtering the zonelist.
*/
-struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
- gfp_t gfp_flags, struct mempolicy **mpol,
- nodemask_t **nodemask)
+int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
+ struct mempolicy **mpol, nodemask_t **nodemask)
{
- struct zonelist *zl;
+ int nid;

*mpol = get_vma_policy(current, vma, addr);
*nodemask = NULL; /* assume !MPOL_BIND */

- if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
- zl = node_zonelist(interleave_nid(*mpol, vma, addr,
- huge_page_shift(hstate_vma(vma))), gfp_flags);
- } else {
- zl = policy_zonelist(gfp_flags, *mpol);
+ if (unlikely((*mpol)->mode == MPOL_INTERLEAVE))
+ nid = interleave_nid(*mpol, vma, addr,
+ huge_page_shift(hstate_vma(vma)));
+ else {
+ nid = policy_node(gfp_flags, *mpol);
if ((*mpol)->mode == MPOL_BIND)
*nodemask = &(*mpol)->v.nodes;
}
- return zl;
+ return nid;
}
#endif

@@ -1497,13 +1496,15 @@ struct zonelist *huge_zonelist(struct vm
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
unsigned nid)
{
- struct zonelist *zl;
struct page *page;

- zl = node_zonelist(nid, gfp);
- page = __alloc_pages(gfp, order, zl);
- if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
- inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
+ page = __alloc_pages(gfp, order, nid);
+ if (page) {
+ struct zonelist *zl;
+ zl = node_zonelist(nid, gfp);
+ if (page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
+ inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
+ }
return page;
}

@@ -1533,31 +1534,30 @@ struct page *
alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
- struct zonelist *zl;
+ int nid;

cpuset_update_task_memory_state();

if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
- unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
mpol_cond_put(pol);
return alloc_page_interleave(gfp, 0, nid);
}
- zl = policy_zonelist(gfp, pol);
+ nid = policy_node(gfp, pol);
if (unlikely(mpol_needs_cond_ref(pol))) {
/*
* slow path: ref counted shared policy
*/
struct page *page = __alloc_pages_nodemask(gfp, 0,
- zl, policy_nodemask(gfp, pol));
+ nid, policy_nodemask(gfp, pol));
__mpol_put(pol);
return page;
}
/*
* fast path: default or task policy
*/
- return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
+ return __alloc_pages_nodemask(gfp, 0, nid, policy_nodemask(gfp, pol));
}

/**
@@ -1595,7 +1595,7 @@ struct page *alloc_pages_current(gfp_t g
if (pol->mode == MPOL_INTERLEAVE)
return alloc_page_interleave(gfp, order, interleave_nodes(pol));
return __alloc_pages_nodemask(gfp, order,
- policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
+ policy_node(gfp, pol), policy_nodemask(gfp, pol));
}
EXPORT_SYMBOL(alloc_pages_current);

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-31 18:54:09.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-31 19:01:46.000000000 +0900
@@ -1383,7 +1383,8 @@ static void zlc_mark_zone_full(struct zo
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
- struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+ struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+ int zonelist_nid)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1514,7 +1515,7 @@ static void set_page_owner(struct page *
*/
struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+ int zonelist_nid, nodemask_t *nodemask)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1527,6 +1528,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u
int alloc_flags;
unsigned long did_some_progress;
unsigned long pages_reclaimed = 0;
+ struct zonelist *zonelist;

might_sleep_if(wait);

@@ -1534,6 +1536,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u
return NULL;

restart:
+ zonelist = node_zonelist(zonelist_nid, gfp_mask);;
z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */

if (unlikely(!z->zone)) {
@@ -1545,7 +1548,9 @@ restart:
}

page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
- zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+ zonelist_nid);
+
if (page)
goto got_pg;

@@ -1590,7 +1595,7 @@ restart:
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
- high_zoneidx, alloc_flags);
+ high_zoneidx, alloc_flags, zonelist_nid);
if (page)
goto got_pg;

@@ -1603,7 +1608,8 @@ rebalance:
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+ zonelist_nid);
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
@@ -1638,7 +1644,8 @@ nofail_alloc:

if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, alloc_flags);
+ zonelist, high_zoneidx, alloc_flags,
+ zonelist_nid);
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
@@ -1655,7 +1662,7 @@ nofail_alloc:
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET, zonelist_nid);
if (page) {
clear_zonelist_oom(zonelist, gfp_mask);
goto got_pg;
Index: current/mm/hugetlb.c
===================================================================
--- current.orig/mm/hugetlb.c 2008-07-31 18:54:09.000000000 +0900
+++ current/mm/hugetlb.c 2008-07-31 18:54:18.000000000 +0900
@@ -411,8 +411,9 @@ static struct page *dequeue_huge_page_vm
struct page *page = NULL;
struct mempolicy *mpol;
nodemask_t *nodemask;
- struct zonelist *zonelist = huge_zonelist(vma, address,
+ int zonelist_nid = huge_node(vma, address,
htlb_alloc_mask, &mpol, &nodemask);
+ struct zonelist *zonelist;
struct zone *zone;
struct zoneref *z;

@@ -429,6 +430,7 @@ static struct page *dequeue_huge_page_vm
if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
return NULL;

+ zonelist = node_zonelist(zonelist_nid, htlb_alloc_mask);
for_each_zone_zonelist_nodemask(zone, z, zonelist,
MAX_NR_ZONES - 1, nodemask) {
nid = zone_to_nid(zone);
Index: current/include/linux/mempolicy.h
===================================================================
--- current.orig/include/linux/mempolicy.h 2008-07-31 18:54:09.000000000 +0900
+++ current/include/linux/mempolicy.h 2008-07-31 18:54:18.000000000 +0900
@@ -197,7 +197,7 @@ extern void mpol_rebind_task(struct task
extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
extern void mpol_fix_fork_child_flag(struct task_struct *p);

-extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+extern int huge_node(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
extern unsigned slab_node(struct mempolicy *policy);

--
Yasunori Goto

2008-07-31 12:03:43

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 006/008](memory hotplug) kswapd_stop() definition


This patch is to make kswapd_stop().
It must be stopped before node removing.


Signed-off-by: Yasunori Goto <[email protected]>

---
include/linux/swap.h | 3 +++
mm/vmscan.c | 13 +++++++++++++
2 files changed, 16 insertions(+)

Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c 2008-07-29 22:17:16.000000000 +0900
+++ current/mm/vmscan.c 2008-07-29 22:17:16.000000000 +0900
@@ -1985,6 +1985,9 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);

+ if (kthread_should_stop())
+ break;
+
if (!try_to_freeze()) {
/* We can speed up thawing tasks if we don't call
* balance_pgdat after returning from the refrigerator
@@ -2216,6 +2219,16 @@ int kswapd_run(int nid)
return ret;
}

+#ifdef CONFIG_MEMORY_HOTREMOVE
+void kswapd_stop(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ if (pgdat->kswapd)
+ kthread_stop(pgdat->kswapd);
+}
+#endif
+
static int __init kswapd_init(void)
{
int nid;
Index: current/include/linux/swap.h
===================================================================
--- current.orig/include/linux/swap.h 2008-07-29 21:20:02.000000000 +0900
+++ current/include/linux/swap.h 2008-07-29 22:17:16.000000000 +0900
@@ -262,6 +262,9 @@ static inline void scan_unevictable_unre
#endif

extern int kswapd_run(int nid);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern void kswapd_stop(int nid);
+#endif

#ifdef CONFIG_MMU
/* linux/mm/shmem.c */

--
Yasunori Goto

2008-07-31 12:04:49

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 007/008](memory hotplug) callback routine for mempolicy


This patch is very incomplete (includes dummy code),
but I would like to show what mempolicy has to do when node is removed.

Basically, user should change policy before removing node
which is used mempolicy. However, user may not know
mempolicy is used due to automatic setting by software.
The kernel must guarantee removed node will be not used.

There is callback when memory offlining, mempolicy can change
each task's policies.

There are some issues.
- If nodes_weight(pol->v.nodes) will be 0 due to node removing,
Kernel will not be able to allocate any pages.
What does kernel should do? Kill its process?
- If preffered node is removing, then which node should be next
preffered node?

Signed-off-by: Yasunori Goto <[email protected]>

---
mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c 2008-07-29 22:17:25.000000000 +0900
+++ current/mm/mempolicy.c 2008-07-29 22:17:29.000000000 +0900
@@ -2345,3 +2345,35 @@ out:
m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static int mempolicy_mem_offline_callback(void *arg)
+{
+ int offline_node;
+ struct memory_notify *marg = arg;
+
+ offline_node = marg->status_change_nid;
+
+ /*
+ * If the node still has available memory, we keep policies.
+ */
+ if (offline_node < 0)
+ return 0;
+
+ /*
+ * Disable all offline node's bit for each node mask.
+ */
+ for_each_policy(pol) {
+ switch (pol->mode) {
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ /* Force disable node bit */
+ node_clear(offline_node, pol->v.nodes);
+ break;
+ case MPOL_PREFFERED:
+ /* TBD */
+ default:
+ break;
+ }
+}
+#endif

--
Yasunori Goto

2008-07-31 12:05:56

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 008/008](memory hotplug) remove_pgdat() function


remove_pgdat() is main code for pgdat removing.
remove_pgdat() should be called for node-hotremove, but nothing calls
it. Sysfs interface (or anything else?) will be necessary.

And current offline_pages() has to be update zonelist and N_HIGH_MEMORY
if there is no present_pages on the node, and stop kswapd().


Signed-off-by: Yasunori Goto <[email protected]>


---
mm/memory_hotplug.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 84 insertions(+), 1 deletion(-)

Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2008-07-29 22:17:24.000000000 +0900
+++ current/mm/memory_hotplug.c 2008-07-29 22:17:32.000000000 +0900
@@ -241,6 +241,82 @@ static int __add_section(struct zone *zo
return register_new_memory(__pfn_to_section(phys_start_pfn));
}

+static int cpus_busy_on_node(int nid)
+{
+ cpumask_t tmp = node_to_cpumask(nid);
+ int cpu, ret;
+
+ for_each_cpu_mask(cpu, tmp) {
+ if (cpu_online(cpu)) {
+ printk(KERN_INFO "cpu %d is busy\n", cpu);
+ ret = 1 ;
+ }
+ }
+ return 0;
+}
+
+static int sections_busy_on_node(struct pglist_data *pgdat)
+{
+ unsigned long section_nr, num, i;
+ int ret = 0;
+
+ section_nr = pfn_to_section_nr(pgdat->node_start_pfn);
+ num = pfn_to_section_nr(pgdat->node_spanned_pages);
+
+ for (i = section_nr; i < num; i++) {
+ if (present_section_nr(i)) {
+ printk(KERN_INFO "section %ld is busy\n", i);
+ ret = 1;
+ }
+ }
+ return ret;
+}
+
+void free_pgdat(int offline_nid, struct pglist_data *pgdat)
+{
+ struct page *page = virt_to_page(pgdat);
+
+ arch_refresh_nodedata(offline_nid, NULL);
+
+ if (PageSlab(page)) {
+ /* This pgdat is allocated on other node via hot-add */
+ arch_free_nodedata(pgdat);
+ return;
+ }
+
+ if (offline_nid != page_to_nid(page)) {
+ /* This pgdat is allocated on other node as memoryless node */
+ put_page_bootmem(page);
+ return;
+ }
+
+ /*
+ * Ok. This pgdat is same node of offlining node.
+ * Don't free it. Because this area will be removed physically at
+ * next step.
+ */
+
+}
+
+int remove_pgdat(int nid)
+{
+ struct pglist_data *pgdat = NODE_DATA(nid);
+
+ if (cpus_busy_on_node(nid))
+ return -EBUSY;
+
+ if (sections_busy_on_node(pgdat))
+ return -EBUSY;
+
+ node_set_offline(nid);
+ synchronize_sched();
+ synchronize_srcu(&pgdat_remove_srcu);
+
+ free_pgdat(nid, pgdat);
+
+ return 0;
+}
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section *ms)
{
@@ -473,7 +549,6 @@ static void rollback_node_hotadd(int nid
return;
}

-
int add_memory(int nid, u64 start, u64 size)
{
pg_data_t *pgdat = NULL;
@@ -842,6 +917,14 @@ repeat:
vm_total_pages = nr_free_pagecache_pages();
writeback_set_ratelimit();

+ if (zone->present_pages == 0)
+ build_all_zonelists();
+
+ if (zone->zone_pgdat->node_present_pages == 0) {
+ node_clear_state(node, N_HIGH_MEMORY);
+ kswapd_stop(node);
+ }
+
memory_notify(MEM_OFFLINE, &arg);
return 0;


--
Yasunori Goto

2008-07-31 12:09:35

by Yasunori Goto

[permalink] [raw]
Subject: [RFC:Patch: 005/008](memory hotplug) check node online before NODE_DATA and so on


When kernel uses NODE_DATA(nid), kernel must check its node is really online
or not. In addition, if numa_node_id() returns offlined node,
it must be bug because cpu offline on the node has to be executed before
node offline.
This patch checks it, and add read locks on some other little parsing
zone/zonelist places.


Signed-off-by: Yasunori Goto <[email protected]>

---
mm/mempolicy.c | 11 +++++++++--
mm/page_alloc.c | 19 +++++++++++++++++--
mm/quicklist.c | 8 +++++++-
mm/slub.c | 7 ++++++-
mm/vmscan.c | 22 +++++++++++++++++++---
5 files changed, 58 insertions(+), 9 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-29 22:17:16.000000000 +0900
@@ -1884,7 +1884,12 @@ static unsigned int nr_free_zone_pages(i
/* Just pick one node, since fallback list is circular */
unsigned int sum = 0;

- struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+ struct zonelist *zonelist;
+ int node = numa_node_id();
+
+ pgdat_remove_read_lock();
+ BUG_ON(!node_online(node));
+ zonelist = node_zonelist(node, GFP_KERNEL);

for_each_zone_zonelist(zone, z, zonelist, offset) {
unsigned long size = zone->present_pages;
@@ -1892,6 +1897,7 @@ static unsigned int nr_free_zone_pages(i
if (size > high)
sum += size - high;
}
+ pgdat_remove_read_unlock();

return sum;
}
@@ -1935,7 +1941,14 @@ EXPORT_SYMBOL(si_meminfo);
#ifdef CONFIG_NUMA
void si_meminfo_node(struct sysinfo *val, int nid)
{
- pg_data_t *pgdat = NODE_DATA(nid);
+ pg_data_t *pgdat;
+
+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(nid))) {
+ pgdat_remove_read_unlock();
+ return;
+ }
+ pgdat = NODE_DATA(nid);

val->totalram = pgdat->node_present_pages;
val->freeram = node_page_state(nid, NR_FREE_PAGES);
@@ -1947,6 +1960,8 @@ void si_meminfo_node(struct sysinfo *val
val->totalhigh = 0;
val->freehigh = 0;
#endif
+ pgdat_remove_read_unlock();
+
val->mem_unit = PAGE_SIZE;
}
#endif
Index: current/mm/quicklist.c
===================================================================
--- current.orig/mm/quicklist.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/quicklist.c 2008-07-29 22:17:16.000000000 +0900
@@ -26,7 +26,12 @@ DEFINE_PER_CPU(struct quicklist, quickli
static unsigned long max_pages(unsigned long min_pages)
{
unsigned long node_free_pages, max;
- struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+ struct zone *zones;
+ int node = numa_node_id();
+
+ pgdat_remove_read_lock();
+ BUG_ON(!node_online(node));
+ zones = NODE_DATA(node)->node_zones;

node_free_pages =
#ifdef CONFIG_ZONE_DMA
@@ -37,6 +42,7 @@ static unsigned long max_pages(unsigned
#endif
zone_page_state(&zones[ZONE_NORMAL], NR_FREE_PAGES);

+ pgdat_remove_read_unlock();
max = node_free_pages / FRACTION_OF_NODE_MEM;
return max(max, min_pages);
}
Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/vmscan.c 2008-07-29 22:17:42.000000000 +0900
@@ -1710,11 +1710,21 @@ unsigned long try_to_free_mem_cgroup_pag
.isolate_pages = mem_cgroup_isolate_pages,
};
struct zonelist *zonelist;
+ unsigned long ret;
+ int node = numa_node_id();

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
- zonelist = NODE_DATA(numa_node_id())->node_zonelists;
- return do_try_to_free_pages(zonelist, &sc);
+
+ pgdat_remove_read_lock_sleepable();
+ if (unlikely(!node_online(node))) {
+ pgdat_remove_read_unlock_sleepable();
+ return 0;
+ }
+ zonelist = NODE_DATA(node)->node_zonelists;
+ ret = do_try_to_free_pages(zonelist, &sc);
+ pgdat_remove_read_unlock_sleepable();
+ return ret;
}
#endif

@@ -2636,19 +2646,25 @@ static ssize_t read_scan_unevictable_nod
static ssize_t write_scan_unevictable_node(struct sys_device *dev,
const char *buf, size_t count)
{
- struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
+ struct zone *node_zones;
struct zone *zone;
unsigned long res;
unsigned long req = strict_strtoul(buf, 10, &res);
+ int node = dev->id;

if (!req)
return 1; /* zero is no-op */

+ pgdat_remove_read_lock();
+ BUG_ON(!node_online(node));
+
+ node_zones = NODE_DATA(node)->node_zones;
for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
if (!populated_zone(zone))
continue;
scan_zone_unevictable_pages(zone);
}
+ pgdat_remove_read_unlock();
return 1;
}

Index: current/mm/slub.c
===================================================================
--- current.orig/mm/slub.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/slub.c 2008-07-29 22:17:16.000000000 +0900
@@ -1300,6 +1300,7 @@ static struct page *get_any_partial(stru
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
struct page *page;
+ int node;

/*
* The defrag ratio allows a configuration of the tradeoffs between
@@ -1323,7 +1324,10 @@ static struct page *get_any_partial(stru
get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;

- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+ pgdat_remove_read_lock();
+ node = slab_node(current->mempolicy);
+ BUG_ON(!node_online(node));
+ zonelist = node_zonelist(node, flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;

@@ -1336,6 +1340,7 @@ static struct page *get_any_partial(stru
return page;
}
}
+ pgdat_remove_read_unlock();
#endif
return NULL;
}
Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/mempolicy.c 2008-07-29 22:17:40.000000000 +0900
@@ -1407,11 +1407,18 @@ unsigned slab_node(struct mempolicy *pol
struct zonelist *zonelist;
struct zone *zone;
enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
- zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+ int node = numa_node_id();
+
+ pgdat_remove_read_lock();
+ BUG_ON(!node_online(node));
+ zonelist = &NODE_DATA(node)->node_zonelists[0];
(void)first_zones_zonelist(zonelist, highest_zoneidx,
&policy->v.nodes,
&zone);
- return zone->node;
+ node = zone->node;
+ pgdat_remove_read_unlock();
+
+ return node;
}

default:

--
Yasunori Goto

2008-07-31 14:05:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Yasunori Goto wrote:

> Current my idea is using RCU feature for waiting them.
> Because it is the least impact against reader's performance,
> and pgdat remover can wait finish of reader's access to pgdat
> which is removing by synchronize_sched().

The use of RCU disables preemption which has implications as to what can be done in a loop over nodes or zones. This would also potentially add more overhead to the page allocator hotpaths.


> If you have better idea, please let me know.

Use stop_machine()? The removal of a zone or node is a pretty rare event after all and it would avoid having to deal with rcu etc etc.

2008-08-01 09:56:28

by Yasunori Goto

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

> Yasunori Goto wrote:
>
> > Current my idea is using RCU feature for waiting them.
> > Because it is the least impact against reader's performance,
> > and pgdat remover can wait finish of reader's access to pgdat
> > which is removing by synchronize_sched().
>
> The use of RCU disables preemption which has implications as to
> what can be done in a loop over nodes or zones.

Yeap. It's the one of (big) cons.

> This would also potentially add more overhead to the page allocator hotpaths.

Agree.

To tell the truth, I tried hackbench with 3rd patch which add rcu_read_lock
in hot-path before this post to make rough estimate its impact.

%hackbench 100 process 2000

without patch.
39.93

with patch
39.99
(Both is 10 times avarage)

I guess this result has effect of disable preemption.
So, throughput looks not so bad, but probably, latency would be worse
as you mind.

Kame-san advised me I should take more other benchmarks which can get memory
performance. I'll do it next week.

> > If you have better idea, please let me know.
>
> Use stop_machine()? The removal of a zone or node is a pretty rare event
> after all and it would avoid having to deal with rcu etc etc.
>

I thought it at first, but are there the following worst case?


CPU 0 CPU 1
-------------------------------------------------------
__alloc_pages()

parsing_zonelist()
:
enter page_reclarim()
sleep (and remember zone) :
:
update zonelist and node_online_map
with stop_machine_run()
free pgdat().
remove the Node electrically.

wake up and touch remembered
zone, but it is removed
(Oops!!!)



Anyway, I'm happy if there is better way than my poor idea. :-)

Thanks for your comment.


--
Yasunori Goto

2008-08-01 13:53:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Yasunori Goto wrote:

> I thought it at first, but are there the following worst case?
>
>
> CPU 0 CPU 1
> -------------------------------------------------------
> __alloc_pages()
>
> parsing_zonelist()
> :
> enter page_reclarim()
> sleep (and remember zone) :
> :
> update zonelist and node_online_map
> with stop_machine_run()
> free pgdat().
> remove the Node electrically.
>
> wake up and touch remembered
> zone, but it is removed
> (Oops!!!)
>
>
>
> Anyway, I'm happy if there is better way than my poor idea. :-)
>
> Thanks for your comment.

Duh. Then the use of RCU would also mean that all of reclaim must be in a rcu period. So reclaim cannot sleep anymore.


2008-08-02 00:28:11

by Yasunori Goto

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

> Yasunori Goto wrote:
>
> > I thought it at first, but are there the following worst case?
> >
> >
> > CPU 0 CPU 1
> > -------------------------------------------------------
> > __alloc_pages()
> >
> > parsing_zonelist()
> > :
> > enter page_reclarim()
> > sleep (and remember zone) :
> > :
> > update zonelist and node_online_map
> > with stop_machine_run()
> > free pgdat().
> > remove the Node electrically.
> >
> > wake up and touch remembered
> > zone, but it is removed
> > (Oops!!!)
> >
> >
> >
> > Anyway, I'm happy if there is better way than my poor idea. :-)
> >
> > Thanks for your comment.
>
> Duh. Then the use of RCU would also mean that all of reclaim must
> be in a rcu period. So reclaim cannot sleep anymore.

I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
page reclaim. So, my patch basic idea is followings.


CPU 0 CPU 1
-------------------------------------------------------
__alloc_pages()

rcu_read_lock() and check
online bitmap
parsing_zonelist()
rcu_read_unlock()
:
enter page_reclarim()
srcu_read_lock()
parse zone/zonelist.
sleep (and remember zone) :
:
update zonelist and node_online_map
with stop_machine_run()

wake up and touch remembered zone,
srcu_read_unlock()
syncronized_sched().
free_pgdat()


Thanks.

--
Yasunori Goto

2008-08-04 13:26:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Yasunori Goto wrote:

>>> Thanks for your comment.
>> Duh. Then the use of RCU would also mean that all of reclaim must
>> be in a rcu period. So reclaim cannot sleep anymore.
>
> I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
> page reclaim. So, my patch basic idea is followings.

But that introduces more overhead in __alloc_pages.

2008-08-05 06:43:23

by Yasunori Goto

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing


> >> Duh. Then the use of RCU would also mean that all of reclaim must
> >> be in a rcu period. So reclaim cannot sleep anymore.
> >
> > I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
> > page reclaim. So, my patch basic idea is followings.
>
> But that introduces more overhead in __alloc_pages.

Hmmm. I think SRCU should be used when kernel has to sleep, and sleep time
will be bigger than SRCU's overhead.....

The followings are results of unixbench and lmbench.
I suppose my patch impacts lantency rather than throghput.
In these results, 100fd select and page fault latencies of lmbench became worse.
So I can't say there is no problem in my patches.

Anyway, I'll retry to find other less impact way if there is,
and compare benchmark results with this way.

Bye.

------------

Unixbench
-----

Normal 2.6.27-rc1-mm1


BYTE UNIX Benchmarks (Version 4.1.0)
System -- Linux localhost.localdomain 2.6.27-rc1-mm1 #1 SMP Mon Aug 4 16:08:48 JST 2008 ia64 ia64 ia64 GNU/Linux
Start Benchmark Run: 2008年 8月 5日 火曜日 10:24:35 JST
1 interactive users.
10:24:35 up 9 min, 1 user, load average: 0.16, 0.08, 0.03
lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash
/bin/sh: symbolic link to `bash'
/dev/sda5 33792348 18360424 13687672 58% /home
Execl Throughput 2954.0 lps (29.8 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1211570.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 281599.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 218859.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 328725.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 72850.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 57095.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 3883690.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1050752.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 564703.0 KBps (30.0 secs, 3 samples)
Pipe Throughput 462027.5 lps (10.0 secs, 10 samples)
Pipe-based Context Switching 105824.3 lps (10.0 secs, 10 samples)
Process Creation 2242.9 lps (30.0 secs, 3 samples)
System Call Overhead 1320907.8 lps (10.0 secs, 10 samples)
Shell Scripts (1 concurrent) 4442.1 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1810.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1042.7 lpm (60.0 secs, 3 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 2954.0 687.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 218859.0 552.7
File Copy 256 bufsize 500 maxblocks 1655.0 57095.0 345.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 564703.0 973.6
Pipe Throughput 12440.0 462027.5 371.4
Pipe-based Context Switching 4000.0 105824.3 264.6
Process Creation 126.0 2242.9 178.0
Shell Scripts (8 concurrent) 6.0 1810.0 3016.7
System Call Overhead 15000.0 1320907.8 880.6
=========
FINAL SCORE 565.6



2.6.27-rc1-mm1 with my patch


BYTE UNIX Benchmarks (Version 4.1.0)
System -- Linux localhost.localdomain 2.6.27-rc1-mm1-goto-test #2 SMP Mon Aug 4 18:50:56 JST 2008 ia64 ia64 ia64 GNU/Linux
Start Benchmark Run: 2008年 8月 4日 月曜日 20:35:11 JST
1 interactive users.
20:35:11 up 1:37, 1 user, load average: 0.00, 0.29, 0.71
lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash
/bin/sh: symbolic link to `bash'
/dev/sda5 33792348 18360420 13687676 58% /home
Execl Throughput 2949.0 lps (29.7 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks 1317211.0 KBps (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks 282643.0 KBps (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks 220360.0 KBps (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks 361448.0 KBps (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks 73172.0 KBps (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks 57489.0 KBps (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks 3819448.0 KBps (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks 1026563.0 KBps (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks 585218.0 KBps (30.0 secs, 3 samples)
Pipe Throughput 482681.7 lps (10.0 secs, 10 samples)
Pipe-based Context Switching 101437.7 lps (10.0 secs, 10 samples)
Process Creation 2237.5 lps (30.0 secs, 3 samples)
System Call Overhead 1282198.4 lps (10.0 secs, 10 samples)
Shell Scripts (1 concurrent) 4447.7 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 1812.7 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1041.7 lpm (60.0 secs, 3 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Execl Throughput 43.0 2949.0 685.8
File Copy 1024 bufsize 2000 maxblocks 3960.0 220360.0 556.5
File Copy 256 bufsize 500 maxblocks 1655.0 57489.0 347.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 585218.0 1009.0
Pipe Throughput 12440.0 482681.7 388.0
Pipe-based Context Switching 4000.0 101437.7 253.6
Process Creation 126.0 2237.5 177.6
Shell Scripts (8 concurrent) 6.0 1812.7 3021.2
System Call Overhead 15000.0 1282198.4 854.8
=========
FINAL SCORE 566.8





LMBENCH

The first lines are results of normal 2.6.27-rc1-mm1.
The second lines are results with my patch.



L M B E N C H 3 . 0 S U M M A R Y
------------------------------------
(Alpha software, do not distribute)

Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
localhost Linux 2.6.27- 1600 0.03 0.23 3.12 4.45 6.73 0.27 1.75 227. 463. 2219
localhost Linux 2.6.27- 1600 0.03 0.23 3.13 4.44 6.74 0.27 1.73 207. 448. 2230

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
localhost Linux 2.6.27- 11.3 11.4 11.5 11.5 12.7 11.8 14.6
localhost Linux 2.6.27- 11.5 11.4 11.5 11.6 12.8 11.9 14.7

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
localhost Linux 2.6.27- 11.3 8.464 28.3 13.4 28.7 46.
localhost Linux 2.6.27- 11.5 8.470 28.3 13.4 32.2 46.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
localhost Linux 2.6.27- 15.1 13.4 45.6 25.4 24.0K 0.384 0.23850 2.804
localhost Linux 2.6.27- 15.8 13.3 43.0 26.0 24.1K 0.401 0.25150 2.835

*Local* Communication bandwidths in MB/s - bigger is better
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
localhost Linux 2.6.27- 1600 0.03 0.23 3.12 4.45 6.73 0.27 1.75 227. 463. 2219
localhost Linux 2.6.27- 1600 0.03 0.23 3.13 4.44 6.74 0.27 1.73 207. 448. 2230

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
localhost Linux 2.6.27- 11.3 11.4 11.5 11.5 12.7 11.8 14.6
localhost Linux 2.6.27- 11.5 11.4 11.5 11.6 12.8 11.9 14.7

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
localhost Linux 2.6.27- 11.3 8.464 28.3 13.4 28.7 46.
localhost Linux 2.6.27- 11.5 8.470 28.3 13.4 32.2 46.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
localhost Linux 2.6.27- 15.1 13.4 45.6 25.4 24.0K 0.384 0.23850 2.804 <---!!!
localhost Linux 2.6.27- 15.8 13.3 43.0 26.0 24.1K 0.401 0.25150 2.835 <----!!!

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
localhost Linux 2.6.27- 4814 4100 1188 2087.4 523.2 549.6 274.9 458. 523.5
localhost Linux 2.6.27- 4811 4111 1219 2090.8 523.1 549.4 276.1 458. 523.5
(END)



--
Yasunori Goto

2008-08-05 11:15:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

On (05/08/08 15:39), Yasunori Goto didst pronounce:
>
> > >> Duh. Then the use of RCU would also mean that all of reclaim must
> > >> be in a rcu period. So reclaim cannot sleep anymore.
> > >
> > > I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
> > > page reclaim. So, my patch basic idea is followings.
> >
> > But that introduces more overhead in __alloc_pages.
>
> Hmmm. I think SRCU should be used when kernel has to sleep, and sleep time
> will be bigger than SRCU's overhead.....
>
> The followings are results of unixbench and lmbench.
> I suppose my patch impacts lantency rather than throghput.
> In these results, 100fd select and page fault latencies of lmbench became worse.
> So I can't say there is no problem in my patches.
>
> Anyway, I'll retry to find other less impact way if there is,
> and compare benchmark results with this way.
>

Maybe I am missing something, but what is wrong with stop_machine during
memory hot-remove?

> Bye.
>
> ------------
>
> Unixbench
> -----
>
> Normal 2.6.27-rc1-mm1
>
>
> BYTE UNIX Benchmarks (Version 4.1.0)
> System -- Linux localhost.localdomain 2.6.27-rc1-mm1 #1 SMP Mon Aug 4 16:08:48 JST 2008 ia64 ia64 ia64 GNU/Linux
> Start Benchmark Run: 2008?$BG/ 8?$B7n 5?$BF| ?$B2PMKF| 10:24:35 JST
> 1 interactive users.
> 10:24:35 up 9 min, 1 user, load average: 0.16, 0.08, 0.03
> lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash
> /bin/sh: symbolic link to `bash'
> /dev/sda5 33792348 18360424 13687672 58% /home
> Execl Throughput 2954.0 lps (29.8 secs, 3 samples)
> File Read 1024 bufsize 2000 maxblocks 1211570.0 KBps (30.0 secs, 3 samples)
> File Write 1024 bufsize 2000 maxblocks 281599.0 KBps (30.0 secs, 3 samples)
> File Copy 1024 bufsize 2000 maxblocks 218859.0 KBps (30.0 secs, 3 samples)
> File Read 256 bufsize 500 maxblocks 328725.0 KBps (30.0 secs, 3 samples)
> File Write 256 bufsize 500 maxblocks 72850.0 KBps (30.0 secs, 3 samples)
> File Copy 256 bufsize 500 maxblocks 57095.0 KBps (30.0 secs, 3 samples)
> File Read 4096 bufsize 8000 maxblocks 3883690.0 KBps (30.0 secs, 3 samples)
> File Write 4096 bufsize 8000 maxblocks 1050752.0 KBps (30.0 secs, 3 samples)
> File Copy 4096 bufsize 8000 maxblocks 564703.0 KBps (30.0 secs, 3 samples)
> Pipe Throughput 462027.5 lps (10.0 secs, 10 samples)
> Pipe-based Context Switching 105824.3 lps (10.0 secs, 10 samples)
> Process Creation 2242.9 lps (30.0 secs, 3 samples)
> System Call Overhead 1320907.8 lps (10.0 secs, 10 samples)
> Shell Scripts (1 concurrent) 4442.1 lpm (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent) 1810.0 lpm (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent) 1042.7 lpm (60.0 secs, 3 samples)
>
>
> INDEX VALUES
> TEST BASELINE RESULT INDEX
>
> Execl Throughput 43.0 2954.0 687.0
> File Copy 1024 bufsize 2000 maxblocks 3960.0 218859.0 552.7
> File Copy 256 bufsize 500 maxblocks 1655.0 57095.0 345.0
> File Copy 4096 bufsize 8000 maxblocks 5800.0 564703.0 973.6
> Pipe Throughput 12440.0 462027.5 371.4
> Pipe-based Context Switching 4000.0 105824.3 264.6
> Process Creation 126.0 2242.9 178.0
> Shell Scripts (8 concurrent) 6.0 1810.0 3016.7
> System Call Overhead 15000.0 1320907.8 880.6
> =========
> FINAL SCORE 565.6
>
>
>
> 2.6.27-rc1-mm1 with my patch
>
>
> BYTE UNIX Benchmarks (Version 4.1.0)
> System -- Linux localhost.localdomain 2.6.27-rc1-mm1-goto-test #2 SMP Mon Aug 4 18:50:56 JST 2008 ia64 ia64 ia64 GNU/Linux
> Start Benchmark Run: 2008?$BG/ 8?$B7n 4?$BF| ?$B7nMKF| 20:35:11 JST
> 1 interactive users.
> 20:35:11 up 1:37, 1 user, load average: 0.00, 0.29, 0.71
> lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash
> /bin/sh: symbolic link to `bash'
> /dev/sda5 33792348 18360420 13687676 58% /home
> Execl Throughput 2949.0 lps (29.7 secs, 3 samples)
> File Read 1024 bufsize 2000 maxblocks 1317211.0 KBps (30.0 secs, 3 samples)
> File Write 1024 bufsize 2000 maxblocks 282643.0 KBps (30.0 secs, 3 samples)
> File Copy 1024 bufsize 2000 maxblocks 220360.0 KBps (30.0 secs, 3 samples)
> File Read 256 bufsize 500 maxblocks 361448.0 KBps (30.0 secs, 3 samples)
> File Write 256 bufsize 500 maxblocks 73172.0 KBps (30.0 secs, 3 samples)
> File Copy 256 bufsize 500 maxblocks 57489.0 KBps (30.0 secs, 3 samples)
> File Read 4096 bufsize 8000 maxblocks 3819448.0 KBps (30.0 secs, 3 samples)
> File Write 4096 bufsize 8000 maxblocks 1026563.0 KBps (30.0 secs, 3 samples)
> File Copy 4096 bufsize 8000 maxblocks 585218.0 KBps (30.0 secs, 3 samples)
> Pipe Throughput 482681.7 lps (10.0 secs, 10 samples)
> Pipe-based Context Switching 101437.7 lps (10.0 secs, 10 samples)
> Process Creation 2237.5 lps (30.0 secs, 3 samples)
> System Call Overhead 1282198.4 lps (10.0 secs, 10 samples)
> Shell Scripts (1 concurrent) 4447.7 lpm (60.0 secs, 3 samples)
> Shell Scripts (8 concurrent) 1812.7 lpm (60.0 secs, 3 samples)
> Shell Scripts (16 concurrent) 1041.7 lpm (60.0 secs, 3 samples)
>
>
> INDEX VALUES
> TEST BASELINE RESULT INDEX
>
> Execl Throughput 43.0 2949.0 685.8
> File Copy 1024 bufsize 2000 maxblocks 3960.0 220360.0 556.5
> File Copy 256 bufsize 500 maxblocks 1655.0 57489.0 347.4
> File Copy 4096 bufsize 8000 maxblocks 5800.0 585218.0 1009.0
> Pipe Throughput 12440.0 482681.7 388.0
> Pipe-based Context Switching 4000.0 101437.7 253.6
> Process Creation 126.0 2237.5 177.6
> Shell Scripts (8 concurrent) 6.0 1812.7 3021.2
> System Call Overhead 15000.0 1282198.4 854.8
> =========
> FINAL SCORE 566.8
>
>
>
>
>
> LMBENCH
>
> The first lines are results of normal 2.6.27-rc1-mm1.
> The second lines are results with my patch.
>
>
>
> L M B E N C H 3 . 0 S U M M A R Y
> ------------------------------------
> (Alpha software, do not distribute)
>
> Basic system parameters
> ------------------------------------------------------------------------------
> Host OS Description Mhz tlb cache mem scal
> pages line par load
> bytes
> --------- ------------- ----------------------- ---- ----- ----- ------ ----
> localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
> localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
>
> Processor, Processes - times in microseconds - smaller is better
> ------------------------------------------------------------------------------
> Host OS Mhz null null open slct sig sig fork exec sh
> call I/O stat clos TCP inst hndl proc proc proc
> --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
> localhost Linux 2.6.27- 1600 0.03 0.23 3.12 4.45 6.73 0.27 1.75 227. 463. 2219
> localhost Linux 2.6.27- 1600 0.03 0.23 3.13 4.44 6.74 0.27 1.73 207. 448. 2230
>
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> localhost Linux 2.6.27- 11.3 11.4 11.5 11.5 12.7 11.8 14.6
> localhost Linux 2.6.27- 11.5 11.4 11.5 11.6 12.8 11.9 14.7
>
> *Local* Communication latencies in microseconds - smaller is better
> ---------------------------------------------------------------------
> Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> ctxsw UNIX UDP TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> localhost Linux 2.6.27- 11.3 8.464 28.3 13.4 28.7 46.
> localhost Linux 2.6.27- 11.5 8.470 28.3 13.4 32.2 46.
>
> File & VM system latencies in microseconds - smaller is better
> -------------------------------------------------------------------------------
> Host OS 0K File 10K File Mmap Prot Page 100fd
> Create Delete Create Delete Latency Fault Fault selct
> --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
> localhost Linux 2.6.27- 15.1 13.4 45.6 25.4 24.0K 0.384 0.23850 2.804
> localhost Linux 2.6.27- 15.8 13.3 43.0 26.0 24.1K 0.401 0.25150 2.835
>
> *Local* Communication bandwidths in MB/s - bigger is better
> ------------------------------------------------------------------------------
> Host OS Description Mhz tlb cache mem scal
> pages line par load
> bytes
> --------- ------------- ----------------------- ---- ----- ----- ------ ----
> localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
> localhost Linux 2.6.27- ia64-linux-gnu 1600 128 1
>
> Processor, Processes - times in microseconds - smaller is better
> ------------------------------------------------------------------------------
> Host OS Mhz null null open slct sig sig fork exec sh
> call I/O stat clos TCP inst hndl proc proc proc
> --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
> localhost Linux 2.6.27- 1600 0.03 0.23 3.12 4.45 6.73 0.27 1.75 227. 463. 2219
> localhost Linux 2.6.27- 1600 0.03 0.23 3.13 4.44 6.74 0.27 1.73 207. 448. 2230
>
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> localhost Linux 2.6.27- 11.3 11.4 11.5 11.5 12.7 11.8 14.6
> localhost Linux 2.6.27- 11.5 11.4 11.5 11.6 12.8 11.9 14.7
>
> *Local* Communication latencies in microseconds - smaller is better
> ---------------------------------------------------------------------
> Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> ctxsw UNIX UDP TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> localhost Linux 2.6.27- 11.3 8.464 28.3 13.4 28.7 46.
> localhost Linux 2.6.27- 11.5 8.470 28.3 13.4 32.2 46.
>
> File & VM system latencies in microseconds - smaller is better
> -------------------------------------------------------------------------------
> Host OS 0K File 10K File Mmap Prot Page 100fd
> Create Delete Create Delete Latency Fault Fault selct
> --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
> localhost Linux 2.6.27- 15.1 13.4 45.6 25.4 24.0K 0.384 0.23850 2.804 <---!!!
> localhost Linux 2.6.27- 15.8 13.3 43.0 26.0 24.1K 0.401 0.25150 2.835 <----!!!
>
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------------------------
> Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
> UNIX reread reread (libc) (hand) read write
> --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
> localhost Linux 2.6.27- 4814 4100 1188 2087.4 523.2 549.6 274.9 458. 523.5
> localhost Linux 2.6.27- 4811 4111 1219 2090.8 523.1 549.4 276.1 458. 523.5
> (END)
>
>
>
> --
> Yasunori Goto
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-08-05 17:14:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Mel Gorman wrote:

> Maybe I am missing something, but what is wrong with stop_machine during
> memory hot-remove?

Reclaim can sleep while going down a zonelist. There would need to be some
form of synchronization to avoid removing a zone from the zonelist that we are
just scanning.