2006-02-09 18:54:33

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 01/07] cpuset cleanup not not operators

From: Paul Jackson <[email protected]>

Since the test_bit() bit operator is boolean (return 0 or 1),
the double not "!!" operations needed to convert a scalar
(zero or not zero) to a boolean are not needed.

Signed-off-by: Paul Jackson <[email protected]>

---

Andrew,

Please drop the patches:

cpuset-memory-spread-basic-implementation.patch
cpuset-memory-spread-basic-implementation-fix.patch
cpuset-memory-spread-page-cache-implementation-and-hooks.patch
cpuset-memory-spread-slab-cache-implementation.patch
cpuset-memory-spread-slab-cache-optimizations.patch
cpuset-memory-spread-slab-cache-optimizations-tidy.patch
cpuset-memory-spread-slab-cache-hooks.patch

and replace with this patch set:

[PATCH v2 01/07] cpuset_cleanup_not_not_ops
[PATCH v2 02/07] cpuset_combine_atomic_inc_calls
[PATCH v2 03/07] cpuset_mem_spread_new
[PATCH v2 04/07] cpuset_mem_spread_page_cache
[PATCH v2 05/07] cpuset_mem_spread_slab_cache
[PATCH v2 06/07] cpuset_mem_spread_slab_optimize
[PATCH v2 07/07] cpuset_mem_spread_mark_some_slab_caches

Changes in patch set from first version to this version v2:
* dropped double not (!!) operators on cpuset flag bit tests
* use more optimal atomic_inc_return() calls
* change from one flag for all spreading to two
flags, one for page cache, one for slab cache
* change name of easily misused mpol_set_task_struct_flag()
to less inviting mpol_fix_fork_child_flag(), to minimize
risk someone will call on another task unsafely.
* a tidy fix and a missing EXPORT_SYMBOL, thanks to Andrew
* added xfs slab cache hooks for slab cache memory spreading
* comment anticipating other file systems may need same hooks
* further optimize kernel text size, for both NUMA and non-NUMA
cases of the page_cache_alloc*() memory spreading hooks
* comment why !in_interrupt() test needed to use memory spreading
* comment why we dont need to check that node returned from the
cpuset_mem_spread_node() is online

kernel/cpuset.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

--- 2.6.16-rc1-mm5.orig/kernel/cpuset.c 2006-02-06 23:17:44.536051878 -0800
+++ 2.6.16-rc1-mm5/kernel/cpuset.c 2006-02-06 23:37:02.709481506 -0800
@@ -114,27 +114,27 @@ typedef enum {
/* convenient tests for these bits */
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
- return !!test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
+ return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
}

static inline int is_mem_exclusive(const struct cpuset *cs)
{
- return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
+ return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
}

static inline int is_removed(const struct cpuset *cs)
{
- return !!test_bit(CS_REMOVED, &cs->flags);
+ return test_bit(CS_REMOVED, &cs->flags);
}

static inline int notify_on_release(const struct cpuset *cs)
{
- return !!test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+ return test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
}

static inline int is_memory_migrate(const struct cpuset *cs)
{
- return !!test_bit(CS_MEMORY_MIGRATE, &cs->flags);
+ return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}

/*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373


2006-02-09 18:54:44

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 02/07] cpuset use combined atomic_inc_return calls

From: Paul Jackson <[email protected]>

Replace pairs of calls to <atomic_inc, atomic_read>, with a single
call atomic_inc_return, saving a few bytes of source and kernel text.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 11 ++++-------
1 files changed, 4 insertions(+), 7 deletions(-)

--- 2.6.16-rc1-mm5.orig/kernel/cpuset.c 2006-02-07 16:26:03.511639777 -0800
+++ 2.6.16-rc1-mm5/kernel/cpuset.c 2006-02-07 16:26:43.563843169 -0800
@@ -858,8 +858,7 @@ static int update_nodemask(struct cpuset

mutex_lock(&callback_mutex);
cs->mems_allowed = trialcs.mems_allowed;
- atomic_inc(&cpuset_mems_generation);
- cs->mems_generation = atomic_read(&cpuset_mems_generation);
+ cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
mutex_unlock(&callback_mutex);

set_cpuset_being_rebound(cs); /* causes mpol_copy() rebind */
@@ -1770,8 +1769,7 @@ static long cpuset_create(struct cpuset
atomic_set(&cs->count, 0);
INIT_LIST_HEAD(&cs->sibling);
INIT_LIST_HEAD(&cs->children);
- atomic_inc(&cpuset_mems_generation);
- cs->mems_generation = atomic_read(&cpuset_mems_generation);
+ cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
fmeter_init(&cs->fmeter);

cs->parent = parent;
@@ -1861,7 +1859,7 @@ int __init cpuset_init_early(void)
struct task_struct *tsk = current;

tsk->cpuset = &top_cpuset;
- tsk->cpuset->mems_generation = atomic_read(&cpuset_mems_generation);
+ tsk->cpuset->mems_generation = atomic_inc_return(&cpuset_mems_generation);
return 0;
}

@@ -1880,8 +1878,7 @@ int __init cpuset_init(void)
top_cpuset.mems_allowed = NODE_MASK_ALL;

fmeter_init(&top_cpuset.fmeter);
- atomic_inc(&cpuset_mems_generation);
- top_cpuset.mems_generation = atomic_read(&cpuset_mems_generation);
+ top_cpuset.mems_generation = atomic_inc_return(&cpuset_mems_generation);

init_task.cpuset = &top_cpuset;


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 18:55:12

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 07/07] cpuset memory spread slab cache hooks

From: Paul Jackson <[email protected]>

Change the kmem_cache_create calls for certain slab caches to
support cpuset memory spreading.

See the previous patches, cpuset_mem_spread, for an explanation
of cpuset memory spreading, and cpuset_mem_spread_slab_cache
for the slab cache support for memory spreading.

The slag caches marked for now are: dentry_cache, inode_cache,
some xfs slab caches, and buffer_head. This list may change
over time. In particular, other file system types that are
used extensively on large NUMA systems may want to allow for
spreading their directory and inode slab cache entries.

Signed-off-by: Paul Jackson

---

fs/buffer.c | 7 +++++--
fs/dcache.c | 3 ++-
fs/inode.c | 9 +++++++--
fs/xfs/linux-2.6/kmem.h | 2 +-
4 files changed, 15 insertions(+), 6 deletions(-)

--- v2.6.16-rc2.orig/fs/dcache.c 2006-02-08 22:48:26.000000000 -0800
+++ v2.6.16-rc2/fs/dcache.c 2006-02-08 22:48:31.000000000 -0800
@@ -1682,7 +1682,8 @@ static void __init dcache_init(unsigned
dentry_cache = kmem_cache_create("dentry_cache",
sizeof(struct dentry),
0,
- SLAB_RECLAIM_ACCOUNT|SLAB_PANIC,
+ (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+ SLAB_MEM_SPREAD),
NULL, NULL);

set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
--- v2.6.16-rc2.orig/fs/inode.c 2006-02-08 22:48:26.000000000 -0800
+++ v2.6.16-rc2/fs/inode.c 2006-02-08 22:48:31.000000000 -0800
@@ -1375,8 +1375,13 @@ void __init inode_init(unsigned long mem
int loop;

/* inode slab cache */
- inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
- 0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_once, NULL);
+ inode_cachep = kmem_cache_create("inode_cache",
+ sizeof(struct inode),
+ 0,
+ (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+ SLAB_MEM_SPREAD),
+ init_once,
+ NULL);
set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);

/* Hash may have been set up in inode_init_early */
--- v2.6.16-rc2.orig/fs/buffer.c 2006-02-08 22:48:26.000000000 -0800
+++ v2.6.16-rc2/fs/buffer.c 2006-02-08 22:48:31.000000000 -0800
@@ -3203,8 +3203,11 @@ void __init buffer_init(void)
int nrpages;

bh_cachep = kmem_cache_create("buffer_head",
- sizeof(struct buffer_head), 0,
- SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_buffer_head, NULL);
+ sizeof(struct buffer_head), 0,
+ (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+ SLAB_MEM_SPREAD),
+ init_buffer_head,
+ NULL);

/*
* Limit the bh occupancy to 10% of ZONE_NORMAL
--- v2.6.16-rc2.orig/fs/xfs/linux-2.6/kmem.h 2006-02-08 23:21:56.000000000 -0800
+++ v2.6.16-rc2/fs/xfs/linux-2.6/kmem.h 2006-02-08 23:38:05.000000000 -0800
@@ -102,7 +102,7 @@ extern void kmem_free(void *, size_t);

#define KM_ZONE_HWALIGN SLAB_HWCACHE_ALIGN
#define KM_ZONE_RECLAIM SLAB_RECLAIM_ACCOUNT
-#define KM_ZONE_SPREAD 0
+#define KM_ZONE_SPREAD SLAB_MEM_SPREAD

#define kmem_zone kmem_cache
#define kmem_zone_t struct kmem_cache

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 18:55:11

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 03/07] cpuset memory spread basic implementation

From: Paul Jackson <[email protected]>

This patch provides the implementation and cpuset interface for
an alternative memory allocation policy that can be applied to
certain kinds of memory allocations, such as the page cache (file
system buffers) and some slab caches (such as inode caches).

The policy is called "memory spreading." If enabled, it
spreads out these kinds of memory allocations over all the
nodes allowed to a task, instead of preferring to place them
on the node where the task is executing.

All other kinds of allocations, including anonymous pages for
a tasks stack and data regions, are not affected by this policy
choice, and continue to be allocated preferring the node local
to execution, as modified by the NUMA mempolicy.

There are two boolean flag files per cpuset that control
where the kernel allocates pages for the file system buffers
and related in kernel data structures. They are called
'memory_spread_page' and 'memory_spread_slab'.

If the per-cpuset boolean flag file 'memory_spread_page' is set,
then the kernel will spread the file system buffers (page cache)
evenly over all the nodes that the faulting task is allowed
to use, instead of preferring to put those pages on the node
where the task is running.

If the per-cpuset boolean flag file 'memory_spread_slab' is set,
then the kernel will spread some file system related slab caches,
such as for inodes and dentries evenly over all the nodes that
the faulting task is allowed to use, instead of preferring to
put those pages on the node where the task is running.

The implementation is simple. Setting the cpuset flags
'memory_spread_page' or 'memory_spread_cache' turns on the
per-process flags PF_SPREAD_PAGE or PF_SPREAD_SLAB, respectively,
for each task that is in the cpuset or subsequently joins that
cpuset. In subsequent patches, the page allocation calls for
the affected page cache and slab caches are modified to perform
an inline check for these flags, and if set, a call to a new
routine cpuset_mem_spread_node() returns the node to prefer
for the allocation.

The cpuset_mem_spread_node() routine is also simple. It uses
the value of a per-task rotor cpuset_mem_spread_rotor to select
the next node in the current tasks mems_allowed to prefer for
the allocation.

This policy can provide substantial improvements for jobs that
need to place thread local data on the corresponding node, but
that need to access large file system data sets that need to
be spread across the several nodes in the jobs cpuset in order
to fit. Without this patch, especially for jobs that might
have one thread reading in the data set, the memory allocation
across the nodes in the jobs cpuset can become very uneven.

A couple of Copyright year ranges are updated as well. And a
couple of email addresses that can be found in the MAINTAINERS
file are removed.

Signed-off-by: Paul Jackson <[email protected]>


---

Documentation/cpusets.txt | 76 ++++++++++++++++++++++++++++++++-
include/linux/cpuset.h | 29 ++++++++++++
include/linux/sched.h | 3 +
kernel/cpuset.c | 104 +++++++++++++++++++++++++++++++++++++++++++---
4 files changed, 203 insertions(+), 9 deletions(-)

--- v2.6.16-rc2.orig/Documentation/cpusets.txt 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/Documentation/cpusets.txt 2006-02-08 22:47:01.000000000 -0800
@@ -17,7 +17,8 @@ CONTENTS:
1.4 What are exclusive cpusets ?
1.5 What does notify_on_release do ?
1.6 What is memory_pressure ?
- 1.7 How do I use cpusets ?
+ 1.7 What is memory spread ?
+ 1.8 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
@@ -315,7 +316,78 @@ the tasks in the cpuset, in units of rec
times 1000.


-1.7 How do I use cpusets ?
+1.7 What is memory spread ?
+---------------------------
+There are two boolean flag files per cpuset that control where the
+kernel allocates pages for the file system buffers and related in
+kernel data structures. They are called 'memory_spread_page' and
+'memory_spread_slab'.
+
+If the per-cpuset boolean flag file 'memory_spread_page' is set, then
+the kernel will spread the file system buffers (page cache) evenly
+over all the nodes that the faulting task is allowed to use, instead
+of preferring to put those pages on the node where the task is running.
+
+If the per-cpuset boolean flag file 'memory_spread_slab' is set,
+then the kernel will spread some file system related slab caches,
+such as for inodes and dentries evenly over all the nodes that the
+faulting task is allowed to use, instead of preferring to put those
+pages on the node where the task is running.
+
+The setting of these flags does not affect anonymous data segment or
+stack segment pages of a task.
+
+By default, both kinds of memory spreading are off, and memory
+pages are allocated on the node local to where the task is running,
+except perhaps as modified by the tasks NUMA mempolicy or cpuset
+configuration, so long as sufficient free memory pages are available.
+
+When new cpusets are created, they inherit the memory spread settings
+of their parent.
+
+Setting memory spreading causes allocations for the affected page
+or slab caches to ignore the tasks NUMA mempolicy and be spread
+instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
+mempolicies will not notice any change in these calls as a result of
+their containing tasks memory spread settings. If memory spreading
+is turned off, then the currently specified NUMA mempolicy once again
+applies to memory page allocations.
+
+Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
+files. By default they contain "0", meaning that the feature is off
+for that cpuset. If a "1" is written to that file, then that turns
+the named feature on.
+
+The implementation is simple.
+
+Setting the flag 'memory_spread_page' turns on a per-process flag
+PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
+joins that cpuset. The page allocation calls for the page cache
+is modified to perform an inline check for this PF_SPREAD_PAGE task
+flag, and if set, a call to a new routine cpuset_mem_spread_node()
+returns the node to prefer for the allocation.
+
+Similarly, setting 'memory_spread_cache' turns on the flag
+PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
+pages from the node returned by cpuset_mem_spread_node().
+
+The cpuset_mem_spread_node() routine is also simple. It uses the
+value of a per-task rotor cpuset_mem_spread_rotor to select the next
+node in the current tasks mems_allowed to prefer for the allocation.
+
+This memory placement policy is also known (in other contexts) as
+round-robin or interleave.
+
+This policy can provide substantial improvements for jobs that need
+to place thread local data on the corresponding node, but that need
+to access large file system data sets that need to be spread across
+the several nodes in the jobs cpuset in order to fit. Without this
+policy, especially for jobs that might have one thread reading in the
+data set, the memory allocation across the nodes in the jobs cpuset
+can become very uneven.
+
+
+1.8 How do I use cpusets ?
--------------------------

In order to minimize the impact of cpusets on critical kernel
--- v2.6.16-rc2.orig/include/linux/cpuset.h 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/include/linux/cpuset.h 2006-02-08 22:47:01.000000000 -0800
@@ -4,7 +4,7 @@
* cpuset interface
*
* Copyright (C) 2003 BULL SA
- * Copyright (C) 2004 Silicon Graphics, Inc.
+ * Copyright (C) 2004-2006 Silicon Graphics, Inc.
*
*/

@@ -51,6 +51,18 @@ extern char *cpuset_task_status_allowed(
extern void cpuset_lock(void);
extern void cpuset_unlock(void);

+extern int cpuset_mem_spread_node(void);
+
+static inline int cpuset_do_page_mem_spread(void)
+{
+ return current->flags & PF_SPREAD_PAGE;
+}
+
+static inline int cpuset_do_slab_mem_spread(void)
+{
+ return current->flags & PF_SPREAD_SLAB;
+}
+
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init_early(void) { return 0; }
@@ -99,6 +111,21 @@ static inline char *cpuset_task_status_a
static inline void cpuset_lock(void) {}
static inline void cpuset_unlock(void) {}

+static inline int cpuset_mem_spread_node(void)
+{
+ return 0;
+}
+
+static inline int cpuset_do_page_mem_spread(void)
+{
+ return 0;
+}
+
+static inline int cpuset_do_slab_mem_spread(void)
+{
+ return 0;
+}
+
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
--- v2.6.16-rc2.orig/include/linux/sched.h 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/include/linux/sched.h 2006-02-08 22:47:01.000000000 -0800
@@ -874,6 +874,7 @@ struct task_struct {
struct cpuset *cpuset;
nodemask_t mems_allowed;
int cpuset_mems_generation;
+ int cpuset_mem_spread_rotor;
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
struct rcu_head rcu;
@@ -935,6 +936,8 @@ static inline void put_task_struct(struc
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
#define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */
+#define PF_SPREAD_PAGE 0x04000000 /* Spread page cache over cpuset */
+#define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches over cpuset */

/*
* Only the _current_ task can read/write to tsk->flags, but other
--- v2.6.16-rc2.orig/kernel/cpuset.c 2006-02-08 22:47:00.000000000 -0800
+++ v2.6.16-rc2/kernel/cpuset.c 2006-02-08 22:47:01.000000000 -0800
@@ -4,15 +4,14 @@
* Processor and Memory placement constraints for sets of tasks.
*
* Copyright (C) 2003 BULL SA.
- * Copyright (C) 2004 Silicon Graphics, Inc.
+ * Copyright (C) 2004-2006 Silicon Graphics, Inc.
*
* Portions derived from Patrick Mochel's sysfs code.
* sysfs is Copyright (c) 2001-3 Patrick Mochel
- * Portions Copyright (c) 2004 Silicon Graphics, Inc.
*
- * 2003-10-10 Written by Simon Derr <[email protected]>
+ * 2003-10-10 Written by Simon Derr.
* 2003-10-22 Updates by Stephen Hemminger.
- * 2004 May-July Rework by Paul Jackson <[email protected]>
+ * 2004 May-July Rework by Paul Jackson.
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
@@ -108,7 +107,9 @@ typedef enum {
CS_MEM_EXCLUSIVE,
CS_MEMORY_MIGRATE,
CS_REMOVED,
- CS_NOTIFY_ON_RELEASE
+ CS_NOTIFY_ON_RELEASE,
+ CS_SPREAD_PAGE,
+ CS_SPREAD_SLAB,
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -137,6 +138,16 @@ static inline int is_memory_migrate(cons
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}

+static inline int is_spread_page(const struct cpuset *cs)
+{
+ return test_bit(CS_SPREAD_PAGE, &cs->flags);
+}
+
+static inline int is_spread_slab(const struct cpuset *cs)
+{
+ return test_bit(CS_SPREAD_SLAB, &cs->flags);
+}
+
/*
* Increment this atomic integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -657,6 +668,14 @@ void cpuset_update_task_memory_state(voi
cs = tsk->cpuset; /* Maybe changed when task not locked */
guarantee_online_mems(cs, &tsk->mems_allowed);
tsk->cpuset_mems_generation = cs->mems_generation;
+ if (is_spread_page(cs))
+ tsk->flags |= PF_SPREAD_PAGE;
+ else
+ tsk->flags &= ~PF_SPREAD_PAGE;
+ if (is_spread_slab(cs))
+ tsk->flags |= PF_SPREAD_SLAB;
+ else
+ tsk->flags &= ~PF_SPREAD_SLAB;
task_unlock(tsk);
mutex_unlock(&callback_mutex);
mpol_rebind_task(tsk, &tsk->mems_allowed);
@@ -956,7 +975,8 @@ static int update_memory_pressure_enable
/*
* update_flag - read a 0 or a 1 in a file and update associated flag
* bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
- * CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE)
+ * CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
+ * CS_SPREAD_PAGE, CS_SPREAD_SLAB)
* cs: the cpuset to update
* buf: the buffer where we read the 0 or 1
*
@@ -1187,6 +1207,8 @@ typedef enum {
FILE_NOTIFY_ON_RELEASE,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
+ FILE_SPREAD_PAGE,
+ FILE_SPREAD_SLAB,
FILE_TASKLIST,
} cpuset_filetype_t;

@@ -1246,6 +1268,14 @@ static ssize_t cpuset_common_file_write(
case FILE_MEMORY_PRESSURE:
retval = -EACCES;
break;
+ case FILE_SPREAD_PAGE:
+ retval = update_flag(CS_SPREAD_PAGE, cs, buffer);
+ cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
+ break;
+ case FILE_SPREAD_SLAB:
+ retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
+ cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
+ break;
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
@@ -1355,6 +1385,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_MEMORY_PRESSURE:
s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
break;
+ case FILE_SPREAD_PAGE:
+ *s++ = is_spread_page(cs) ? '1' : '0';
+ break;
+ case FILE_SPREAD_SLAB:
+ *s++ = is_spread_slab(cs) ? '1' : '0';
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1718,6 +1754,16 @@ static struct cftype cft_memory_pressure
.private = FILE_MEMORY_PRESSURE,
};

+static struct cftype cft_spread_page = {
+ .name = "memory_spread_page",
+ .private = FILE_SPREAD_PAGE,
+};
+
+static struct cftype cft_spread_slab = {
+ .name = "memory_spread_slab",
+ .private = FILE_SPREAD_SLAB,
+};
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1736,6 +1782,10 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0)
+ return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
+ return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
return 0;
@@ -1764,6 +1814,10 @@ static long cpuset_create(struct cpuset
cs->flags = 0;
if (notify_on_release(parent))
set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
+ if (is_spread_page(parent))
+ set_bit(CS_SPREAD_PAGE, &cs->flags);
+ if (is_spread_slab(parent))
+ set_bit(CS_SPREAD_SLAB, &cs->flags);
cs->cpus_allowed = CPU_MASK_NONE;
cs->mems_allowed = NODE_MASK_NONE;
atomic_set(&cs->count, 0);
@@ -2168,6 +2222,44 @@ void cpuset_unlock(void)
}

/**
+ * cpuset_mem_spread_node() - On which node to begin search for a page
+ *
+ * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
+ * tasks in a cpuset with is_spread_page or is_spread_slab set),
+ * and if the memory allocation used cpuset_mem_spread_node()
+ * to determine on which node to start looking, as it will for
+ * certain page cache or slab cache pages such as used for file
+ * system buffers and inode caches, then instead of starting on the
+ * local node to look for a free page, rather spread the starting
+ * node around the tasks mems_allowed nodes.
+ *
+ * We don't have to worry about the returned node being offline
+ * because "it can't happen", and even if it did, it would be ok.
+ *
+ * The routines calling guarantee_online_mems() are careful to
+ * only set nodes in task->mems_allowed that are online. So it
+ * should not be possible for the following code to return an
+ * offline node. But if it did, that would be ok, as this routine
+ * is not returning the node where the allocation must be, only
+ * the node where the search should start. The zonelist passed to
+ * __alloc_pages() will include all nodes. If the slab allocator
+ * is passed an offline node, it will fall back to the local node.
+ * See kmem_cache_alloc_node().
+ */
+
+int cpuset_mem_spread_node(void)
+{
+ int node;
+
+ node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
+ if (node == MAX_NUMNODES)
+ node = first_node(current->mems_allowed);
+ current->cpuset_mem_spread_rotor = node;
+ return node;
+}
+EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
+
+/**
* cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
* @p: pointer to task_struct of some other task.
*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 18:55:58

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 05/07] cpuset memory spread slab cache implementation

From: Paul Jackson <[email protected]>

Provide the slab cache infrastructure to support cpuset memory
spreading.

See the previous patches, cpuset_mem_spread, for an explanation
of cpuset memory spreading.

This patch provides a slab cache SLAB_MEM_SPREAD flag. If set
in the kmem_cache_create() call defining a slab cache, then
any task marked with the process state flag PF_MEMSPREAD will
spread memory page allocations for that cache over all the
allowed nodes, instead of preferring the local (faulting) node.

On systems not configured with CONFIG_NUMA, this results in no
change to the page allocation code path for slab caches.

On systems with cpusets configured in the kernel, but the
"memory_spread" cpuset option not enabled for the current tasks
cpuset, this adds a call to a cpuset routine and failed bit
test of the processor state flag PF_SPREAD_SLAB.

For tasks so marked, a second inline test is done for the
slab cache flag SLAB_MEM_SPREAD, and if that is set and if
the allocation is not in_interrupt(), this adds a call to to a
cpuset routine that computes which of the tasks mems_allowed
nodes should be preferred for this allocation.

==> This patch adds another hook into the performance critical
code path to allocating objects from the slab cache, in the
____cache_alloc() chunk, below. The next patch optimizes this
hook, reducing the impact of the combined mempolicy plus memory
spreading hooks on this critical code path to a single check
against the tasks task_struct flags word.

This patch provides the generic slab flags and logic needed to
apply memory spreading to a particular slab.

A subsequent patch will mark a few specific slab caches for this
placement policy.

Signed-off-by: Paul Jackson

---

include/linux/slab.h | 1 +
mm/slab.c | 13 +++++++++++--
2 files changed, 12 insertions(+), 2 deletions(-)

--- v2.6.16-rc2.orig/include/linux/slab.h 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/include/linux/slab.h 2006-02-08 22:47:23.000000000 -0800
@@ -47,6 +47,7 @@ typedef struct kmem_cache kmem_cache_t;
what is reclaimable later*/
#define SLAB_PANIC 0x00040000UL /* panic if kmem_cache_create() fails */
#define SLAB_DESTROY_BY_RCU 0x00080000UL /* defer freeing pages to RCU */
+#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */

/* flags passed to a constructor func */
#define SLAB_CTOR_CONSTRUCTOR 0x001UL /* if not set, then deconstructor */
--- v2.6.16-rc2.orig/mm/slab.c 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/mm/slab.c 2006-02-08 22:47:23.000000000 -0800
@@ -94,6 +94,7 @@
#include <linux/interrupt.h>
#include <linux/init.h>
#include <linux/compiler.h>
+#include <linux/cpuset.h>
#include <linux/seq_file.h>
#include <linux/notifier.h>
#include <linux/kallsyms.h>
@@ -173,12 +174,12 @@
SLAB_NO_REAP | SLAB_CACHE_DMA | \
SLAB_MUST_HWCACHE_ALIGN | SLAB_STORE_USER | \
SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
- SLAB_DESTROY_BY_RCU)
+ SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
#else
# define CREATE_MASK (SLAB_HWCACHE_ALIGN | SLAB_NO_REAP | \
SLAB_CACHE_DMA | SLAB_MUST_HWCACHE_ALIGN | \
SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
- SLAB_DESTROY_BY_RCU)
+ SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD)
#endif

/*
@@ -2759,6 +2760,14 @@ static inline void *____cache_alloc(stru
if (nid != numa_node_id())
return __cache_alloc_node(cachep, flags, nid);
}
+ if (unlikely(cpuset_do_slab_mem_spread() &&
+ (cachep->flags & SLAB_MEM_SPREAD) &&
+ !in_interrupt())) {
+ int nid = cpuset_mem_spread_node();
+
+ if (nid != numa_node_id())
+ return __cache_alloc_node(cachep, flags, nid);
+ }
#endif

check_irq_off();

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 18:55:58

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 06/07] cpuset memory spread slab cache optimizations

From: Paul Jackson <[email protected]>

The hooks in the slab cache allocator code path for support of
NUMA mempolicies and cpuset memory spreading are in an important
code path. Many systems will use neither feature.

This patch optimizes those hooks down to a single check of
some bits in the current tasks task_struct flags. For non NUMA
systems, this hook and related code is already ifdef'd out.

The optimization is done by using another task flag, set if the
task is using a non-default NUMA mempolicy. Taking this flag
bit along with the PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits
added earlier in this 'cpuset memory spreading' patch set, one
can check for the combination of any of these special case
memory placement mechanisms with a single test of the current
tasks task_struct flags.

This patch also tightens up the code, to save a few bytes of
kernel text space, and moves some of it out of line. Due to
the nested inlines called from multiple places, we were ending
up with three copies of this code, which once we get off the
main code path (for local node allocation) seems a bit wasteful
of instruction memory.

Signed-off-by: Paul Jackson <[email protected]>

---

include/linux/mempolicy.h | 5 +++++
include/linux/sched.h | 1 +
kernel/fork.c | 1 +
mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++++
mm/slab.c | 41 ++++++++++++++++++++++++++++-------------
5 files changed, 67 insertions(+), 13 deletions(-)

--- 2.6.16-rc2-mm1.orig/kernel/fork.c 2006-02-09 00:09:17.785558470 -0800
+++ 2.6.16-rc2-mm1/kernel/fork.c 2006-02-09 01:38:12.885936624 -0800
@@ -1018,6 +1018,7 @@ static task_t *copy_process(unsigned lon
p->mempolicy = NULL;
goto bad_fork_cleanup_cpuset;
}
+ mpol_fix_fork_child_flag(p);
#endif

#ifdef CONFIG_DEBUG_MUTEXES
--- 2.6.16-rc2-mm1.orig/mm/mempolicy.c 2006-02-09 00:09:17.785558470 -0800
+++ 2.6.16-rc2-mm1/mm/mempolicy.c 2006-02-09 00:29:45.204510838 -0800
@@ -411,6 +411,37 @@ static int contextualize_policy(int mode
return mpol_check_policy(mode, nodes);
}

+
+/*
+ * Update task->flags PF_MEMPOLICY bit: set iff non-default
+ * mempolicy. Allows more rapid checking of this (combined perhaps
+ * with other PF_* flag bits) on memory allocation hot code paths.
+ *
+ * If called from outside this file, the task 'p' should -only- be
+ * a newly forked child not yet visible on the task list, because
+ * manipulating the task flags of a visible task is not safe.
+ *
+ * The above limitation is why this routine has the funny name
+ * mpol_fix_fork_child_flag().
+ *
+ * It is also safe to call this with a task pointer of current,
+ * which the static wrapper mpol_set_task_struct_flag() does,
+ * for use within this file.
+ */
+
+void mpol_fix_fork_child_flag(struct task_struct *p)
+{
+ if (p->mempolicy)
+ p->flags |= PF_MEMPOLICY;
+ else
+ p->flags &= ~PF_MEMPOLICY;
+}
+
+static void mpol_set_task_struct_flag(void)
+{
+ mpol_fix_fork_child_flag(current);
+}
+
/* Set the process memory policy */
long do_set_mempolicy(int mode, nodemask_t *nodes)
{
@@ -423,6 +454,7 @@ long do_set_mempolicy(int mode, nodemask
return PTR_ERR(new);
mpol_free(current->mempolicy);
current->mempolicy = new;
+ mpol_set_task_struct_flag();
if (new && new->policy == MPOL_INTERLEAVE)
current->il_next = first_node(new->v.nodes);
return 0;
--- 2.6.16-rc2-mm1.orig/include/linux/mempolicy.h 2006-02-09 00:09:17.699619995 -0800
+++ 2.6.16-rc2-mm1/include/linux/mempolicy.h 2006-02-09 00:14:01.868839884 -0800
@@ -147,6 +147,7 @@ extern void mpol_rebind_policy(struct me
extern void mpol_rebind_task(struct task_struct *tsk,
const nodemask_t *new);
extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
+extern void mpol_fix_fork_child_flag(struct task_struct *p);
#define set_cpuset_being_rebound(x) (cpuset_being_rebound = (x))

#ifdef CONFIG_CPUSET
@@ -248,6 +249,10 @@ static inline void mpol_rebind_mm(struct
{
}

+static inline void mpol_fix_fork_child_flag(struct task_struct *p)
+{
+}
+
#define set_cpuset_being_rebound(x) do {} while (0)

static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
--- 2.6.16-rc2-mm1.orig/mm/slab.c 2006-02-09 00:14:01.833683242 -0800
+++ 2.6.16-rc2-mm1/mm/slab.c 2006-02-09 01:45:12.558625414 -0800
@@ -859,6 +859,7 @@ static struct array_cache *alloc_arrayca

#ifdef CONFIG_NUMA
static void *__cache_alloc_node(struct kmem_cache *, gfp_t, int);
+static void *alternate_node_alloc(struct kmem_cache *, gfp_t);

static struct array_cache **alloc_alien_cache(int node, int limit)
{
@@ -2754,19 +2755,11 @@ static inline void *____cache_alloc(stru
struct array_cache *ac;

#ifdef CONFIG_NUMA
- if (unlikely(current->mempolicy && !in_interrupt())) {
- int nid = slab_node(current->mempolicy);
-
- if (nid != numa_node_id())
- return __cache_alloc_node(cachep, flags, nid);
- }
- if (unlikely(cpuset_do_slab_mem_spread() &&
- (cachep->flags & SLAB_MEM_SPREAD) &&
- !in_interrupt())) {
- int nid = cpuset_mem_spread_node();
-
- if (nid != numa_node_id())
- return __cache_alloc_node(cachep, flags, nid);
+ if (unlikely(current->flags & (PF_SPREAD_PAGE | PF_SPREAD_SLAB |
+ PF_MEMPOLICY))) {
+ objp = alternate_node_alloc(cachep, flags);
+ if (objp != NULL)
+ return objp;
}
#endif

@@ -2802,6 +2795,28 @@ static __always_inline void *__cache_all

#ifdef CONFIG_NUMA
/*
+ * Try allocating on another node if PF_SPREAD_PAGE|PF_SPREAD_SLAB|PF_MEMPOLICY.
+ *
+ * If we are in_interrupt, then process context, including cpusets and
+ * mempolicy, may not apply and should not be used for allocation policy.
+ */
+static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
+{
+ int nid_alloc, nid_here;
+
+ if (in_interrupt())
+ return NULL;
+ nid_alloc = nid_here = numa_node_id();
+ if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
+ nid_alloc = cpuset_mem_spread_node();
+ else if (current->mempolicy)
+ nid_alloc = slab_node(current->mempolicy);
+ if (nid_alloc != nid_here)
+ return __cache_alloc_node(cachep, flags, nid_alloc);
+ return NULL;
+}
+
+/*
* A interface to enable slab creation on nodeid
*/
static void *__cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
--- 2.6.16-rc2-mm1.orig/include/linux/sched.h 2006-02-09 00:14:01.739932197 -0800
+++ 2.6.16-rc2-mm1/include/linux/sched.h 2006-02-09 01:38:13.631062204 -0800
@@ -938,6 +938,7 @@ static inline void put_task_struct(struc
#define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x04000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches over cpuset */
+#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */

/*
* Only the _current_ task can read/write to tsk->flags, but other

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 18:55:13

by Paul Jackson

[permalink] [raw]
Subject: [PATCH v2 04/07] cpuset memory spread page cache implementation and hooks

From: Paul Jackson <[email protected]>

Change the page cache allocation calls to support cpuset memory
spreading.

See the previous patch, cpuset_mem_spread, for an explanation
of cpuset memory spreading.

On systems without cpusets configured in the kernel, this is
no change.

On systems with cpusets configured in the kernel, but the
"memory_spread" cpuset option not enabled for the current tasks
cpuset, this adds a call to a cpuset routine and failed bit
test of the processor state flag PF_SPREAD_PAGE.

On tasks in cpusets with "memory_spread" enabled, this adds
a call to a cpuset routine that computes which of the tasks
mems_allowed nodes should be preferred for this allocation.

If memory spreading applies to a particular allocation, then
any other NUMA mempolicy does not apply.

Signed-off-by: Paul Jackson

---

include/linux/pagemap.h | 5 +++++
mm/filemap.c | 23 +++++++++++++++++++++++
2 files changed, 28 insertions(+)

--- v2.6.16-rc2.orig/include/linux/pagemap.h 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/include/linux/pagemap.h 2006-02-08 22:47:04.000000000 -0800
@@ -51,6 +51,10 @@ static inline void mapping_set_gfp_mask(
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

+#ifdef CONFIG_NUMA
+extern struct page *page_cache_alloc(struct address_space *x);
+extern struct page *page_cache_alloc_cold(struct address_space *x);
+#else
static inline struct page *page_cache_alloc(struct address_space *x)
{
return alloc_pages(mapping_gfp_mask(x), 0);
@@ -60,6 +64,7 @@ static inline struct page *page_cache_al
{
return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
}
+#endif

typedef int filler_t(void *, struct page *);

--- v2.6.16-rc2.orig/mm/filemap.c 2006-02-08 22:46:56.000000000 -0800
+++ v2.6.16-rc2/mm/filemap.c 2006-02-08 22:47:04.000000000 -0800
@@ -30,6 +30,7 @@
#include <linux/blkdev.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/cpuset.h>
#include "filemap.h"
/*
* FIXME: remove all knowledge of the buffer layer from the core VM
@@ -425,6 +426,28 @@ int add_to_page_cache_lru(struct page *p
return ret;
}

+#ifdef CONFIG_NUMA
+struct page *page_cache_alloc(struct address_space *x)
+{
+ if (cpuset_do_page_mem_spread()) {
+ int n = cpuset_mem_spread_node();
+ return alloc_pages_node(n, mapping_gfp_mask(x), 0);
+ }
+ return alloc_pages(mapping_gfp_mask(x), 0);
+}
+EXPORT_SYMBOL(page_cache_alloc);
+
+struct page *page_cache_alloc_cold(struct address_space *x)
+{
+ if (cpuset_do_page_mem_spread()) {
+ int n = cpuset_mem_spread_node();
+ return alloc_pages_node(n, mapping_gfp_mask(x)|__GFP_COLD, 0);
+ }
+ return alloc_pages(mapping_gfp_mask(x)|__GFP_COLD, 0);
+}
+EXPORT_SYMBOL(page_cache_alloc_cold);
+#endif
+
/*
* In order to wait for pages to become available there must be
* waitqueues associated with pages. By using a hash table of

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-02-09 19:23:57

by Michael Buesch

[permalink] [raw]
Subject: Re: [PATCH v2 02/07] cpuset use combined atomic_inc_return calls

On Thursday 09 February 2006 19:54, you wrote:
> @@ -1770,8 +1769,7 @@ static long cpuset_create(struct cpuset
> atomic_set(&cs->count, 0);
> INIT_LIST_HEAD(&cs->sibling);
> INIT_LIST_HEAD(&cs->children);
> - atomic_inc(&cpuset_mems_generation);
> - cs->mems_generation = atomic_read(&cpuset_mems_generation);
> + cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
> fmeter_init(&cs->fmeter);
>
> cs->parent = parent;
> @@ -1861,7 +1859,7 @@ int __init cpuset_init_early(void)
> struct task_struct *tsk = current;
>
> tsk->cpuset = &top_cpuset;
> - tsk->cpuset->mems_generation = atomic_read(&cpuset_mems_generation);
> + tsk->cpuset->mems_generation = atomic_inc_return(&cpuset_mems_generation);
> return 0;
> }

Is this hunk really correct? I did not look into the code, but from
the patch context it seems like it adds an inc here.

--
Greetings Michael.


Attachments:
(No filename) (896.00 B)
(No filename) (189.00 B)
Download all attachments

2006-02-09 21:08:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH v2 02/07] cpuset use combined atomic_inc_return calls

Michael wrote:
> Is this hunk really correct? I did not look into the code, but from
> the patch context it seems like it adds an inc here.

You have sharp eyes.

It doesn't matter much either way whether we increment or not here.

That's because this is in cpuset_init_early(), which is the -very- first
use of the global cpuset_mems_generation counter. All other uses must
increment, so that they don't reuse a generation value someone else
used. But we're first, so no possibility of reuse.

We can start with the first value, zero (0), or we can increment and
use one (1).

I changed it to atomic_inc_return() just to be consistent with all the
other locations that read this. That way, if anyone else ever has to
get a value of cpuset_mems_generation and looks around to see how to do
it, they will see that it is always done using atomic_inc_return(), and
probably do it the same way, with minimum risk of confusion.

Just avoiding gratuitous differences in code that don't have a good
reason.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401