Adds a per-cpuset `memory_slab_hardwall' flag.
The slab allocator interface for determining whether an object is allowed
is
int current_cpuset_object_allowed(int node, gfp_t flags)
This returns non-zero when the object is allowed, either because
current's cpuset does not have memory_slab_hardwall enabled or because
it allows allocation on the node. Otherwise, it returns zero.
This interface is lockless because a task's cpuset can always be safely
dereferenced atomically.
For slab, if the physical node id of the cpu cache is not from an
allowable node, the allocation will fail. If an allocation is targeted
for a node that is not allowed, we allocate from an appropriate one
instead of failing.
For slob, if the page from the slob list is not from an allowable node,
we continue to scan for an appropriate slab. If none can be used, a new
slab is allocated.
For slub, if the cpu slab is not from an allowable node, the partial list
is scanned for a replacement. If none can be used, a new slab is
allocated.
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroups/cpusets.txt | 54 ++++++++++++++++++++++++-------------
include/linux/cpuset.h | 6 ++++
kernel/cpuset.c | 34 +++++++++++++++++++++++
mm/slab.c | 4 +++
mm/slob.c | 6 +++-
mm/slub.c | 12 +++++---
6 files changed, 91 insertions(+), 25 deletions(-)
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -14,20 +14,21 @@ CONTENTS:
=========
1. Cpusets
- 1.1 What are cpusets ?
- 1.2 Why are cpusets needed ?
- 1.3 How are cpusets implemented ?
- 1.4 What are exclusive cpusets ?
- 1.5 What is memory_pressure ?
- 1.6 What is memory spread ?
- 1.7 What is sched_load_balance ?
- 1.8 What is sched_relax_domain_level ?
- 1.9 How do I use cpusets ?
+ 1.1 What are cpusets ?
+ 1.2 Why are cpusets needed ?
+ 1.3 How are cpusets implemented ?
+ 1.4 What are exclusive cpusets ?
+ 1.5 What is memory_pressure ?
+ 1.6 What is memory spread ?
+ 1.7 What is sched_load_balance ?
+ 1.8 What is sched_relax_domain_level ?
+ 1.9 What is memory_slab_hardwall ?
+ 1.10 How do I use cpusets ?
2. Usage Examples and Syntax
- 2.1 Basic Usage
- 2.2 Adding/removing cpus
- 2.3 Setting flags
- 2.4 Attaching processes
+ 2.1 Basic Usage
+ 2.2 Adding/removing cpus
+ 2.3 Setting flags
+ 2.4 Attaching processes
3. Questions
4. Contact
@@ -581,8 +582,22 @@ If your situation is:
then increasing 'sched_relax_domain_level' would benefit you.
-1.9 How do I use cpusets ?
---------------------------
+1.9 What is memory_slab_hardwall ?
+----------------------------------
+
+A cpuset may require that slab object allocations all originate from
+its set of mems, either for memory isolation or NUMA optimizations. Slab
+allocators normally optimize allocations in the fastpath by returning
+objects from a cpu slab. These objects do not necessarily originate from
+slabs allocated on a cpuset's mems.
+
+When memory_slab_hardwall is set, all objects are allocated from slabs on
+the cpuset's set of mems. This may incur a performance penalty if the
+cpu slab must be swapped for a different slab.
+
+
+1.10 How do I use cpusets ?
+---------------------------
In order to minimize the impact of cpusets on critical kernel
code, such as the scheduler, and due to the fact that the kernel
@@ -725,10 +740,11 @@ Now you want to do something with this cpuset.
In this directory you can find several files:
# ls
-cpu_exclusive memory_migrate mems tasks
-cpus memory_pressure notify_on_release
-mem_exclusive memory_spread_page sched_load_balance
-mem_hardwall memory_spread_slab sched_relax_domain_level
+cpu_exclusive memory_pressure notify_on_release
+cpus memory_slab_hardwall sched_load_balance
+mem_exclusive memory_spread_page sched_relax_domain_level
+mem_hardwall memory_spread_slab tasks
+memory_migrate mems
Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -87,6 +87,7 @@ static inline int cpuset_do_slab_mem_spread(void)
}
extern int current_cpuset_is_being_rebound(void);
+extern int current_cpuset_object_allowed(int node, gfp_t flags);
extern void rebuild_sched_domains(void);
@@ -179,6 +180,11 @@ static inline int current_cpuset_is_being_rebound(void)
return 0;
}
+static inline int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+ return 1;
+}
+
static inline void rebuild_sched_domains(void)
{
partition_sched_domains(1, NULL, NULL);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -142,6 +142,7 @@ typedef enum {
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_SLAB_HARDWALL,
} cpuset_flagbits_t;
/* convenient tests for these bits */
@@ -180,6 +181,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
+static inline int is_slab_hardwall(const struct cpuset *cs)
+{
+ return test_bit(CS_SLAB_HARDWALL, &cs->flags);
+}
+
/*
* Increment this integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -1190,6 +1196,19 @@ int current_cpuset_is_being_rebound(void)
return task_cs(current) == cpuset_being_rebound;
}
+/**
+ * current_cpuset_object_allowed - can a slab object be allocated on a node?
+ * @node: the node for object allocation
+ * @flags: allocation flags
+ *
+ * Return non-zero if object is allowed, zero otherwise.
+ */
+int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+ return !is_slab_hardwall(task_cs(current)) ||
+ cpuset_node_allowed_hardwall(node, flags);
+}
+
static int update_relax_domain_level(struct cpuset *cs, s64 val)
{
if (val < -1 || val >= SD_LV_MAX)
@@ -1417,6 +1436,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_SLAB_HARDWALL,
} cpuset_filetype_t;
static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1458,6 +1478,9 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
retval = update_flag(CS_SPREAD_SLAB, cs, val);
cs->mems_generation = cpuset_mems_generation++;
break;
+ case FILE_SLAB_HARDWALL:
+ retval = update_flag(CS_SLAB_HARDWALL, cs, val);
+ break;
default:
retval = -EINVAL;
break;
@@ -1614,6 +1637,8 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
return is_spread_page(cs);
case FILE_SPREAD_SLAB:
return is_spread_slab(cs);
+ case FILE_SLAB_HARDWALL:
+ return is_slab_hardwall(cs);
default:
BUG();
}
@@ -1721,6 +1746,13 @@ static struct cftype files[] = {
.write_u64 = cpuset_write_u64,
.private = FILE_SPREAD_SLAB,
},
+
+ {
+ .name = "memory_slab_hardwall",
+ .read_u64 = cpuset_read_u64,
+ .write_u64 = cpuset_write_u64,
+ .private = FILE_SLAB_HARDWALL,
+ },
};
static struct cftype cft_memory_pressure_enabled = {
@@ -1814,6 +1846,8 @@ static struct cgroup_subsys_state *cpuset_create(
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
+ if (is_slab_hardwall(parent))
+ set_bit(CS_SLAB_HARDWALL, &cs->flags);
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3124,6 +3124,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
check_irq_off();
ac = cpu_cache_get(cachep);
+ if (!current_cpuset_object_allowed(numa_node_id(), flags))
+ return NULL;
if (likely(ac->avail)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
@@ -3249,6 +3251,8 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
void *obj;
int x;
+ if (!current_cpuset_object_allowed(nodeid, flags))
+ nodeid = cpuset_mem_spread_node();
l3 = cachep->nodelists[nodeid];
BUG_ON(!l3);
diff --git a/mm/slob.c b/mm/slob.c
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -319,14 +319,18 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
spin_lock_irqsave(&slob_lock, flags);
/* Iterate through each partially free page, try to find room */
list_for_each_entry(sp, slob_list, list) {
+ int slab_node = page_to_nid(&sp->page);
+
#ifdef CONFIG_NUMA
/*
* If there's a node specification, search for a partial
* page with a matching node id in the freelist.
*/
- if (node != -1 && page_to_nid(&sp->page) != node)
+ if (node != -1 && slab_node != node)
continue;
#endif
+ if (!current_cpuset_object_allowed(slab_node, gfp))
+ continue;
/* Enough room on this page? */
if (sp->units < SLOB_UNITS(size))
continue;
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1353,6 +1353,8 @@ static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
struct page *page;
int searchnode = (node == -1) ? numa_node_id() : node;
+ if (!current_cpuset_object_allowed(node, flags))
+ searchnode = cpuset_mem_spread_node();
page = get_partial_node(get_node(s, searchnode));
if (page || (flags & __GFP_THISNODE))
return page;
@@ -1475,15 +1477,15 @@ static void flush_all(struct kmem_cache *s)
/*
* Check if the objects in a per cpu structure fit numa
- * locality expectations.
+ * locality expectations and is allowed in current's cpuset.
*/
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int check_node(struct kmem_cache_cpu *c, int node, gfp_t flags)
{
#ifdef CONFIG_NUMA
if (node != -1 && c->node != node)
return 0;
#endif
- return 1;
+ return current_cpuset_object_allowed(node, flags);
}
/*
@@ -1517,7 +1519,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto new_slab;
slab_lock(c->page);
- if (unlikely(!node_match(c, node)))
+ if (unlikely(!check_node(c, node, gfpflags)))
goto another_slab;
stat(c, ALLOC_REFILL);
@@ -1604,7 +1606,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
objsize = c->objsize;
- if (unlikely(!c->freelist || !node_match(c, node)))
+ if (unlikely(!c->freelist || !check_node(c, node, gfpflags)))
object = __slab_alloc(s, gfpflags, node, addr, c);
On Sun, Mar 8, 2009 at 9:27 AM, David Rientjes <[email protected]> wrote:
> +/**
> + * current_cpuset_object_allowed - can a slab object be allocated on a node?
> + * @node: the node for object allocation
> + * @flags: allocation flags
> + *
> + * Return non-zero if object is allowed, zero otherwise.
> + */
> +int current_cpuset_object_allowed(int node, gfp_t flags)
> +{
> + ? ? ? return !is_slab_hardwall(task_cs(current)) ||
> + ? ? ? ? ? ? ?cpuset_node_allowed_hardwall(node, flags);
> +}
> +
This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
dereference the result of task_cs(current)
I'll leave the actual memory allocator changes for others to comment on .
Paul
On Sun, 2009-03-08 at 09:27 -0700, David Rientjes wrote:
> Adds a per-cpuset `memory_slab_hardwall' flag.
>
> The slab allocator interface for determining whether an object is allowed
> is
>
> int current_cpuset_object_allowed(int node, gfp_t flags)
>
> This returns non-zero when the object is allowed, either because
> current's cpuset does not have memory_slab_hardwall enabled or because
> it allows allocation on the node. Otherwise, it returns zero.
>
> This interface is lockless because a task's cpuset can always be safely
> dereferenced atomically.
>
> For slab, if the physical node id of the cpu cache is not from an
> allowable node, the allocation will fail. If an allocation is targeted
> for a node that is not allowed, we allocate from an appropriate one
> instead of failing.
>
> For slob, if the page from the slob list is not from an allowable node,
> we continue to scan for an appropriate slab. If none can be used, a new
> slab is allocated.
Looks fine to me, if a little expensive. We'll be needing SLQB support
though.
--
http://selenic.com : development and support for Mercurial and Linux
On Sun, 8 Mar 2009, Paul Menage wrote:
> > +/**
> > + * current_cpuset_object_allowed - can a slab object be allocated on a node?
> > + * @node: the node for object allocation
> > + * @flags: allocation flags
> > + *
> > + * Return non-zero if object is allowed, zero otherwise.
> > + */
> > +int current_cpuset_object_allowed(int node, gfp_t flags)
> > +{
> > + ? ? ? return !is_slab_hardwall(task_cs(current)) ||
> > + ? ? ? ? ? ? ?cpuset_node_allowed_hardwall(node, flags);
> > +}
> > +
>
> This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
> dereference the result of task_cs(current)
>
I've folded the following into the patch, thanks Paul.
---
kernel/cpuset.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1205,8 +1205,12 @@ int current_cpuset_is_being_rebound(void)
*/
int current_cpuset_object_allowed(int node, gfp_t flags)
{
- return !is_slab_hardwall(task_cs(current)) ||
- cpuset_node_allowed_hardwall(node, flags);
+ int is_hardwall;
+
+ rcu_read_lock();
+ is_hardwall = is_slab_hardwall(task_cs(current));
+ rcu_read_unlock();
+ return !is_hardwall || cpuset_node_allowed_hardwall(node, flags);
}
static int update_relax_domain_level(struct cpuset *cs, s64 val)
On Sun, 8 Mar 2009, Matt Mackall wrote:
> > For slob, if the page from the slob list is not from an allowable node,
> > we continue to scan for an appropriate slab. If none can be used, a new
> > slab is allocated.
>
> Looks fine to me, if a little expensive.
Is that your acked-by? :)
It's not expensive for cpusets that do not set memory_slab_hardwall, which
is disabled by default, other than some cacheline polluting. If the
option is set, then the performance penalty is described in the
documentation and should be assumed by the user.
We currently have a couple different ways to check for a task's cpuset
options:
- per-task flags such as PF_SPREAD_PAGE and PF_SPREAD_SLAB used in the
hotpath, and
- rcu dereferencing current's cpuset and atomically checking a cpuset
flag bit.
It would be nice to unify these to free up some task flag bits.
> We'll be needing SLQB support
> though.
>
Yeah, I'd like to add the necessary slqb support from Pekka's git tree but
this patch depends on
cpusets-replace-zone-allowed-functions-with-node-allowed.patch in -mm, so
we'll need to know the route by which this should be pushed.
> Adds a per-cpuset `memory_slab_hardwall' flag.
>
> The slab allocator interface for determining whether an object is allowed
> is
>
> int current_cpuset_object_allowed(int node, gfp_t flags)
>
> This returns non-zero when the object is allowed, either because
> current's cpuset does not have memory_slab_hardwall enabled or because
> it allows allocation on the node. Otherwise, it returns zero.
>
> This interface is lockless because a task's cpuset can always be safely
> dereferenced atomically.
>
> For slab, if the physical node id of the cpu cache is not from an
> allowable node, the allocation will fail. If an allocation is targeted
> for a node that is not allowed, we allocate from an appropriate one
> instead of failing.
>
> For slob, if the page from the slob list is not from an allowable node,
> we continue to scan for an appropriate slab. If none can be used, a new
> slab is allocated.
>
> For slub, if the cpu slab is not from an allowable node, the partial list
> is scanned for a replacement. If none can be used, a new slab is
> allocated.
Hmmm,
this description only explay how to implement this.
but no explain why this patch is useful.
Could you please who and why need it?
Another thought - it would probably be better to call this flag
kernel_mem_hardwall or mem_hardwall_kernel, to avoid hard-coding its
name to be slab-specific.
Paul
On Sun, Mar 8, 2009 at 2:38 PM, David Rientjes <[email protected]> wrote:
> On Sun, 8 Mar 2009, Paul Menage wrote:
>
>> > +/**
>> > + * current_cpuset_object_allowed - can a slab object be allocated on a node?
>> > + * @node: the node for object allocation
>> > + * @flags: allocation flags
>> > + *
>> > + * Return non-zero if object is allowed, zero otherwise.
>> > + */
>> > +int current_cpuset_object_allowed(int node, gfp_t flags)
>> > +{
>> > + ? ? ? return !is_slab_hardwall(task_cs(current)) ||
>> > + ? ? ? ? ? ? ?cpuset_node_allowed_hardwall(node, flags);
>> > +}
>> > +
>>
>> This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
>> dereference the result of task_cs(current)
>>
>
> I've folded the following into the patch, thanks Paul.
> ---
> ?kernel/cpuset.c | ? ?8 ++++++--
> ?1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1205,8 +1205,12 @@ int current_cpuset_is_being_rebound(void)
> ?*/
> ?int current_cpuset_object_allowed(int node, gfp_t flags)
> ?{
> - ? ? ? return !is_slab_hardwall(task_cs(current)) ||
> - ? ? ? ? ? ? ?cpuset_node_allowed_hardwall(node, flags);
> + ? ? ? int is_hardwall;
> +
> + ? ? ? rcu_read_lock();
> + ? ? ? is_hardwall = is_slab_hardwall(task_cs(current));
> + ? ? ? rcu_read_unlock();
> + ? ? ? return !is_hardwall || cpuset_node_allowed_hardwall(node, flags);
> ?}
>
> ?static int update_relax_domain_level(struct cpuset *cs, s64 val)
On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
> Hmmm,
> this description only explay how to implement this.
> but no explain why this patch is useful.
>
> Could you please who and why need it?
>
The change to Documentation/cgroups/cpusets.txt should have explained it.
This is for two cases: true memory isolation (now including slab
allocations at the object level) and NUMA optimizations.
Prior to this change, it was possible for slabs to be allocated in a
cpuset while its objects were largely consumed by disjoint cpusets. We
can fix that by only allocating objects from slabs that are found on
current->mems_allowed. While this incurs a performance penalty, some
users may find that true isolation outweighs the cache optimizations.
It is also helpful for long-lived objects that require NUMA affinity to a
certain cpu or group of cpus. That is, after all, the reasoning behind
cpusets in the first place. If slab objects were all allocated from a
node with remote affinity to the cpus that will be addressing it, it
negates a significant advantage that cpusets provides to the user.
On Mon, 9 Mar 2009, Paul Menage wrote:
> Another thought - it would probably be better to call this flag
> kernel_mem_hardwall or mem_hardwall_kernel, to avoid hard-coding its
> name to be slab-specific.
>
The change only affects slab allocations, it doesn't affect all kernel
memory allocations. With slub, for example, allocations that are larger
than SLUB_MAX_ORDER (was formerly PAGE_SIZE) simply use compound pages
from the page allocator where the cpuset memory policy was already
enforced.
While there are a few different options for slab allocators in mainline
and slqb on the way, these are still generally referred to as "slab"
allocations regardless of which one is configured.
Prefixing the name with `memory_' just seemed natural considering the
tunables already in place such as memory_spread_page and
memory_spread_slab. The user also already understands `hardwall' since
mem_hardwall has existed for quite some time.
> On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
>
> > Hmmm,
> > this description only explay how to implement this.
> > but no explain why this patch is useful.
> >
> > Could you please who and why need it?
> >
>
> The change to Documentation/cgroups/cpusets.txt should have explained it.
>
> This is for two cases: true memory isolation (now including slab
> allocations at the object level) and NUMA optimizations.
>
> Prior to this change, it was possible for slabs to be allocated in a
> cpuset while its objects were largely consumed by disjoint cpusets. We
> can fix that by only allocating objects from slabs that are found on
> current->mems_allowed. While this incurs a performance penalty, some
> users may find that true isolation outweighs the cache optimizations.
>
> It is also helpful for long-lived objects that require NUMA affinity to a
> certain cpu or group of cpus. That is, after all, the reasoning behind
> cpusets in the first place. If slab objects were all allocated from a
> node with remote affinity to the cpus that will be addressing it, it
> negates a significant advantage that cpusets provides to the user.
My question mean, Why anyone need isolation?
your patch insert new branch into hotpath.
then, it makes slower hotpath a abit although a user don't use this feature.
typically, slab cache don't need strict node binding because
inode/dentry touched from multiple cpus.
In addition, on large numa systems, slab cache is relatively small
than page cache. then this feature's improvement seems relatively small too.
if you have strongly reason, I don't oppose this proposal.
but I don't think your explanation is enough reasonable reason.
btw, have you seen "immediate values" patch series? I think it
can become make the patch zero cost for non-cpuset user.
after that patch merging, I don't oppose this patch although
your reason isn't so much.
Again these are fastpath modifications.
Scanning the partial list for matching nodes is an expensive operation.
Adding RCU into the fast paths is also another big worry.
On Mon, 9 Mar 2009, Christoph Lameter wrote:
> Again these are fastpath modifications.
>
The nature of the change requires the logic to be placed in the fastpath
to determine whether a cpu slab's node is allowed by the allocating task's
cpuset.
You have previously stated that you would prefer that this feature be
tunable from userspace. This patch adds the `memory_slab_hardwall' cpuset
flag which defaults to off.
> Scanning the partial list for matching nodes is an expensive operation.
>
It depends on how long you scan for a matching node, but again: this
should be assumed by the user if the option has been enabled.
> Adding RCU into the fast paths is also another big worry.
>
This could be mitigated by adding a PF_SLAB_HARDWALL flag similiar to
PF_SPREAD_PAGE and PF_SPREAD_SLAB. I'd prefer not to add additional
cpuset-specific task flags, but this would address your concern.
On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
> My question mean, Why anyone need isolation?
> your patch insert new branch into hotpath.
> then, it makes slower hotpath a abit although a user don't use this feature.
>
On large NUMA machines, it is currently possible for a very large
percentage (if not all) of your slab allocations to come from memory that
is distant from your application's set of allowable cpus. Such
allocations that are long-lived would benefit from having affinity to
those processors. Again, this is the typical use case for cpusets: to
bind memory nodes to groups of cpus with affinity to it for the tasks
attached to the cpuset.
> typically, slab cache don't need strict node binding because
> inode/dentry touched from multiple cpus.
>
This change would obviously require inode and dentry objects to originate
from a node on the cpuset's set of mems_allowed. That would incur a
performance penalty if the cpu slab is not from such a node, but that is
assumed by the user who has enabled the option.
> In addition, on large numa systems, slab cache is relatively small
> than page cache. then this feature's improvement seems relatively small too.
>
That's irrelevant, large NUMA machines may still require memory affinity
to a specific group of cpus, the size of the global slab cache isn't
important if that's the goal. When the option is enabled for cpusets
that require that memory locality, we happily trade off partial list
fragmentation and increased slab allocations for the long-lived local
allocations.
On Mon, 9 Mar 2009, David Rientjes wrote:
> On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
>
> > My question mean, Why anyone need isolation?
> > your patch insert new branch into hotpath.
> > then, it makes slower hotpath a abit although a user don't use this feature.
> On large NUMA machines, it is currently possible for a very large
> percentage (if not all) of your slab allocations to come from memory that
> is distant from your application's set of allowable cpus. Such
> allocations that are long-lived would benefit from having affinity to
> those processors. Again, this is the typical use case for cpusets: to
> bind memory nodes to groups of cpus with affinity to it for the tasks
> attached to the cpuset.
Can you show us a real workload that suffers from this issue?
If you want to make sure that an allocation comes from a certain node then
specifying the node in kmalloc_node() will give you what you want.
> > typically, slab cache don't need strict node binding because
> > inode/dentry touched from multiple cpus.
> This change would obviously require inode and dentry objects to originate
> from a node on the cpuset's set of mems_allowed. That would incur a
> performance penalty if the cpu slab is not from such a node, but that is
> assumed by the user who has enabled the option.
The usage of kernel objects may not be cpuset specific. This is true for
other objects than inode and dentries well.
> > In addition, on large numa systems, slab cache is relatively small
> > than page cache. then this feature's improvement seems relatively small too.
> That's irrelevant, large NUMA machines may still require memory affinity
> to a specific group of cpus, the size of the global slab cache isn't
> important if that's the goal. When the option is enabled for cpusets
> that require that memory locality, we happily trade off partial list
> fragmentation and increased slab allocations for the long-lived local
> allocations.
Other memory may spill over too. F.e. two processes from disjunct cpu sets
cause faults in the same address range (its rather common for this to
happen to glibc code f.e.). Two processes may use another kernel feature
that buffers objects (are you going to want to search the LRU lists for objects
from the right node?)
NUMA affinity is there in the large picture. In detail the allocation
strategies over nodes etc etc may be disturbed by this and that in
particular if processes with disjoint cpusets run on the same processor.
Just dont do it. Dedicate a cpu to a cpuset. Overlapping cpusets can cause
other strange things as well.
On Mon, 9 Mar 2009, Christoph Lameter wrote:
> > On large NUMA machines, it is currently possible for a very large
> > percentage (if not all) of your slab allocations to come from memory that
> > is distant from your application's set of allowable cpus. Such
> > allocations that are long-lived would benefit from having affinity to
> > those processors. Again, this is the typical use case for cpusets: to
> > bind memory nodes to groups of cpus with affinity to it for the tasks
> > attached to the cpuset.
>
> Can you show us a real workload that suffers from this issue?
>
We're more interested in the isolation characteristic, but that also
benefits large NUMA machines by keeping nodes free of egregious amounts of
slab allocated for remote cpus.
> If you want to make sure that an allocation comes from a certain node then
> specifying the node in kmalloc_node() will give you what you want.
>
That's essentially what the change does implicitly: it changes all
kmalloc() calls to kmalloc_node() for current->mems_allowed.
> > This change would obviously require inode and dentry objects to originate
> > from a node on the cpuset's set of mems_allowed. That would incur a
> > performance penalty if the cpu slab is not from such a node, but that is
> > assumed by the user who has enabled the option.
>
> The usage of kernel objects may not be cpuset specific. This is true for
> other objects than inode and dentries well.
>
Yes, and that's why we require the cpuset hardwall on a configurable
per-cpuset basis. If a cpuset has set this option for its workload, then
it is demanding object allocations from local memory. Other cpusets that
do not have memory_slab_hardwall set can still allocate from any cpu slab
or partial slab, including those allocated for the hardwall cpuset.
> Other memory may spill over too. F.e. two processes from disjunct cpu sets
> cause faults in the same address range (its rather common for this to
> happen to glibc code f.e.). Two processes may use another kernel feature
> that buffers objects (are you going to want to search the LRU lists for objects
> from the right node?)
>
If a workload is demanding node local object allocation, then an object
buffer probably isn't in its best interest if they are not all from nodes
with affinity.
> NUMA affinity is there in the large picture.
It depends heavily on the allocation and freeing pattern, it is quite
possible that NUMA affinity will never be realized through slub if all
slabs are consistently allocated on a single node just because we get an
alloc when the current cpu slab must be replaced.
> On Mon, 9 Mar 2009, Christoph Lameter wrote:
>
> > Again these are fastpath modifications.
> >
>
> The nature of the change requires the logic to be placed in the fastpath
> to determine whether a cpu slab's node is allowed by the allocating task's
> cpuset.
>
> You have previously stated that you would prefer that this feature be
> tunable from userspace. This patch adds the `memory_slab_hardwall' cpuset
> flag which defaults to off.
That's pointless.
Again, any fastpath modification should have reasonable reason.
We are looking for your explanation.
I have each 6+ year experience on embedded, HPC, and high-end server area.
but I haven't hear this requirement. I still can't imazine who use this feature.
On Tue, 10 Mar 2009, KOSAKI Motohiro wrote:
> That's pointless.
> Again, any fastpath modification should have reasonable reason.
> We are looking for your explanation.
>
The fastpath modification simply checks if the hardwall bit is set in the
allocating task's cpuset flags. If it's disabled, there is no additional
overhead.
This requirement was mandated during the first review of the patch by
Christoph, who requested that it be configurable. Before that it was
possible to simply check if the global `number_of_cpusets' count was > 1.
If not, cpuset_node_allowed_hardwall() would always return true. If the
system had more than one cpuset, it would have reduced to checking
return in_interrupt() || (gfp_mask & __GFP_THISNODE) ||
node_isset(node, current->mems_allowed);
As I already mentioned, if fastpath optimization is your only concern,
that we could simply add PF_SLAB_HARDWALL task flags that would simply
make this
return current->flags & PF_SLAB_HARDWALL;
So the fastpath cost can be mitigated at the expense of an additional task
flag.
On Mon, 2009-03-09 at 19:01 -0700, David Rientjes wrote:
> On Tue, 10 Mar 2009, KOSAKI Motohiro wrote:
>
> > That's pointless.
> > Again, any fastpath modification should have reasonable reason.
> > We are looking for your explanation.
> >
>
> The fastpath modification simply checks if the hardwall bit is set in the
> allocating task's cpuset flags. If it's disabled, there is no additional
> overhead.
Ok, I for one understand perfectly the desire for this feature.
But we are still extremely sensitive to adding potential branches to one
of the most important fast-paths in the kernel, especially for a feature
with a fairly narrow use case. We've invested an awful lot of time into
micro-optimizing SLAB (by rewriting it as SLUB/SLQB) so any steps
backward at this stage are cause for concern. Also, remember 99%+ of
users will never care about this feature.
For SLOB, I think the code is fine as it stands, but we probably want to
be a bit more clever for the others. At the very minimum, we'd like this
to be in an unlikely path. Better still if the initial test can somehow
be hidden with another test. It might also be possible to use the
patching code used by markers to enable the path only when one or more
tasks needs it.
--
http://selenic.com : development and support for Mercurial and Linux
On Mon, 9 Mar 2009, Matt Mackall wrote:
> But we are still extremely sensitive to adding potential branches to one
> of the most important fast-paths in the kernel, especially for a feature
> with a fairly narrow use case. We've invested an awful lot of time into
> micro-optimizing SLAB (by rewriting it as SLUB/SLQB) so any steps
> backward at this stage are cause for concern. Also, remember 99%+ of
> users will never care about this feature.
>
My latest proposal simply checks for !(current->flags & PF_SLAB_HARDWALL)
before determining whether the set of allowable nodes needs to be checked.
For slub, this is in addition to the prexisting logic that checks whether
the object can be from any node (node == -1) in slab_alloc() or the cpu
slab is from the node requested for kmalloc_node() users for CONFIG_NUMA
kernels.
You could argue that, in the slub example, check_node() should do this:
static inline int check_node(struct kmem_cache_cpu *c, int node,
gfp_t flags)
{
#ifdef CONFIG_NUMA
if (node != -1 && c->node != node)
return 0;
if (likely(!(current->flags & PF_SLAB_HARDWALL)))
return 1;
#endif
return current_cpuset_object_allowed(node, flags);
}
Although this would penalize the case where current's cpuset has
memory_slab_hardwall enabled, yet the cpu slab is still allowed because it
originated from current->mems_allowed.
If checking for the PF_SLAB_HARDWALL bit in current->flags really is
unacceptable in my latest proposal, then a viable solution probably
doesn't exist for such workloads that want hardwall object allocations.
On Mon, 9 Mar 2009, David Rientjes wrote:
> On Mon, 9 Mar 2009, Christoph Lameter wrote:
>
> > > On large NUMA machines, it is currently possible for a very large
> > > percentage (if not all) of your slab allocations to come from memory that
> > > is distant from your application's set of allowable cpus. Such
> > > allocations that are long-lived would benefit from having affinity to
> > > those processors. Again, this is the typical use case for cpusets: to
> > > bind memory nodes to groups of cpus with affinity to it for the tasks
> > > attached to the cpuset.
> >
> > Can you show us a real workload that suffers from this issue?
> >
>
> We're more interested in the isolation characteristic, but that also
> benefits large NUMA machines by keeping nodes free of egregious amounts of
> slab allocated for remote cpus.
So no real workload just some isolation idea.
> > If you want to make sure that an allocation comes from a certain node then
> > specifying the node in kmalloc_node() will give you what you want.
> >
>
> That's essentially what the change does implicitly: it changes all
> kmalloc() calls to kmalloc_node() for current->mems_allowed.
Ok then you can use kmalloc_node?
> > The usage of kernel objects may not be cpuset specific. This is true for
> > other objects than inode and dentries well.
> >
>
> Yes, and that's why we require the cpuset hardwall on a configurable
> per-cpuset basis. If a cpuset has set this option for its workload, then
> it is demanding object allocations from local memory. Other cpusets that
> do not have memory_slab_hardwall set can still allocate from any cpu slab
> or partial slab, including those allocated for the hardwall cpuset.
You cannot hardwall something that is used in a shared way by processes in
multiple cpusets.
On Tue, 2009-03-10 at 16:50 -0400, Christoph Lameter wrote:
> On Mon, 9 Mar 2009, David Rientjes wrote:
>
> > On Mon, 9 Mar 2009, Christoph Lameter wrote:
> >
> > > > On large NUMA machines, it is currently possible for a very large
> > > > percentage (if not all) of your slab allocations to come from memory that
> > > > is distant from your application's set of allowable cpus. Such
> > > > allocations that are long-lived would benefit from having affinity to
> > > > those processors. Again, this is the typical use case for cpusets: to
> > > > bind memory nodes to groups of cpus with affinity to it for the tasks
> > > > attached to the cpuset.
> > >
> > > Can you show us a real workload that suffers from this issue?
> > >
> >
> > We're more interested in the isolation characteristic, but that also
> > benefits large NUMA machines by keeping nodes free of egregious amounts of
> > slab allocated for remote cpus.
>
> So no real workload just some isolation idea.
>
> > > If you want to make sure that an allocation comes from a certain node then
> > > specifying the node in kmalloc_node() will give you what you want.
> > >
> >
> > That's essentially what the change does implicitly: it changes all
> > kmalloc() calls to kmalloc_node() for current->mems_allowed.
>
> Ok then you can use kmalloc_node?
Yes, he certainly could change every single kmalloc that a process might
ever reach to kmalloc_node. But I don't think that's optimal.
>
> > > The usage of kernel objects may not be cpuset specific. This is true for
> > > other objects than inode and dentries well.
> > >
> >
> > Yes, and that's why we require the cpuset hardwall on a configurable
> > per-cpuset basis. If a cpuset has set this option for its workload, then
> > it is demanding object allocations from local memory. Other cpusets that
> > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > or partial slab, including those allocated for the hardwall cpuset.
>
> You cannot hardwall something that is used in a shared way by processes in
> multiple cpusets.
He can enforce that every allocation made when a given task is current
conforms. His patch demonstrates that.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, Mar 10, 2009 at 1:50 PM, Christoph Lameter
<[email protected]> wrote:
>
> So no real workload just some isolation idea.
We definitely have real workloads where a job is allocating lots of
slab memory (e.g. network socket buffers, dentry/inode objects, etc)
and we want to be able to account the memory usage to each job rather
than having all the slab scattered around unidentifiably, and to
reduce fragmentation (so when a job finishes, all its sockets close
and all its files are deleted, there's a better chance that we'll be
able to reclaim some slab memory). We could probably turn those into
more synthetic benchmarkable loads if necessary for demonstration.
Paul
On Mon, Mar 09, 2009 at 02:50:06PM -0400, Christoph Lameter wrote:
> Again these are fastpath modifications.
>
> Scanning the partial list for matching nodes is an expensive operation.
>
> Adding RCU into the fast paths is also another big worry.
Hello, Christoph,
Adding synchronize_rcu() into a fast path would certainly be a problem,
but call_rcu() should be OK. If the data structure is updated often
(old elements removed and new elements added), then the cache misses
from elements that were removed, went cache-cold, and then were added
again could potentially cause trouble, but read-mostly data structures
should be OK.
Or were you worried about some other aspect of RCU overhead?
Thanx, Paul
On Wed, 11 Mar 2009, Paul E. McKenney wrote:
> Adding synchronize_rcu() into a fast path would certainly be a problem,
> but call_rcu() should be OK. If the data structure is updated often
> (old elements removed and new elements added), then the cache misses
> from elements that were removed, went cache-cold, and then were added
> again could potentially cause trouble, but read-mostly data structures
> should be OK.
>
> Or were you worried about some other aspect of RCU overhead?
>
Thanks for looking at this, Paul. My latest proposal actually replaces
the need for the rcu with a per-task flag called PF_SLAB_HARDWALL (see
http://marc.info/?l=linux-kernel&m=123665181400366).
On Tue, 10 Mar 2009, Paul Menage wrote:
> We definitely have real workloads where a job is allocating lots of
> slab memory (e.g. network socket buffers, dentry/inode objects, etc)
> and we want to be able to account the memory usage to each job rather
> than having all the slab scattered around unidentifiably, and to
> reduce fragmentation (so when a job finishes, all its sockets close
> and all its files are deleted, there's a better chance that we'll be
> able to reclaim some slab memory). We could probably turn those into
> more synthetic benchmarkable loads if necessary for demonstration.
So this is about memory accounting? The kernel tracks all memory used by a
process and releases it independantly from this patch.
The resources that you are mentioning are resources that are typically
shared by multiple processes. There no task owning these items. It is
accidental that a certain process is exclusively using one of these at a
time.
The real workloads are running in cpusets that are overlapping? Why would
this be done? The point of cpusets is typically to segment the
processors for a certain purpose.
On Tue, 10 Mar 2009, Matt Mackall wrote:
> > > Yes, and that's why we require the cpuset hardwall on a configurable
> > > per-cpuset basis. If a cpuset has set this option for its workload, then
> > > it is demanding object allocations from local memory. Other cpusets that
> > > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > > or partial slab, including those allocated for the hardwall cpuset.
> >
> > You cannot hardwall something that is used in a shared way by processes in
> > multiple cpusets.
>
> He can enforce that every allocation made when a given task is current
> conforms. His patch demonstrates that.
Of course. But that may just be a subset of the data used by a task. If an
inode, dentry and so on was already allocated in the context of another
process then the locality of that allocation will not be changed. The
hardwall will have no effect.
On Thu, 2009-03-12 at 12:08 -0400, Christoph Lameter wrote:
> On Tue, 10 Mar 2009, Matt Mackall wrote:
>
> > > > Yes, and that's why we require the cpuset hardwall on a configurable
> > > > per-cpuset basis. If a cpuset has set this option for its workload, then
> > > > it is demanding object allocations from local memory. Other cpusets that
> > > > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > > > or partial slab, including those allocated for the hardwall cpuset.
> > >
> > > You cannot hardwall something that is used in a shared way by processes in
> > > multiple cpusets.
> >
> > He can enforce that every allocation made when a given task is current
> > conforms. His patch demonstrates that.
>
> Of course. But that may just be a subset of the data used by a task. If an
> inode, dentry and so on was already allocated in the context of another
> process then the locality of that allocation will not be changed. The
> hardwall will have no effect.
It will if he's also using a namespace. This is part of a larger puzzle.
--
http://selenic.com : development and support for Mercurial and Linux