2013-06-11 16:14:38

by Alex Thorlton

[permalink] [raw]
Subject: [PATCH v2] Make transparent hugepages cpuset aware

This patch adds the ability to control THPs on a per cpuset basis. Please see
the additions to Documentation/cgroups/cpusets.txt for more information.

Signed-off-by: Alex Thorlton <[email protected]>
Reviewed-by: Robin Holt <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Xiao Guangrong <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes since last patch version:
- Modified transparent_hugepage_enable to always check the vma for the
VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack.
- Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags

Documentation/cgroups/cpusets.txt | 50 ++++++++++-
include/linux/cpuset.h | 5 ++
include/linux/huge_mm.h | 27 +++++-
kernel/cpuset.c | 181 ++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 3 +
5 files changed, 263 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 12e01d4..b7b2c83 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -22,12 +22,14 @@ CONTENTS:
1.6 What is memory spread ?
1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ?
- 1.9 How do I use cpusets ?
+ 1.9 What is thp_enabled ?
+ 1.10 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
2.3 Setting flags
2.4 Attaching processes
+ 2.5 Setting thp_enabled flags
3. Questions
4. Contact

@@ -581,7 +583,34 @@ If your situation is:
then increasing 'sched_relax_domain_level' would benefit you.


-1.9 How do I use cpusets ?
+1.9 What is thp_enabled ?
+-----------------------
+
+The thp_enabled file contained within each cpuset controls how transparent
+hugepages are handled within that cpuset.
+
+The root cpuset's thp_enabled flags mirror the flags set in
+/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset can
+only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The
+thp_enabled file for the root cpuset is read only. These flags cause the
+root cpuset to behave as one might expect:
+
+- When set to always, THPs are used whenever practical
+- When set to madvise, THPs are used only on chunks of memory that have the
+ MADV_HUGEPAGE flag set
+- When set to never, THPs are never allowed for tasks in this cpuset
+
+The behavior of thp_enabled for children of the root cpuset is where things
+become a bit more interesting. The child cpusets accept the same flags as the
+root, but also have a default flag, which, when set, causes a cpuset to use the
+behavior of its parent. When a child cpuset is created, its default flag is
+always initially set.
+
+Since the flags on child cpusets are allowed to differ from the flags on their
+parents, we are able to enable THPs for tasks in specific cpusets, and disable
+them in others.
+
+1.10 How do I use cpusets ?
--------------------------

In order to minimize the impact of cpusets on critical kernel
@@ -733,6 +762,7 @@ cpuset.cpus cpuset.sched_load_balance
cpuset.mem_exclusive cpuset.sched_relax_domain_level
cpuset.mem_hardwall notify_on_release
cpuset.memory_migrate tasks
+thp_enabled

Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
@@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one after another:
...
# /bin/echo PIDn > tasks

+2.5 Setting thp_enabled flags
+-----------------------------
+
+The syntax for setting these flags is similar to setting thp flags in
+/sys/kernel/mm/transparent_hugepage/enabled. In order to change the flags you
+simply echo the name of the flag you want to set to the thp_enabled file of the
+desired cpuset:
+
+# /bin/echo always > thp_enabled -> always use THPs when practical
+# /bin/echo madvise > thp_enabled -> only use THPs in madvise sections
+# /bin/echo never > thp_enabled -> never use THPs
+# /bin/echo default > thp_enabled -> use parent cpuset's THP flags
+
+Note that the flags on the root cpuset cannot be changed in /dev/cpuset. These
+flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can only
+be modified there.

3. Questions
============
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index cc1b01c..624aafd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -19,9 +19,12 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */

extern int cpuset_init(void);
extern void cpuset_init_smp(void);
+extern void cpuset_update_top_thp_flags(void);
extern void cpuset_update_active_cpus(bool cpu_online);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_thp_always(struct task_struct *p);
+extern int cpuset_thp_madvise(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -122,6 +125,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}

+static inline void cpuset_update_top_thp_flags(void) {}
+
static inline void cpuset_update_active_cpus(bool cpu_online)
{
partition_sched_domains(1, NULL, NULL);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 528454c..d6a8bb3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -66,7 +66,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,

extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

-#define transparent_hugepage_enabled(__vma) \
+#define _transparent_hugepage_enabled(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_FLAG) || \
(transparent_hugepage_flags & \
@@ -177,6 +177,31 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

+#ifdef CONFIG_CPUSETS
+extern int cpuset_thp_always(struct task_struct *p);
+extern int cpuset_thp_madvise(struct task_struct *p);
+
+static inline int transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+ if (cpuset_thp_always(current) &&
+ !((vma)->vm_flags & VM_NOHUGEPAGE) &&
+ !is_vma_temporary_stack(vma))
+ return 1;
+ else if (cpuset_thp_madvise(current) &&
+ ((vma)->vm_flags & VM_HUGEPAGE) &&
+ !((vma)->vm_flags & VM_NOHUGEPAGE) &&
+ !is_vma_temporary_stack(vma))
+ return 1;
+ else
+ return 0;
+}
+#else
+static inline int transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+ return _transparent_hugepage_enabled(vma);
+}
+#endif
+
extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 64b3f79..d596e50 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -150,6 +150,9 @@ typedef enum {
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+ CS_THP_MADVISE,
+ CS_THP_ALWAYS,
+ CS_THP_DEFAULT,
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -193,6 +196,56 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+static inline int is_thp_always(const struct cpuset *cs)
+{
+ return test_bit(CS_THP_ALWAYS, &cs->flags);
+}
+
+static inline int is_thp_madvise(const struct cpuset *cs)
+{
+ return test_bit(CS_THP_MADVISE, &cs->flags);
+}
+
+static inline int is_thp_default(const struct cpuset *cs)
+{
+ return test_bit(CS_THP_DEFAULT, &cs->flags);
+}
+
+/* convenient sets for thp flags */
+static inline void set_thp_always(struct cpuset *cs)
+{
+ set_bit(CS_THP_ALWAYS, &cs->flags);
+ clear_bit(CS_THP_MADVISE, &cs->flags);
+}
+
+static inline void set_thp_madvise(struct cpuset *cs)
+{
+ set_bit(CS_THP_MADVISE, &cs->flags);
+ clear_bit(CS_THP_ALWAYS, &cs->flags);
+}
+
+static inline void set_thp_never(struct cpuset *cs)
+{
+ clear_bit(CS_THP_ALWAYS, &cs->flags);
+ clear_bit(CS_THP_MADVISE, &cs->flags);
+}
+
+static inline void copy_thp_flags(struct cpuset *source_cs,
+ struct cpuset *dest_cs)
+{
+ /*
+ * The CS_THP_DEFAULT flag isn't copied here because it is
+ * set by default when a new cpuset is created, and after
+ * that point, should only be changed by the user.
+ */
+ if (is_thp_always(source_cs))
+ set_thp_always(dest_cs);
+ else if (is_thp_madvise(source_cs))
+ set_thp_madvise(dest_cs);
+ else
+ set_thp_never(dest_cs);
+}
+
static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)),
@@ -1496,6 +1549,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_THP_ENABLED,
} cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1624,6 +1678,71 @@ out_unlock:
return retval;
}

+static void cpuset_update_child_thp_flags(struct cpuset *parent_cs)
+{
+ struct cpuset *child_cs;
+ struct cgroup *pos_cg;
+
+ rcu_read_lock();
+ cpuset_for_each_child(child_cs, pos_cg, parent_cs) {
+ if (test_bit(CS_THP_DEFAULT, &child_cs->flags)) {
+ copy_thp_flags(parent_cs, child_cs);
+ cpuset_update_child_thp_flags(child_cs);
+ }
+ }
+ rcu_read_unlock();
+}
+
+void cpuset_update_top_thp_flags(void)
+{
+ if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) {
+ set_thp_always(&top_cpuset);
+ } else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+ &transparent_hugepage_flags)) {
+ set_thp_madvise(&top_cpuset);
+ } else {
+ set_thp_never(&top_cpuset);
+ }
+
+ cpuset_update_child_thp_flags(&top_cpuset);
+}
+
+static int cpuset_thp_enabled_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buf)
+{
+ struct cpuset *cs = cgroup_cs(cgrp);
+ int retval = strlen(buf) + 1;
+
+ if (cs == &top_cpuset) {
+ retval = -EPERM;
+ goto out;
+ }
+
+ if (cft->private == FILE_THP_ENABLED) {
+ if (!memcmp("always", buf, sizeof("always")-1)) {
+ clear_bit(CS_THP_DEFAULT, &cs->flags);
+ set_thp_always(cs);
+ } else if (!memcmp("madvise", buf, sizeof("madvise")-1)) {
+ clear_bit(CS_THP_DEFAULT, &cs->flags);
+ set_thp_madvise(cs);
+ } else if (!memcmp("never", buf, sizeof("never")-1)) {
+ clear_bit(CS_THP_DEFAULT, &cs->flags);
+ set_thp_never(cs);
+ } else if (!memcmp("default", buf, sizeof("default")-1)) {
+ set_bit(CS_THP_DEFAULT, &cs->flags);
+ copy_thp_flags(parent_cs(cs), cs);
+ } else {
+ retval = -EINVAL;
+ goto out;
+ }
+ }
+
+ cpuset_update_child_thp_flags(cs);
+
+out:
+ return retval;
+}
+
/*
* These ascii lists should be read in a single call, by using a user
* buffer large enough to hold the entire map. If read in smaller
@@ -1658,6 +1777,39 @@ static size_t cpuset_sprintf_memlist(char *page, struct cpuset *cs)
return count;
}

+static size_t cpuset_sprintf_thp(char *buf, struct cpuset *cs)
+{
+ if (test_bit(CS_THP_ALWAYS, &cs->flags)) {
+ VM_BUG_ON(test_bit(CS_THP_MADVISE, &cs->flags));
+ return sprintf(buf, "[always] madvise never %s",
+ test_bit(CS_THP_DEFAULT, &cs->flags) ?
+ "(default from parent)" :
+ "(overrides parent)");
+ } else if (test_bit(CS_THP_MADVISE, &cs->flags)) {
+ VM_BUG_ON(test_bit(CS_THP_ALWAYS, &cs->flags));
+ return sprintf(buf, "always [madvise] never %s",
+ test_bit(CS_THP_DEFAULT, &cs->flags) ?
+ "(default from parent)" :
+ "(overrides parent)");
+ } else
+ return sprintf(buf, "always madvise [never] %s",
+ test_bit(CS_THP_DEFAULT, &cs->flags) ?
+ "(default from parent)" :
+ "(overrides parent)");
+}
+
+static size_t cpuset_sprintf_thp_top(char *buf, struct cpuset *cs)
+{
+ if (test_bit(CS_THP_ALWAYS, &cs->flags)) {
+ VM_BUG_ON(test_bit(CS_THP_MADVISE, &cs->flags));
+ return sprintf(buf, "[always] madvise never");
+ } else if (test_bit(CS_THP_MADVISE, &cs->flags)) {
+ VM_BUG_ON(test_bit(CS_THP_ALWAYS, &cs->flags));
+ return sprintf(buf, "always [madvise] never");
+ } else
+ return sprintf(buf, "always madvise [never]");
+}
+
static ssize_t cpuset_common_file_read(struct cgroup *cont,
struct cftype *cft,
struct file *file,
@@ -1682,6 +1834,12 @@ static ssize_t cpuset_common_file_read(struct cgroup *cont,
case FILE_MEMLIST:
s += cpuset_sprintf_memlist(s, cs);
break;
+ case FILE_THP_ENABLED:
+ if (&top_cpuset == cs)
+ s += cpuset_sprintf_thp_top(s, cs);
+ else
+ s += cpuset_sprintf_thp(s, cs);
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1834,6 +1992,13 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},

+ {
+ .name = "thp_enabled",
+ .read = cpuset_common_file_read,
+ .write_string = cpuset_thp_enabled_write,
+ .private = FILE_THP_ENABLED,
+ },
+
{ } /* terminate */
};

@@ -1880,11 +2045,14 @@ static int cpuset_css_online(struct cgroup *cgrp)
mutex_lock(&cpuset_mutex);

set_bit(CS_ONLINE, &cs->flags);
+ set_bit(CS_THP_DEFAULT, &cs->flags);
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);

+ copy_thp_flags(parent, cs);
+
number_of_cpusets++;

if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags))
@@ -1982,6 +2150,9 @@ int __init cpuset_init(void)

fmeter_init(&top_cpuset.fmeter);
set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
+
+ cpuset_update_top_thp_flags();
+
top_cpuset.relax_domain_level = -1;

err = register_filesystem(&cpuset_fs_type);
@@ -2276,6 +2447,16 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
*/
}

+int cpuset_thp_always(struct task_struct *p)
+{
+ return is_thp_always(task_cs(p));
+}
+
+int cpuset_thp_madvise(struct task_struct *p)
+{
+ return is_thp_madvise(task_cs(p));
+}
+
void cpuset_init_current_mems_allowed(void)
{
nodes_setall(current->mems_allowed);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..d48f0b5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/cpuset.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -268,6 +269,8 @@ static ssize_t double_flag_store(struct kobject *kobj,
} else
return -EINVAL;

+ cpuset_update_top_thp_flags();
+
return count;
}

--
1.7.12.4


2013-06-11 22:20:18

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Tue, 11 Jun 2013, Alex Thorlton wrote:

> This patch adds the ability to control THPs on a per cpuset basis. Please see
> the additions to Documentation/cgroups/cpusets.txt for more information.
>

What's missing from both this changelog and the documentation you point to
is why this change is needed.

I can understand how you would want a subset of processes to not use thp
when it is enabled. This is typically where MADV_NOHUGEPAGE is used with
some type of malloc hook.

I don't think we need to do this on a cpuset level, so unfortunately I
think this needs to be reworked. Would it make sense to add a per-process
tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and
mmap() calls? Perhaps, but then you would need to justify why it can't be
done with a malloc hook in userspace.

This seems to just be working around a userspace issue or for a matter of
convenience, right?

2013-06-18 16:45:29

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote:
> On Tue, 11 Jun 2013, Alex Thorlton wrote:
>
> > This patch adds the ability to control THPs on a per cpuset basis.
> > Please see
> > the additions to Documentation/cgroups/cpusets.txt for more
> > information.
> >
>
> What's missing from both this changelog and the documentation you
> point to
> is why this change is needed.
>
> I can understand how you would want a subset of processes to not use
> thp
> when it is enabled. This is typically where MADV_NOHUGEPAGE is used
> with
> some type of malloc hook.
>
> I don't think we need to do this on a cpuset level, so unfortunately I
> think this needs to be reworked. Would it make sense to add a
> per-process
> tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk()
> and
> mmap() calls? Perhaps, but then you would need to justify why it
> can't be
> done with a malloc hook in userspace.
>
> This seems to just be working around a userspace issue or for a matter
> of
> convenience, right?

David,

Thanks for your input, however, I believe the method of using a malloc
hook falls apart when it comes to static binaries, since we wont' have
any shared libraries to hook into. Although using a malloc hook is a
perfectly suitable solution for most cases, we're looking to implement a
solution that can be used in all situations.

Aside from that particular shortcoming of the malloc hook solution,
there are some other situations having a cpuset-based option is a
much simpler and more efficient solution than the alternatives. One
such situation that comes to mind would be an environment where a batch
scheduler is in use to ration system resources. If an administrator
determines that a users jobs run more efficiently with thp always on,
the administrator can simply set the users jobs to always run with that
setting, instead of having to coordinate with that user to get them to
run their jobs in a different way. I feel that, for cases such as this,
the this additional flag is in line with the other capabilities that
cgroups and cpusets provide.

While there are likely numerous other situations where having a flag to
control thp on the cpuset level makes things a bit easier to manage, the
one real show-stopper here is that we really have no other options when
it comes to static binaries.

- Alex Thorlton

2013-06-19 00:01:28

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Tue, 18 Jun 2013, Alex Thorlton wrote:

> Thanks for your input, however, I believe the method of using a malloc
> hook falls apart when it comes to static binaries, since we wont' have
> any shared libraries to hook into. Although using a malloc hook is a
> perfectly suitable solution for most cases, we're looking to implement a
> solution that can be used in all situations.
>

I guess the question would be why you don't want your malloc memory backed
by thp pages for certain static binaries and not others? Is it because of
an increased rss due to khugepaged collapsing memory because of its
default max_ptes_none value?

> Aside from that particular shortcoming of the malloc hook solution,
> there are some other situations having a cpuset-based option is a
> much simpler and more efficient solution than the alternatives.

Sure, but why should this be a cpuset based solution? What is special
about cpusets that make certain statically allocated binaries not want
memory backed by thp while others do? This still seems based solely on
convenience instead of any hard requirement.

> One
> such situation that comes to mind would be an environment where a batch
> scheduler is in use to ration system resources. If an administrator
> determines that a users jobs run more efficiently with thp always on,
> the administrator can simply set the users jobs to always run with that
> setting, instead of having to coordinate with that user to get them to
> run their jobs in a different way. I feel that, for cases such as this,
> the this additional flag is in line with the other capabilities that
> cgroups and cpusets provide.
>

That sounds like a memcg, i.e. container, type of an issue, not a cpuset
issue which is more geared toward NUMA optimizations. User jobs should
always run more efficiently with thp always on, the worst-case scenario
should be if they run with the same performance as thp set to never. In
other words, there shouldn't be any regression that requires certain
cpusets to disable thp because of a performance regression. If there are
any, we'd like to investigate that separately from this patch.

2013-06-19 09:32:18

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Tue, Jun 18, 2013 at 05:01:23PM -0700, David Rientjes wrote:
> On Tue, 18 Jun 2013, Alex Thorlton wrote:
>
> > Thanks for your input, however, I believe the method of using a malloc
> > hook falls apart when it comes to static binaries, since we wont' have
> > any shared libraries to hook into. Although using a malloc hook is a
> > perfectly suitable solution for most cases, we're looking to implement a
> > solution that can be used in all situations.
> >
>
> I guess the question would be why you don't want your malloc memory backed
> by thp pages for certain static binaries and not others? Is it because of
> an increased rss due to khugepaged collapsing memory because of its
> default max_ptes_none value?
>
> > Aside from that particular shortcoming of the malloc hook solution,
> > there are some other situations having a cpuset-based option is a
> > much simpler and more efficient solution than the alternatives.
>
> Sure, but why should this be a cpuset based solution? What is special
> about cpusets that make certain statically allocated binaries not want
> memory backed by thp while others do? This still seems based solely on
> convenience instead of any hard requirement.

The convenience being that many batch schedulers have added cpuset
support. They create the cpuset's and configure them as appropriate
for the job as determined by a mixture of input from the submitting
user but still under the control of the administrator. That seems like
a fairly significant convenience given that it took years to get the
batch schedulers to adopt cpusets in the first place. At this point,
expanding their use of cpusets is under the control of the system
administrator and would not require any additional development on
the batch scheduler developers part.

> > One
> > such situation that comes to mind would be an environment where a batch
> > scheduler is in use to ration system resources. If an administrator
> > determines that a users jobs run more efficiently with thp always on,
> > the administrator can simply set the users jobs to always run with that
> > setting, instead of having to coordinate with that user to get them to
> > run their jobs in a different way. I feel that, for cases such as this,
> > the this additional flag is in line with the other capabilities that
> > cgroups and cpusets provide.
> >
>
> That sounds like a memcg, i.e. container, type of an issue, not a cpuset
> issue which is more geared toward NUMA optimizations. User jobs should
> always run more efficiently with thp always on, the worst-case scenario
> should be if they run with the same performance as thp set to never. In
> other words, there shouldn't be any regression that requires certain
> cpusets to disable thp because of a performance regression. If there are
> any, we'd like to investigate that separately from this patch.

Here are the entries in the cpuset:
cgroup.event_control mem_exclusive memory_pressure_enabled notify_on_release tasks
cgroup.procs mem_hardwall memory_spread_page release_agent
cpu_exclusive memory_migrate memory_spread_slab sched_load_balance
cpus memory_pressure mems sched_relax_domain_level

There are scheduler, slab allocator, page_cache layout, etc controls.
Why _NOT_ add a thp control to that nicely contained central location?
It is a concise set of controls for the job.

Maybe I am misunderstanding. Are you saying you want to put memcg
information into the cpuset or something like that?

Robin

2013-06-19 21:24:12

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Wed, 19 Jun 2013, Robin Holt wrote:

> The convenience being that many batch schedulers have added cpuset
> support. They create the cpuset's and configure them as appropriate
> for the job as determined by a mixture of input from the submitting
> user but still under the control of the administrator. That seems like
> a fairly significant convenience given that it took years to get the
> batch schedulers to adopt cpusets in the first place. At this point,
> expanding their use of cpusets is under the control of the system
> administrator and would not require any additional development on
> the batch scheduler developers part.
>

You can't say the same for memcg?

> Here are the entries in the cpuset:
> cgroup.event_control mem_exclusive memory_pressure_enabled notify_on_release tasks
> cgroup.procs mem_hardwall memory_spread_page release_agent
> cpu_exclusive memory_migrate memory_spread_slab sched_load_balance
> cpus memory_pressure mems sched_relax_domain_level
>
> There are scheduler, slab allocator, page_cache layout, etc controls.

I think this is mostly for historical reasons since cpusets were
introduced before cgroups.

> Why _NOT_ add a thp control to that nicely contained central location?
> It is a concise set of controls for the job.
>

All of the above seem to be for cpusets primary purpose, i.e. NUMA
optimizations. It has nothing to do with transparent hugepages. (I'm not
saying thp has anything to do with memcg either, but a "memory controller"
seems more appropriate for controlling thp behavior.)

> Maybe I am misunderstanding. Are you saying you want to put memcg
> information into the cpuset or something like that?
>

I'm saying there's absolutely no reason to have thp controlled by a
cpuset, or ANY cgroup for that matter, since you chose not to respond to
the question I asked: why do you want to control thp behavior for certain
static binaries and not others? Where is the performance regression or
the downside? Is it because of max_ptes_none for certain jobs blowing up
the rss? We need information, and even if were justifiable then it
wouldn't have anything to do with ANY cgroup but rather a per-process
control. It has nothing to do with cpusets whatsoever.

(And I'm very curious why you didn't even cc the cpusets maintainer on
this patch in the first place who would probably say the same thing.)

2013-06-20 02:27:45

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote:
> On Wed, 19 Jun 2013, Robin Holt wrote:
>
> > The convenience being that many batch schedulers have added cpuset
> > support. They create the cpuset's and configure them as appropriate
> > for the job as determined by a mixture of input from the submitting
> > user but still under the control of the administrator. That seems like
> > a fairly significant convenience given that it took years to get the
> > batch schedulers to adopt cpusets in the first place. At this point,
> > expanding their use of cpusets is under the control of the system
> > administrator and would not require any additional development on
> > the batch scheduler developers part.
> >
>
> You can't say the same for memcg?

I am not aware of batch scheduler support for memory controllers.
The request came from our benchmarking group.

> > Here are the entries in the cpuset:
> > cgroup.event_control mem_exclusive memory_pressure_enabled notify_on_release tasks
> > cgroup.procs mem_hardwall memory_spread_page release_agent
> > cpu_exclusive memory_migrate memory_spread_slab sched_load_balance
> > cpus memory_pressure mems sched_relax_domain_level
> >
> > There are scheduler, slab allocator, page_cache layout, etc controls.
>
> I think this is mostly for historical reasons since cpusets were
> introduced before cgroups.
>
> > Why _NOT_ add a thp control to that nicely contained central location?
> > It is a concise set of controls for the job.
> >
>
> All of the above seem to be for cpusets primary purpose, i.e. NUMA
> optimizations. It has nothing to do with transparent hugepages. (I'm not
> saying thp has anything to do with memcg either, but a "memory controller"
> seems more appropriate for controlling thp behavior.)

cpusets was not for NUMA. It has no preference for "nodes" or anything like
that. It was for splitting a machine into layered smaller groups. Usually,
we see one cpuset with contains the batch scheduler. The batch scheduler then
creates cpusets for jobs it starts. Has nothing to do with nodes. That is
more an administrator issue. They set the minimum grouping of resources
for scheduled jobs.

> > Maybe I am misunderstanding. Are you saying you want to put memcg
> > information into the cpuset or something like that?
> >
>
> I'm saying there's absolutely no reason to have thp controlled by a
> cpuset, or ANY cgroup for that matter, since you chose not to respond to
> the question I asked: why do you want to control thp behavior for certain
> static binaries and not others? Where is the performance regression or
> the downside? Is it because of max_ptes_none for certain jobs blowing up
> the rss? We need information, and even if were justifiable then it
> wouldn't have anything to do with ANY cgroup but rather a per-process
> control. It has nothing to do with cpusets whatsoever.

It was a request from our benchmarking group that has found some jobs
benefit from thp, while other are harmed. Let me ask them for more
details.

> (And I'm very curious why you didn't even cc the cpusets maintainer on
> this patch in the first place who would probably say the same thing.)

I didn't know there was a cpuset maintainer. Paul Jackson (SGI retired)
had originally worked to get cpusets introduced and then converted to
use cgroups. I had never known there was a maintainer after him. Sorry
for that.

Robin

2013-06-20 02:43:28

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Wed, 19 Jun 2013, Robin Holt wrote:

> cpusets was not for NUMA. It has no preference for "nodes" or anything like
> that. It was for splitting a machine into layered smaller groups. Usually,
> we see one cpuset with contains the batch scheduler. The batch scheduler then
> creates cpusets for jobs it starts. Has nothing to do with nodes. That is
> more an administrator issue. They set the minimum grouping of resources
> for scheduled jobs.
>

I disagree with all of the above, it's not what Paul Jackson developed
cpusets for, it's not what he wrote in Documentation/cgroups/cpusets.txt,
and it's not why libnuma immediately supported it. Cpusets is for NUMA,
like it or not.

> > I'm saying there's absolutely no reason to have thp controlled by a
> > cpuset, or ANY cgroup for that matter, since you chose not to respond to
> > the question I asked: why do you want to control thp behavior for certain
> > static binaries and not others? Where is the performance regression or
> > the downside? Is it because of max_ptes_none for certain jobs blowing up
> > the rss? We need information, and even if were justifiable then it
> > wouldn't have anything to do with ANY cgroup but rather a per-process
> > control. It has nothing to do with cpusets whatsoever.
>
> It was a request from our benchmarking group that has found some jobs
> benefit from thp, while other are harmed. Let me ask them for more
> details.
>

Yes, please, because if some jobs are harmed by thp then we need to fix
that regression and not paper around with it with some cpuset-based
solution. People should be able to run with CONFIG_TRANSPARENT_HUGEPAGE
enabled and not be required to enable CONFIG_CPUSETS for optimal behavior.
I'm suspecting that you're referring to enlarged rss because of
khugepaged's max_ptes_none and because you're abusing the purpose of
cpusets for containerization.

2013-06-20 03:11:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Wed, 2013-06-19 at 19:43 -0700, David Rientjes wrote:
>
> I'm suspecting that you're referring to enlarged rss because of
> khugepaged's max_ptes_none and because you're abusing the purpose of
> cpusets for containerization.

Why is containerization an abuse? What's wrong with renting out chunks
of a big box farm on the fly like a task motel? If a realtime customer
checks in, he's not gonna be thrilled about sharing a room.

-Mike

2013-06-20 03:35:03

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

Cc: Tejun, and cgroup ML

>> Here are the entries in the cpuset:
>> cgroup.event_control mem_exclusive memory_pressure_enabled notify_on_release tasks
>> cgroup.procs mem_hardwall memory_spread_page release_agent
>> cpu_exclusive memory_migrate memory_spread_slab sched_load_balance
>> cpus memory_pressure mems sched_relax_domain_level
>>
>> There are scheduler, slab allocator, page_cache layout, etc controls.
>
> I think this is mostly for historical reasons since cpusets were
> introduced before cgroups.
>
>> Why _NOT_ add a thp control to that nicely contained central location?
>> It is a concise set of controls for the job.
>>
>
> All of the above seem to be for cpusets primary purpose, i.e. NUMA
> optimizations. It has nothing to do with transparent hugepages. (I'm not
> saying thp has anything to do with memcg either, but a "memory controller"
> seems more appropriate for controlling thp behavior.)
>
>> Maybe I am misunderstanding. Are you saying you want to put memcg
>> information into the cpuset or something like that?
>>
>
> I'm saying there's absolutely no reason to have thp controlled by a
> cpuset, or ANY cgroup for that matter, since you chose not to respond to
> the question I asked: why do you want to control thp behavior for certain
> static binaries and not others? Where is the performance regression or
> the downside? Is it because of max_ptes_none for certain jobs blowing up
> the rss? We need information, and even if were justifiable then it
> wouldn't have anything to do with ANY cgroup but rather a per-process
> control. It has nothing to do with cpusets whatsoever.
>
> (And I'm very curious why you didn't even cc the cpusets maintainer on
> this patch in the first place who would probably say the same thing.)
> .

Don't know whom you were refering to here. It's Paul Jackson who invented
cpusets, and then Paul Menage took over the maintainership but he wasn't
doing much maintainer's work. Now it's me and Tejun maintaining cpusets.
(long ago Ingo once requested cpuset patches should be cced to him and
Peter.)

Back to this patch, I'm definitely on your side. This feature doesn't
interact with existing cpuset features, and it doens't need anything
that cpuset provides. In a word, it has nothing to do with cpusets hence
it shouldn't belong to cpusets.

We're clearing all the messes in cgroups, and this patch acts in the
converse direction.

2013-06-20 20:37:43

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

On Thu, 20 Jun 2013, Mike Galbraith wrote:

> > I'm suspecting that you're referring to enlarged rss because of
> > khugepaged's max_ptes_none and because you're abusing the purpose of
> > cpusets for containerization.
>
> Why is containerization an abuse? What's wrong with renting out chunks
> of a big box farm on the fly like a task motel? If a realtime customer
> checks in, he's not gonna be thrilled about sharing a room.
>

We "abused" cpusets for containerization for years, I'm not implying any
negative connotation to it, see the
Documentation/x86/x86_64/fake-numa-for-cpusets document that I wrote. It
doesn't suggest that we should be controlling thp through cpusets; if he's
complaining about static binaries where he can't use a malloc hook then
why not make it per-process so users don't have to configure cpusets to
control it?

2013-07-29 19:42:45

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCH v2] Make transparent hugepages cpuset aware

We've re-evaluated the need for a patch to support some sort of finer
grained control over thp, and, based on some tests performed by our
benchmarking team, we're seeing that we'd definitely still like to
implement some method to support this. Here's an e-mail from John
Baron ([email protected]), on our benchmarking team, containing some data
which shows a decrease in performance for some SPEC OMP benchmarks when
thp is enabled:

> Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64
> sockets. These show the performance ratio for jobs run with THP
> disabled versus THP enabled (so > 1.0 means THP disabled is faster).
> One possible reason for lower performance with THP enabled is that the
> larger page granularity can result in more remote data accesses.
>
>
> SPEC OMP2012:
>
> 350.md 1.0
> 351.bwaves 1.3
> 352.nab 1.0
> 357.bt331 0.9
> 358.botsalgn 1.0
> 359.botsspar 1.1
> 360.ilbdc 1.8
> 362.fma3d 1.0
> 363.swim 1.4
> 367.imagick 0.9
> 370.mgrid331 1.1
> 371.applu331 0.9
> 372.smithwa 1.0
> 376.kdtree 1.0
>
> SPEC OMPL2001:
>
> 311.wupwise_l 1.1
> 313.swim_l 1.5
> 315.mgrid_l 1.0
> 317.applu_l 1.1
> 321.equake_l 5.8
> 325.apsi_l 1.5
> 327.gafort_l 1.0
> 329.fma3d_l 1.0
> 331.art_l 0.8
>
> One could argue that real-world applications could be modified to avoid
> these kinds of effects, but (a) it is not always possible to modify code
> (e.g. in benchmark situations) and (b) even if it is possible to do so,
> it is not necessarily easy to do so (e.g. for customers with large
> legacy Fortran codes).
>
> We have also observed on Intel Sandy Bridge processors that, as
> counter-intuitive as it may seem, local memory bandwidth is actually
> slightly lower with THP enabled (1-2%), even with unit stride data
> accesses. This may not seem like much of a performance hit but
> it is important for HPC workloads. No code modification will help here.

In light of the previous issues discussed in this thread, and some
suggestions from David Rientjes:

> why not make it per-process so users don't have to configure
> cpusets to control it?

Robin and I have come up with a proposal for a way to replicate behavior
similar to what this patch introduced, only on a per-process level
instead of at the cpuset level.

Our idea would be to add a flag somewhere in the task_struct to keep
track of whether or not thp is enabled for each task. The flag would be
controlled by an additional option included in prctl(), allowing
programmers to set/unset this flag via the prctl() syscall. We would
also introduce some code into the clone() syscall to ensure that this
flag is copied down from parent to child tasks when necessary. The flag
would be checked in the same place the the per-cpuset flag was checked
in my original patch, thereby allowing the same behavior to be
replicated on a per-process level.

In this way, we will also be able to get static binaries to behave
appropriately by setting this flag in a userland program, and then
having that program exec the static binary for which we need to disable
thp.

This solution allows us to incorporate the behavior that we're looking
for into the kernel, without abusing cpusets for the purpose of
containerization.

Please let me know if anyone has any objections to this approach, or if
you have any suggestions as to how we could improve upon this idea.

Thanks!

- Alex