2018-05-18 04:28:20

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

Thanks to Mike Kravetz for comment on the previous version patch.

The purpose of this patch-set is to make it possible to control whether or
not to charge surplus hugetlb pages obtained by overcommitting to memory
cgroup. In the future, I am trying to accomplish limiting the memory usage
of applications that use both normal pages and hugetlb pages by the memory
cgroup(not use the hugetlb cgroup).

Applications that use shared libraries like libhugetlbfs.so use both normal
pages and hugetlb pages, but we do not know how much to use each. Please
suppose you want to manage the memory usage of such applications by cgroup
How do you set the memory cgroup and hugetlb cgroup limit when you want to
limit memory usage to 10GB?

If you set a limit of 10GB for each, the user can use a total of 20GB of
memory and can not limit it well. Since it is difficult to estimate the
ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
case, I thought that by using my patch-set, we could manage resources just
by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
cgroup).

In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
struct hstate. If it is true, it charges to the memory cgroup to which the
task that obtained surplus hugepages belongs. If it is false, do nothing as
before, and the default value is false. The charge_surplus_huge_pages can
be controlled procfs or sysfs interfaces.

Since THP is very effective in environments with kernel page size of 4KB,
such as x86, there is no reason to positively use HugeTLBfs, so I think
that there is no situation to enable charge_surplus_huge_pages. However, in
some distributions such as arm64, the page size of the kernel is 64KB, and
the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
may support multiple huge page sizes, and in such a special environment
there is a desire to use HugeTLBfs.

The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
acceptable or not, so I just done a simple test.

Thanks,
Tsukada

TSUKADA Koutaro (7):
hugetlb: introduce charge_surplus_huge_pages to struct hstate
hugetlb: supports migrate charging for surplus hugepages
memcg: use compound_order rather than hpage_nr_pages
mm, sysctl: make charging surplus hugepages controllable
hugetlb: add charge_surplus_hugepages attribute
Documentation, hugetlb: describe about charge_surplus_hugepages
memcg: supports movement of surplus hugepages statistics

Documentation/vm/hugetlbpage.txt | 6 +
include/linux/hugetlb.h | 4 +
kernel/sysctl.c | 7 +
mm/hugetlb.c | 148 +++++++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 109 +++++++++++++++++++++++++++-
5 files changed, 269 insertions(+), 5 deletions(-)

--
Tsukada




2018-05-18 04:30:22

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 1/7] hugetlb: introduce charge_surplus_huge_pages to struct hstate

The charge_surplus_huge_pages indicates to charge surplus huge pages
obteined from the normal page pool to memory cgroup. The default value is
false.

This patch implements the core part of charging surplus hugepages. Use the
private and mem_cgroup member of the second entry of compound hugepage for
surplus hugepage charging.

Mark when surplus hugepage is obtained from normal pool, and charge to
memory cgroup at alloc_huge_page. Once the mapping of the page is decided,
commit the charge. surplus hugepages will uncharge or cancel at
free_huge_page.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
include/linux/hugetlb.h | 2
mm/hugetlb.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2..33fe5be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

bool is_hugetlb_entry_migration(pte_t pte);
+bool PageSurplusCharge(struct page *page);

#else /* !CONFIG_HUGETLB_PAGE */

@@ -338,6 +339,7 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+ bool charge_surplus_huge_pages; /* default to off */
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
struct cftype cgroup_files[5];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2186791..679c151f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,6 +36,7 @@
#include <linux/node.h>
#include <linux/userfaultfd_k.h>
#include <linux/page_owner.h>
+#include <linux/memcontrol.h>
#include "internal.h"

int hugetlb_max_hstate __read_mostly;
@@ -1236,6 +1237,90 @@ static inline void ClearPageHugeTemporary(struct page *page)
page[2].mapping = NULL;
}

+#define HUGETLB_SURPLUS_CHARGE 1UL
+
+bool PageSurplusCharge(struct page *page)
+{
+ if (!PageHuge(page))
+ return false;
+ return page[1].private == HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void SetPageSurplusCharge(struct page *page)
+{
+ page[1].private = HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void ClearPageSurplusCharge(struct page *page)
+{
+ page[1].private = 0;
+}
+
+static inline void
+set_surplus_hugepage_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+ page[1].mem_cgroup = memcg;
+}
+
+static inline struct mem_cgroup *get_surplus_hugepage_memcg(struct page *page)
+{
+ return page[1].mem_cgroup;
+}
+
+static void surplus_hugepage_set_charge(struct hstate *h, struct page *page)
+{
+ if (likely(!h->charge_surplus_huge_pages))
+ return;
+ if (unlikely(!page))
+ return;
+ SetPageSurplusCharge(page);
+}
+
+static int surplus_hugepage_try_charge(struct page *page, struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg;
+
+ if (likely(!PageSurplusCharge(page)))
+ return 0;
+
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, true)) {
+ /* mem_cgroup oom invoked */
+ ClearPageSurplusCharge(page);
+ return -ENOMEM;
+ }
+ set_surplus_hugepage_memcg(page, memcg);
+
+ return 0;
+}
+
+static void surplus_hugepage_commit_charge(struct page *page)
+{
+ struct mem_cgroup *memcg;
+
+ if (likely(!PageSurplusCharge(page)))
+ return;
+
+ memcg = get_surplus_hugepage_memcg(page);
+ mem_cgroup_commit_charge(page, memcg, false, true);
+ set_surplus_hugepage_memcg(page, NULL);
+}
+
+static void surplus_hugepage_finalize_charge(struct page *page)
+{
+ struct mem_cgroup *memcg;
+
+ if (likely(!PageSurplusCharge(page)))
+ return;
+
+ memcg = get_surplus_hugepage_memcg(page);
+ if (memcg)
+ mem_cgroup_cancel_charge(page, memcg, true);
+ else
+ mem_cgroup_uncharge(page);
+ set_surplus_hugepage_memcg(page, NULL);
+ ClearPageSurplusCharge(page);
+}
+
void free_huge_page(struct page *page)
{
/*
@@ -1248,6 +1333,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+ surplus_hugepage_finalize_charge(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1583,6 +1670,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
out_unlock:
spin_unlock(&hugetlb_lock);

+ surplus_hugepage_set_charge(h, page);
+
return page;
}

@@ -2062,6 +2151,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
spin_unlock(&hugetlb_lock);

+ if (unlikely(surplus_hugepage_try_charge(page, vma->vm_mm))) {
+ put_page(page);
+ return ERR_PTR(-ENOMEM);
+ }
+
set_page_private(page, (unsigned long)spool);

map_commit = vma_commit_reservation(h, vma, addr);
@@ -3610,6 +3704,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, address);
+ surplus_hugepage_commit_charge(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
@@ -3667,6 +3762,9 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,

if (err)
return err;
+
+ surplus_hugepage_commit_charge(page);
+
ClearPagePrivate(page);

spin_lock(&inode->i_lock);
@@ -3809,6 +3907,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon_rmap) {
ClearPagePrivate(page);
hugepage_add_new_anon_rmap(page, vma, address);
+ surplus_hugepage_commit_charge(page);
} else
page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
@@ -4108,6 +4207,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
} else {
ClearPagePrivate(page);
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+ surplus_hugepage_commit_charge(page);
}

_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
--
Tsukada


2018-05-18 04:33:16

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 2/7] hugetlb: support migrate charging for surplus hugepages

Surplus hugepages allocated for migration also charge to memory cgroup.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
hugetlb.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 679c151f..2e7b543 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1687,6 +1687,8 @@ static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
if (!page)
return NULL;

+ surplus_hugepage_set_charge(h, page);
+
/*
* We do not account these pages as surplus because they are only
* temporary and will be released properly on the last reference

--
Tsukada


2018-05-18 04:35:23

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

The current memcg implementation assumes that the compound page is THP.
In order to be able to charge surplus hugepage, we use compound_order.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
memcontrol.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3..a8f1ff8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
struct mem_cgroup *to)
{
unsigned long flags;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret;
bool anon;

@@ -5417,7 +5417,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
bool compound)
{
struct mem_cgroup *memcg = NULL;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -5478,7 +5478,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare, bool compound)
{
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5522,7 +5522,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
bool compound)
{
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

if (mem_cgroup_disabled())
return;
@@ -5729,7 +5729,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)

/* Force-charge the new page. The old one will be freed soon */
compound = PageTransHuge(newpage);
- nr_pages = compound ? hpage_nr_pages(newpage) : 1;
+ nr_pages = compound ? (1 << compound_order(newpage)) : 1;

page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())

--
Tsukada


2018-05-18 04:36:43

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 4/7] mm, sysctl: make charging surplus hugepages controllable

Make the default hugetlb surplus hugepage controlable by
/proc/sys/vm/charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
include/linux/hugetlb.h | 2 ++
kernel/sysctl.c | 7 +++++++
mm/hugetlb.c | 21 +++++++++++++++++++++
3 files changed, 30 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 33fe5be..9314b07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -80,6 +80,8 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
+int hugetlb_charge_surplus_handler(struct ctl_table *, int, void __user *,
+ size_t *, loff_t *);
int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);

#ifdef CONFIG_NUMA
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6a78cf7..d562d64 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1394,6 +1394,13 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
.mode = 0644,
.proc_handler = hugetlb_overcommit_handler,
},
+ {
+ .procname = "charge_surplus_hugepages",
+ .data = NULL,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = hugetlb_charge_surplus_handler,
+ },
#endif
{
.procname = "lowmem_reserve_ratio",
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e7b543..9a9549c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3069,6 +3069,27 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
return ret;
}

+int hugetlb_charge_surplus_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct hstate *h = &default_hstate;
+ int tmp, ret;
+
+ if (!hugepages_supported())
+ return -EOPNOTSUPP;
+
+ tmp = h->charge_surplus_huge_pages ? 1 : 0;
+ table->data = &tmp;
+ table->maxlen = sizeof(int);
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ goto out;
+
+ if (write)
+ h->charge_surplus_huge_pages = tmp ? true : false;
+out:
+ return ret;
+}
#endif /* CONFIG_SYSCTL */

void hugetlb_report_meminfo(struct seq_file *m)

--
Tsukada


2018-05-18 04:39:14

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 5/7] hugetlb: add charge_surplus_hugepages attribute

Add an entry for charge_surplus_hugepages to sysfs.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
hugetlb.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a9549c..2f9bdbc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2662,6 +2662,30 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
}
HSTATE_ATTR_RO(surplus_hugepages);

+static ssize_t charge_surplus_hugepages_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct hstate *h = kobj_to_hstate(kobj, NULL);
+ return sprintf(buf, "%d\n", h->charge_surplus_huge_pages);
+}
+
+static ssize_t charge_surplus_hugepages_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t len)
+{
+ int err;
+ unsigned long input;
+ struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+ err = kstrtoul(buf, 10, &input);
+ if (err)
+ return err;
+
+ h->charge_surplus_huge_pages = input ? true : false;
+
+ return len;
+}
+HSTATE_ATTR(charge_surplus_hugepages);
+
static struct attribute *hstate_attrs[] = {
&nr_hugepages_attr.attr,
&nr_overcommit_hugepages_attr.attr,
@@ -2671,6 +2695,7 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
#ifdef CONFIG_NUMA
&nr_hugepages_mempolicy_attr.attr,
#endif
+ &charge_surplus_hugepages_attr.attr,
NULL,
};

--
Tsukada



2018-05-18 04:41:16

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 6/7] Documentation, hugetlb: describe about charge_surplus_hugepages,

Add a description about charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
hugetlbpage.txt | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index faf077d..af8d112 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -129,6 +129,11 @@ number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool.

+/proc/sys/vm/charge_surplus_hugepages indicates to charge "surplus" huge pages
+obteined from the normal page pool to memory cgroup. If true, the amount to be
+overcommitted is limited within memory usage allowed by the memory cgroup to
+which the task belongs. The default value is false.
+
When increasing the huge page pool size via nr_hugepages, any existing surplus
pages will first be promoted to persistent huge pages. Then, additional
huge pages will be allocated, if necessary and if possible, to fulfill
@@ -169,6 +174,7 @@ Inside each of these directories, the same set of files will exist:
free_hugepages
resv_hugepages
surplus_hugepages
+ charge_surplus_hugepages

which function as described above for the default huge page-sized case.

--
Tsukada



2018-05-18 04:41:53

by TSUKADA Koutaro

[permalink] [raw]
Subject: [PATCH v2 7/7] memcg: supports movement of surplus hugepages statistics

When the task that charged surplus hugepages moves memory cgroup, it
updates the statistical information correctly.

Signed-off-by: TSUKADA Koutaro <[email protected]>
---
memcontrol.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8f1ff8..63f0922 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4698,12 +4698,110 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
return 0;
}

+#ifdef CONFIG_HUGETLB_PAGE
+static enum mc_target_type get_mctgt_type_hugetlb(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte, union mc_target *target)
+{
+ struct page *page = NULL;
+ pte_t entry;
+ enum mc_target_type ret = MC_TARGET_NONE;
+
+ if (!(mc.flags & MOVE_ANON))
+ return ret;
+
+ entry = huge_ptep_get(pte);
+ if (!pte_present(entry))
+ return ret;
+
+ page = pte_page(entry);
+ VM_BUG_ON_PAGE(!page || !PageHead(page), page);
+ if (likely(!PageSurplusCharge(page)))
+ return ret;
+ if (page->mem_cgroup == mc.from) {
+ ret = MC_TARGET_PAGE;
+ if (target) {
+ get_page(page);
+ target->page = page;
+ }
+ }
+
+ return ret;
+}
+
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;
+ struct mm_struct *mm = walk->mm;
+ spinlock_t *ptl;
+ union mc_target target;
+
+ ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+ if (get_mctgt_type_hugetlb(vma, addr, pte, &target) == MC_TARGET_PAGE) {
+ mc.precharge += (1 << compound_order(target.page));
+ put_page(target.page);
+ }
+ spin_unlock(ptl);
+
+ return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;
+ struct mm_struct *mm = walk->mm;
+ spinlock_t *ptl;
+ enum mc_target_type target_type;
+ union mc_target target;
+ struct page *page;
+ unsigned long nr_pages;
+
+ ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+ target_type = get_mctgt_type_hugetlb(vma, addr, pte, &target);
+ if (target_type == MC_TARGET_PAGE) {
+ page = target.page;
+ nr_pages = (1 << compound_order(page));
+ if (mc.precharge < nr_pages) {
+ put_page(page);
+ goto unlock;
+ }
+ if (!mem_cgroup_move_account(page, true, mc.from, mc.to)) {
+ mc.precharge -= nr_pages;
+ mc.moved_charge += nr_pages;
+ }
+ put_page(page);
+ }
+unlock:
+ spin_unlock(ptl);
+
+ return 0;
+}
+#else
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ return 0;
+}
+#endif
+
static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
{
unsigned long precharge;

struct mm_walk mem_cgroup_count_precharge_walk = {
.pmd_entry = mem_cgroup_count_precharge_pte_range,
+ .hugetlb_entry = hugetlb_count_precharge_pte_range,
.mm = mm,
};
down_read(&mm->mmap_sem);
@@ -4981,6 +5079,7 @@ static void mem_cgroup_move_charge(void)
{
struct mm_walk mem_cgroup_move_charge_walk = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
+ .hugetlb_entry = hugetlb_move_charge_pte_range,
.mm = mc.mm,
};

--
Tsukada


2018-05-18 17:48:56

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

Tsukada-san,

I am not familiar with memcg so can't comment about whether the patchset
is the right way to solve the problem outlined in the cover letter but
had a couple of comments about this patch.

TSUKADA Koutaro <[email protected]> writes:

> The current memcg implementation assumes that the compound page is THP.
> In order to be able to charge surplus hugepage, we use compound_order.
>
> Signed-off-by: TSUKADA Koutaro <[email protected]>

Please move this before Patch 1/7. This is to prevent wrong accounting
of pages to memcg for size != PMD_SIZE.

> ---
> memcontrol.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2bd3df3..a8f1ff8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
> struct mem_cgroup *to)
> {
> unsigned long flags;
> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

Instead of replacing calls to hpage_nr_pages(), is it possible to modify
it to do the calculation?

Thanks,
Punit

> int ret;
> bool anon;
>
> @@ -5417,7 +5417,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> bool compound)
> {
> struct mem_cgroup *memcg = NULL;
> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
> int ret = 0;
>
> if (mem_cgroup_disabled())
> @@ -5478,7 +5478,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> bool lrucare, bool compound)
> {
> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>
> VM_BUG_ON_PAGE(!page->mapping, page);
> VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
> @@ -5522,7 +5522,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
> bool compound)
> {
> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>
> if (mem_cgroup_disabled())
> return;
> @@ -5729,7 +5729,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
>
> /* Force-charge the new page. The old one will be freed soon */
> compound = PageTransHuge(newpage);
> - nr_pages = compound ? hpage_nr_pages(newpage) : 1;
> + nr_pages = compound ? (1 << compound_order(newpage)) : 1;
>
> page_counter_charge(&memcg->memory, nr_pages);
> if (do_memsw_account())

2018-05-18 17:51:56

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

Punit Agrawal <[email protected]> writes:

> Tsukada-san,
>
> I am not familiar with memcg so can't comment about whether the patchset
> is the right way to solve the problem outlined in the cover letter but
> had a couple of comments about this patch.
>
> TSUKADA Koutaro <[email protected]> writes:
>
>> The current memcg implementation assumes that the compound page is THP.
>> In order to be able to charge surplus hugepage, we use compound_order.
>>
>> Signed-off-by: TSUKADA Koutaro <[email protected]>
>
> Please move this before Patch 1/7. This is to prevent wrong accounting
> of pages to memcg for size != PMD_SIZE.

I just noticed that the default state is off so the change isn't enabled
until the sysfs node is exposed in the next patch. Please ignore this
comment.

One below still applies.

>
>> ---
>> memcontrol.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2bd3df3..a8f1ff8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
>> struct mem_cgroup *to)
>> {
>> unsigned long flags;
>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>
> Instead of replacing calls to hpage_nr_pages(), is it possible to modify
> it to do the calculation?
>
> Thanks,
> Punit
>
>> int ret;
>> bool anon;
>>
>> @@ -5417,7 +5417,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>> bool compound)
>> {
>> struct mem_cgroup *memcg = NULL;
>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>> int ret = 0;
>>
>> if (mem_cgroup_disabled())
>> @@ -5478,7 +5478,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>> void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
>> bool lrucare, bool compound)
>> {
>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>>
>> VM_BUG_ON_PAGE(!page->mapping, page);
>> VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
>> @@ -5522,7 +5522,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
>> void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
>> bool compound)
>> {
>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>>
>> if (mem_cgroup_disabled())
>> return;
>> @@ -5729,7 +5729,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
>>
>> /* Force-charge the new page. The old one will be freed soon */
>> compound = PageTransHuge(newpage);
>> - nr_pages = compound ? hpage_nr_pages(newpage) : 1;
>> + nr_pages = compound ? (1 << compound_order(newpage)) : 1;
>>
>> page_counter_charge(&memcg->memory, nr_pages);
>> if (do_memsw_account())

2018-05-21 03:48:50

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

On 2018/05/19 2:51, Punit Agrawal wrote:
> Punit Agrawal <[email protected]> writes:
>
>> Tsukada-san,
>>
>> I am not familiar with memcg so can't comment about whether the patchset
>> is the right way to solve the problem outlined in the cover letter but
>> had a couple of comments about this patch.
>>
>> TSUKADA Koutaro <[email protected]> writes:
>>
>>> The current memcg implementation assumes that the compound page is THP.
>>> In order to be able to charge surplus hugepage, we use compound_order.
>>>
>>> Signed-off-by: TSUKADA Koutaro <[email protected]>
>>
>> Please move this before Patch 1/7. This is to prevent wrong accounting
>> of pages to memcg for size != PMD_SIZE.
>
> I just noticed that the default state is off so the change isn't enabled
> until the sysfs node is exposed in the next patch. Please ignore this
> comment.
>
> One below still applies.
>
>>
>>> ---
>>> memcontrol.c | 10 +++++-----
>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 2bd3df3..a8f1ff8 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
>>> struct mem_cgroup *to)
>>> {
>>> unsigned long flags;
>>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>>
>> Instead of replacing calls to hpage_nr_pages(), is it possible to modify
>> it to do the calculation?

Thank you for review my code and please just call me Tsukada.

I think it is possible to modify the inside of itself rather than
replacing the call to hpage_nr_pages().

Inferring from the processing that hpage_nr_pages() desires, I thought
that the definition of hpage_nr_pages() could be moved outside the
CONFIG_TRANSPARENT_HUGEPAGE. It seems that THP and HugeTLBfs can be
handled correctly because compound_order() is judged by seeing whether it
is PageHead or not.

Also, I would like to use compound_order() inside hpage_nr_pages(), but
since huge_mm.h is included before mm.h where compound_order() is defined,
move hpage_nr_pages to mm.h.

Instead of patch 3/7, are the following patches implementing what you
intended?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a1262..1186ab7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -204,12 +204,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
else
return NULL;
}
-static inline int hpage_nr_pages(struct page *page)
-{
- if (unlikely(PageTransHuge(page)))
- return HPAGE_PMD_NR;
- return 1;
-}

struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags);
@@ -254,8 +248,6 @@ static inline bool thp_migration_supported(void)
#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

-#define hpage_nr_pages(x) 1
-
static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
{
return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..082f2ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -673,6 +673,12 @@ static inline unsigned int compound_order(struct page *page)
return page[1].compound_order;
}

+static inline int hpage_nr_pages(struct page *page)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ return (1 << compound_order(page));
+}
+
static inline void set_compound_order(struct page *page, unsigned int order)
{
page[1].compound_order = order;

--
Thanks,
Tsukada


2018-05-21 14:52:56

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

Hi Tsukada,

I was staring at memcg code to better understand your changes and had
the below thought.

TSUKADA Koutaro <[email protected]> writes:

[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

Instead of tying the surplus huge page charging control per-hstate,
could the control be made per-memcg?

This can be done by introducing a per-memory controller file in sysfs
(memory.charge_surplus_hugepages?) that indicates whether surplus
hugepages are to be charged to the controller and forms part of the
total limit. IIUC, the limit already accounts for page and swap cache
pages.

This would allow the control to be enabled per-cgroup and also keep the
userspace control interface in one place.

As said earlier, I'm not familiar with memcg so the above might not be a
feasible but think it'll lead to a more coherent user
interface. Hopefully, more knowledgeable folks on the thread can chime
in.

Thanks,
Punit

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.
>
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
>
> Thanks,
> Tsukada
>
> TSUKADA Koutaro (7):
> hugetlb: introduce charge_surplus_huge_pages to struct hstate
> hugetlb: supports migrate charging for surplus hugepages
> memcg: use compound_order rather than hpage_nr_pages
> mm, sysctl: make charging surplus hugepages controllable
> hugetlb: add charge_surplus_hugepages attribute
> Documentation, hugetlb: describe about charge_surplus_hugepages
> memcg: supports movement of surplus hugepages statistics
>
> Documentation/vm/hugetlbpage.txt | 6 +
> include/linux/hugetlb.h | 4 +
> kernel/sysctl.c | 7 +
> mm/hugetlb.c | 148 +++++++++++++++++++++++++++++++++++++++
> mm/memcontrol.c | 109 +++++++++++++++++++++++++++-
> 5 files changed, 269 insertions(+), 5 deletions(-)

2018-05-21 14:54:46

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

TSUKADA Koutaro <[email protected]> writes:

> On 2018/05/19 2:51, Punit Agrawal wrote:
>> Punit Agrawal <[email protected]> writes:
>>
>>> Tsukada-san,
>>>
>>> I am not familiar with memcg so can't comment about whether the patchset
>>> is the right way to solve the problem outlined in the cover letter but
>>> had a couple of comments about this patch.
>>>
>>> TSUKADA Koutaro <[email protected]> writes:
>>>
>>>> The current memcg implementation assumes that the compound page is THP.
>>>> In order to be able to charge surplus hugepage, we use compound_order.
>>>>
>>>> Signed-off-by: TSUKADA Koutaro <[email protected]>
>>>
>>> Please move this before Patch 1/7. This is to prevent wrong accounting
>>> of pages to memcg for size != PMD_SIZE.
>>
>> I just noticed that the default state is off so the change isn't enabled
>> until the sysfs node is exposed in the next patch. Please ignore this
>> comment.
>>
>> One below still applies.
>>
>>>
>>>> ---
>>>> memcontrol.c | 10 +++++-----
>>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 2bd3df3..a8f1ff8 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
>>>> struct mem_cgroup *to)
>>>> {
>>>> unsigned long flags;
>>>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>>>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>>>
>>> Instead of replacing calls to hpage_nr_pages(), is it possible to modify
>>> it to do the calculation?
>
> Thank you for review my code and please just call me Tsukada.
>
> I think it is possible to modify the inside of itself rather than
> replacing the call to hpage_nr_pages().
>
> Inferring from the processing that hpage_nr_pages() desires, I thought
> that the definition of hpage_nr_pages() could be moved outside the
> CONFIG_TRANSPARENT_HUGEPAGE. It seems that THP and HugeTLBfs can be
> handled correctly because compound_order() is judged by seeing whether it
> is PageHead or not.
>
> Also, I would like to use compound_order() inside hpage_nr_pages(), but
> since huge_mm.h is included before mm.h where compound_order() is defined,
> move hpage_nr_pages to mm.h.
>
> Instead of patch 3/7, are the following patches implementing what you
> intended?
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a8a1262..1186ab7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -204,12 +204,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
> else
> return NULL;
> }
> -static inline int hpage_nr_pages(struct page *page)
> -{
> - if (unlikely(PageTransHuge(page)))
> - return HPAGE_PMD_NR;
> - return 1;
> -}
>
> struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd, int flags);
> @@ -254,8 +248,6 @@ static inline bool thp_migration_supported(void)
> #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
>
> -#define hpage_nr_pages(x) 1
> -
> static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
> {
> return false;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ac1f06..082f2ee 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -673,6 +673,12 @@ static inline unsigned int compound_order(struct page *page)
> return page[1].compound_order;
> }
>
> +static inline int hpage_nr_pages(struct page *page)
> +{
> + VM_BUG_ON_PAGE(PageTail(page), page);
> + return (1 << compound_order(page));
> +}
> +
> static inline void set_compound_order(struct page *page, unsigned int order)
> {
> page[1].compound_order = order;

That looks a lot better. Thanks for giving it a go.


2018-05-21 18:09:00

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.
>
> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).
>
> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?
>
> If you set a limit of 10GB for each, the user can use a total of 20GB of
> memory and can not limit it well. Since it is difficult to estimate the
> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> case, I thought that by using my patch-set, we could manage resources just
> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> cgroup).
>
> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.
>
> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

One of the basic questions/concerns I have is accounting for surplus huge
pages in the default memory resource controller. The existing huegtlb
resource controller already takes hugetlbfs huge pages into account,
including surplus pages. This series would allow surplus pages to be
accounted for in the default memory controller, or the hugetlb controller
or both.

I understand that current mechanisms do not meet the needs of the above
use case. The question is whether this is an appropriate way to approach
the issue. My cgroup experience and knowledge is extremely limited, but
it does not appear that any other resource can be controlled by multiple
controllers. Therefore, I am concerned that this may be going against
basic cgroup design philosophy.

It would be good to get comments from people more cgroup knowledgeable,
and especially from those involved in the decision to do separate hugetlb
control.

--
Mike Kravetz

>
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
>
> Thanks,
> Tsukada
>
> TSUKADA Koutaro (7):
> hugetlb: introduce charge_surplus_huge_pages to struct hstate
> hugetlb: supports migrate charging for surplus hugepages
> memcg: use compound_order rather than hpage_nr_pages
> mm, sysctl: make charging surplus hugepages controllable
> hugetlb: add charge_surplus_hugepages attribute
> Documentation, hugetlb: describe about charge_surplus_hugepages
> memcg: supports movement of surplus hugepages statistics
>
> Documentation/vm/hugetlbpage.txt | 6 +
> include/linux/hugetlb.h | 4 +
> kernel/sysctl.c | 7 +
> mm/hugetlb.c | 148 +++++++++++++++++++++++++++++++++++++++
> mm/memcontrol.c | 109 +++++++++++++++++++++++++++-
> 5 files changed, 269 insertions(+), 5 deletions(-)
>

2018-05-22 12:57:28

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

Hi Punit,

On 2018/05/21 23:52, Punit Agrawal wrote:
> Hi Tsukada,
>
> I was staring at memcg code to better understand your changes and had
> the below thought.
>
> TSUKADA Koutaro <[email protected]> writes:
>
> [...]
>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>
> Instead of tying the surplus huge page charging control per-hstate,
> could the control be made per-memcg?
>
> This can be done by introducing a per-memory controller file in sysfs
> (memory.charge_surplus_hugepages?) that indicates whether surplus
> hugepages are to be charged to the controller and forms part of the
> total limit. IIUC, the limit already accounts for page and swap cache
> pages.
>
> This would allow the control to be enabled per-cgroup and also keep the
> userspace control interface in one place.
>
> As said earlier, I'm not familiar with memcg so the above might not be a
> feasible but think it'll lead to a more coherent user
> interface. Hopefully, more knowledgeable folks on the thread can chime
> in.
>

Thank you for good advise.
As you mentioned, it is better to be able to control by per-memcg. After
organizing my thoughts, I will develop the next version patch-set that can
solve issues and challenge again.

Thanks,
Tsukada

> Thanks,
> Punit
>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>>
>> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
>> acceptable or not, so I just done a simple test.
>>
>> Thanks,
>> Tsukada
>>
>> TSUKADA Koutaro (7):
>> hugetlb: introduce charge_surplus_huge_pages to struct hstate
>> hugetlb: supports migrate charging for surplus hugepages
>> memcg: use compound_order rather than hpage_nr_pages
>> mm, sysctl: make charging surplus hugepages controllable
>> hugetlb: add charge_surplus_hugepages attribute
>> Documentation, hugetlb: describe about charge_surplus_hugepages
>> memcg: supports movement of surplus hugepages statistics
>>
>> Documentation/vm/hugetlbpage.txt | 6 +
>> include/linux/hugetlb.h | 4 +
>> kernel/sysctl.c | 7 +
>> mm/hugetlb.c | 148 +++++++++++++++++++++++++++++++++++++++
>> mm/memcontrol.c | 109 +++++++++++++++++++++++++++-
>> 5 files changed, 269 insertions(+), 5 deletions(-)



2018-05-22 13:07:00

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/22 3:07, Mike Kravetz wrote:
> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>> Thanks to Mike Kravetz for comment on the previous version patch.
>>
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>>
>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>> memory and can not limit it well. Since it is difficult to estimate the
>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>> case, I thought that by using my patch-set, we could manage resources just
>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>> cgroup).
>>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>
> One of the basic questions/concerns I have is accounting for surplus huge
> pages in the default memory resource controller. The existing huegtlb
> resource controller already takes hugetlbfs huge pages into account,
> including surplus pages. This series would allow surplus pages to be
> accounted for in the default memory controller, or the hugetlb controller
> or both.
>
> I understand that current mechanisms do not meet the needs of the above
> use case. The question is whether this is an appropriate way to approach
> the issue. My cgroup experience and knowledge is extremely limited, but
> it does not appear that any other resource can be controlled by multiple
> controllers. Therefore, I am concerned that this may be going against
> basic cgroup design philosophy.

Thank you for your feedback.
That makes sense, surplus hugepages are charged to both memcg and hugetlb
cgroup, which may be contrary to cgroup design philosophy.

Based on the above advice, I have considered the following improvements,
what do you think about?

The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
"whether to charge memcg in addition to hugetlb cgroup", but it will be
abolished. Instead, change to "switch only to memcg instead of hugetlb
cgroup" option. This is called 'surplus_charge_to_memcg'.

The surplus_charge_to_memcg option is created in per hugetlb cgroup.
If it is false(default), charge destination cgroup of various page types
is the same as the current kernel version. If it become true, hugetlb
cgroup stops accounting for surplus hugepages, and memcg starts accounting
instead.

A table showing which cgroups are charged:

page types | current v2(off) v2(on) v3(off) v3(on)
-------------------------------------------------------------------
normal + THP | m m m m m
hugetlb(persistent) | h h h h h
hugetlb(surplus) | h h m+h h m
-------------------------------------------------------------------

v2: charge_surplus_hugepages option
v3: next version, surplus_charge_to_memcg option
m: memory cgroup
h: hugetlb cgroup

>
> It would be good to get comments from people more cgroup knowledgeable,
> and especially from those involved in the decision to do separate hugetlb
> control.
>

I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
have specially considered of surplus hugepages. Later, I will send a mail
to hugetlb cgroup's committer to ask about surplus hugepages charge
specifications.

--
Thanks,
Tsukada


2018-05-22 13:52:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.

I am sorry that I didn't join the discussion for the previous version
but time just didn't allow that. So sorry if I am repeating something
already sorted out.

> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).

There was a deliberate decision to keep hugetlb and "normal" memory
cgroup controllers separate. Mostly because hugetlb memory is an
artificial memory subsystem on its own and it doesn't fit into the rest
of memcg accounted memory very well. I believe we want to keep that
status quo.

> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?

Well such a usecase requires an explicit configuration already. Either
by using special wrappers or modifying the code. So I would argue that
you have quite a good knowlege of the setup. If you need a greater
flexibility then just do not use hugetlb at all and rely on THP.
[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

I do not really think this is a good idea. We really do not want to make
the current hugetlb code more complex than it is already. The current
hugetlb cgroup controller is simple and works at least somehow. I would
not add more on top unless there is a _really_ strong usecase behind.
Please make sure to describe such a usecase in details before we even
start considering the code.

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

Well, then I would argue that you shouldn't use 64kB pages for your
setup or allow THP for smaller sizes. Really hugetlb pages are by no
means a substitute here.
--
Michal Hocko
SUSE Labs

2018-05-22 18:54:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
> On 2018/05/22 3:07, Mike Kravetz wrote:
> > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> >> Thanks to Mike Kravetz for comment on the previous version patch.
> >>
> >> The purpose of this patch-set is to make it possible to control whether or
> >> not to charge surplus hugetlb pages obtained by overcommitting to memory
> >> cgroup. In the future, I am trying to accomplish limiting the memory usage
> >> of applications that use both normal pages and hugetlb pages by the memory
> >> cgroup(not use the hugetlb cgroup).
> >>
> >> Applications that use shared libraries like libhugetlbfs.so use both normal
> >> pages and hugetlb pages, but we do not know how much to use each. Please
> >> suppose you want to manage the memory usage of such applications by cgroup
> >> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> >> limit memory usage to 10GB?
> >>
> >> If you set a limit of 10GB for each, the user can use a total of 20GB of
> >> memory and can not limit it well. Since it is difficult to estimate the
> >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> >> case, I thought that by using my patch-set, we could manage resources just
> >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> >> cgroup).
> >>
> >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> >> struct hstate. If it is true, it charges to the memory cgroup to which the
> >> task that obtained surplus hugepages belongs. If it is false, do nothing as
> >> before, and the default value is false. The charge_surplus_huge_pages can
> >> be controlled procfs or sysfs interfaces.
> >>
> >> Since THP is very effective in environments with kernel page size of 4KB,
> >> such as x86, there is no reason to positively use HugeTLBfs, so I think
> >> that there is no situation to enable charge_surplus_huge_pages. However, in
> >> some distributions such as arm64, the page size of the kernel is 64KB, and
> >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> >> may support multiple huge page sizes, and in such a special environment
> >> there is a desire to use HugeTLBfs.
> >
> > One of the basic questions/concerns I have is accounting for surplus huge
> > pages in the default memory resource controller. The existing huegtlb
> > resource controller already takes hugetlbfs huge pages into account,
> > including surplus pages. This series would allow surplus pages to be
> > accounted for in the default memory controller, or the hugetlb controller
> > or both.
> >
> > I understand that current mechanisms do not meet the needs of the above
> > use case. The question is whether this is an appropriate way to approach
> > the issue.

I do share your view Mike!

> > My cgroup experience and knowledge is extremely limited, but
> > it does not appear that any other resource can be controlled by multiple
> > controllers. Therefore, I am concerned that this may be going against
> > basic cgroup design philosophy.
>
> Thank you for your feedback.
> That makes sense, surplus hugepages are charged to both memcg and hugetlb
> cgroup, which may be contrary to cgroup design philosophy.
>
> Based on the above advice, I have considered the following improvements,
> what do you think about?
>
> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
> "whether to charge memcg in addition to hugetlb cgroup", but it will be
> abolished. Instead, change to "switch only to memcg instead of hugetlb
> cgroup" option. This is called 'surplus_charge_to_memcg'.

This all looks so hackish and ad-hoc that I would be tempted to give it
an outright nack, but let's here more about why do we need this fiddling
at all. I've asked in other email so I guess I will get an answer there
but let me just emphasize again that I absolutely detest a possibility
to put hugetlb pages into the memcg mix. They just do not belong there.
Try to look at previous discussions why it has been decided to have a
separate hugetlb pages at all.

I am also quite confused why you keep distinguishing surplus hugetlb
pages from regular preallocated ones. Being a surplus page is an
implementation detail that we use for an internal accounting rather than
something to exhibit to the userspace even more than we do currently.
Just look at what [sw]hould when you need to adjust accounting - e.g.
due to the pool resize. Are you going to uncharge those surplus pages
ffrom memcg to reflect their persistence?
--
Michal Hocko
SUSE Labs

2018-05-22 20:31:35

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 05/22/2018 06:04 AM, TSUKADA Koutaro wrote:
>
> I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
> have specially considered of surplus hugepages. Later, I will send a mail
> to hugetlb cgroup's committer to ask about surplus hugepages charge
> specifications.
>

I went back and looked at surplus huge page allocation. Previously, I made
a statement that the hugetlb controller accounts for surplus huge pages.
Turns out that may not be 100% correct.

Thanks to Michal, all surplus huge page allocation is performed via the
alloc_surplus_huge_page() routine. This will ultimately call into the
buddy allocator without any cgroup charges. Calls to alloc_surplus_huge_page
are made from:
- alloc_huge_page() when allocating a huge page to a mapping/file. In this
case, appropriate calls to the hugetlb controller are in place. So, any
limits are enforced here.
- gather_surplus_pages() when allocating and setting aside 'reserved' huge
pages. No accounting is performed here. Do note that in this case the
allocated huge pages are not assigned to the mapping/file. Even though
'reserved', they are deposited into the global pool and also counted as
'free'. When these reserved pages are ultimately used to populate a
file/mapping, the code path goes through alloc_huge_page() where appropriate
calls to the hugetlb controller are in place.

So, the bottom line is that surplus huge pages are not accounted for when
they are allocated as 'reserves'. It is not until these reserves are actually
used that accounting limits are checked. This 'seems' to align with general
allocation of huge pages within the pool. No accounting is done until they
are actually allocated to a mapping/file.

--
Mike Kravetz

2018-05-24 04:29:04

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/22 22:51, Michal Hocko wrote:
> On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>
> There was a deliberate decision to keep hugetlb and "normal" memory
> cgroup controllers separate. Mostly because hugetlb memory is an
> artificial memory subsystem on its own and it doesn't fit into the rest
> of memcg accounted memory very well. I believe we want to keep that
> status quo.
>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>
> Well such a usecase requires an explicit configuration already. Either
> by using special wrappers or modifying the code. So I would argue that
> you have quite a good knowlege of the setup. If you need a greater
> flexibility then just do not use hugetlb at all and rely on THP.
> [...]
>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>
> I do not really think this is a good idea. We really do not want to make
> the current hugetlb code more complex than it is already. The current
> hugetlb cgroup controller is simple and works at least somehow. I would
> not add more on top unless there is a _really_ strong usecase behind.
> Please make sure to describe such a usecase in details before we even
> start considering the code.

Thank you for your time.

I do not know if it is really a strong use case, but I will explain my
motive in detail. English is not my native language, so please pardon
my poor English.

I am one of the developers for software that managing the resource used
from user job at HPC-Cluster with Linux. The resource is memory mainly.
The HPC-Cluster may be shared by multiple people and used. Therefore, the
memory used by each user must be strictly controlled, otherwise the
user's job will runaway, not only will it hamper the other users, it will
crash the entire system in OOM.

Some users of HPC are very nervous about performance. Jobs are executed
while synchronizing with MPI communication using multiple compute nodes.
Since CPU wait time will occur when synchronizing, they want to minimize
the variation in execution time at each node to reduce waiting times as
much as possible. We call this variation a noise.

THP does not guarantee to use the Huge Page, but may use the normal page.
This mechanism is one cause of variation(noise).

The users who know this mechanism will be hesitant to use THP. However,
the users also know the benefits of the Huge Page's TLB hit rate
performance, and the Huge Page seems to be attractive. It seems natural
that these users are interested in HugeTLBfs, I do not know at all
whether it is the right approach or not.

At the very least, our HPC system is pursuing high versatility and we
have to consider whether we can provide it if users want to use HugeTLBfs.

In order to use HugeTLBfs we need to create a persistent pool, but in
our use case sharing nodes, it would be impossible to create, delete or
resize the pool.

One of the answers I have reached is to use HugeTLBfs by overcommitting
without creating a pool(this is the surplus hugepage).

Surplus hugepages is hugetlb page, but I think at least that consuming
buddy pool is a decisive difference from hugetlb page of persistent pool.
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
surplus hugepages from buddy pool is all unlimited even if being limited
by memcg. In extreme cases, overcommitment will allow users to exhaust
the entire memory of the system. Of course, this can be prevented by the
hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
respectively, as I asked in the first mail(set limit to 10GB), the
control will not work.

I thought I could charge surplus hugepages to memcg, but maybe I did not
have enough knowledge about memcg. I would like to reply to another mail
for details.

>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>
> Well, then I would argue that you shouldn't use 64kB pages for your
> setup or allow THP for smaller sizes. Really hugetlb pages are by no
> means a substitute here.
>

Actually, I am opposed to the 64KB page, but the proposal to change the
page size is expected to be dismissed as a problem.

--
Thanks,
Tsukada


2018-05-24 04:40:45

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/23 3:54, Michal Hocko wrote:
> On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
>> On 2018/05/22 3:07, Mike Kravetz wrote:
>>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>>>> Thanks to Mike Kravetz for comment on the previous version patch.
>>>>
>>>> The purpose of this patch-set is to make it possible to control whether or
>>>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>>>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>>>> of applications that use both normal pages and hugetlb pages by the memory
>>>> cgroup(not use the hugetlb cgroup).
>>>>
>>>> Applications that use shared libraries like libhugetlbfs.so use both normal
>>>> pages and hugetlb pages, but we do not know how much to use each. Please
>>>> suppose you want to manage the memory usage of such applications by cgroup
>>>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>>>> limit memory usage to 10GB?
>>>>
>>>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>>>> memory and can not limit it well. Since it is difficult to estimate the
>>>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>>>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>>>> case, I thought that by using my patch-set, we could manage resources just
>>>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>>>> cgroup).
>>>>
>>>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>>>> struct hstate. If it is true, it charges to the memory cgroup to which the
>>>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>>>> before, and the default value is false. The charge_surplus_huge_pages can
>>>> be controlled procfs or sysfs interfaces.
>>>>
>>>> Since THP is very effective in environments with kernel page size of 4KB,
>>>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>>>> that there is no situation to enable charge_surplus_huge_pages. However, in
>>>> some distributions such as arm64, the page size of the kernel is 64KB, and
>>>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>>>> may support multiple huge page sizes, and in such a special environment
>>>> there is a desire to use HugeTLBfs.
>>>
>>> One of the basic questions/concerns I have is accounting for surplus huge
>>> pages in the default memory resource controller. The existing huegtlb
>>> resource controller already takes hugetlbfs huge pages into account,
>>> including surplus pages. This series would allow surplus pages to be
>>> accounted for in the default memory controller, or the hugetlb controller
>>> or both.
>>>
>>> I understand that current mechanisms do not meet the needs of the above
>>> use case. The question is whether this is an appropriate way to approach
>>> the issue.
>
> I do share your view Mike!
>
>>> My cgroup experience and knowledge is extremely limited, but
>>> it does not appear that any other resource can be controlled by multiple
>>> controllers. Therefore, I am concerned that this may be going against
>>> basic cgroup design philosophy.
>>
>> Thank you for your feedback.
>> That makes sense, surplus hugepages are charged to both memcg and hugetlb
>> cgroup, which may be contrary to cgroup design philosophy.
>>
>> Based on the above advice, I have considered the following improvements,
>> what do you think about?
>>
>> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
>> "whether to charge memcg in addition to hugetlb cgroup", but it will be
>> abolished. Instead, change to "switch only to memcg instead of hugetlb
>> cgroup" option. This is called 'surplus_charge_to_memcg'.
>
> This all looks so hackish and ad-hoc that I would be tempted to give it
> an outright nack, but let's here more about why do we need this fiddling
> at all. I've asked in other email so I guess I will get an answer there
> but let me just emphasize again that I absolutely detest a possibility
> to put hugetlb pages into the memcg mix. They just do not belong there.
> Try to look at previous discussions why it has been decided to have a
> separate hugetlb pages at all.
>
> I am also quite confused why you keep distinguishing surplus hugetlb
> pages from regular preallocated ones. Being a surplus page is an
> implementation detail that we use for an internal accounting rather than
> something to exhibit to the userspace even more than we do currently.

I apologize for having confused.

The hugetlb pages obtained from the pool do not waste the buddy pool. On
the other hand, surplus hugetlb pages waste the buddy pool. Due to this
difference in property, I thought it could be distinguished.

Although my memcg knowledge is extremely limited, memcg is accounting for
various kinds of pages obtained from the buddy pool by the task belonging
to it. I would like to argue that surplus hugepage has specificity in
terms of obtaining from the buddy pool, and that it is specially permitted
charge requirements for memcg.

It seems very strange that charge hugetlb page to memcg, but essentially
it only charges the usage of the compound page obtained from the buddy pool,
and even if that page is used as hugetlb page after that, memcg is not
interested in that.

I will completely apologize if my way of thinking is wrong. It would be
greatly appreciated if you could mention why we can not charge surplus
hugepages to memcg.

> Just look at what [sw]hould when you need to adjust accounting - e.g.
> due to the pool resize. Are you going to uncharge those surplus pages
> ffrom memcg to reflect their persistence?
>

I could not understand the intention of this question, sorry. When resize
the pool, I think that the number of surplus hugepages in use does not
change. Could you explain what you were concerned about?

--
Thanks,
Tsukada



2018-05-24 08:27:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> On 2018/05/23 3:54, Michal Hocko wrote:
[...]
> > I am also quite confused why you keep distinguishing surplus hugetlb
> > pages from regular preallocated ones. Being a surplus page is an
> > implementation detail that we use for an internal accounting rather than
> > something to exhibit to the userspace even more than we do currently.
>
> I apologize for having confused.
>
> The hugetlb pages obtained from the pool do not waste the buddy pool.

Because they have already allocated from the buddy allocator so the end
result is very same.

> On
> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> difference in property, I thought it could be distinguished.

But this is simply not correct. Surplus pages are fluid. If you increase
the hugetlb size they will become regular persistent hugetlb pages.

> Although my memcg knowledge is extremely limited, memcg is accounting for
> various kinds of pages obtained from the buddy pool by the task belonging
> to it. I would like to argue that surplus hugepage has specificity in
> terms of obtaining from the buddy pool, and that it is specially permitted
> charge requirements for memcg.

Not really. Memcg accounts primarily for reclaimable memory. We do
account for some non-reclaimable slabs but the life time should be at
least bound to a process life time. Otherwise the memcg oom killer
behavior is not guaranteed to unclutter the situation. Hugetlb pages are
simply persistent. Well, to be completely honest tmpfs pages have a
similar problem but lacking the swap space for them is kinda
configuration bug.

> It seems very strange that charge hugetlb page to memcg, but essentially
> it only charges the usage of the compound page obtained from the buddy pool,
> and even if that page is used as hugetlb page after that, memcg is not
> interested in that.

Ohh, it is very much interested. The primary goal of memcg is to enforce
the limit. How are you going to do that in an absence of the reclaimable
memory? And quite a lot of it because hugetlb pages usually consume a
lot of memory.

> I will completely apologize if my way of thinking is wrong. It would be
> greatly appreciated if you could mention why we can not charge surplus
> hugepages to memcg.
>
> > Just look at what [sw]hould when you need to adjust accounting - e.g.
> > due to the pool resize. Are you going to uncharge those surplus pages
> > ffrom memcg to reflect their persistence?
> >
>
> I could not understand the intention of this question, sorry. When resize
> the pool, I think that the number of surplus hugepages in use does not
> change. Could you explain what you were concerned about?

It does change when ou change the hugetlb pool size, migrate pages
between per-numa pools (have a look at adjust_pool_surplus).
--
Michal Hocko
SUSE Labs

2018-05-24 08:30:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote:
[...]
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
>
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
>
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
>
> THP does not guarantee to use the Huge Page, but may use the normal page.
> This mechanism is one cause of variation(noise).
>
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.

Sure, asking for guarantee makes hugetlb pages attractive. But nothing
is really for free, especially any resource _guarantee_, and you have to
pay an additional configuration price usually.

> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
>
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.

Why? I can see this would be quite a PITA but not really impossible.

> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).
>
> Surplus hugepages is hugetlb page, but I think at least that consuming
> buddy pool is a decisive difference from hugetlb page of persistent pool.
> If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
> surplus hugepages from buddy pool is all unlimited even if being limited
> by memcg.

Not really, you can specify how much you can overcommit hugetlb pages.

> In extreme cases, overcommitment will allow users to exhaust
> the entire memory of the system. Of course, this can be prevented by the
> hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
> respectively, as I asked in the first mail(set limit to 10GB), the
> control will not work.
--
Michal Hocko
SUSE Labs

2018-05-24 22:56:45

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
>
> Because they have already allocated from the buddy allocator so the end
> result is very same.
>
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
>
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
>
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
>
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
>
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

--
Thanks,
Tsukada


2018-05-25 02:11:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote:
> On 2018/05/24 17:20, Michal Hocko wrote:
> > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> >> On 2018/05/23 3:54, Michal Hocko wrote:
> > [...]
> >>> I am also quite confused why you keep distinguishing surplus hugetlb
> >>> pages from regular preallocated ones. Being a surplus page is an
> >>> implementation detail that we use for an internal accounting rather than
> >>> something to exhibit to the userspace even more than we do currently.
> >>
> >> I apologize for having confused.
> >>
> >> The hugetlb pages obtained from the pool do not waste the buddy pool.
> >
> > Because they have already allocated from the buddy allocator so the end
> > result is very same.
> >
> >> On
> >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> >> difference in property, I thought it could be distinguished.
> >
> > But this is simply not correct. Surplus pages are fluid. If you increase
> > the hugetlb size they will become regular persistent hugetlb pages.
>
> I really can not understand what's wrong with this. That page is obviously
> released before being added to the persistent pool, and at that time it is
> uncharged from memcg to which the task belongs(This assumes my patch-set).
> After that, the same page obtained from the pool is not surplus hugepage
> so it will not be charged to memcg again.

I do not see anything like that. adjust_pool_surplus is simply and
accounting thing. At least the last time I've checked. Maybe your
patchset handles that?

> >> Although my memcg knowledge is extremely limited, memcg is accounting for
> >> various kinds of pages obtained from the buddy pool by the task belonging
> >> to it. I would like to argue that surplus hugepage has specificity in
> >> terms of obtaining from the buddy pool, and that it is specially permitted
> >> charge requirements for memcg.
> >
> > Not really. Memcg accounts primarily for reclaimable memory. We do
> > account for some non-reclaimable slabs but the life time should be at
> > least bound to a process life time. Otherwise the memcg oom killer
> > behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> > simply persistent. Well, to be completely honest tmpfs pages have a
> > similar problem but lacking the swap space for them is kinda
> > configuration bug.
>
> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
> pages and hugetlb page?)

No mlocked pages cannot be reclaimed and that is why we restrict them to
a relatively small amount.

> >> It seems very strange that charge hugetlb page to memcg, but essentially
> >> it only charges the usage of the compound page obtained from the buddy pool,
> >> and even if that page is used as hugetlb page after that, memcg is not
> >> interested in that.
> >
> > Ohh, it is very much interested. The primary goal of memcg is to enforce
> > the limit. How are you going to do that in an absence of the reclaimable
> > memory? And quite a lot of it because hugetlb pages usually consume a
> > lot of memory.
>
> Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
> reclaim at the time of account of with surplus hugepages.

But that will not release the hugetlb memory, does it?

> [...]
> >> I could not understand the intention of this question, sorry. When resize
> >> the pool, I think that the number of surplus hugepages in use does not
> >> change. Could you explain what you were concerned about?
> >
> > It does change when you change the hugetlb pool size, migrate pages
> > between per-numa pools (have a look at adjust_pool_surplus).
>
> As I looked at, what kind of fatal problem is caused by charging surplus
> hugepages to memcg by just manipulating counter of statistical information?

Fatal? Not sure. It simply tries to add an alien memory to the memcg
concept so I would pressume an unexpected behavior (e.g. not being able
to reclaim memcg or, over reclaim, trashing etc.).
--
Michal Hocko
SUSE Labs

2018-05-25 02:35:38

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote:
>
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
>
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
>
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
>
> THP does not guarantee to use the Huge Page, but may use the normal page.

Note. You do not want to use THP because "THP does not guarantee".

> This mechanism is one cause of variation(noise).
>
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.
>
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
>
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.
>
> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).

Using hugetlbfs overcommit also does not provide a guarantee. Without
doing much research, I would say the failure rate for obtaining a huge
page via THP and hugetlbfs overcommit is about the same. The most
difficult issue in both cases will be obtaining a "huge page" number of
pages from the buddy allocator.

I really do not think hugetlbfs overcommit will provide any benefit over
THP for your use case. Also, new user space code is required to "fall back"
to normal pages in the case of hugetlbfs page allocation failure. This
is not needed in the THP case.
--
Mike Kravetz

2018-05-25 02:50:10

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
>
> Note. You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
>
> Using hugetlbfs overcommit also does not provide a guarantee. Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same. The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

> Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure. This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

--
Thanks,
Tsukada


2018-05-25 02:51:13

by TSUKADA Koutaro

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
>
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

--
Thanks,
Tsukada