2023-10-03 00:18:59

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v3 0/3] hugetlb memcg accounting

Changelog:
v3:
* Add a prep patch at the start of the series to extend the memory
controller interface with new helper functions for hugetlb
accounting.
* Do not account hugetlb memory for memcontroller in cgroup v1
(patch 2) (suggested by Johannes Weiner).
* Change the gfp flag passed to mem cgroup charging (patch 2)
(suggested by Michal Hocko).
* Add caveats to cgroup admin guide and commit changelog
(patch 2) (suggested by Michal Hocko).
v2:
* Add a cgroup mount option to enable/disable the new hugetlb memcg
accounting behavior (patch 1) (suggested by Johannes Weiner).
* Add a couple more ksft_print_msg() on error to aid debugging when
the selftest fails. (patch 2)

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for
e.g when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.

Nhat Pham (3):
memcontrol: add helpers for hugetlb memcg accounting
hugetlb: memcg: account hugetlb-backed memory in memory controller
selftests: add a selftest to verify hugetlb usage in memcg

Documentation/admin-guide/cgroup-v2.rst | 29 +++
MAINTAINERS | 2 +
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 30 +++
kernel/cgroup/cgroup.c | 15 +-
mm/hugetlb.c | 35 ++-
mm/memcontrol.c | 94 ++++++-
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
.../selftests/cgroup/test_hugetlb_memcg.c | 234 ++++++++++++++++++
10 files changed, 427 insertions(+), 20 deletions(-)
create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c

--
2.34.1


2023-10-03 00:19:11

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each
container may or may not use up to 3 gigantic page, depending on the
workload within it. The rest is anon, cache, slab, etc. We can set the
hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
But it is very difficult to configure memory.max to keep overall
consumption, including anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).

Signed-off-by: Nhat Pham <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
include/linux/cgroup-defs.h | 5 ++++
include/linux/memcontrol.h | 9 +++++++
kernel/cgroup/cgroup.c | 15 ++++++++++-
mm/hugetlb.c | 35 ++++++++++++++++++++-----
mm/memcontrol.c | 35 +++++++++++++++++++++++++
6 files changed, 120 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 622a7f28db1f..606b2e0eac4b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).

+ memory_hugetlb_accounting
+ Count HugeTLB memory usage towards the cgroup's overall
+ memory usage for the memory controller (for the purpose of
+ statistics reporting and memory protetion). This is a new
+ behavior that could regress existing setups, so it must be
+ explicitly opted in with this mount option.
+
+ A few caveats to keep in mind:
+
+ * There is no HugeTLB pool management involved in the memory
+ controller. The pre-allocated pool does not belong to anyone.
+ Specifically, when a new HugeTLB folio is allocated to
+ the pool, it is not accounted for from the perspective of the
+ memory controller. It is only charged to a cgroup when it is
+ actually used (for e.g at page fault time). Host memory
+ overcommit management has to consider this when configuring
+ hard limits. In general, HugeTLB pool management should be
+ done via other mechanisms (such as the HugeTLB controller).
+ * Failure to charge a HugeTLB folio to the memory controller
+ results in SIGBUS. This could happen even if the HugeTLB pool
+ still has pages available (but the cgroup limit is hit and
+ reclaim attempt fails).
+ * Charging HugeTLB memory towards the memory controller affects
+ memory protection and reclaim dynamics. Any userspace tuning
+ (of low, min limits for e.g) needs to take this into account.
+ * HugeTLB pages utilized while this option is not selected
+ will not be tracked by the memory controller (even if cgroup
+ v2 is remounted later on).
+

Organizing Processes and Threads
--------------------------------
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f1b3151ac30b..8641f4320c98 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -115,6 +115,11 @@ enum {
* Enable recursive subtree protection
*/
CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
+
+ /*
+ * Enable hugetlb accounting for the memory controller.
+ */
+ CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
};

/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 42bf7e9b1a2f..a827e2129790 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
return __mem_cgroup_charge(folio, mm, gfp);
}

+int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
+ long nr_pages);
+
int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
@@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
return 0;
}

+static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
+ gfp_t gfp, long nr_pages)
+{
+ return 0;
+}
+
static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
{
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1fb7f562289d..f11488b18ceb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1902,6 +1902,7 @@ enum cgroup2_param {
Opt_favordynmods,
Opt_memory_localevents,
Opt_memory_recursiveprot,
+ Opt_memory_hugetlb_accounting,
nr__cgroup2_params
};

@@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("favordynmods", Opt_favordynmods),
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
+ fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
{}
};

@@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
case Opt_memory_recursiveprot:
ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
return 0;
+ case Opt_memory_hugetlb_accounting:
+ ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+ return 0;
}
return -EINVAL;
}
@@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+
+ if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+ cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
+ else
+ cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
}
}

@@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
seq_puts(seq, ",memory_localevents");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
seq_puts(seq, ",memory_recursiveprot");
+ if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
+ seq_puts(seq, ",memory_hugetlb_accounting");
return 0;
}

@@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
"nsdelegate\n"
"favordynmods\n"
"memory_localevents\n"
- "memory_recursiveprot\n");
+ "memory_recursiveprot\n"
+ "memory_hugetlb_accounting\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..74472e911b0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
pages_per_huge_page(h), folio);
hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
pages_per_huge_page(h), folio);
+ mem_cgroup_uncharge(folio);
if (restore_reserve)
h->resv_huge_pages++;

@@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct folio *folio;
- long map_chg, map_commit;
+ long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
long gbl_chg;
- int ret, idx;
+ int memcg_charge_ret, ret, idx;
struct hugetlb_cgroup *h_cg = NULL;
+ struct mem_cgroup *memcg;
bool deferred_reserve;
+ gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+
+ memcg = get_mem_cgroup_from_current();
+ memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
+ if (memcg_charge_ret == -ENOMEM) {
+ mem_cgroup_put(memcg);
+ return ERR_PTR(-ENOMEM);
+ }

idx = hstate_index(h);
/*
@@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
* code of zero indicates a reservation exists (no change).
*/
map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
- if (map_chg < 0)
+ if (map_chg < 0) {
+ if (!memcg_charge_ret)
+ mem_cgroup_cancel_charge(memcg, nr_pages);
+ mem_cgroup_put(memcg);
return ERR_PTR(-ENOMEM);
+ }

/*
* Processes that did not create the mapping will have no
@@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
*/
if (map_chg || avoid_reserve) {
gbl_chg = hugepage_subpool_get_pages(spool, 1);
- if (gbl_chg < 0) {
- vma_end_reservation(h, vma, addr);
- return ERR_PTR(-ENOSPC);
- }
+ if (gbl_chg < 0)
+ goto out_end_reservation;

/*
* Even though there was no reservation in the region/reserve
@@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
pages_per_huge_page(h), folio);
}
+
+ if (!memcg_charge_ret)
+ mem_cgroup_commit_charge(folio, memcg);
+ mem_cgroup_put(memcg);
+
return folio;

out_uncharge_cgroup:
@@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
+out_end_reservation:
vma_end_reservation(h, vma, addr);
+ if (!memcg_charge_ret)
+ mem_cgroup_cancel_charge(memcg, nr_pages);
+ mem_cgroup_put(memcg);
return ERR_PTR(-ENOSPC);
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0219befeae38..6660684f6f97 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
return ret;
}

+/**
+ * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
+ * @memcg: memcg to charge.
+ * @gfp: reclaim mode.
+ * @nr_pages: number of pages to charge.
+ *
+ * This function is called when allocating a huge page folio to determine if
+ * the memcg has the capacity for it. It does not commit the charge yet,
+ * as the hugetlb folio itself has not been obtained from the hugetlb pool.
+ *
+ * Once we have obtained the hugetlb folio, we can call
+ * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
+ * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
+ * of try_charge().
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
+ long nr_pages)
+{
+ /*
+ * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
+ * but do not attempt to commit charge later (or cancel on error) either.
+ */
+ if (mem_cgroup_disabled() || !memcg ||
+ !cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
+ !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+ return -EOPNOTSUPP;
+
+ if (try_charge(memcg, gfp, nr_pages))
+ return -ENOMEM;
+
+ return 0;
+}
+
/**
* mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
* @folio: folio to charge.
--
2.34.1

2023-10-03 00:19:21

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v3 1/3] memcontrol: add helpers for hugetlb memcg accounting

This patch exposes charge committing and cancelling as parts of the
memory controller interface. These functionalities are useful when the
try_charge() and commit_charge() stages have to be separated by other
actions in between (which can fail). One such example is the new hugetlb
accounting behavior in the following patch.

The patch also adds a helper function to obtain a reference to the
current task's memcg.

Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/memcontrol.h | 21 ++++++++++++++
mm/memcontrol.c | 59 ++++++++++++++++++++++++++++++--------
2 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e0cfab58ab71..42bf7e9b1a2f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -653,6 +653,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
page_counter_read(&memcg->memory);
}

+void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg);
+
int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);

/**
@@ -704,6 +706,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
__mem_cgroup_uncharge_list(page_list);
}

+void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
+
void mem_cgroup_migrate(struct folio *old, struct folio *new);

/**
@@ -760,6 +764,8 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);

+struct mem_cgroup *get_mem_cgroup_from_current(void);
+
struct lruvec *folio_lruvec_lock(struct folio *folio);
struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
@@ -1245,6 +1251,11 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
return false;
}

+static inline void mem_cgroup_commit_charge(struct folio *folio,
+ struct mem_cgroup *memcg)
+{
+}
+
static inline int mem_cgroup_charge(struct folio *folio,
struct mm_struct *mm, gfp_t gfp)
{
@@ -1269,6 +1280,11 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
{
}

+static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+}
+
static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
}
@@ -1306,6 +1322,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}

+static inline struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+ return NULL;
+}
+
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a322a75172..0219befeae38 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1086,6 +1086,27 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
}
EXPORT_SYMBOL(get_mem_cgroup_from_mm);

+/**
+ * get_mem_cgroup_from_current - Obtain a reference on current task's memcg.
+ */
+struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+ struct mem_cgroup *memcg;
+
+ if (mem_cgroup_disabled())
+ return NULL;
+
+again:
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ if (!css_tryget(&memcg->css)) {
+ rcu_read_unlock();
+ goto again;
+ }
+ rcu_read_unlock();
+ return memcg;
+}
+
static __always_inline bool memcg_kmem_bypass(void)
{
/* Allow remote memcg charging from any context. */
@@ -2873,7 +2894,12 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
return try_charge_memcg(memcg, gfp_mask, nr_pages);
}

-static inline void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
+/**
+ * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call.
+ * @memcg: memcg previously charged.
+ * @nr_pages: number of pages previously charged.
+ */
+void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
{
if (mem_cgroup_is_root(memcg))
return;
@@ -2898,6 +2924,22 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
folio->memcg_data = (unsigned long)memcg;
}

+/**
+ * mem_cgroup_commit_charge - commit a previously successful try_charge().
+ * @folio: folio to commit the charge to.
+ * @memcg: memcg previously charged.
+ */
+void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+{
+ css_get(&memcg->css);
+ commit_charge(folio, memcg);
+
+ local_irq_disable();
+ mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio));
+ memcg_check_events(memcg, folio_nid(folio));
+ local_irq_enable();
+}
+
#ifdef CONFIG_MEMCG_KMEM
/*
* The allocated objcg pointers array is not accounted directly.
@@ -6105,7 +6147,7 @@ static void __mem_cgroup_clear_mc(void)

/* we must uncharge all the leftover precharges from mc.to */
if (mc.precharge) {
- cancel_charge(mc.to, mc.precharge);
+ mem_cgroup_cancel_charge(mc.to, mc.precharge);
mc.precharge = 0;
}
/*
@@ -6113,7 +6155,7 @@ static void __mem_cgroup_clear_mc(void)
* we must uncharge here.
*/
if (mc.moved_charge) {
- cancel_charge(mc.from, mc.moved_charge);
+ mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
mc.moved_charge = 0;
}
/* we must fixup refcnts and charges */
@@ -7020,20 +7062,13 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
- long nr_pages = folio_nr_pages(folio);
int ret;

- ret = try_charge(memcg, gfp, nr_pages);
+ ret = try_charge(memcg, gfp, folio_nr_pages(folio));
if (ret)
goto out;

- css_get(&memcg->css);
- commit_charge(folio, memcg);
-
- local_irq_disable();
- mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, folio_nid(folio));
- local_irq_enable();
+ mem_cgroup_commit_charge(folio, memcg);
out:
return ret;
}
--
2.34.1

2023-10-03 00:26:32

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Mon, Oct 2, 2023 at 5:18 PM Nhat Pham <[email protected]> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
>
> Some caveats to consider:
> * This feature is only available on cgroup v2.
> * There is no hugetlb pool management involved in the memory
> controller. As stated above, hugetlb folios are only charged towards
> the memory controller when it is used. Host overcommit management
> has to consider it when configuring hard limits.
> * Failure to charge towards the memcg results in SIGBUS. This could
> happen even if the hugetlb pool still has pages (but the cgroup
> limit is hit and reclaim attempt fails).
> * When this feature is enabled, hugetlb pages contribute to memory
> reclaim protection. low, min limits tuning must take into account
> hugetlb memory.
> * Hugetlb pages utilized while this option is not selected will not
> be tracked by the memory controller (even if cgroup v2 is remounted
> later on).
(special thanks to Michal Hocko, who pointed most of these out).
>
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 29 ++++++++++++++++++++
> include/linux/cgroup-defs.h | 5 ++++
> include/linux/memcontrol.h | 9 +++++++
> kernel/cgroup/cgroup.c | 15 ++++++++++-
> mm/hugetlb.c | 35 ++++++++++++++++++++-----
> mm/memcontrol.c | 35 +++++++++++++++++++++++++
> 6 files changed, 120 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 622a7f28db1f..606b2e0eac4b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -210,6 +210,35 @@ cgroup v2 currently supports the following mount options.
> relying on the original semantics (e.g. specifying bogusly
> high 'bypass' protection values at higher tree levels).
>
> + memory_hugetlb_accounting
> + Count HugeTLB memory usage towards the cgroup's overall
> + memory usage for the memory controller (for the purpose of
> + statistics reporting and memory protetion). This is a new
> + behavior that could regress existing setups, so it must be
> + explicitly opted in with this mount option.
> +
> + A few caveats to keep in mind:
> +
> + * There is no HugeTLB pool management involved in the memory
> + controller. The pre-allocated pool does not belong to anyone.
> + Specifically, when a new HugeTLB folio is allocated to
> + the pool, it is not accounted for from the perspective of the
> + memory controller. It is only charged to a cgroup when it is
> + actually used (for e.g at page fault time). Host memory
> + overcommit management has to consider this when configuring
> + hard limits. In general, HugeTLB pool management should be
> + done via other mechanisms (such as the HugeTLB controller).
> + * Failure to charge a HugeTLB folio to the memory controller
> + results in SIGBUS. This could happen even if the HugeTLB pool
> + still has pages available (but the cgroup limit is hit and
> + reclaim attempt fails).
> + * Charging HugeTLB memory towards the memory controller affects
> + memory protection and reclaim dynamics. Any userspace tuning
> + (of low, min limits for e.g) needs to take this into account.
> + * HugeTLB pages utilized while this option is not selected
> + will not be tracked by the memory controller (even if cgroup
> + v2 is remounted later on).
> +
>
> Organizing Processes and Threads
> --------------------------------
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f1b3151ac30b..8641f4320c98 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -115,6 +115,11 @@ enum {
> * Enable recursive subtree protection
> */
> CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18),
> +
> + /*
> + * Enable hugetlb accounting for the memory controller.
> + */
> + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
> };
>
> /* cftype->flags */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 42bf7e9b1a2f..a827e2129790 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -679,6 +679,9 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> return __mem_cgroup_charge(folio, mm, gfp);
> }
>
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> + long nr_pages);
> +
> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> gfp_t gfp, swp_entry_t entry);
> void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> @@ -1262,6 +1265,12 @@ static inline int mem_cgroup_charge(struct folio *folio,
> return 0;
> }
>
> +static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg,
> + gfp_t gfp, long nr_pages)
> +{
> + return 0;
> +}
> +
> static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> {
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1fb7f562289d..f11488b18ceb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1902,6 +1902,7 @@ enum cgroup2_param {
> Opt_favordynmods,
> Opt_memory_localevents,
> Opt_memory_recursiveprot,
> + Opt_memory_hugetlb_accounting,
> nr__cgroup2_params
> };
>
> @@ -1910,6 +1911,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
> fsparam_flag("favordynmods", Opt_favordynmods),
> fsparam_flag("memory_localevents", Opt_memory_localevents),
> fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
> + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
> {}
> };
>
> @@ -1936,6 +1938,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
> case Opt_memory_recursiveprot:
> ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> return 0;
> + case Opt_memory_hugetlb_accounting:
> + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> + return 0;
> }
> return -EINVAL;
> }
> @@ -1960,6 +1965,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
> cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> else
> cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
> +
> + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> + else
> + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
> }
> }
>
> @@ -1973,6 +1983,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
> seq_puts(seq, ",memory_localevents");
> if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
> seq_puts(seq, ",memory_recursiveprot");
> + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
> + seq_puts(seq, ",memory_hugetlb_accounting");
> return 0;
> }
>
> @@ -7050,7 +7062,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
> "nsdelegate\n"
> "favordynmods\n"
> "memory_localevents\n"
> - "memory_recursiveprot\n");
> + "memory_recursiveprot\n"
> + "memory_hugetlb_accounting\n");
> }
> static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> pages_per_huge_page(h), folio);
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> + mem_cgroup_uncharge(folio);
> if (restore_reserve)
> h->resv_huge_pages++;
>
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct folio *folio;
> - long map_chg, map_commit;
> + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> long gbl_chg;
> - int ret, idx;
> + int memcg_charge_ret, ret, idx;
> struct hugetlb_cgroup *h_cg = NULL;
> + struct mem_cgroup *memcg;
> bool deferred_reserve;
> + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> + memcg = get_mem_cgroup_from_current();
> + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> + if (memcg_charge_ret == -ENOMEM) {
> + mem_cgroup_put(memcg);
> + return ERR_PTR(-ENOMEM);
> + }
>
> idx = hstate_index(h);
> /*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> * code of zero indicates a reservation exists (no change).
> */
> map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> - if (map_chg < 0)
> + if (map_chg < 0) {
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOMEM);
> + }
>
> /*
> * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> */
> if (map_chg || avoid_reserve) {
> gbl_chg = hugepage_subpool_get_pages(spool, 1);
> - if (gbl_chg < 0) {
> - vma_end_reservation(h, vma, addr);
> - return ERR_PTR(-ENOSPC);
> - }
> + if (gbl_chg < 0)
> + goto out_end_reservation;
>
> /*
> * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> }
> +
> + if (!memcg_charge_ret)
> + mem_cgroup_commit_charge(folio, memcg);
> + mem_cgroup_put(memcg);
> +
> return folio;
>
> out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> out_subpool_put:
> if (map_chg || avoid_reserve)
> hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
> vma_end_reservation(h, vma, addr);
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOSPC);
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0219befeae38..6660684f6f97 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7085,6 +7085,41 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
> return ret;
> }
>
> +/**
> + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio
> + * @memcg: memcg to charge.
> + * @gfp: reclaim mode.
> + * @nr_pages: number of pages to charge.
> + *
> + * This function is called when allocating a huge page folio to determine if
> + * the memcg has the capacity for it. It does not commit the charge yet,
> + * as the hugetlb folio itself has not been obtained from the hugetlb pool.
> + *
> + * Once we have obtained the hugetlb folio, we can call
> + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the
> + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect
> + * of try_charge().
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
> + long nr_pages)
> +{
> + /*
> + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation,
> + * but do not attempt to commit charge later (or cancel on error) either.
> + */
> + if (mem_cgroup_disabled() || !memcg ||
> + !cgroup_subsys_on_dfl(memory_cgrp_subsys) ||
> + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
> + return -EOPNOTSUPP;
> +
> + if (try_charge(memcg, gfp, nr_pages))
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> /**
> * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin.
> * @folio: folio to charge.
> --
> 2.34.1

2023-10-03 11:50:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] memcontrol: add helpers for hugetlb memcg accounting

On Mon 02-10-23 17:18:26, Nhat Pham wrote:
> This patch exposes charge committing and cancelling as parts of the
> memory controller interface. These functionalities are useful when the
> try_charge() and commit_charge() stages have to be separated by other
> actions in between (which can fail). One such example is the new hugetlb
> accounting behavior in the following patch.
>
> The patch also adds a helper function to obtain a reference to the
> current task's memcg.
>
> Signed-off-by: Nhat Pham <[email protected]>

OK, makes sense.
Acked-by: Michal Hocko <[email protected]>

Usually we prefer to have newly introduced functions along with their
users but I do understand that this split just makes it easier to review
the main patch for the feature.

> ---
> include/linux/memcontrol.h | 21 ++++++++++++++
> mm/memcontrol.c | 59 ++++++++++++++++++++++++++++++--------
> 2 files changed, 68 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e0cfab58ab71..42bf7e9b1a2f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -653,6 +653,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> page_counter_read(&memcg->memory);
> }
>
> +void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg);
> +
> int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
>
> /**
> @@ -704,6 +706,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
> __mem_cgroup_uncharge_list(page_list);
> }
>
> +void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
> +
> void mem_cgroup_migrate(struct folio *old, struct folio *new);
>
> /**
> @@ -760,6 +764,8 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
>
> +struct mem_cgroup *get_mem_cgroup_from_current(void);
> +
> struct lruvec *folio_lruvec_lock(struct folio *folio);
> struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
> struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
> @@ -1245,6 +1251,11 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> return false;
> }
>
> +static inline void mem_cgroup_commit_charge(struct folio *folio,
> + struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline int mem_cgroup_charge(struct folio *folio,
> struct mm_struct *mm, gfp_t gfp)
> {
> @@ -1269,6 +1280,11 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
> {
> }
>
> +static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> + unsigned int nr_pages)
> +{
> +}
> +
> static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
> {
> }
> @@ -1306,6 +1322,11 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> return NULL;
> }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_current(void)
> +{
> + return NULL;
> +}
> +
> static inline
> struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d1a322a75172..0219befeae38 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1086,6 +1086,27 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> }
> EXPORT_SYMBOL(get_mem_cgroup_from_mm);
>
> +/**
> + * get_mem_cgroup_from_current - Obtain a reference on current task's memcg.
> + */
> +struct mem_cgroup *get_mem_cgroup_from_current(void)
> +{
> + struct mem_cgroup *memcg;
> +
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> +again:
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(current);
> + if (!css_tryget(&memcg->css)) {
> + rcu_read_unlock();
> + goto again;
> + }
> + rcu_read_unlock();
> + return memcg;
> +}
> +
> static __always_inline bool memcg_kmem_bypass(void)
> {
> /* Allow remote memcg charging from any context. */
> @@ -2873,7 +2894,12 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> return try_charge_memcg(memcg, gfp_mask, nr_pages);
> }
>
> -static inline void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
> +/**
> + * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call.
> + * @memcg: memcg previously charged.
> + * @nr_pages: number of pages previously charged.
> + */
> +void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> if (mem_cgroup_is_root(memcg))
> return;
> @@ -2898,6 +2924,22 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> folio->memcg_data = (unsigned long)memcg;
> }
>
> +/**
> + * mem_cgroup_commit_charge - commit a previously successful try_charge().
> + * @folio: folio to commit the charge to.
> + * @memcg: memcg previously charged.
> + */
> +void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> +{
> + css_get(&memcg->css);
> + commit_charge(folio, memcg);
> +
> + local_irq_disable();
> + mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio));
> + memcg_check_events(memcg, folio_nid(folio));
> + local_irq_enable();
> +}
> +
> #ifdef CONFIG_MEMCG_KMEM
> /*
> * The allocated objcg pointers array is not accounted directly.
> @@ -6105,7 +6147,7 @@ static void __mem_cgroup_clear_mc(void)
>
> /* we must uncharge all the leftover precharges from mc.to */
> if (mc.precharge) {
> - cancel_charge(mc.to, mc.precharge);
> + mem_cgroup_cancel_charge(mc.to, mc.precharge);
> mc.precharge = 0;
> }
> /*
> @@ -6113,7 +6155,7 @@ static void __mem_cgroup_clear_mc(void)
> * we must uncharge here.
> */
> if (mc.moved_charge) {
> - cancel_charge(mc.from, mc.moved_charge);
> + mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
> mc.moved_charge = 0;
> }
> /* we must fixup refcnts and charges */
> @@ -7020,20 +7062,13 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
> static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
> gfp_t gfp)
> {
> - long nr_pages = folio_nr_pages(folio);
> int ret;
>
> - ret = try_charge(memcg, gfp, nr_pages);
> + ret = try_charge(memcg, gfp, folio_nr_pages(folio));
> if (ret)
> goto out;
>
> - css_get(&memcg->css);
> - commit_charge(folio, memcg);
> -
> - local_irq_disable();
> - mem_cgroup_charge_statistics(memcg, nr_pages);
> - memcg_check_events(memcg, folio_nid(folio));
> - local_irq_enable();
> + mem_cgroup_commit_charge(folio, memcg);
> out:
> return ret;
> }
> --
> 2.34.1

--
Michal Hocko
SUSE Labs

2023-10-03 12:47:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] memcontrol: add helpers for hugetlb memcg accounting

On Mon, Oct 02, 2023 at 05:18:26PM -0700, Nhat Pham wrote:
> This patch exposes charge committing and cancelling as parts of the
> memory controller interface. These functionalities are useful when the
> try_charge() and commit_charge() stages have to be separated by other
> actions in between (which can fail). One such example is the new hugetlb
> accounting behavior in the following patch.
>
> The patch also adds a helper function to obtain a reference to the
> current task's memcg.
>
> Signed-off-by: Nhat Pham <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2023-10-03 12:54:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Mon, Oct 02, 2023 at 05:18:27PM -0700, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.
> Furthermore, when there is a delay in memory limits correction, (for e.g
> when hugetlb usage changes within consecutive runs of the userspace
> agent), the system could be in an over/underprotected state.
>
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is utilized, and uncharging when the folio is freed (analogous to
> the hugetlb controller). Note that we do not charge when the folio is
> allocated to the hugetlb pool, because at this point it is not owned by
> any memcg.
>
> Some caveats to consider:
> * This feature is only available on cgroup v2.
> * There is no hugetlb pool management involved in the memory
> controller. As stated above, hugetlb folios are only charged towards
> the memory controller when it is used. Host overcommit management
> has to consider it when configuring hard limits.
> * Failure to charge towards the memcg results in SIGBUS. This could
> happen even if the hugetlb pool still has pages (but the cgroup
> limit is hit and reclaim attempt fails).
> * When this feature is enabled, hugetlb pages contribute to memory
> reclaim protection. low, min limits tuning must take into account
> hugetlb memory.
> * Hugetlb pages utilized while this option is not selected will not
> be tracked by the memory controller (even if cgroup v2 is remounted
> later on).
>
> Signed-off-by: Nhat Pham <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2023-10-03 12:59:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> For instance, here is one of our usecases: suppose there are two 32G
> containers. The machine is booted with hugetlb_cma=6G, and each
> container may or may not use up to 3 gigantic page, depending on the
> workload within it. The rest is anon, cache, slab, etc. We can set the
> hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> But it is very difficult to configure memory.max to keep overall
> consumption, including anon, cache, slab etc. fair.
>
> What we have had to resort to is to constantly poll hugetlb usage and
> readjust memory.max. Similar procedure is done to other memory limits
> (memory.low for e.g). However, this is rather cumbersome and buggy.

Could you expand some more on how this _helps_ memory.low? The
hugetlb memory is not reclaimable so whatever portion of its memcg
consumption will be "protected from the reclaim". Consider this
parent
/ \
A B
low=50% low=0
current=40% current=60%

We have an external memory pressure and the reclaim should prefer B as A
is under its low limit, correct? But now consider that the predominant
consumption of B is hugetlb which would mean the memory reclaim cannot
do much for B and so the A's protection might be breached.

As an admin (or a tool) you need to know about hugetlb as a potential
contributor to this behavior (sure mlocked memory would behave the same
but mlock rarely consumes huge amount of memory in my experience).
Without the accounting there might not be any external pressure in the
first place.

All that being said, I do not see how adding hugetlb into accounting
makes low, min limits management any easier.

--
Michal Hocko
SUSE Labs

2023-10-03 16:01:32

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote:
> On Mon 02-10-23 17:18:27, Nhat Pham wrote:
> > Currently, hugetlb memory usage is not acounted for in the memory
> > controller, which could lead to memory overprotection for cgroups with
> > hugetlb-backed memory. This has been observed in our production system.
> >
> > For instance, here is one of our usecases: suppose there are two 32G
> > containers. The machine is booted with hugetlb_cma=6G, and each
> > container may or may not use up to 3 gigantic page, depending on the
> > workload within it. The rest is anon, cache, slab, etc. We can set the
> > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness.
> > But it is very difficult to configure memory.max to keep overall
> > consumption, including anon, cache, slab etc. fair.
> >
> > What we have had to resort to is to constantly poll hugetlb usage and
> > readjust memory.max. Similar procedure is done to other memory limits
> > (memory.low for e.g). However, this is rather cumbersome and buggy.
>
> Could you expand some more on how this _helps_ memory.low? The
> hugetlb memory is not reclaimable so whatever portion of its memcg
> consumption will be "protected from the reclaim". Consider this
> parent
> / \
> A B
> low=50% low=0
> current=40% current=60%
>
> We have an external memory pressure and the reclaim should prefer B as A
> is under its low limit, correct? But now consider that the predominant
> consumption of B is hugetlb which would mean the memory reclaim cannot
> do much for B and so the A's protection might be breached.
>
> As an admin (or a tool) you need to know about hugetlb as a potential
> contributor to this behavior (sure mlocked memory would behave the same
> but mlock rarely consumes huge amount of memory in my experience).
> Without the accounting there might not be any external pressure in the
> first place.
>
> All that being said, I do not see how adding hugetlb into accounting
> makes low, min limits management any easier.

It's important to differentiate the cgroup usecases. One is of course
the cloud/virtual server scenario, where you set the hard limits to
whatever the customer paid for, and don't know and don't care about
the workload running inside. In that case, memory.low and overcommit
aren't really safe to begin with due to unknown unreclaimable mem.

The other common usecase is the datacenter where you run your own
applications. You understand their workingset and requirements, and
configure and overcommit the containers in a way where jobs always
meet their SLAs. E.g. if multiple containers spike, memory.low is set
such that interactive workloads are prioritized over batch jobs, and
both have priority over routine system management tasks.

This is arguably the only case where it's safe to use memory.low. You
have to know what's reclaimable and what isn't, otherwise you cannot
know that memory.low will even do anything, and isolation breaks down.
So we already have that knowledge: mlocked sections, how much anon is
without swap space, and how much memory must not be reclaimed (even if
it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't
really complicate this equation - we already have to consider it
unreclaimable workingset from an overcommit POV on those hosts.

The reason this patch helps in this scenario is that the service teams
are usually different from the containers/infra team. The service
understands their workload and declares its workingset. But it's the
infra team running the containers that currently has to go and find
out if they're using hugetlb and tweak the cgroups. Bugs and
untimeliness in the tweaking have caused multiple production incidents
already. And both teams are regularly confused when there are large
parts of the workload that don't show up in memory.current which both
sides monitor. Keep in mind that these systems are already pretty
complex, with multiple overcommitted containers and system-level
activity. The current hugetlb quirk can heavily distort what a given
container is doing on the host.

With this patch, the service can declare its workingset, the container
team can configure the container, and memory.current makes sense to
everybody. The workload parameters are pretty reliable, but if the
service team gets it wrong and we underprotect the workload, and/or
its unreclaimable memory exceeds what was declared, the infra team
gets alarms on elevated LOW breaching events and investigates if its
an infra problem or a service spec problem that needs escalation.

So the case you describe above only happens when mistakes are made,
and we detect and rectify them. In the common case, hugetlb is part of
the recognized workingset, and we configure memory.low to cut off only
known optional and reclaimable memory under pressure.

2023-10-03 17:14:32

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On 10/02/23 17:18, Nhat Pham wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..74472e911b0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> pages_per_huge_page(h), folio);
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> + mem_cgroup_uncharge(folio);
> if (restore_reserve)
> h->resv_huge_pages++;
>
> @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct folio *folio;
> - long map_chg, map_commit;
> + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> long gbl_chg;
> - int ret, idx;
> + int memcg_charge_ret, ret, idx;
> struct hugetlb_cgroup *h_cg = NULL;
> + struct mem_cgroup *memcg;
> bool deferred_reserve;
> + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> + memcg = get_mem_cgroup_from_current();
> + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> + if (memcg_charge_ret == -ENOMEM) {
> + mem_cgroup_put(memcg);
> + return ERR_PTR(-ENOMEM);
> + }
>
> idx = hstate_index(h);
> /*
> @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> * code of zero indicates a reservation exists (no change).
> */
> map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> - if (map_chg < 0)
> + if (map_chg < 0) {
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOMEM);
> + }
>
> /*
> * Processes that did not create the mapping will have no
> @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> */
> if (map_chg || avoid_reserve) {
> gbl_chg = hugepage_subpool_get_pages(spool, 1);
> - if (gbl_chg < 0) {
> - vma_end_reservation(h, vma, addr);
> - return ERR_PTR(-ENOSPC);
> - }
> + if (gbl_chg < 0)
> + goto out_end_reservation;
>
> /*
> * Even though there was no reservation in the region/reserve
> @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> pages_per_huge_page(h), folio);
> }
> +
> + if (!memcg_charge_ret)
> + mem_cgroup_commit_charge(folio, memcg);
> + mem_cgroup_put(memcg);
> +
> return folio;
>
> out_uncharge_cgroup:
> @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> out_subpool_put:
> if (map_chg || avoid_reserve)
> hugepage_subpool_put_pages(spool, 1);
> +out_end_reservation:
> vma_end_reservation(h, vma, addr);
> + if (!memcg_charge_ret)
> + mem_cgroup_cancel_charge(memcg, nr_pages);
> + mem_cgroup_put(memcg);
> return ERR_PTR(-ENOSPC);
> }
>

IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
free_huge_folio. During migration, huge pages are allocated via
alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
charging for the migration target page and we uncharge the source page.
It looks like there will be no charge for the huge page after migration?

If my analysis above is correct, then we may need to be careful about
this accounting. We may not want both source and target pages to be
charged at the same time.
--
Mike Kravetz

2023-10-03 18:02:05

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <[email protected]> wrote:
>
> On 10/02/23 17:18, Nhat Pham wrote:
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index de220e3ff8be..74472e911b0a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > pages_per_huge_page(h), folio);
> > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > pages_per_huge_page(h), folio);
> > + mem_cgroup_uncharge(folio);
> > if (restore_reserve)
> > h->resv_huge_pages++;
> >
> > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > struct hugepage_subpool *spool = subpool_vma(vma);
> > struct hstate *h = hstate_vma(vma);
> > struct folio *folio;
> > - long map_chg, map_commit;
> > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > long gbl_chg;
> > - int ret, idx;
> > + int memcg_charge_ret, ret, idx;
> > struct hugetlb_cgroup *h_cg = NULL;
> > + struct mem_cgroup *memcg;
> > bool deferred_reserve;
> > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > +
> > + memcg = get_mem_cgroup_from_current();
> > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > + if (memcg_charge_ret == -ENOMEM) {
> > + mem_cgroup_put(memcg);
> > + return ERR_PTR(-ENOMEM);
> > + }
> >
> > idx = hstate_index(h);
> > /*
> > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > * code of zero indicates a reservation exists (no change).
> > */
> > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > - if (map_chg < 0)
> > + if (map_chg < 0) {
> > + if (!memcg_charge_ret)
> > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > + mem_cgroup_put(memcg);
> > return ERR_PTR(-ENOMEM);
> > + }
> >
> > /*
> > * Processes that did not create the mapping will have no
> > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > */
> > if (map_chg || avoid_reserve) {
> > gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > - if (gbl_chg < 0) {
> > - vma_end_reservation(h, vma, addr);
> > - return ERR_PTR(-ENOSPC);
> > - }
> > + if (gbl_chg < 0)
> > + goto out_end_reservation;
> >
> > /*
> > * Even though there was no reservation in the region/reserve
> > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > pages_per_huge_page(h), folio);
> > }
> > +
> > + if (!memcg_charge_ret)
> > + mem_cgroup_commit_charge(folio, memcg);
> > + mem_cgroup_put(memcg);
> > +
> > return folio;
> >
> > out_uncharge_cgroup:
> > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > out_subpool_put:
> > if (map_chg || avoid_reserve)
> > hugepage_subpool_put_pages(spool, 1);
> > +out_end_reservation:
> > vma_end_reservation(h, vma, addr);
> > + if (!memcg_charge_ret)
> > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > + mem_cgroup_put(memcg);
> > return ERR_PTR(-ENOSPC);
> > }
> >
>
> IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> free_huge_folio. During migration, huge pages are allocated via
> alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> charging for the migration target page and we uncharge the source page.
> It looks like there will be no charge for the huge page after migration?
>

Ah I see! This is a bit subtle indeed.

For the hugetlb controller, it looks like they update the cgroup info
inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
to transfer the hugetlb cgroup info to the destination folio.

Perhaps we can do something analogous here.

> If my analysis above is correct, then we may need to be careful about
> this accounting. We may not want both source and target pages to be
> charged at the same time.

We can create a variant of mem_cgroup_migrate that does not double
charge, but instead just copy the mem_cgroup information to the new
folio, and then clear that info from the old folio. That way the memory
usage counters are untouched.

Somebody with more expertise on migration should fact check me
of course :)

> --
> Mike Kravetz

2023-10-03 18:39:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <[email protected]> wrote:
> >
> > On 10/02/23 17:18, Nhat Pham wrote:
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index de220e3ff8be..74472e911b0a 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > > pages_per_huge_page(h), folio);
> > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > pages_per_huge_page(h), folio);
> > > + mem_cgroup_uncharge(folio);
> > > if (restore_reserve)
> > > h->resv_huge_pages++;
> > >
> > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > struct hugepage_subpool *spool = subpool_vma(vma);
> > > struct hstate *h = hstate_vma(vma);
> > > struct folio *folio;
> > > - long map_chg, map_commit;
> > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > > long gbl_chg;
> > > - int ret, idx;
> > > + int memcg_charge_ret, ret, idx;
> > > struct hugetlb_cgroup *h_cg = NULL;
> > > + struct mem_cgroup *memcg;
> > > bool deferred_reserve;
> > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > > +
> > > + memcg = get_mem_cgroup_from_current();
> > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > > + if (memcg_charge_ret == -ENOMEM) {
> > > + mem_cgroup_put(memcg);
> > > + return ERR_PTR(-ENOMEM);
> > > + }
> > >
> > > idx = hstate_index(h);
> > > /*
> > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > * code of zero indicates a reservation exists (no change).
> > > */
> > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > > - if (map_chg < 0)
> > > + if (map_chg < 0) {
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > + mem_cgroup_put(memcg);
> > > return ERR_PTR(-ENOMEM);
> > > + }
> > >
> > > /*
> > > * Processes that did not create the mapping will have no
> > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > */
> > > if (map_chg || avoid_reserve) {
> > > gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > > - if (gbl_chg < 0) {
> > > - vma_end_reservation(h, vma, addr);
> > > - return ERR_PTR(-ENOSPC);
> > > - }
> > > + if (gbl_chg < 0)
> > > + goto out_end_reservation;
> > >
> > > /*
> > > * Even though there was no reservation in the region/reserve
> > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > pages_per_huge_page(h), folio);
> > > }
> > > +
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_commit_charge(folio, memcg);
> > > + mem_cgroup_put(memcg);
> > > +
> > > return folio;
> > >
> > > out_uncharge_cgroup:
> > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > out_subpool_put:
> > > if (map_chg || avoid_reserve)
> > > hugepage_subpool_put_pages(spool, 1);
> > > +out_end_reservation:
> > > vma_end_reservation(h, vma, addr);
> > > + if (!memcg_charge_ret)
> > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > + mem_cgroup_put(memcg);
> > > return ERR_PTR(-ENOSPC);
> > > }
> > >
> >
> > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > free_huge_folio. During migration, huge pages are allocated via
> > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> > charging for the migration target page and we uncharge the source page.
> > It looks like there will be no charge for the huge page after migration?
> >
>
> Ah I see! This is a bit subtle indeed.
>
> For the hugetlb controller, it looks like they update the cgroup info
> inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> to transfer the hugetlb cgroup info to the destination folio.
>
> Perhaps we can do something analogous here.
>
> > If my analysis above is correct, then we may need to be careful about
> > this accounting. We may not want both source and target pages to be
> > charged at the same time.
>
> We can create a variant of mem_cgroup_migrate that does not double
> charge, but instead just copy the mem_cgroup information to the new
> folio, and then clear that info from the old folio. That way the memory
> usage counters are untouched.
>
> Somebody with more expertise on migration should fact check me
> of course :)

The only reason mem_cgroup_migrate() double charges right now is
because it's used by replace_page_cache_folio(). In that context, the
isolation of the old page isn't quite as thorough as with migration,
so it cannot transfer and uncharge directly. This goes back a long
time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40

If you rename the current implementation to mem_cgroup_replace_page()
for that one caller, you can add a mem_cgroup_migrate() variant which
is charge neutral and clears old->memcg_data. This can be used for
regular and hugetlb page migration. Something like this (totally
untested):

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4d3282493b6..17ec45bf3653 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
if (mem_cgroup_disabled())
return;

- /* Page cache replacement: new folio already charged? */
- if (folio_memcg(new))
- return;
-
memcg = folio_memcg(old);
VM_WARN_ON_ONCE_FOLIO(!memcg, old);
if (!memcg)
return;

- /* Force-charge the new page. The old one will be freed soon */
- if (!mem_cgroup_is_root(memcg)) {
- page_counter_charge(&memcg->memory, nr_pages);
- if (do_memsw_account())
- page_counter_charge(&memcg->memsw, nr_pages);
- }
-
- css_get(&memcg->css);
+ /* Transfer the charge and the css ref */
commit_charge(new, memcg);
-
- local_irq_save(flags);
- mem_cgroup_charge_statistics(memcg, nr_pages);
- memcg_check_events(memcg, folio_nid(new));
- local_irq_restore(flags);
+ old->memcg_data = 0;
}

DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);

2023-10-03 22:10:08

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <[email protected]> wrote:
> > >
> > > On 10/02/23 17:18, Nhat Pham wrote:
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index de220e3ff8be..74472e911b0a 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio)
> > > > pages_per_huge_page(h), folio);
> > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > > pages_per_huge_page(h), folio);
> > > > + mem_cgroup_uncharge(folio);
> > > > if (restore_reserve)
> > > > h->resv_huge_pages++;
> > > >
> > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > > struct hugepage_subpool *spool = subpool_vma(vma);
> > > > struct hstate *h = hstate_vma(vma);
> > > > struct folio *folio;
> > > > - long map_chg, map_commit;
> > > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
> > > > long gbl_chg;
> > > > - int ret, idx;
> > > > + int memcg_charge_ret, ret, idx;
> > > > struct hugetlb_cgroup *h_cg = NULL;
> > > > + struct mem_cgroup *memcg;
> > > > bool deferred_reserve;
> > > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> > > > +
> > > > + memcg = get_mem_cgroup_from_current();
> > > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
> > > > + if (memcg_charge_ret == -ENOMEM) {
> > > > + mem_cgroup_put(memcg);
> > > > + return ERR_PTR(-ENOMEM);
> > > > + }
> > > >
> > > > idx = hstate_index(h);
> > > > /*
> > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > > * code of zero indicates a reservation exists (no change).
> > > > */
> > > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> > > > - if (map_chg < 0)
> > > > + if (map_chg < 0) {
> > > > + if (!memcg_charge_ret)
> > > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > > + mem_cgroup_put(memcg);
> > > > return ERR_PTR(-ENOMEM);
> > > > + }
> > > >
> > > > /*
> > > > * Processes that did not create the mapping will have no
> > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > > */
> > > > if (map_chg || avoid_reserve) {
> > > > gbl_chg = hugepage_subpool_get_pages(spool, 1);
> > > > - if (gbl_chg < 0) {
> > > > - vma_end_reservation(h, vma, addr);
> > > > - return ERR_PTR(-ENOSPC);
> > > > - }
> > > > + if (gbl_chg < 0)
> > > > + goto out_end_reservation;
> > > >
> > > > /*
> > > > * Even though there was no reservation in the region/reserve
> > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > > pages_per_huge_page(h), folio);
> > > > }
> > > > +
> > > > + if (!memcg_charge_ret)
> > > > + mem_cgroup_commit_charge(folio, memcg);
> > > > + mem_cgroup_put(memcg);
> > > > +
> > > > return folio;
> > > >
> > > > out_uncharge_cgroup:
> > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > > out_subpool_put:
> > > > if (map_chg || avoid_reserve)
> > > > hugepage_subpool_put_pages(spool, 1);
> > > > +out_end_reservation:
> > > > vma_end_reservation(h, vma, addr);
> > > > + if (!memcg_charge_ret)
> > > > + mem_cgroup_cancel_charge(memcg, nr_pages);
> > > > + mem_cgroup_put(memcg);
> > > > return ERR_PTR(-ENOSPC);
> > > > }
> > > >
> > >
> > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > free_huge_folio. During migration, huge pages are allocated via
> > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> > > charging for the migration target page and we uncharge the source page.
> > > It looks like there will be no charge for the huge page after migration?
> > >
> >
> > Ah I see! This is a bit subtle indeed.
> >
> > For the hugetlb controller, it looks like they update the cgroup info
> > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > to transfer the hugetlb cgroup info to the destination folio.
> >
> > Perhaps we can do something analogous here.
> >
> > > If my analysis above is correct, then we may need to be careful about
> > > this accounting. We may not want both source and target pages to be
> > > charged at the same time.
> >
> > We can create a variant of mem_cgroup_migrate that does not double
> > charge, but instead just copy the mem_cgroup information to the new
> > folio, and then clear that info from the old folio. That way the memory
> > usage counters are untouched.
> >
> > Somebody with more expertise on migration should fact check me
> > of course :)
>
> The only reason mem_cgroup_migrate() double charges right now is
> because it's used by replace_page_cache_folio(). In that context, the
> isolation of the old page isn't quite as thorough as with migration,
> so it cannot transfer and uncharge directly. This goes back a long
> time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
>
> If you rename the current implementation to mem_cgroup_replace_page()
> for that one caller, you can add a mem_cgroup_migrate() variant which
> is charge neutral and clears old->memcg_data. This can be used for
> regular and hugetlb page migration. Something like this (totally
> untested):
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4d3282493b6..17ec45bf3653 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> if (mem_cgroup_disabled())
> return;
>
> - /* Page cache replacement: new folio already charged? */
> - if (folio_memcg(new))
> - return;
> -
> memcg = folio_memcg(old);
> VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> if (!memcg)
> return;
>
> - /* Force-charge the new page. The old one will be freed soon */
> - if (!mem_cgroup_is_root(memcg)) {
> - page_counter_charge(&memcg->memory, nr_pages);
> - if (do_memsw_account())
> - page_counter_charge(&memcg->memsw, nr_pages);
> - }
> -
> - css_get(&memcg->css);
> + /* Transfer the charge and the css ref */
> commit_charge(new, memcg);
> -
> - local_irq_save(flags);
> - mem_cgroup_charge_statistics(memcg, nr_pages);
> - memcg_check_events(memcg, folio_nid(new));
> - local_irq_restore(flags);
> + old->memcg_data = 0;
> }
>
> DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
>

Ah, I like this. Will send a fixlet based on this :)
I was scratching my head trying to figure out why we were
doing the double charging in the first place. Thanks for the context,
Johannes!

2023-10-03 22:43:22

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On 10/03/23 15:09, Nhat Pham wrote:
> On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <[email protected]> wrote:
> > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <[email protected]> wrote:
> > > > On 10/02/23 17:18, Nhat Pham wrote:
> > > >
> > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > > free_huge_folio. During migration, huge pages are allocated via
> > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> > > > charging for the migration target page and we uncharge the source page.
> > > > It looks like there will be no charge for the huge page after migration?
> > > >
> > >
> > > Ah I see! This is a bit subtle indeed.
> > >
> > > For the hugetlb controller, it looks like they update the cgroup info
> > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > > to transfer the hugetlb cgroup info to the destination folio.
> > >
> > > Perhaps we can do something analogous here.
> > >
> > > > If my analysis above is correct, then we may need to be careful about
> > > > this accounting. We may not want both source and target pages to be
> > > > charged at the same time.
> > >
> > > We can create a variant of mem_cgroup_migrate that does not double
> > > charge, but instead just copy the mem_cgroup information to the new
> > > folio, and then clear that info from the old folio. That way the memory
> > > usage counters are untouched.
> > >
> > > Somebody with more expertise on migration should fact check me
> > > of course :)
> >
> > The only reason mem_cgroup_migrate() double charges right now is
> > because it's used by replace_page_cache_folio(). In that context, the
> > isolation of the old page isn't quite as thorough as with migration,
> > so it cannot transfer and uncharge directly. This goes back a long
> > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
> >
> > If you rename the current implementation to mem_cgroup_replace_page()
> > for that one caller, you can add a mem_cgroup_migrate() variant which
> > is charge neutral and clears old->memcg_data. This can be used for
> > regular and hugetlb page migration. Something like this (totally
> > untested):
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a4d3282493b6..17ec45bf3653 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > if (mem_cgroup_disabled())
> > return;
> >
> > - /* Page cache replacement: new folio already charged? */
> > - if (folio_memcg(new))
> > - return;
> > -
> > memcg = folio_memcg(old);
> > VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> > if (!memcg)
> > return;
> >
> > - /* Force-charge the new page. The old one will be freed soon */
> > - if (!mem_cgroup_is_root(memcg)) {
> > - page_counter_charge(&memcg->memory, nr_pages);
> > - if (do_memsw_account())
> > - page_counter_charge(&memcg->memsw, nr_pages);
> > - }
> > -
> > - css_get(&memcg->css);
> > + /* Transfer the charge and the css ref */
> > commit_charge(new, memcg);
> > -
> > - local_irq_save(flags);
> > - mem_cgroup_charge_statistics(memcg, nr_pages);
> > - memcg_check_events(memcg, folio_nid(new));
> > - local_irq_restore(flags);
> > + old->memcg_data = 0;
> > }
> >
> > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> >
>
> Ah, I like this. Will send a fixlet based on this :)
> I was scratching my head trying to figure out why we were
> doing the double charging in the first place. Thanks for the context,
> Johannes!

Be sure to check for code similar to this in folio_migrate_flags:

void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
{
...
if (!folio_test_hugetlb(folio))
mem_cgroup_migrate(folio, newfolio);
}

There are many places where hugetlb is special cased.
--
Mike Kravetz

2023-10-03 23:14:39

by Nhat Pham

[permalink] [raw]
Subject: [PATCH] memcontrol: only transfer the memcg data for migration

For most migration use cases, only transfer the memcg data from the old
folio to the new folio, and clear the old folio's memcg data. No
charging and uncharging will be done. These use cases include the new
hugetlb memcg accounting behavior (which was not previously handled).

This shaves off some work on the migration path, and avoids the
temporary double charging of a folio during its migration.

The only exception is replace_page_cache_folio(), which will use the old
mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
context, the isolation of the old page isn't quite as thorough as with
migration, so we cannot use our new implementation directly.

This patch is the result of the following discussion on the new hugetlb
memcg accounting behavior:

https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/

Reported-by: Mike Kravetz <[email protected]>
Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
include/linux/memcontrol.h | 7 ++++++
mm/filemap.c | 2 +-
mm/memcontrol.c | 45 +++++++++++++++++++++++++++++++++++---
mm/migrate.c | 3 +--
4 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a827e2129790..e3eaa123256b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -711,6 +711,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)

void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);

+void mem_cgroup_replace_folio(struct folio *old, struct folio *new);
+
void mem_cgroup_migrate(struct folio *old, struct folio *new);

/**
@@ -1294,6 +1296,11 @@ static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
{
}

+static inline void mem_cgroup_replace_folio(struct folio *old,
+ struct folio *new)
+{
+}
+
static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 9481ffaf24e6..673745219c82 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -819,7 +819,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
new->mapping = mapping;
new->index = offset;

- mem_cgroup_migrate(old, new);
+ mem_cgroup_replace_folio(old, new);

xas_lock_irq(&xas);
xas_store(&xas, new);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6660684f6f97..cbaa26605b3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7316,16 +7316,17 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
}

/**
- * mem_cgroup_migrate - Charge a folio's replacement.
+ * mem_cgroup_replace_folio - Charge a folio's replacement.
* @old: Currently circulating folio.
* @new: Replacement folio.
*
* Charge @new as a replacement folio for @old. @old will
- * be uncharged upon free.
+ * be uncharged upon free. This is only used by the page cache
+ * (in replace_page_cache_folio()).
*
* Both folios must be locked, @new->mapping must be set up.
*/
-void mem_cgroup_migrate(struct folio *old, struct folio *new)
+void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
{
struct mem_cgroup *memcg;
long nr_pages = folio_nr_pages(new);
@@ -7364,6 +7365,44 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
local_irq_restore(flags);
}

+/**
+ * mem_cgroup_migrate - Transfer the memcg data from the old to the new folio.
+ * @old: Currently circulating folio.
+ * @new: Replacement folio.
+ *
+ * Transfer the memcg data from the old folio to the new folio for migration.
+ * The old folio's data info will be cleared. Note that the memory counters
+ * will remain unchanged throughout the process.
+ *
+ * Both folios must be locked, @new->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct folio *old, struct folio *new)
+{
+ struct mem_cgroup *memcg;
+
+ VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
+ VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
+ VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
+ VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new);
+
+ if (mem_cgroup_disabled())
+ return;
+
+ memcg = folio_memcg(old);
+ /*
+ * Note that it is normal to see !memcg for a hugetlb folio.
+ * It could have been allocated when memory_hugetlb_accounting was not
+ * selected, for e.g.
+ */
+ VM_WARN_ON_ONCE_FOLIO(!memcg, old);
+ if (!memcg)
+ return;
+
+ /* Transfer the charge and the css ref */
+ commit_charge(new, memcg);
+ old->memcg_data = 0;
+}
+
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
EXPORT_SYMBOL(memcg_sockets_enabled_key);

diff --git a/mm/migrate.c b/mm/migrate.c
index 7d1804c4a5d9..6034c7ed1d65 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)

folio_copy_owner(newfolio, folio);

- if (!folio_test_hugetlb(folio))
- mem_cgroup_migrate(folio, newfolio);
+ mem_cgroup_migrate(folio, newfolio);
}
EXPORT_SYMBOL(folio_migrate_flags);

--
2.34.1

2023-10-03 23:23:15

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 3, 2023 at 4:14 PM Nhat Pham <[email protected]> wrote:
>
> For most migration use cases, only transfer the memcg data from the old
> folio to the new folio, and clear the old folio's memcg data. No
> charging and uncharging will be done. These use cases include the new
> hugetlb memcg accounting behavior (which was not previously handled).
>
> This shaves off some work on the migration path, and avoids the
> temporary double charging of a folio during its migration.
>
> The only exception is replace_page_cache_folio(), which will use the old
> mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> context, the isolation of the old page isn't quite as thorough as with
> migration, so we cannot use our new implementation directly.
>
> This patch is the result of the following discussion on the new hugetlb
> memcg accounting behavior:
>
> https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
>
> Reported-by: Mike Kravetz <[email protected]>
> Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>

Does this patch fit before or after your series? In both cases I think
there might be a problem for bisectability.

> ---
> include/linux/memcontrol.h | 7 ++++++
> mm/filemap.c | 2 +-
> mm/memcontrol.c | 45 +++++++++++++++++++++++++++++++++++---
> mm/migrate.c | 3 +--
> 4 files changed, 51 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a827e2129790..e3eaa123256b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -711,6 +711,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
>
> void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
>
> +void mem_cgroup_replace_folio(struct folio *old, struct folio *new);
> +
> void mem_cgroup_migrate(struct folio *old, struct folio *new);
>
> /**
> @@ -1294,6 +1296,11 @@ static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> {
> }
>
> +static inline void mem_cgroup_replace_folio(struct folio *old,
> + struct folio *new)
> +{
> +}
> +
> static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
> {
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9481ffaf24e6..673745219c82 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -819,7 +819,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> new->mapping = mapping;
> new->index = offset;
>
> - mem_cgroup_migrate(old, new);
> + mem_cgroup_replace_folio(old, new);
>
> xas_lock_irq(&xas);
> xas_store(&xas, new);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6660684f6f97..cbaa26605b3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7316,16 +7316,17 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
> }
>
> /**
> - * mem_cgroup_migrate - Charge a folio's replacement.
> + * mem_cgroup_replace_folio - Charge a folio's replacement.
> * @old: Currently circulating folio.
> * @new: Replacement folio.
> *
> * Charge @new as a replacement folio for @old. @old will
> - * be uncharged upon free.
> + * be uncharged upon free. This is only used by the page cache
> + * (in replace_page_cache_folio()).
> *
> * Both folios must be locked, @new->mapping must be set up.
> */
> -void mem_cgroup_migrate(struct folio *old, struct folio *new)
> +void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
> {
> struct mem_cgroup *memcg;
> long nr_pages = folio_nr_pages(new);
> @@ -7364,6 +7365,44 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> local_irq_restore(flags);
> }
>
> +/**
> + * mem_cgroup_migrate - Transfer the memcg data from the old to the new folio.
> + * @old: Currently circulating folio.
> + * @new: Replacement folio.
> + *
> + * Transfer the memcg data from the old folio to the new folio for migration.
> + * The old folio's data info will be cleared. Note that the memory counters
> + * will remain unchanged throughout the process.
> + *
> + * Both folios must be locked, @new->mapping must be set up.
> + */
> +void mem_cgroup_migrate(struct folio *old, struct folio *new)
> +{
> + struct mem_cgroup *memcg;
> +
> + VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
> + VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
> + VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + memcg = folio_memcg(old);
> + /*
> + * Note that it is normal to see !memcg for a hugetlb folio.
> + * It could have been allocated when memory_hugetlb_accounting was not
> + * selected, for e.g.
> + */
> + VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> + if (!memcg)
> + return;
> +
> + /* Transfer the charge and the css ref */
> + commit_charge(new, memcg);
> + old->memcg_data = 0;
> +}
> +
> DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> EXPORT_SYMBOL(memcg_sockets_enabled_key);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7d1804c4a5d9..6034c7ed1d65 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
>
> folio_copy_owner(newfolio, folio);
>
> - if (!folio_test_hugetlb(folio))
> - mem_cgroup_migrate(folio, newfolio);
> + mem_cgroup_migrate(folio, newfolio);
> }
> EXPORT_SYMBOL(folio_migrate_flags);
>
> --
> 2.34.1
>

2023-10-03 23:26:47

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller

On Tue, Oct 3, 2023 at 3:42 PM Mike Kravetz <[email protected]> wrote:
>
> On 10/03/23 15:09, Nhat Pham wrote:
> > On Tue, Oct 3, 2023 at 11:39 AM Johannes Weiner <[email protected]> wrote:
> > > On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote:
> > > > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz <[email protected]> wrote:
> > > > > On 10/02/23 17:18, Nhat Pham wrote:
> > > > >
> > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in
> > > > > free_huge_folio. During migration, huge pages are allocated via
> > > > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no
> > > > > charging for the migration target page and we uncharge the source page.
> > > > > It looks like there will be no charge for the huge page after migration?
> > > > >
> > > >
> > > > Ah I see! This is a bit subtle indeed.
> > > >
> > > > For the hugetlb controller, it looks like they update the cgroup info
> > > > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate()
> > > > to transfer the hugetlb cgroup info to the destination folio.
> > > >
> > > > Perhaps we can do something analogous here.
> > > >
> > > > > If my analysis above is correct, then we may need to be careful about
> > > > > this accounting. We may not want both source and target pages to be
> > > > > charged at the same time.
> > > >
> > > > We can create a variant of mem_cgroup_migrate that does not double
> > > > charge, but instead just copy the mem_cgroup information to the new
> > > > folio, and then clear that info from the old folio. That way the memory
> > > > usage counters are untouched.
> > > >
> > > > Somebody with more expertise on migration should fact check me
> > > > of course :)
> > >
> > > The only reason mem_cgroup_migrate() double charges right now is
> > > because it's used by replace_page_cache_folio(). In that context, the
> > > isolation of the old page isn't quite as thorough as with migration,
> > > so it cannot transfer and uncharge directly. This goes back a long
> > > time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40
> > >
> > > If you rename the current implementation to mem_cgroup_replace_page()
> > > for that one caller, you can add a mem_cgroup_migrate() variant which
> > > is charge neutral and clears old->memcg_data. This can be used for
> > > regular and hugetlb page migration. Something like this (totally
> > > untested):
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index a4d3282493b6..17ec45bf3653 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > > if (mem_cgroup_disabled())
> > > return;
> > >
> > > - /* Page cache replacement: new folio already charged? */
> > > - if (folio_memcg(new))
> > > - return;
> > > -
> > > memcg = folio_memcg(old);
> > > VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> > > if (!memcg)
> > > return;
> > >
> > > - /* Force-charge the new page. The old one will be freed soon */
> > > - if (!mem_cgroup_is_root(memcg)) {
> > > - page_counter_charge(&memcg->memory, nr_pages);
> > > - if (do_memsw_account())
> > > - page_counter_charge(&memcg->memsw, nr_pages);
> > > - }
> > > -
> > > - css_get(&memcg->css);
> > > + /* Transfer the charge and the css ref */
> > > commit_charge(new, memcg);
> > > -
> > > - local_irq_save(flags);
> > > - mem_cgroup_charge_statistics(memcg, nr_pages);
> > > - memcg_check_events(memcg, folio_nid(new));
> > > - local_irq_restore(flags);
> > > + old->memcg_data = 0;
> > > }
> > >
> > > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> > >
> >
> > Ah, I like this. Will send a fixlet based on this :)
> > I was scratching my head trying to figure out why we were
> > doing the double charging in the first place. Thanks for the context,
> > Johannes!
>
> Be sure to check for code similar to this in folio_migrate_flags:
>
> void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
> {
> ...
> if (!folio_test_hugetlb(folio))
> mem_cgroup_migrate(folio, newfolio);
> }
>
> There are many places where hugetlb is special cased.

Yeah makes sense. I'm actually gonna take advantage of this,
and remove the test hugetlb check here, so that it will also
migrate the memcg metadata in this case too. See the new patch
I just sent out.

> --
> Mike Kravetz

2023-10-03 23:31:58

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 3, 2023 at 4:22 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Oct 3, 2023 at 4:14 PM Nhat Pham <[email protected]> wrote:
> >
> > For most migration use cases, only transfer the memcg data from the old
> > folio to the new folio, and clear the old folio's memcg data. No
> > charging and uncharging will be done. These use cases include the new
> > hugetlb memcg accounting behavior (which was not previously handled).
> >
> > This shaves off some work on the migration path, and avoids the
> > temporary double charging of a folio during its migration.
> >
> > The only exception is replace_page_cache_folio(), which will use the old
> > mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> > context, the isolation of the old page isn't quite as thorough as with
> > migration, so we cannot use our new implementation directly.
> >
> > This patch is the result of the following discussion on the new hugetlb
> > memcg accounting behavior:
> >
> > https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> >
> > Reported-by: Mike Kravetz <[email protected]>
> > Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> > Suggested-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Nhat Pham <[email protected]>
>
> Does this patch fit before or after your series? In both cases I think
> there might be a problem for bisectability.

Hmm my intention for this patch is as a fixlet.
(i.e it should be eventually squashed to the second patch of that series).
I just include the extra context on the fixlet for review purposes.

My apologies - should have been much clearer.
(Perhaps I should just send out v4 at this point?)

>
> > ---
> > include/linux/memcontrol.h | 7 ++++++
> > mm/filemap.c | 2 +-
> > mm/memcontrol.c | 45 +++++++++++++++++++++++++++++++++++---
> > mm/migrate.c | 3 +--
> > 4 files changed, 51 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index a827e2129790..e3eaa123256b 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -711,6 +711,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
> >
> > void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
> >
> > +void mem_cgroup_replace_folio(struct folio *old, struct folio *new);
> > +
> > void mem_cgroup_migrate(struct folio *old, struct folio *new);
> >
> > /**
> > @@ -1294,6 +1296,11 @@ static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> > {
> > }
> >
> > +static inline void mem_cgroup_replace_folio(struct folio *old,
> > + struct folio *new)
> > +{
> > +}
> > +
> > static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > {
> > }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9481ffaf24e6..673745219c82 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -819,7 +819,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> > new->mapping = mapping;
> > new->index = offset;
> >
> > - mem_cgroup_migrate(old, new);
> > + mem_cgroup_replace_folio(old, new);
> >
> > xas_lock_irq(&xas);
> > xas_store(&xas, new);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6660684f6f97..cbaa26605b3d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7316,16 +7316,17 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
> > }
> >
> > /**
> > - * mem_cgroup_migrate - Charge a folio's replacement.
> > + * mem_cgroup_replace_folio - Charge a folio's replacement.
> > * @old: Currently circulating folio.
> > * @new: Replacement folio.
> > *
> > * Charge @new as a replacement folio for @old. @old will
> > - * be uncharged upon free.
> > + * be uncharged upon free. This is only used by the page cache
> > + * (in replace_page_cache_folio()).
> > *
> > * Both folios must be locked, @new->mapping must be set up.
> > */
> > -void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > +void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
> > {
> > struct mem_cgroup *memcg;
> > long nr_pages = folio_nr_pages(new);
> > @@ -7364,6 +7365,44 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > local_irq_restore(flags);
> > }
> >
> > +/**
> > + * mem_cgroup_migrate - Transfer the memcg data from the old to the new folio.
> > + * @old: Currently circulating folio.
> > + * @new: Replacement folio.
> > + *
> > + * Transfer the memcg data from the old folio to the new folio for migration.
> > + * The old folio's data info will be cleared. Note that the memory counters
> > + * will remain unchanged throughout the process.
> > + *
> > + * Both folios must be locked, @new->mapping must be set up.
> > + */
> > +void mem_cgroup_migrate(struct folio *old, struct folio *new)
> > +{
> > + struct mem_cgroup *memcg;
> > +
> > + VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
> > + VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> > + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
> > + VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new);
> > +
> > + if (mem_cgroup_disabled())
> > + return;
> > +
> > + memcg = folio_memcg(old);
> > + /*
> > + * Note that it is normal to see !memcg for a hugetlb folio.
> > + * It could have been allocated when memory_hugetlb_accounting was not
> > + * selected, for e.g.
> > + */
> > + VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> > + if (!memcg)
> > + return;
> > +
> > + /* Transfer the charge and the css ref */
> > + commit_charge(new, memcg);
> > + old->memcg_data = 0;
> > +}
> > +
> > DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> > EXPORT_SYMBOL(memcg_sockets_enabled_key);
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 7d1804c4a5d9..6034c7ed1d65 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
> >
> > folio_copy_owner(newfolio, folio);
> >
> > - if (!folio_test_hugetlb(folio))
> > - mem_cgroup_migrate(folio, newfolio);
> > + mem_cgroup_migrate(folio, newfolio);
> > }
> > EXPORT_SYMBOL(folio_migrate_flags);
> >
> > --
> > 2.34.1
> >

2023-10-03 23:55:21

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 3, 2023 at 4:31 PM Nhat Pham <[email protected]> wrote:
>
> On Tue, Oct 3, 2023 at 4:22 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, Oct 3, 2023 at 4:14 PM Nhat Pham <[email protected]> wrote:
> > >
> > > For most migration use cases, only transfer the memcg data from the old
> > > folio to the new folio, and clear the old folio's memcg data. No
> > > charging and uncharging will be done. These use cases include the new
> > > hugetlb memcg accounting behavior (which was not previously handled).
> > >
> > > This shaves off some work on the migration path, and avoids the
> > > temporary double charging of a folio during its migration.
> > >
> > > The only exception is replace_page_cache_folio(), which will use the old
> > > mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> > > context, the isolation of the old page isn't quite as thorough as with
> > > migration, so we cannot use our new implementation directly.
> > >
> > > This patch is the result of the following discussion on the new hugetlb
> > > memcg accounting behavior:
> > >
> > > https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> > >
> > > Reported-by: Mike Kravetz <[email protected]>
> > > Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> > > Suggested-by: Johannes Weiner <[email protected]>
> > > Signed-off-by: Nhat Pham <[email protected]>
> >
> > Does this patch fit before or after your series? In both cases I think
> > there might be a problem for bisectability.
>
> Hmm my intention for this patch is as a fixlet.
> (i.e it should be eventually squashed to the second patch of that series).
> I just include the extra context on the fixlet for review purposes.
>
> My apologies - should have been much clearer.
> (Perhaps I should just send out v4 at this point?)
>

It's really up to Andrew, just make it clear what the intention is.

Thanks!

2023-10-04 00:02:53

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 3, 2023 at 4:54 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Oct 3, 2023 at 4:31 PM Nhat Pham <[email protected]> wrote:
> >
> > On Tue, Oct 3, 2023 at 4:22 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Tue, Oct 3, 2023 at 4:14 PM Nhat Pham <[email protected]> wrote:
> > > >
> > > > For most migration use cases, only transfer the memcg data from the old
> > > > folio to the new folio, and clear the old folio's memcg data. No
> > > > charging and uncharging will be done. These use cases include the new
> > > > hugetlb memcg accounting behavior (which was not previously handled).
> > > >
> > > > This shaves off some work on the migration path, and avoids the
> > > > temporary double charging of a folio during its migration.
> > > >
> > > > The only exception is replace_page_cache_folio(), which will use the old
> > > > mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> > > > context, the isolation of the old page isn't quite as thorough as with
> > > > migration, so we cannot use our new implementation directly.
> > > >
> > > > This patch is the result of the following discussion on the new hugetlb
> > > > memcg accounting behavior:
> > > >
> > > > https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> > > >
> > > > Reported-by: Mike Kravetz <[email protected]>
> > > > Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> > > > Suggested-by: Johannes Weiner <[email protected]>
> > > > Signed-off-by: Nhat Pham <[email protected]>
> > >
> > > Does this patch fit before or after your series? In both cases I think
> > > there might be a problem for bisectability.
> >
> > Hmm my intention for this patch is as a fixlet.
> > (i.e it should be eventually squashed to the second patch of that series).
> > I just include the extra context on the fixlet for review purposes.
> >
> > My apologies - should have been much clearer.
> > (Perhaps I should just send out v4 at this point?)
> >
>
> It's really up to Andrew, just make it clear what the intention is.

Thanks for reminding me! That was my oversight.

>
> Thanks!

2023-10-04 00:03:04

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 3, 2023 at 4:14 PM Nhat Pham <[email protected]> wrote:
>
> For most migration use cases, only transfer the memcg data from the old
> folio to the new folio, and clear the old folio's memcg data. No
> charging and uncharging will be done. These use cases include the new
> hugetlb memcg accounting behavior (which was not previously handled).
>
> This shaves off some work on the migration path, and avoids the
> temporary double charging of a folio during its migration.
>
> The only exception is replace_page_cache_folio(), which will use the old
> mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> context, the isolation of the old page isn't quite as thorough as with
> migration, so we cannot use our new implementation directly.
>
> This patch is the result of the following discussion on the new hugetlb
> memcg accounting behavior:
>
> https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/

For Andrew:
This fixes the following patch (which has not been merged to stable):

https://lore.kernel.org/lkml/[email protected]/

and should ideally be squashed to it. My apologies for not making it clear
in the changelog. Let me know if you'd like to see a new version instead
of this fixlet.

>
> Reported-by: Mike Kravetz <[email protected]>
> Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>
> ---
> include/linux/memcontrol.h | 7 ++++++
> mm/filemap.c | 2 +-
> mm/memcontrol.c | 45 +++++++++++++++++++++++++++++++++++---
> mm/migrate.c | 3 +--
> 4 files changed, 51 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a827e2129790..e3eaa123256b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -711,6 +711,8 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
>
> void mem_cgroup_cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
>
> +void mem_cgroup_replace_folio(struct folio *old, struct folio *new);
> +
> void mem_cgroup_migrate(struct folio *old, struct folio *new);
>
> /**
> @@ -1294,6 +1296,11 @@ static inline void mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> {
> }
>
> +static inline void mem_cgroup_replace_folio(struct folio *old,
> + struct folio *new)
> +{
> +}
> +
> static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
> {
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9481ffaf24e6..673745219c82 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -819,7 +819,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> new->mapping = mapping;
> new->index = offset;
>
> - mem_cgroup_migrate(old, new);
> + mem_cgroup_replace_folio(old, new);
>
> xas_lock_irq(&xas);
> xas_store(&xas, new);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6660684f6f97..cbaa26605b3d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7316,16 +7316,17 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
> }
>
> /**
> - * mem_cgroup_migrate - Charge a folio's replacement.
> + * mem_cgroup_replace_folio - Charge a folio's replacement.
> * @old: Currently circulating folio.
> * @new: Replacement folio.
> *
> * Charge @new as a replacement folio for @old. @old will
> - * be uncharged upon free.
> + * be uncharged upon free. This is only used by the page cache
> + * (in replace_page_cache_folio()).
> *
> * Both folios must be locked, @new->mapping must be set up.
> */
> -void mem_cgroup_migrate(struct folio *old, struct folio *new)
> +void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
> {
> struct mem_cgroup *memcg;
> long nr_pages = folio_nr_pages(new);
> @@ -7364,6 +7365,44 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
> local_irq_restore(flags);
> }
>
> +/**
> + * mem_cgroup_migrate - Transfer the memcg data from the old to the new folio.
> + * @old: Currently circulating folio.
> + * @new: Replacement folio.
> + *
> + * Transfer the memcg data from the old folio to the new folio for migration.
> + * The old folio's data info will be cleared. Note that the memory counters
> + * will remain unchanged throughout the process.
> + *
> + * Both folios must be locked, @new->mapping must be set up.
> + */
> +void mem_cgroup_migrate(struct folio *old, struct folio *new)
> +{
> + struct mem_cgroup *memcg;
> +
> + VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
> + VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
> + VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + memcg = folio_memcg(old);
> + /*
> + * Note that it is normal to see !memcg for a hugetlb folio.
> + * It could have been allocated when memory_hugetlb_accounting was not
> + * selected, for e.g.
> + */
> + VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> + if (!memcg)
> + return;
> +
> + /* Transfer the charge and the css ref */
> + commit_charge(new, memcg);
> + old->memcg_data = 0;
> +}
> +
> DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
> EXPORT_SYMBOL(memcg_sockets_enabled_key);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7d1804c4a5d9..6034c7ed1d65 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
>
> folio_copy_owner(newfolio, folio);
>
> - if (!folio_test_hugetlb(folio))
> - mem_cgroup_migrate(folio, newfolio);
> + mem_cgroup_migrate(folio, newfolio);
> }
> EXPORT_SYMBOL(folio_migrate_flags);
>
> --
> 2.34.1
>

2023-10-04 14:17:30

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcontrol: only transfer the memcg data for migration

On Tue, Oct 03, 2023 at 04:14:22PM -0700, Nhat Pham wrote:
> For most migration use cases, only transfer the memcg data from the old
> folio to the new folio, and clear the old folio's memcg data. No
> charging and uncharging will be done. These use cases include the new
> hugetlb memcg accounting behavior (which was not previously handled).
>
> This shaves off some work on the migration path, and avoids the
> temporary double charging of a folio during its migration.
>
> The only exception is replace_page_cache_folio(), which will use the old
> mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that
> context, the isolation of the old page isn't quite as thorough as with
> migration, so we cannot use our new implementation directly.
>
> This patch is the result of the following discussion on the new hugetlb
> memcg accounting behavior:
>
> https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
>
> Reported-by: Mike Kravetz <[email protected]>
> Closes: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Nhat Pham <[email protected]>

For squashing, the patch title should be:

hugetlb: memcg: account hugetlb-backed memory in memory controller fix

However, I think this should actually be split out. It changes how all
pages are cgroup-migrated, which is a bit too large of a side effect
for the hugetlb accounting patch itself. Especially because the
reasoning outlined above will get lost once this fixup is folded.

IOW, send one prep patch, to go before the series, which splits
mem_cgroup_replace_folio() and does the mem_cgroup_migrate()
optimization() with the above explanation.

Then send a fixlet for the hugetlb accounting patch that removes the
!hugetlb-conditional for the mem_cgroup_migrate() call.

If you're clear in the queueing instructions for both patches, Andrew
can probably do it in-place without having to resend everything :)

> +void mem_cgroup_migrate(struct folio *old, struct folio *new)
> +{
> + struct mem_cgroup *memcg;
> +
> + VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
> + VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
> + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new);
> + VM_BUG_ON_FOLIO(folio_nr_pages(old) != folio_nr_pages(new), new);
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + memcg = folio_memcg(old);
> + /*
> + * Note that it is normal to see !memcg for a hugetlb folio.
> + * It could have been allocated when memory_hugetlb_accounting was not
> + * selected, for e.g.

Is that sentence truncated?

> + */
> + VM_WARN_ON_ONCE_FOLIO(!memcg, old);
> + if (!memcg)
> + return;

If this is expected to happen, it shouldn't warn:

VM_WARN_ON_ONCE(!folio_test_hugetlb(old) && !memcg, old);

2023-10-04 19:46:08

by Nhat Pham

[permalink] [raw]
Subject: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller (fix)

Ensure hugetlb folio migration also transfers the memcg metadata.

This fixlet should be squashed to the following patch:
https://lore.kernel.org/lkml/[email protected]/
hugetlb: memcg: account hugetlb-backed memory in memory controller

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Nhat Pham <[email protected]>
---
mm/migrate.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7d1804c4a5d9..6034c7ed1d65 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -633,8 +633,7 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)

folio_copy_owner(newfolio, folio);

- if (!folio_test_hugetlb(folio))
- mem_cgroup_migrate(folio, newfolio);
+ mem_cgroup_migrate(folio, newfolio);
}
EXPORT_SYMBOL(folio_migrate_flags);

--
2.34.1

2023-10-06 17:26:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller (fix)

On Wed, 4 Oct 2023 12:45:06 -0700 Nhat Pham <[email protected]> wrote:

> Ensure hugetlb folio migration also transfers the memcg metadata.
>
> This fixlet should be squashed to the following patch:
> https://lore.kernel.org/lkml/[email protected]/
> hugetlb: memcg: account hugetlb-backed memory in memory controller
>

I've rather lost track of what's going on here. I'll drop the series
"hugetlb memcg accounting" v3. Please resend everything and let's try
again.

2023-10-06 18:24:13

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller (fix)

On Fri, Oct 6, 2023 at 10:25 AM Andrew Morton <[email protected]> wrote:
>
> On Wed, 4 Oct 2023 12:45:06 -0700 Nhat Pham <[email protected]> wrote:
>
> > Ensure hugetlb folio migration also transfers the memcg metadata.
> >
> > This fixlet should be squashed to the following patch:
> > https://lore.kernel.org/lkml/[email protected]/
> > hugetlb: memcg: account hugetlb-backed memory in memory controller
> >
>
> I've rather lost track of what's going on here. I'll drop the series
> "hugetlb memcg accounting" v3. Please resend everything and let's try
> again.
>

Sure thing! I'll send a v4 with everything cleaned up + squashed
in a moment. Apologies for the confusion.