2020-01-15 01:28:26

by Mina Almasry

[permalink] [raw]
Subject: [PATCH v10 1/8] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

This is patch 1 of an 8 patch series.

Problem:
Currently tasks attempting to reserve more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to reserve hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such that users
violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather
than getting SIGBUS'd when they try to fault the excess memory in. This
gives the user an opportunity to fallback more gracefully to
non-hugetlbfs memory for example.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named
'hugetlb.xMB.reservation_[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
reservation time rather than fault time. Adding a new page_counter seems
better as userspace could, if it wants, choose to enforce different cgroups
differently: one via limit_in_bytes, and another via
reservation_limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and reservation_limit_in_bytes concurrently, and this
approach gives them the option to do so.

Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry <[email protected]>

---
Changes in v10:
- Renamed reservation_* to resv.*

---
include/linux/hugetlb.h | 4 +-
mm/hugetlb_cgroup.c | 115 +++++++++++++++++++++++++++++++++++-----
2 files changed, 104 insertions(+), 15 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1e897e4168ac1..dea6143aa0685 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -432,8 +432,8 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
- struct cftype cgroup_files_dfl[5];
- struct cftype cgroup_files_legacy[5];
+ struct cftype cgroup_files_dfl[7];
+ struct cftype cgroup_files_legacy[9];
#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index e434b05416c68..209f9b9604d34 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -36,6 +36,11 @@ struct hugetlb_cgroup {
*/
struct page_counter hugepage[HUGE_MAX_HSTATE];

+ /*
+ * the counter to account for hugepage reservations from hugetlb.
+ */
+ struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+
atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];

@@ -55,6 +60,14 @@ struct hugetlb_cgroup {

static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+ if (reserved)
+ return &h_cg->reserved_hugepage[idx];
+ return &h_cg->hugepage[idx];
+}
+
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -295,28 +308,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,

enum {
RES_USAGE,
+ RES_RESERVATION_USAGE,
RES_LIMIT,
+ RES_RESERVATION_LIMIT,
RES_MAX_USAGE,
+ RES_RESERVATION_MAX_USAGE,
RES_FAILCNT,
+ RES_RESERVATION_FAILCNT,
};

static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct page_counter *counter;
+ struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+ reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
case RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
+ case RES_RESERVATION_USAGE:
+ return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
case RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
+ case RES_RESERVATION_LIMIT:
+ return (u64)reserved_counter->max * PAGE_SIZE;
case RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
+ case RES_RESERVATION_MAX_USAGE:
+ return (u64)reserved_counter->watermark * PAGE_SIZE;
case RES_FAILCNT:
return counter->failcnt;
+ case RES_RESERVATION_FAILCNT:
+ return reserved_counter->failcnt;
default:
BUG();
}
@@ -338,10 +365,16 @@ static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v)
1 << huge_page_order(&hstates[idx]));

switch (MEMFILE_ATTR(cft->private)) {
+ case RES_RESERVATION_USAGE:
+ counter = &h_cg->reserved_hugepage[idx];
+ /* Fall through. */
case RES_USAGE:
val = (u64)page_counter_read(counter);
seq_printf(seq, "%llu\n", val * PAGE_SIZE);
break;
+ case RES_RESERVATION_LIMIT:
+ counter = &h_cg->reserved_hugepage[idx];
+ /* Fall through. */
case RES_LIMIT:
val = (u64)counter->max;
if (val == limit)
@@ -365,6 +398,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+ bool reserved = false;

if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -378,9 +412,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

switch (MEMFILE_ATTR(of_cft(of)->private)) {
+ case RES_RESERVATION_LIMIT:
+ reserved = true;
+ /* Fall through. */
case RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
- ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+ ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+ reserved),
+ nr_pages);
mutex_unlock(&hugetlb_limit_mutex);
break;
default:
@@ -406,18 +445,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int ret = 0;
- struct page_counter *counter;
+ struct page_counter *counter, *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+ reserved_counter =
+ &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];

switch (MEMFILE_ATTR(of_cft(of)->private)) {
case RES_MAX_USAGE:
page_counter_reset_watermark(counter);
break;
+ case RES_RESERVATION_MAX_USAGE:
+ page_counter_reset_watermark(reserved_counter);
+ break;
case RES_FAILCNT:
counter->failcnt = 0;
break;
+ case RES_RESERVATION_FAILCNT:
+ reserved_counter->failcnt = 0;
+ break;
default:
ret = -EINVAL;
break;
@@ -472,7 +519,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
struct hstate *h = &hstates[idx];

/* format the size */
- mem_fmt(buf, 32, huge_page_size(h));
+ mem_fmt(buf, sizeof(buf), huge_page_size(h));

/* Add the limit file */
cft = &h->cgroup_files_dfl[0];
@@ -482,15 +529,30 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->write = hugetlb_cgroup_write_dfl;
cft->flags = CFTYPE_NOT_ON_ROOT;

- /* Add the current usage file */
+ /* Add the reservation limit file */
cft = &h->cgroup_files_dfl[1];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.max", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
+ cft->seq_show = hugetlb_cgroup_read_u64_max;
+ cft->write = hugetlb_cgroup_write_dfl;
+ cft->flags = CFTYPE_NOT_ON_ROOT;
+
+ /* Add the current usage file */
+ cft = &h->cgroup_files_dfl[2];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
cft->seq_show = hugetlb_cgroup_read_u64_max;
cft->flags = CFTYPE_NOT_ON_ROOT;

+ /* Add the current reservation usage file */
+ cft = &h->cgroup_files_dfl[3];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.current", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
+ cft->seq_show = hugetlb_cgroup_read_u64_max;
+ cft->flags = CFTYPE_NOT_ON_ROOT;
+
/* Add the events file */
- cft = &h->cgroup_files_dfl[2];
+ cft = &h->cgroup_files_dfl[4];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_show;
@@ -498,7 +560,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->flags = CFTYPE_NOT_ON_ROOT;

/* Add the events.local file */
- cft = &h->cgroup_files_dfl[3];
+ cft = &h->cgroup_files_dfl[5];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
cft->private = MEMFILE_PRIVATE(idx, 0);
cft->seq_show = hugetlb_events_local_show;
@@ -507,7 +569,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
cft->flags = CFTYPE_NOT_ON_ROOT;

/* NULL terminate the last cft */
- cft = &h->cgroup_files_dfl[4];
+ cft = &h->cgroup_files_dfl[6];
memset(cft, 0, sizeof(*cft));

WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -521,7 +583,7 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
struct hstate *h = &hstates[idx];

/* format the size */
- mem_fmt(buf, 32, huge_page_size(h));
+ mem_fmt(buf, sizeof(buf), huge_page_size(h));

/* Add the limit file */
cft = &h->cgroup_files_legacy[0];
@@ -530,28 +592,55 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
cft->read_u64 = hugetlb_cgroup_read_u64;
cft->write = hugetlb_cgroup_write_legacy;

- /* Add the usage file */
+ /* Add the reservation limit file */
cft = &h->cgroup_files_legacy[1];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.limit_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+ cft->write = hugetlb_cgroup_write_legacy;
+
+ /* Add the usage file */
+ cft = &h->cgroup_files_legacy[2];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
cft->read_u64 = hugetlb_cgroup_read_u64;

+ /* Add the reservation usage file */
+ cft = &h->cgroup_files_legacy[3];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
/* Add the MAX usage file */
- cft = &h->cgroup_files_legacy[2];
+ cft = &h->cgroup_files_legacy[4];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;

+ /* Add the MAX reservation usage file */
+ cft = &h->cgroup_files_legacy[5];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
/* Add the failcntfile */
- cft = &h->cgroup_files_legacy[3];
+ cft = &h->cgroup_files_legacy[6];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation failcntfile */
+ cft = &h->cgroup_files_legacy[7];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_FAILCNT);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;

/* NULL terminate the last cft */
- cft = &h->cgroup_files_legacy[4];
+ cft = &h->cgroup_files_legacy[8];
memset(cft, 0, sizeof(*cft));

WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.25.0.rc1.283.g88dfdc4193-goog


2020-01-15 01:28:30

by Mina Almasry

[permalink] [raw]
Subject: [PATCH v10 4/8] hugetlb: disable region_add file_region coalescing

A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

We've also modified the assumption that region_add after region_chg
never fails. region_chg now pre-allocates at least 1 region for
region_add. If region_add needs more regions than region_chg has
allocated for it, then it may fail.

Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>

---

Changes in v9:
- Added clarifications in the comments and addressed minor issues from
code review.
Changes in v7:
- region_chg no longer allocates (t-f) / 2 file_region entries.
Changes in v6:
- Fix bug in number of region_caches allocated by region_chg

---
mm/hugetlb.c | 327 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 223 insertions(+), 104 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f1b63946ee95c..de0028e9a8630 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -245,110 +245,179 @@ struct file_region {
long to;
};

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to)
+{
+ struct file_region *nrg = NULL;
+
+ VM_BUG_ON(resv->region_cache_count <= 0);
+
+ resv->region_cache_count--;
+ nrg = list_first_entry(&resv->region_cache, struct file_region, link);
+ VM_BUG_ON(!nrg);
+ list_del(&nrg->link);
+
+ nrg->from = from;
+ nrg->to = to;
+
+ return nrg;
+}
+
/* Must be called with resv->lock held. Calling this with count_only == true
* will count the number of pages to be added but will not modify the linked
- * list.
+ * list. If regions_needed != NULL and count_only == true, then regions_needed
+ * will indicate the number of file_regions needed in the cache to carry out to
+ * add the regions for this range.
*/
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
- bool count_only)
+ long *regions_needed, bool count_only)
{
- long chg = 0;
+ long add = 0;
struct list_head *head = &resv->regions;
+ long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

- /* Locate the region we are before or in. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
-
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
- f = rg->from;
+ if (regions_needed)
+ *regions_needed = 0;

- chg = t - f;
+ /* In this loop, we essentially handle an entry for the range
+ * [last_accounted_offset, rg->from), at every iteration, with some
+ * bounds checking.
+ */
+ list_for_each_entry_safe(rg, trg, head, link) {
+ /* Skip irrelevant regions that start before our range. */
+ if (rg->from < f) {
+ /* If this region ends after the last accounted offset,
+ * then we need to update last_accounted_offset.
+ */
+ if (rg->to > last_accounted_offset)
+ last_accounted_offset = rg->to;
+ continue;
+ }

- /* Check for and consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
- break;
+ /* When we find a region that starts beyond our range, we've
+ * finished.
+ */
if (rg->from > t)
break;

- /* We overlap with this area, if it extends further than
- * us then we must extend ourselves. Account for its
- * existing reservation.
+ /* Add an entry for last_accounted_offset -> rg->from, and
+ * update last_accounted_offset.
*/
- if (rg->to > t) {
- chg += rg->to - t;
- t = rg->to;
+ if (rg->from > last_accounted_offset) {
+ add += rg->from - last_accounted_offset;
+ if (!count_only) {
+ nrg = get_file_region_entry_from_cache(
+ resv, last_accounted_offset, rg->from);
+ list_add(&nrg->link, rg->link.prev);
+ } else if (regions_needed)
+ *regions_needed += 1;
}
- chg -= rg->to - rg->from;

- if (!count_only && rg != nrg) {
- list_del(&rg->link);
- kfree(rg);
- }
+ last_accounted_offset = rg->to;
}

- if (!count_only) {
- nrg->from = f;
- nrg->to = t;
+ /* Handle the case where our range extends beyond
+ * last_accounted_offset.
+ */
+ if (last_accounted_offset < t) {
+ add += t - last_accounted_offset;
+ if (!count_only) {
+ nrg = get_file_region_entry_from_cache(
+ resv, last_accounted_offset, t);
+ list_add(&nrg->link, rg->link.prev);
+ } else if (regions_needed)
+ *regions_needed += 1;
}

- return chg;
+ return add;
}

/*
* Add the huge page range represented by [f, t) to the reserve
- * map. Existing regions will be expanded to accommodate the specified
- * range, or a region will be taken from the cache. Sufficient regions
- * must exist in the cache due to the previous call to region_chg with
- * the same range.
+ * map. Regions will be taken from the cache to fill in this range.
+ * Sufficient regions should exist in the cache due to the previous
+ * call to region_chg with the same range, but in some cases the cache will not
+ * have sufficient entries due to races with other code doing region_add or
+ * region_del. The extra needed entries will be allocated.
*
- * Return the number of new huge pages added to the map. This
- * number is greater than or equal to zero.
+ * regions_needed is the out value provided by a previous call to region_chg.
+ *
+ * Return the number of new huge pages added to the map. This number is greater
+ * than or equal to zero. If file_region entries needed to be allocated for
+ * this operation and we were not able to allocate, it ruturns -ENOMEM.
+ * region_add of regions of length 1 never allocate file_regions and cannot
+ * fail; region_chg will always allocate at least 1 entry and a region_add for
+ * 1 page will only require at most 1 entry.
*/
-static long region_add(struct resv_map *resv, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t,
+ long in_regions_needed)
{
- struct list_head *head = &resv->regions;
- struct file_region *rg, *nrg;
- long add = 0;
+ long add = 0, actual_regions_needed = 0, i = 0;
+ struct file_region *trg = NULL, *rg = NULL;
+ struct list_head allocated_regions;
+
+ INIT_LIST_HEAD(&allocated_regions);

spin_lock(&resv->lock);
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (f <= rg->to)
- break;
+retry:
+
+ /* Count how many regions are actually needed to execute this add. */
+ add_reservation_in_range(resv, f, t, &actual_regions_needed, true);

/*
- * If no region exists which can be expanded to include the
- * specified range, pull a region descriptor from the cache
- * and use it for this range.
+ * Check for sufficient descriptors in the cache to accommodate
+ * this add operation. Note that actual_regions_needed may be greater
+ * than in_regions_needed. In this case, we need to make sure that we
+ * allocate extra entries, such that we have enough for all the
+ * existing adds_in_progress, plus the excess needed for this
+ * operation.
*/
- if (&rg->link == head || t < rg->from) {
- VM_BUG_ON(resv->region_cache_count <= 0);
+ if (resv->region_cache_count <
+ resv->adds_in_progress +
+ (actual_regions_needed - in_regions_needed)) {
+ /* region_add operation of range 1 should never need to
+ * allocate file_region entries.
+ */
+ VM_BUG_ON(t - f <= 1);

- resv->region_cache_count--;
- nrg = list_first_entry(&resv->region_cache, struct file_region,
- link);
- list_del(&nrg->link);
+ /* Must drop lock to allocate a new descriptor. */
+ spin_unlock(&resv->lock);
+ for (i = 0; i < (actual_regions_needed - in_regions_needed);
+ i++) {
+ trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+ if (!trg)
+ goto out_of_memory;
+ list_add(&trg->link, &allocated_regions);
+ }
+ spin_lock(&resv->lock);

- nrg->from = f;
- nrg->to = t;
- list_add(&nrg->link, rg->link.prev);
+ list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+ list_del(&rg->link);
+ list_add(&rg->link, &resv->region_cache);
+ resv->region_cache_count++;
+ }

- add += t - f;
- goto out_locked;
+ goto retry;
}

- add = add_reservation_in_range(resv, f, t, false);
+ add = add_reservation_in_range(resv, f, t, NULL, false);
+
+ resv->adds_in_progress -= in_regions_needed;

-out_locked:
- resv->adds_in_progress--;
spin_unlock(&resv->lock);
VM_BUG_ON(add < 0);
return add;
+
+out_of_memory:
+ list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return -ENOMEM;
}

/*
@@ -358,49 +427,79 @@ static long region_add(struct resv_map *resv, long f, long t)
* call to region_add that will actually modify the reserve
* map to add the specified range [f, t). region_chg does
* not change the number of huge pages represented by the
- * map. A new file_region structure is added to the cache
- * as a placeholder, so that the subsequent region_add
- * call will have all the regions it needs and will not fail.
+ * map. A number of new file_region structures is added to the cache as a
+ * placeholder, for the subsequent region_add call to use. At least 1
+ * file_region structure is added.
+ *
+ * out_regions_needed is the number of regions added to the
+ * resv->adds_in_progress. This value needs to be provided to a follow up call
+ * to region_add or region_abort for proper accounting.
*
* Returns the number of huge pages that need to be added to the existing
* reservation map for the range [f, t). This number is greater or equal to
* zero. -ENOMEM is returned if a new file_region structure or cache entry
* is needed and can not be allocated.
*/
-static long region_chg(struct resv_map *resv, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t,
+ long *out_regions_needed)
{
- long chg = 0;
+ struct file_region *trg = NULL, *rg = NULL;
+ long chg = 0, i = 0, to_allocate = 0;
+ struct list_head allocated_regions;
+
+ INIT_LIST_HEAD(&allocated_regions);

spin_lock(&resv->lock);
-retry_locked:
- resv->adds_in_progress++;
+
+ /* Count how many hugepages in this range are NOT respresented. */
+ chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
+
+ if (*out_regions_needed == 0)
+ *out_regions_needed = 1;
+
+ resv->adds_in_progress += *out_regions_needed;

/*
* Check for sufficient descriptors in the cache to accommodate
* the number of in progress add operations.
*/
- if (resv->adds_in_progress > resv->region_cache_count) {
- struct file_region *trg;
-
- VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1);
- /* Must drop lock to allocate a new descriptor. */
- resv->adds_in_progress--;
+ while (resv->region_cache_count < resv->adds_in_progress) {
+ to_allocate = resv->adds_in_progress - resv->region_cache_count;
+
+ /* Must drop lock to allocate a new descriptor. Note that even
+ * though we drop the lock here, we do not make another call to
+ * add_reservation_in_range after re-acquiring the lock.
+ * Essentially this branch makes sure that we have enough
+ * descriptors in the cache as suggested by the first call to
+ * add_reservation_in_range. If more regions turn out to be
+ * required, region_add will deal with it.
+ */
spin_unlock(&resv->lock);
-
- trg = kmalloc(sizeof(*trg), GFP_KERNEL);
- if (!trg)
- return -ENOMEM;
+ for (i = 0; i < to_allocate; i++) {
+ trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+ if (!trg)
+ goto out_of_memory;
+ list_add(&trg->link, &allocated_regions);
+ }

spin_lock(&resv->lock);
- list_add(&trg->link, &resv->region_cache);
- resv->region_cache_count++;
- goto retry_locked;
- }

- chg = add_reservation_in_range(resv, f, t, true);
+ list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+ list_del(&rg->link);
+ list_add(&rg->link, &resv->region_cache);
+ resv->region_cache_count++;
+ }
+ }

spin_unlock(&resv->lock);
return chg;
+
+out_of_memory:
+ list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return -ENOMEM;
}

/*
@@ -408,17 +507,20 @@ static long region_chg(struct resv_map *resv, long f, long t)
* of the resv_map keeps track of the operations in progress between
* calls to region_chg and region_add. Operations are sometimes
* aborted after the call to region_chg. In such cases, region_abort
- * is called to decrement the adds_in_progress counter.
+ * is called to decrement the adds_in_progress counter. regions_needed
+ * is the value returned by the region_chg call, it is used to decrement
+ * the adds_in_progress counter.
*
* NOTE: The range arguments [f, t) are not needed or used in this
* routine. They are kept to make reading the calling code easier as
* arguments will match the associated region_chg call.
*/
-static void region_abort(struct resv_map *resv, long f, long t)
+static void region_abort(struct resv_map *resv, long f, long t,
+ long regions_needed)
{
spin_lock(&resv->lock);
VM_BUG_ON(!resv->region_cache_count);
- resv->adds_in_progress--;
+ resv->adds_in_progress -= regions_needed;
spin_unlock(&resv->lock);
}

@@ -1883,6 +1985,7 @@ static long __vma_reservation_common(struct hstate *h,
struct resv_map *resv;
pgoff_t idx;
long ret;
+ long dummy_out_regions_needed;

resv = vma_resv_map(vma);
if (!resv)
@@ -1891,20 +1994,29 @@ static long __vma_reservation_common(struct hstate *h,
idx = vma_hugecache_offset(h, vma, addr);
switch (mode) {
case VMA_NEEDS_RESV:
- ret = region_chg(resv, idx, idx + 1);
+ ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed);
+ /* We assume that vma_reservation_* routines always operate on
+ * 1 page, and that adding to resv map a 1 page entry can only
+ * ever require 1 region.
+ */
+ VM_BUG_ON(dummy_out_regions_needed != 1);
break;
case VMA_COMMIT_RESV:
- ret = region_add(resv, idx, idx + 1);
+ ret = region_add(resv, idx, idx + 1, 1);
+ /* region_add calls of range 1 should never fail. */
+ VM_BUG_ON(ret < 0);
break;
case VMA_END_RESV:
- region_abort(resv, idx, idx + 1);
+ region_abort(resv, idx, idx + 1, 1);
ret = 0;
break;
case VMA_ADD_RESV:
- if (vma->vm_flags & VM_MAYSHARE)
- ret = region_add(resv, idx, idx + 1);
- else {
- region_abort(resv, idx, idx + 1);
+ if (vma->vm_flags & VM_MAYSHARE) {
+ ret = region_add(resv, idx, idx + 1, 1);
+ /* region_add calls of range 1 should never fail. */
+ VM_BUG_ON(ret < 0);
+ } else {
+ region_abort(resv, idx, idx + 1, 1);
ret = region_del(resv, idx, idx + 1);
}
break;
@@ -4563,12 +4675,12 @@ int hugetlb_reserve_pages(struct inode *inode,
struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
- long ret, chg;
+ long ret, chg, add = -1;
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
struct hugetlb_cgroup *h_cg;
- long gbl_reserve;
+ long gbl_reserve, regions_needed = 0;

/* This should never happen */
if (from > to) {
@@ -4598,7 +4710,7 @@ int hugetlb_reserve_pages(struct inode *inode,
*/
resv_map = inode_resv_map(inode);

- chg = region_chg(resv_map, from, to);
+ chg = region_chg(resv_map, from, to, &regions_needed);

} else {
/* Private mapping. */
@@ -4668,9 +4780,14 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE) {
- long add = region_add(resv_map, from, to);
-
- if (unlikely(chg > add)) {
+ add = region_add(resv_map, from, to, regions_needed);
+
+ if (unlikely(add < 0)) {
+ hugetlb_acct_memory(h, -gbl_reserve);
+ /* put back original number of pages, chg */
+ (void)hugepage_subpool_put_pages(spool, chg);
+ goto out_err;
+ } else if (unlikely(chg > add)) {
/*
* pages in this range were added to the reserve
* map between region_chg and region_add. This
@@ -4688,9 +4805,11 @@ int hugetlb_reserve_pages(struct inode *inode,
return 0;
out_err:
if (!vma || vma->vm_flags & VM_MAYSHARE)
- /* Don't call region_abort if region_chg failed */
- if (chg >= 0)
- region_abort(resv_map, from, to);
+ /* Only call region_abort if the region_chg succeeded but the
+ * region_add failed or didn't run.
+ */
+ if (chg >= 0 && add < 0)
+ region_abort(resv_map, from, to, regions_needed);
if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
kref_put(&resv_map->refs, resv_map_release);
return ret;
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-15 01:28:37

by Mina Almasry

[permalink] [raw]
Subject: [PATCH v10 6/8] hugetlb_cgroup: support noreserve mappings

Support MAP_NORESERVE accounting as part of the new counter.

For each hugepage allocation, at allocation time we check if there is
a reservation for this allocation or not. If there is a reservation for
this allocation, then this allocation was charged at reservation time,
and we don't re-account it. If there is no reserevation for this
allocation, we charge the appropriate hugetlb_cgroup.

The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private. We use new APIs added in an earlier patch to set this
pointer.

Signed-off-by: Mina Almasry <[email protected]>

---

Changes in v10:
- Refactored deferred_reserve check.

---
mm/hugetlb.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9bcfc12c5d214..d3f107626f927 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1374,6 +1374,9 @@ static void __free_huge_page(struct page *page)
clear_page_huge_active(page);
hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
page, false);
+ hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
+ page, true);
+
if (restore_reserve)
h->resv_huge_pages++;

@@ -2207,6 +2210,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
long gbl_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg;
+ bool deferred_reserve;

idx = hstate_index(h);
/*
@@ -2244,10 +2248,20 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

+ /* If this allocation is not consuming a reservation, charge it now.
+ */
+ deferred_reserve = map_chg || avoid_reserve || !vma_resv_map(vma);
+ if (deferred_reserve) {
+ ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h),
+ &h_cg, true);
+ if (ret)
+ goto out_subpool_put;
+ }
+
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
false);
if (ret)
- goto out_subpool_put;
+ goto out_uncharge_cgroup_reservation;

spin_lock(&hugetlb_lock);
/*
@@ -2271,6 +2285,14 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
}
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
false);
+ /* If allocation is not consuming a reservation, also store the
+ * hugetlb_cgroup pointer on the page.
+ */
+ if (deferred_reserve) {
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg,
+ page, true);
+ }
+
spin_unlock(&hugetlb_lock);

set_page_private(page, (unsigned long)spool);
@@ -2296,6 +2318,10 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
false);
+out_uncharge_cgroup_reservation:
+ if (deferred_reserve)
+ hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h),
+ h_cg, true);
out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-15 01:28:44

by Mina Almasry

[permalink] [raw]
Subject: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Also add test for hugetlb reservation reparenting, since this is
a subtle issue.

Signed-off-by: Mina Almasry <[email protected]>

---

Changes in v10:
- Updated tests to resv.* name changes.
Changes in v9:
- Added tests for hugetlb reparenting.
- Make tests explicitly support cgroup v1 and v2 via script argument.
Changes in v6:
- Updates tests for cgroups-v2 and NORESERVE allocations.

---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
.../selftests/vm/charge_reserved_hugetlb.sh | 557 ++++++++++++++++++
.../selftests/vm/hugetlb_reparenting_test.sh | 235 ++++++++
.../selftests/vm/write_hugetlb_memory.sh | 23 +
.../testing/selftests/vm/write_to_hugetlbfs.c | 261 ++++++++
6 files changed, 1078 insertions(+)
create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
create mode 100755 tools/testing/selftests/vm/hugetlb_reparenting_test.sh
create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
gup_benchmark
va_128TBswitch
map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 7f9a8a8c31da9..662bb95e84c5d 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -22,6 +22,7 @@ TEST_GEN_FILES += userfaultfd
ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64))
TEST_GEN_FILES += va_128TBswitch
TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs
endif

TEST_PROGS := run_vmtests
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0000000000000..8f2413352514b
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,557 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+ echo "This test must be run as root. Skipping..."
+ exit 0
+fi
+
+fault_limit_file=limit_in_bytes
+reservation_limit_file=resv.limit_in_bytes
+fault_usage_file=usage_in_bytes
+reservation_usage_file=resv.usage_in_bytes
+
+if [[ "$1" == "-cgroup-v2" ]]; then
+ cgroup2=1
+ fault_limit_file=max
+ reservation_limit_file=resv.max
+ fault_usage_file=current
+ reservation_usage_file=resv.current
+fi
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+ mkdir -p $cgroup_path
+ if [[ $cgroup2 ]]; then
+ mount -t cgroup2 none $cgroup_path
+ else
+ mount -t cgroup memory,hugetlb $cgroup_path
+ fi
+fi
+
+if [[ $cgroup2 ]]; then
+ echo "+hugetlb" >/dev/cgroup/memory/cgroup.subtree_control
+fi
+
+cleanup() {
+ if [[ $cgroup2 ]]; then
+ echo $$ >$cgroup_path/cgroup.procs
+ else
+ echo $$ >$cgroup_path/tasks
+ fi
+
+ if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+ fi
+ if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+ fi
+ if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+ fi
+ if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+ fi
+ echo 0 >/proc/sys/vm/nr_hugepages
+ echo CLEANUP DONE
+}
+
+function expect_equal() {
+ local expected="$1"
+ local actual="$2"
+ local error="$3"
+
+ if [[ "$expected" != "$actual" ]]; then
+ echo "expected ($expected) != actual ($actual): $3"
+ cleanup
+ exit 1
+ fi
+}
+
+function setup_cgroup() {
+ local name="$1"
+ local cgroup_limit="$2"
+ local reservation_limit="$3"
+
+ mkdir $cgroup_path/$name
+
+ echo writing cgroup limit: "$cgroup_limit"
+ echo "$cgroup_limit" >$cgroup_path/$name/hugetlb.2MB.$fault_limit_file
+
+ echo writing reseravation limit: "$reservation_limit"
+ echo "$reservation_limit" > \
+ $cgroup_path/$name/hugetlb.2MB.$reservation_limit_file
+
+ if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then
+ echo 0 >$cgroup_path/$name/cpuset.cpus
+ fi
+ if [ -e "$cgroup_path/$name/cpuset.mems" ]; then
+ echo 0 >$cgroup_path/$name/cpuset.mems
+ fi
+}
+
+function wait_for_hugetlb_memory_to_get_depleted() {
+ local cgroup="$1"
+ local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.$reservation_usage_file"
+ # Wait for hugetlbfs memory to get depleted.
+ while [ $(cat $path) != 0 ]; do
+ echo Waiting for hugetlb memory to get depleted.
+ cat $path
+ sleep 0.5
+ done
+}
+
+function wait_for_hugetlb_memory_to_get_reserved() {
+ local cgroup="$1"
+ local size="$2"
+
+ local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.$reservation_usage_file"
+ # Wait for hugetlbfs memory to get written.
+ while [ $(cat $path) != $size ]; do
+ echo Waiting for hugetlb memory reservation to reach size $size.
+ cat $path
+ sleep 0.5
+ done
+}
+
+function wait_for_hugetlb_memory_to_get_written() {
+ local cgroup="$1"
+ local size="$2"
+
+ local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.$fault_usage_file"
+ # Wait for hugetlbfs memory to get written.
+ while [ $(cat $path) != $size ]; do
+ echo Waiting for hugetlb memory to reach size $size.
+ cat $path
+ sleep 0.5
+ done
+}
+
+function write_hugetlbfs_and_get_usage() {
+ local cgroup="$1"
+ local size="$2"
+ local populate="$3"
+ local write="$4"
+ local path="$5"
+ local method="$6"
+ local private="$7"
+ local expect_failure="$8"
+ local reserve="$9"
+
+ # Function return values.
+ reservation_failed=0
+ oom_killed=0
+ hugetlb_difference=0
+ reserved_difference=0
+
+ local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.$fault_usage_file
+ local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.$reservation_usage_file
+
+ local hugetlb_before=$(cat $hugetlb_usage)
+ local reserved_before=$(cat $reserved_usage)
+
+ echo
+ echo Starting:
+ echo hugetlb_usage="$hugetlb_before"
+ echo reserved_usage="$reserved_before"
+ echo expect_failure is "$expect_failure"
+
+ set +e
+ if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] ||
+ [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+
+ bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+ "$cgroup" "$path" "$method" "$private" "-l" "$reserve" &
+
+ local write_result=$?
+
+ if [[ "$reserve" != "-n" ]]; then
+ wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+ elif [[ "$populate" == "-o" ]] || [[ "$write" == "-w" ]]; then
+ wait_for_hugetlb_memory_to_get_written "$cgroup" "$size"
+ else
+ # This case doesn't produce visible effects, but we still have
+ # to wait for the async process to start and execute...
+ sleep 0.5
+ fi
+
+ echo write_result is $write_result
+ else
+ bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+ "$cgroup" "$path" "$method" "$private" "$reserve"
+ local write_result=$?
+
+ if [[ "$reserve" != "-n" ]]; then
+ wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+ fi
+ fi
+ set -e
+
+ if [[ "$write_result" == 1 ]]; then
+ reservation_failed=1
+ fi
+
+ # On linus/master, the above process gets SIGBUS'd on oomkill, with
+ # return code 135. On earlier kernels, it gets actual oomkill, with return
+ # code 137, so just check for both conditions in case we're testing
+ # against an earlier kernel.
+ if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
+ oom_killed=1
+ fi
+
+ local hugetlb_after=$(cat $hugetlb_usage)
+ local reserved_after=$(cat $reserved_usage)
+
+ echo After write:
+ echo hugetlb_usage="$hugetlb_after"
+ echo reserved_usage="$reserved_after"
+
+ hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
+ reserved_difference=$(($reserved_after - $reserved_before))
+}
+
+function cleanup_hugetlb_memory() {
+ set +e
+ local cgroup="$1"
+ if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+ echo kiling write_to_hugetlbfs
+ killall -2 write_to_hugetlbfs
+ wait_for_hugetlb_memory_to_get_depleted $cgroup
+ fi
+ set -e
+
+ if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge
+ rmdir /mnt/huge
+ fi
+}
+
+function run_test() {
+ local size="$1"
+ local populate="$2"
+ local write="$3"
+ local cgroup_limit="$4"
+ local reservation_limit="$5"
+ local nr_hugepages="$6"
+ local method="$7"
+ local private="$8"
+ local expect_failure="$9"
+ local reserve="${10}"
+
+ # Function return values.
+ hugetlb_difference=0
+ reserved_difference=0
+ reservation_failed=0
+ oom_killed=0
+
+ echo nr hugepages = "$nr_hugepages"
+ echo "$nr_hugepages" >/proc/sys/vm/nr_hugepages
+
+ setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
+
+ mkdir -p /mnt/huge
+ mount -t hugetlbfs \
+ -o pagesize=2M,size=256M none /mnt/huge
+
+ write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
+ "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" \
+ "$reserve"
+
+ cleanup_hugetlb_memory "hugetlb_cgroup_test"
+
+ local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.$fault_usage_file)
+ local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.$reservation_usage_file)
+
+ echo $hugetlb_difference
+ echo $reserved_difference
+ expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
+ expect_equal "0" "$final_reservation" "final reservation is not zero"
+}
+
+function run_multiple_cgroup_test() {
+ local size1="$1"
+ local populate1="$2"
+ local write1="$3"
+ local cgroup_limit1="$4"
+ local reservation_limit1="$5"
+
+ local size2="$6"
+ local populate2="$7"
+ local write2="$8"
+ local cgroup_limit2="$9"
+ local reservation_limit2="${10}"
+
+ local nr_hugepages="${11}"
+ local method="${12}"
+ local private="${13}"
+ local expect_failure="${14}"
+ local reserve="${15}"
+
+ # Function return values.
+ hugetlb_difference1=0
+ reserved_difference1=0
+ reservation_failed1=0
+ oom_killed1=0
+
+ hugetlb_difference2=0
+ reserved_difference2=0
+ reservation_failed2=0
+ oom_killed2=0
+
+ echo nr hugepages = "$nr_hugepages"
+ echo "$nr_hugepages" >/proc/sys/vm/nr_hugepages
+
+ setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
+ setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
+
+ mkdir -p /mnt/huge
+ mount -t hugetlbfs \
+ -o pagesize=2M,size=256M none /mnt/huge
+
+ write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
+ "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
+ "$expect_failure" "$reserve"
+
+ hugetlb_difference1=$hugetlb_difference
+ reserved_difference1=$reserved_difference
+ reservation_failed1=$reservation_failed
+ oom_killed1=$oom_killed
+
+ local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.$fault_usage_file
+ local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.$reservation_usage_file
+ local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.$fault_usage_file
+ local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.$reservation_usage_file
+
+ local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
+ local reservation_usage_before_second_write=$(cat \
+ $cgroup1_reservation_usage)
+
+ write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
+ "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
+ "$expect_failure" "$reserve"
+
+ hugetlb_difference2=$hugetlb_difference
+ reserved_difference2=$reserved_difference
+ reservation_failed2=$reservation_failed
+ oom_killed2=$oom_killed
+
+ expect_equal "$usage_before_second_write" \
+ "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
+ expect_equal "$reservation_usage_before_second_write" \
+ "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
+
+ cleanup_hugetlb_memory
+
+ local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
+ local final_reservation=$(cat $cgroup1_reservation_usage)
+
+ expect_equal "0" "$final_hugetlb" \
+ "hugetlbt_cgroup_test1 final hugetlb is not zero"
+ expect_equal "0" "$final_reservation" \
+ "hugetlbt_cgroup_test1 final reservation is not zero"
+
+ local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
+ local final_reservation=$(cat $cgroup2_reservation_usage)
+
+ expect_equal "0" "$final_hugetlb" \
+ "hugetlb_cgroup_test2 final hugetlb is not zero"
+ expect_equal "0" "$final_reservation" \
+ "hugetlb_cgroup_test2 final reservation is not zero"
+}
+
+cleanup
+
+for populate in "" "-o"; do
+ for method in 0 1 2; do
+ for private in "" "-r"; do
+ for reserve in "" "-n"; do
+
+ # Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
+ if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
+ continue
+ fi
+
+ # Skip populated shmem tests. Doesn't seem to be supported.
+ if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
+ continue
+ fi
+
+ if [[ "$method" == 2"" ]] && [[ "$reserve" == "-n" ]]; then
+ continue
+ fi
+
+ cleanup
+ echo
+ echo
+ echo
+ echo Test normal case.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+ run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
+ $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve"
+
+ echo Memory charged to hugtlb=$hugetlb_difference
+ echo Memory charged to reservation=$reserved_difference
+
+ if [[ "$populate" == "-o" ]]; then
+ expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+ "Reserved memory charged to hugetlb cgroup."
+ else
+ expect_equal "0" "$hugetlb_difference" \
+ "Reserved memory charged to hugetlb cgroup."
+ fi
+
+ if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+ expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+ "Reserved memory not charged to reservation usage."
+ else
+ expect_equal "0" "$reserved_difference" \
+ "Reserved memory not charged to reservation usage."
+ fi
+
+ echo 'PASS'
+
+ cleanup
+ echo
+ echo
+ echo
+ echo Test normal case with write.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+ run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
+ $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve"
+
+ echo Memory charged to hugtlb=$hugetlb_difference
+ echo Memory charged to reservation=$reserved_difference
+
+ expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+ "Reserved memory charged to hugetlb cgroup."
+
+ expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+ "Reserved memory not charged to reservation usage."
+
+ echo 'PASS'
+
+ cleanup
+ continue
+ echo
+ echo
+ echo
+ echo Test more than reservation case.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+ if [ "$reserve" != "-n" ]; then
+ run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
+ "$((5 * 1024 * 1024))" "10" "$method" "$private" "1" "$reserve"
+
+ expect_equal "1" "$reservation_failed" "Reservation succeeded."
+ fi
+
+ echo 'PASS'
+
+ cleanup
+
+ echo
+ echo
+ echo
+ echo Test more than cgroup limit case.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+ # Not sure if shm memory can be cleaned up when the process gets sigbus'd.
+ if [[ "$method" != 2 ]]; then
+ run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
+ $((20 * 1024 * 1024)) 10 "$method" "$private" "1" "$reserve"
+
+ expect_equal "1" "$oom_killed" "Not oom killed."
+ fi
+ echo 'PASS'
+
+ cleanup
+
+ echo
+ echo
+ echo
+ echo Test normal case, multiple cgroups.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+ run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
+ "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+ "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+ "$method" "$private" "0" "$reserve"
+
+ echo Memory charged to hugtlb1=$hugetlb_difference1
+ echo Memory charged to reservation1=$reserved_difference1
+ echo Memory charged to hugtlb2=$hugetlb_difference2
+ echo Memory charged to reservation2=$reserved_difference2
+
+ if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+ expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+ "Incorrect reservations charged to cgroup 1."
+
+ expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+ "Incorrect reservation charged to cgroup 2."
+
+ else
+ expect_equal "0" "$reserved_difference1" \
+ "Incorrect reservations charged to cgroup 1."
+
+ expect_equal "0" "$reserved_difference2" \
+ "Incorrect reservation charged to cgroup 2."
+ fi
+
+ if [[ "$populate" == "-o" ]]; then
+ expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+ "Incorrect hugetlb charged to cgroup 1."
+
+ expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+ "Incorrect hugetlb charged to cgroup 2."
+
+ else
+ expect_equal "0" "$hugetlb_difference1" \
+ "Incorrect hugetlb charged to cgroup 1."
+
+ expect_equal "0" "$hugetlb_difference2" \
+ "Incorrect hugetlb charged to cgroup 2."
+ fi
+ echo 'PASS'
+
+ cleanup
+ echo
+ echo
+ echo
+ echo Test normal case with write, multiple cgroups.
+ echo private=$private, populate=$populate, method=$method, reserve=$reserve
+ run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
+ "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+ "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+ "$method" "$private" "0" "$reserve"
+
+ echo Memory charged to hugtlb1=$hugetlb_difference1
+ echo Memory charged to reservation1=$reserved_difference1
+ echo Memory charged to hugtlb2=$hugetlb_difference2
+ echo Memory charged to reservation2=$reserved_difference2
+
+ expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+ "Incorrect hugetlb charged to cgroup 1."
+
+ expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+ "Incorrect reservation charged to cgroup 1."
+
+ expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+ "Incorrect hugetlb charged to cgroup 2."
+
+ expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+ "Incorrected reservation charged to cgroup 2."
+ echo 'PASS'
+
+ cleanup
+
+ done # reserve
+ done # private
+ done # populate
+done # method
+
+umount $cgroup_path
+rmdir $cgroup_path
diff --git a/tools/testing/selftests/vm/hugetlb_reparenting_test.sh b/tools/testing/selftests/vm/hugetlb_reparenting_test.sh
new file mode 100755
index 0000000000000..2be672c2b311a
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb_reparenting_test.sh
@@ -0,0 +1,235 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+ echo "This test must be run as root. Skipping..."
+ exit 0
+fi
+
+usage_file=usage_in_bytes
+
+if [[ "$1" == "-cgroup-v2" ]]; then
+ cgroup2=1
+ usage_file=current
+fi
+
+CGROUP_ROOT='/dev/cgroup/memory'
+MNT='/mnt/huge/'
+
+if [[ ! -e $CGROUP_ROOT ]]; then
+ mkdir -p $CGROUP_ROOT
+ if [[ $cgroup2 ]]; then
+ mount -t cgroup2 none $CGROUP_ROOT
+ sleep 1
+ echo "+hugetlb +memory" >$CGROUP_ROOT/cgroup.subtree_control
+ else
+ mount -t cgroup memory,hugetlb $CGROUP_ROOT
+ fi
+fi
+
+function cleanup() {
+ echo cleanup
+ set +e
+ rm -rf "$MNT"/* 2>/dev/null
+ umount "$MNT" 2>/dev/null
+ rmdir "$MNT" 2>/dev/null
+ rmdir "$CGROUP_ROOT"/a/b 2>/dev/null
+ rmdir "$CGROUP_ROOT"/a 2>/dev/null
+ rmdir "$CGROUP_ROOT"/test1 2>/dev/null
+ echo 0 >/proc/sys/vm/nr_hugepages
+ set -e
+}
+
+function assert_state() {
+ local expected_a="$1"
+ local expected_a_hugetlb="$2"
+ local expected_b=""
+ local expected_b_hugetlb=""
+
+ if [ ! -z ${3:-} ] && [ ! -z ${4:-} ]; then
+ expected_b="$3"
+ expected_b_hugetlb="$4"
+ fi
+ local tolerance=$((5 * 1024 * 1024))
+
+ local actual_a
+ actual_a="$(cat "$CGROUP_ROOT"/a/memory.$usage_file)"
+ if [[ $actual_a -lt $(($expected_a - $tolerance)) ]] ||
+ [[ $actual_a -gt $(($expected_a + $tolerance)) ]]; then
+ echo actual a = $((${actual_a%% *} / 1024 / 1024)) MB
+ echo expected a = $((${expected_a%% *} / 1024 / 1024)) MB
+ echo fail
+
+ cleanup
+ exit 1
+ fi
+
+ local actual_a_hugetlb
+ actual_a_hugetlb="$(cat "$CGROUP_ROOT"/a/hugetlb.2MB.$usage_file)"
+ if [[ $actual_a_hugetlb -lt $(($expected_a_hugetlb - $tolerance)) ]] ||
+ [[ $actual_a_hugetlb -gt $(($expected_a_hugetlb + $tolerance)) ]]; then
+ echo actual a hugetlb = $((${actual_a_hugetlb%% *} / 1024 / 1024)) MB
+ echo expected a hugetlb = $((${expected_a_hugetlb%% *} / 1024 / 1024)) MB
+ echo fail
+
+ cleanup
+ exit 1
+ fi
+
+ if [[ -z "$expected_b" || -z "$expected_b_hugetlb" ]]; then
+ return
+ fi
+
+ local actual_b
+ actual_b="$(cat "$CGROUP_ROOT"/a/b/memory.$usage_file)"
+ if [[ $actual_b -lt $(($expected_b - $tolerance)) ]] ||
+ [[ $actual_b -gt $(($expected_b + $tolerance)) ]]; then
+ echo actual b = $((${actual_b%% *} / 1024 / 1024)) MB
+ echo expected b = $((${expected_b%% *} / 1024 / 1024)) MB
+ echo fail
+
+ cleanup
+ exit 1
+ fi
+
+ local actual_b_hugetlb
+ actual_b_hugetlb="$(cat "$CGROUP_ROOT"/a/b/hugetlb.2MB.$usage_file)"
+ if [[ $actual_b_hugetlb -lt $(($expected_b_hugetlb - $tolerance)) ]] ||
+ [[ $actual_b_hugetlb -gt $(($expected_b_hugetlb + $tolerance)) ]]; then
+ echo actual b hugetlb = $((${actual_b_hugetlb%% *} / 1024 / 1024)) MB
+ echo expected b hugetlb = $((${expected_b_hugetlb%% *} / 1024 / 1024)) MB
+ echo fail
+
+ cleanup
+ exit 1
+ fi
+}
+
+function setup() {
+ echo 100 >/proc/sys/vm/nr_hugepages
+ mkdir "$CGROUP_ROOT"/a
+ sleep 1
+ if [[ $cgroup2 ]]; then
+ echo "+hugetlb +memory" >$CGROUP_ROOT/a/cgroup.subtree_control
+ else
+ echo 0 >$CGROUP_ROOT/a/cpuset.mems
+ echo 0 >$CGROUP_ROOT/a/cpuset.cpus
+ fi
+
+ mkdir "$CGROUP_ROOT"/a/b
+
+ if [[ ! $cgroup2 ]]; then
+ echo 0 >$CGROUP_ROOT/a/b/cpuset.mems
+ echo 0 >$CGROUP_ROOT/a/b/cpuset.cpus
+ fi
+
+ mkdir -p "$MNT"
+ mount -t hugetlbfs none "$MNT"
+}
+
+write_hugetlbfs() {
+ local cgroup="$1"
+ local path="$2"
+ local size="$3"
+
+ if [[ $cgroup2 ]]; then
+ echo $$ >$CGROUP_ROOT/$cgroup/cgroup.procs
+ else
+ echo 0 >$CGROUP_ROOT/$cgroup/cpuset.mems
+ echo 0 >$CGROUP_ROOT/$cgroup/cpuset.cpus
+ echo $$ >"$CGROUP_ROOT/$cgroup/tasks"
+ fi
+ ./write_to_hugetlbfs -p "$path" -s "$size" -m 0 -o
+ if [[ $cgroup2 ]]; then
+ echo $$ >$CGROUP_ROOT/cgroup.procs
+ else
+ echo $$ >"$CGROUP_ROOT/tasks"
+ fi
+ echo
+}
+
+set -e
+
+size=$((2 * 1024 * 1024 * 25)) # 50MB = 25 * 2MB hugepages.
+
+cleanup
+
+echo
+echo
+echo Test charge, rmdir, uncharge
+setup
+echo mkdir
+mkdir $CGROUP_ROOT/test1
+
+echo write
+write_hugetlbfs test1 "$MNT"/test $size
+
+echo rmdir
+rmdir $CGROUP_ROOT/test1
+mkdir $CGROUP_ROOT/test1
+
+echo uncharge
+rm -rf /mnt/huge/*
+
+cleanup
+
+echo done
+echo
+echo
+if [[ ! $cgroup2 ]]; then
+ echo "Test parent and child hugetlb usage"
+ setup
+
+ echo write
+ write_hugetlbfs a "$MNT"/test $size
+
+ echo Assert memory charged correctly for parent use.
+ assert_state 0 $size 0 0
+
+ write_hugetlbfs a/b "$MNT"/test2 $size
+
+ echo Assert memory charged correctly for child use.
+ assert_state 0 $(($size * 2)) 0 $size
+
+ rmdir "$CGROUP_ROOT"/a/b
+ sleep 5
+ echo Assert memory reparent correctly.
+ assert_state 0 $(($size * 2))
+
+ rm -rf "$MNT"/*
+ umount "$MNT"
+ echo Assert memory uncharged correctly.
+ assert_state 0 0
+
+ cleanup
+fi
+
+echo
+echo
+echo "Test child only hugetlb usage"
+echo setup
+setup
+
+echo write
+write_hugetlbfs a/b "$MNT"/test2 $size
+
+echo Assert memory charged correctly for child only use.
+assert_state 0 $(($size)) 0 $size
+
+rmdir "$CGROUP_ROOT"/a/b
+echo Assert memory reparent correctly.
+assert_state 0 $size
+
+rm -rf "$MNT"/*
+umount "$MNT"
+echo Assert memory uncharged correctly.
+assert_state 0 0
+
+cleanup
+
+echo ALL PASS
+
+umount $CGROUP_ROOT
+rm -rf $CGROUP_ROOT
diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
new file mode 100644
index 0000000000000..d3d0d108924d4
--- /dev/null
+++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+size=$1
+populate=$2
+write=$3
+cgroup=$4
+path=$5
+method=$6
+private=$7
+want_sleep=$8
+reserve=$9
+
+echo "Putting task in cgroup '$cgroup'"
+echo $$ > /dev/cgroup/memory/"$cgroup"/cgroup.procs
+
+echo "Method is $method"
+
+set +e
+./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
+ "$private" "$want_sleep" "$reserve"
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
new file mode 100644
index 0000000000000..85811c3384a10
--- /dev/null
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program reserves and uses hugetlb memory, supporting a bunch of
+ * scenarios needed by the charged_reserved_hugetlb.sh test.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+enum method {
+ HUGETLBFS,
+ MMAP_MAP_HUGETLB,
+ SHM,
+ MAX_METHOD
+};
+
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+ printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
+ "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
+ "[-o] [-w] [-n]\n",
+ self);
+ exit(EXIT_FAILURE);
+}
+
+void sig_handler(int signo)
+{
+ printf("Received %d.\n", signo);
+ if (signo == SIGINT) {
+ printf("Deleting the memory\n");
+ if (shmdt((const void *)shmaddr) != 0) {
+ perror("Detach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(4);
+ }
+
+ shmctl(shmid, IPC_RMID, NULL);
+ printf("Done deleting the memory\n");
+ }
+ exit(2);
+}
+
+int main(int argc, char **argv)
+{
+ int fd = 0;
+ int key = 0;
+ int *ptr = NULL;
+ int c = 0;
+ int size = 0;
+ char path[256] = "";
+ enum method method = MAX_METHOD;
+ int want_sleep = 0, private = 0;
+ int populate = 0;
+ int write = 0;
+ int reserve = 1;
+
+ unsigned long i;
+
+ if (signal(SIGINT, sig_handler) == SIG_ERR)
+ err(1, "\ncan't catch SIGINT\n");
+
+ /* Parse command-line arguments. */
+ setvbuf(stdout, NULL, _IONBF, 0);
+ self = argv[0];
+
+ while ((c = getopt(argc, argv, "s:p:m:owlrn")) != -1) {
+ switch (c) {
+ case 's':
+ size = atoi(optarg);
+ break;
+ case 'p':
+ strncpy(path, optarg, sizeof(path));
+ break;
+ case 'm':
+ if (atoi(optarg) >= MAX_METHOD) {
+ errno = EINVAL;
+ perror("Invalid -m.");
+ exit_usage();
+ }
+ method = atoi(optarg);
+ break;
+ case 'o':
+ populate = 1;
+ break;
+ case 'w':
+ write = 1;
+ break;
+ case 'l':
+ want_sleep = 1;
+ break;
+ case 'r':
+ private
+ = 1;
+ break;
+ case 'n':
+ reserve = 0;
+ break;
+ default:
+ errno = EINVAL;
+ perror("Invalid arg");
+ exit_usage();
+ }
+ }
+
+ if (strncmp(path, "", sizeof(path)) != 0) {
+ printf("Writing to this path: %s\n", path);
+ } else {
+ errno = EINVAL;
+ perror("path not found");
+ exit_usage();
+ }
+
+ if (size != 0) {
+ printf("Writing this size: %d\n", size);
+ } else {
+ errno = EINVAL;
+ perror("size not found");
+ exit_usage();
+ }
+
+ if (!populate)
+ printf("Not populating.\n");
+ else
+ printf("Populating.\n");
+
+ if (!write)
+ printf("Not writing to memory.\n");
+
+ if (method == MAX_METHOD) {
+ errno = EINVAL;
+ perror("-m Invalid");
+ exit_usage();
+ } else
+ printf("Using method=%d\n", method);
+
+ if (!private)
+ printf("Shared mapping.\n");
+ else
+ printf("Private mapping.\n");
+
+ if (!reserve)
+ printf("NO_RESERVE mapping.\n");
+ else
+ printf("RESERVE mapping.\n");
+
+ switch (method) {
+ case HUGETLBFS:
+ printf("Allocating using HUGETLBFS.\n");
+ fd = open(path, O_CREAT | O_RDWR, 0777);
+ if (fd == -1)
+ err(1, "Failed to open file.");
+
+ ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+ (private ? MAP_PRIVATE : MAP_SHARED) |
+ (populate ? MAP_POPULATE : 0) |
+ (reserve ? 0 : MAP_NORESERVE),
+ fd, 0);
+
+ if (ptr == MAP_FAILED) {
+ close(fd);
+ err(1, "Error mapping the file");
+ }
+ break;
+ case MMAP_MAP_HUGETLB:
+ printf("Allocating using MAP_HUGETLB.\n");
+ ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+ (private ? (MAP_PRIVATE | MAP_ANONYMOUS) :
+ MAP_SHARED) |
+ MAP_HUGETLB | (populate ? MAP_POPULATE : 0) |
+ (reserve ? 0 : MAP_NORESERVE),
+ -1, 0);
+
+ if (ptr == MAP_FAILED)
+ err(1, "mmap");
+
+ printf("Returned address is %p\n", ptr);
+ break;
+ case SHM:
+ printf("Allocating using SHM.\n");
+ shmid = shmget(key, size,
+ SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+ if (shmid < 0) {
+ shmid = shmget(++key, size,
+ SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+ if (shmid < 0)
+ err(1, "shmget");
+ }
+ printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
+
+ shmaddr = shmat(shmid, NULL, 0);
+ if (shmaddr == (char *)-1) {
+ perror("Shared memory attach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(2);
+ }
+ printf("shmaddr: %p\n", shmaddr);
+
+ break;
+ default:
+ errno = EINVAL;
+ err(1, "Invalid method.");
+ }
+
+ if (write) {
+ printf("Writing to memory.\n");
+ if (method != SHM) {
+ memset(ptr, 1, size);
+ } else {
+ printf("Starting the writes:\n");
+ for (i = 0; i < size; i++) {
+ shmaddr[i] = (char)(i);
+ if (!(i % (1024 * 1024)))
+ printf(".");
+ }
+ printf("\n");
+
+ printf("Starting the Check...");
+ for (i = 0; i < size; i++)
+ if (shmaddr[i] != (char)i) {
+ printf("\nIndex %lu mismatched\n", i);
+ exit(3);
+ }
+ printf("Done.\n");
+ }
+ }
+
+ if (want_sleep) {
+ /* Signal to caller that we're done. */
+ printf("DONE\n");
+
+ /* Hold memory until external kill signal is delivered. */
+ while (1)
+ sleep(100);
+ }
+
+ switch (method == HUGETLBFS) {
+ close(fd);
+ }
+
+ return 0;
+}
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-15 01:28:58

by Mina Almasry

[permalink] [raw]
Subject: [PATCH v10 5/8] hugetlb_cgroup: add accounting for shared mappings

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry <[email protected]>

---

Changes in v10:
- Deleted duplicated code snippet.
Changes in V9:
- Updated for hugetlb reservation repareting.

---
mm/hugetlb.c | 156 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 124 insertions(+), 32 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de0028e9a8630..9bcfc12c5d214 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -243,6 +243,16 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+ /*
+ * On shared mappings, each reserved region appears as a struct
+ * file_region in resv_map. These fields hold the info needed to
+ * uncharge each reservation.
+ */
+ struct page_counter *reservation_counter;
+ unsigned long pages_per_hpage;
+ struct cgroup_subsys_state *css;
+#endif
};

/* Helper that removes a struct file_region from the resv_map cache and returns
@@ -266,6 +276,25 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to)
return nrg;
}

+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+ struct file_region *nrg,
+ struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+ if (h_cg) {
+ nrg->reservation_counter =
+ &h_cg->reserved_hugepage[hstate_index(h)];
+ nrg->pages_per_hpage = pages_per_huge_page(h);
+ nrg->css = &h_cg->css;
+ } else {
+ nrg->reservation_counter = NULL;
+ nrg->pages_per_hpage = 0;
+ nrg->css = NULL;
+ }
+#endif
+}
+
/* Must be called with resv->lock held. Calling this with count_only == true
* will count the number of pages to be added but will not modify the linked
* list. If regions_needed != NULL and count_only == true, then regions_needed
@@ -273,7 +302,9 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to)
* add the regions for this range.
*/
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
- long *regions_needed, bool count_only)
+ struct hugetlb_cgroup *h_cg,
+ struct hstate *h, long *regions_needed,
+ bool count_only)
{
long add = 0;
struct list_head *head = &resv->regions;
@@ -312,6 +343,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
+ record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+ h);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
@@ -328,11 +361,13 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
+ record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
}

+ VM_BUG_ON(add < 0);
return add;
}

@@ -353,7 +388,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
* fail; region_chg will always allocate at least 1 entry and a region_add for
* 1 page will only require at most 1 entry.
*/
-static long region_add(struct resv_map *resv, long f, long t,
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+ struct resv_map *resv, long f, long t,
long in_regions_needed)
{
long add = 0, actual_regions_needed = 0, i = 0;
@@ -366,7 +402,8 @@ static long region_add(struct resv_map *resv, long f, long t,
retry:

/* Count how many regions are actually needed to execute this add. */
- add_reservation_in_range(resv, f, t, &actual_regions_needed, true);
+ add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed,
+ true);

/*
* Check for sufficient descriptors in the cache to accommodate
@@ -404,7 +441,7 @@ static long region_add(struct resv_map *resv, long f, long t,
goto retry;
}

- add = add_reservation_in_range(resv, f, t, NULL, false);
+ add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);

resv->adds_in_progress -= in_regions_needed;

@@ -452,7 +489,8 @@ static long region_chg(struct resv_map *resv, long f, long t,
spin_lock(&resv->lock);

/* Count how many hugepages in this range are NOT respresented. */
- chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
+ chg = add_reservation_in_range(resv, f, t, NULL, NULL,
+ out_regions_needed, true);

if (*out_regions_needed == 0)
*out_regions_needed = 1;
@@ -524,6 +562,29 @@ static void region_abort(struct resv_map *resv, long f, long t,
spin_unlock(&resv->lock);
}

+static void uncharge_cgroup_if_shared_mapping(struct resv_map *resv,
+ struct file_region *rg,
+ unsigned long nr_pages)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+ /*
+ * If resv->reservation_counter is NULL, then this is either a shared
+ * reservation, or cgroup charging is disabled on this resv_map.
+ *
+ * If the cgroup charging is disabled, then rg->reservation_counter is
+ * NULL and the uncharge counter call is a no-op. If the mapping is
+ * shared then the reserved memory is tracked in the file_struct
+ * entries inside of resv_map. So we need to uncharge the memory here.
+ */
+ if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 &&
+ !resv->reservation_counter) {
+ hugetlb_cgroup_uncharge_counter(rg->reservation_counter,
+ nr_pages * rg->pages_per_hpage,
+ rg->css);
+ }
+#endif
+}
+
/*
* Delete the specified range [f, t) from the reserve map. If the
* t parameter is LONG_MAX, this indicates that ALL regions after f
@@ -588,11 +649,22 @@ static long region_del(struct resv_map *resv, long f, long t)
/* New entry for end of split region */
nrg->from = t;
nrg->to = rg->to;
+
+#ifdef CONFIG_CGROUP_HUGETLB
+ nrg->reservation_counter = rg->reservation_counter;
+ nrg->pages_per_hpage = rg->pages_per_hpage;
+ nrg->css = rg->css;
+ css_get(rg->css);
+#endif
+
INIT_LIST_HEAD(&nrg->link);

/* Original entry is trimmed */
rg->to = f;

+ uncharge_cgroup_if_shared_mapping(resv, rg,
+ nrg->to - nrg->from);
+
list_add(&nrg->link, &rg->link);
nrg = NULL;
break;
@@ -600,6 +672,8 @@ static long region_del(struct resv_map *resv, long f, long t)

if (f <= rg->from && t >= rg->to) { /* Remove entire region */
del += rg->to - rg->from;
+ uncharge_cgroup_if_shared_mapping(resv, rg,
+ rg->to - rg->from);
list_del(&rg->link);
kfree(rg);
continue;
@@ -608,14 +682,20 @@ static long region_del(struct resv_map *resv, long f, long t)
if (f <= rg->from) { /* Trim beginning of region */
del += t - rg->from;
rg->from = t;
+
+ uncharge_cgroup_if_shared_mapping(resv, rg,
+ t - rg->from);
} else { /* Trim end of region */
del += rg->to - f;
rg->to = f;
+
+ uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f);
}
}

spin_unlock(&resv->lock);
kfree(nrg);
+
return del;
}

@@ -2002,7 +2082,7 @@ static long __vma_reservation_common(struct hstate *h,
VM_BUG_ON(dummy_out_regions_needed != 1);
break;
case VMA_COMMIT_RESV:
- ret = region_add(resv, idx, idx + 1, 1);
+ ret = region_add(NULL, NULL, resv, idx, idx + 1, 1);
/* region_add calls of range 1 should never fail. */
VM_BUG_ON(ret < 0);
break;
@@ -2012,7 +2092,7 @@ static long __vma_reservation_common(struct hstate *h,
break;
case VMA_ADD_RESV:
if (vma->vm_flags & VM_MAYSHARE) {
- ret = region_add(resv, idx, idx + 1, 1);
+ ret = region_add(NULL, NULL, resv, idx, idx + 1, 1);
/* region_add calls of range 1 should never fail. */
VM_BUG_ON(ret < 0);
} else {
@@ -4679,7 +4759,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
- struct hugetlb_cgroup *h_cg;
+ struct hugetlb_cgroup *h_cg = NULL;
long gbl_reserve, regions_needed = 0;

/* This should never happen */
@@ -4720,23 +4800,6 @@ int hugetlb_reserve_pages(struct inode *inode,

chg = to - from;

- if (hugetlb_cgroup_charge_cgroup(hstate_index(h),
- chg * pages_per_huge_page(h),
- &h_cg, true)) {
- kref_put(&resv_map->refs, resv_map_release);
- return -ENOMEM;
- }
-
-#ifdef CONFIG_CGROUP_HUGETLB
- /*
- * Since this branch handles private mappings, we attach the
- * counter to uncharge for this reservation off resv_map.
- */
- resv_map->reservation_counter =
- &h_cg->reserved_hugepage[hstate_index(h)];
- resv_map->pages_per_hpage = pages_per_huge_page(h);
-#endif
-
set_vma_resv_map(vma, resv_map);
set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
}
@@ -4746,6 +4809,26 @@ int hugetlb_reserve_pages(struct inode *inode,
goto out_err;
}

+ ret = hugetlb_cgroup_charge_cgroup(
+ hstate_index(h), chg * pages_per_huge_page(h), &h_cg, true);
+
+ if (ret < 0) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+
+#ifdef CONFIG_CGROUP_HUGETLB
+ if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
+ /* For private mappings, the hugetlb_cgroup uncharge info hangs
+ * of the resv_map.
+ */
+ resv_map->reservation_counter =
+ &h_cg->reserved_hugepage[hstate_index(h)];
+ resv_map->pages_per_hpage = pages_per_huge_page(h);
+ resv_map->css = &h_cg->css;
+ }
+#endif
+
/*
* There must be enough pages in the subpool for the mapping. If
* the subpool has a minimum size, there may be some global
@@ -4754,7 +4837,7 @@ int hugetlb_reserve_pages(struct inode *inode,
gbl_reserve = hugepage_subpool_get_pages(spool, chg);
if (gbl_reserve < 0) {
ret = -ENOSPC;
- goto out_err;
+ goto out_uncharge_cgroup;
}

/*
@@ -4763,9 +4846,7 @@ int hugetlb_reserve_pages(struct inode *inode,
*/
ret = hugetlb_acct_memory(h, gbl_reserve);
if (ret < 0) {
- /* put back original number of pages, chg */
- (void)hugepage_subpool_put_pages(spool, chg);
- goto out_err;
+ goto out_put_pages;
}

/*
@@ -4780,7 +4861,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE) {
- add = region_add(resv_map, from, to, regions_needed);
+ add = region_add(h, h_cg, resv_map, from, to, regions_needed);

if (unlikely(add < 0)) {
hugetlb_acct_memory(h, -gbl_reserve);
@@ -4797,12 +4878,23 @@ int hugetlb_reserve_pages(struct inode *inode,
*/
long rsv_adjust;

- rsv_adjust = hugepage_subpool_put_pages(spool,
- chg - add);
+ hugetlb_cgroup_uncharge_cgroup(
+ hstate_index(h),
+ (chg - add) * pages_per_huge_page(h), h_cg,
+ true);
+
+ rsv_adjust =
+ hugepage_subpool_put_pages(spool, chg - add);
hugetlb_acct_memory(h, -rsv_adjust);
}
}
return 0;
+out_put_pages:
+ /* put back original number of pages, chg */
+ (void)hugepage_subpool_put_pages(spool, chg);
+out_uncharge_cgroup:
+ hugetlb_cgroup_uncharge_cgroup(
+ hstate_index(h), chg * pages_per_huge_page(h), h_cg, true);
out_err:
if (!vma || vma->vm_flags & VM_MAYSHARE)
/* Only call region_abort if the region_chg succeeded but the
--
2.25.0.rc1.283.g88dfdc4193-goog

2020-01-16 23:27:14

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v10 1/8] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

On 1/14/20 5:26 PM, Mina Almasry wrote:
> These counters will track hugetlb reservations rather than hugetlb
> memory faulted in. This patch only adds the counter, following patches
> add the charging and uncharging of the counter.
>
> This is patch 1 of an 8 patch series.
>
> Problem:
> Currently tasks attempting to reserve more hugetlb memory than is available get
> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> However, if a task attempts to reserve hugetlb memory only more than its

*reword*
However, if a task attempts to reserve more hugetlb memory than its

> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> but will SIGBUS the task when it attempts to fault the memory in.

*reword*
but will SIGBUS the task when it attempts to fault in the excess memory.

>
> We have users hitting their hugetlb_cgroup limits and thus we've been
> looking at this failure mode. We'd like to improve this behavior such that users
> violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather
> than getting SIGBUS'd when they try to fault the excess memory in. This
> gives the user an opportunity to fallback more gracefully to
> non-hugetlbfs memory for example.
>
> The underlying problem is that today's hugetlb_cgroup accounting happens
> at hugetlb memory *fault* time, rather than at *reservation* time.
> Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> the offending task gets SIGBUS'd.
>
> Proposed Solution:
> A new page counter named
> 'hugetlb.xMB.reservation_[limit|usage|max_usage]_in_bytes'. This counter has
> slightly different semantics than

You changed the name to 'hugetlb.xMB.resv_[limit|usage|max_usage]_in_bytes'
in the code, but left this description.

Also, David suggested 'rsvd' as the abbreviation to use here. I would also
prefer that name to be consistent with other hugetlb interfaces.

> 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
>
> - While usage_in_bytes tracks all *faulted* hugetlb memory,
> reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
> hugetlb memory faulted in without a prior reservation.
>
> - If a task attempts to reserve more memory than limit_in_bytes allows,
> the kernel will allow it to do so. But if a task attempts to reserve
> more memory than reservation_limit_in_bytes, the kernel will fail this
> reservation.
>
> This proposal is implemented in this patch series, with tests to verify
> functionality and show the usage.
>
> Alternatives considered:
> 1. A new cgroup, instead of only a new page_counter attached to
> the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
> duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
> hugetlb_cgroup seemed cleaner as well.
>
> 2. Instead of adding a new counter, we considered adding a sysctl that modifies
> the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
> reservation time rather than fault time. Adding a new page_counter seems
> better as userspace could, if it wants, choose to enforce different cgroups
> differently: one via limit_in_bytes, and another via
> reservation_limit_in_bytes. This could be very useful if you're
> transitioning how hugetlb memory is partitioned on your system one
> cgroup at a time, for example. Also, someone may find usage for both
> limit_in_bytes and reservation_limit_in_bytes concurrently, and this
> approach gives them the option to do so.
>
> Testing:
> - Added tests passing.
> - Used libhugetlbfs for regression testing.
>
> [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
>
> Signed-off-by: Mina Almasry <[email protected]>
>
> ---
> Changes in v10:
> - Renamed reservation_* to resv.*
>
> ---
> include/linux/hugetlb.h | 4 +-
> mm/hugetlb_cgroup.c | 115 +++++++++++++++++++++++++++++++++++-----
> 2 files changed, 104 insertions(+), 15 deletions(-)

The code looks fine to me.

With the commit message and naming updates, I will add a Reviewed-by:

Please do wait a few/several days before sending a revised edition to
make sure we get all feedback. I really would like to get comments from
people more familiar with cgroups.

--
Mike Kravetz

2020-01-21 18:53:23

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v10 4/8] hugetlb: disable region_add file_region coalescing

On 1/14/20 5:26 PM, Mina Almasry wrote:
> A follow up patch in this series adds hugetlb cgroup uncharge info the
> file_region entries in resv->regions. The cgroup uncharge info may
> differ for different regions, so they can no longer be coalesced at
> region_add time. So, disable region coalescing in region_add in this
> patch.
>
> Behavior change:
>
> Say a resv_map exists like this [0->1], [2->3], and [5->6].
>
> Then a region_chg/add call comes in region_chg/add(f=0, t=5).
>
> Old code would generate resv->regions: [0->5], [5->6].
> New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> [5->6].
>
> Special care needs to be taken to handle the resv->adds_in_progress
> variable correctly. In the past, only 1 region would be added for every
> region_chg and region_add call. But now, each call may add multiple
> regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> region_chg calls add_reservation_in_range() to count the number of regions
> needed and allocates those, and that info is passed to region_add and
> region_abort to decrement adds_in_progress correctly.
>
> We've also modified the assumption that region_add after region_chg
> never fails. region_chg now pre-allocates at least 1 region for
> region_add. If region_add needs more regions than region_chg has
> allocated for it, then it may fail.

Some time back we briefly discussed an optimization to coalesce file
region entries if they were from the same cgroup. At the time, the
thought was that such an optimization could wait. For large mappings,
known users will reserve the entire area. Smaller mappings such as
those in the commit log are not the common case and are mentioned mostly
to illustrate what the code must handle.

However, I just remembered that for private mappings file region entries
are allocated at page fault time: one per page. Since we are no longer
coalescing, there will be one file region struct for each page in a
private mapping. Is that correct?

I honestly do not know how common private mappings are today. But,
this would cause excessive overhead for any large private mapping.

--
Mike Kravetz

2020-01-23 09:59:40

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

Hi,

On 15/01/20 6:56 am, Mina Almasry wrote:
> The tests use both shared and private mapped hugetlb memory, and
> monitors the hugetlb usage counter as well as the hugetlb reservation
> counter. They test different configurations such as hugetlb memory usage
> via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
> MAP_POPULATE.
>
> Also add test for hugetlb reservation reparenting, since this is
> a subtle issue.
>
> Signed-off-by: Mina Almasry <[email protected]>
>

For powerpc64, either 16MB/16GB or 2MB/1GB huge pages are supported depending
on the MMU type (Hash or Radix). I was just running these tests on a powerpc64
system with Hash MMU and ran into problems because the tests assume that the
hugepage size is always 2MB. Can you determine the huge page size at runtime?


- Sandipan

2020-01-23 20:23:56

by Mina Almasry

[permalink] [raw]
Subject: Re: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

On Thu, Jan 23, 2020 at 1:15 AM Sandipan Das <[email protected]> wrote:
>
> For powerpc64, either 16MB/16GB or 2MB/1GB huge pages are supported depending
> on the MMU type (Hash or Radix). I was just running these tests on a powerpc64
> system with Hash MMU and ran into problems because the tests assume that the
> hugepage size is always 2MB. Can you determine the huge page size at runtime?
>

Absolutely. Let me try to reproduce this failure and it should be
fixed in the next patchset version.

2020-01-29 21:02:16

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

On Thu, 23 Jan 2020, Sandipan Das wrote:

> > The tests use both shared and private mapped hugetlb memory, and
> > monitors the hugetlb usage counter as well as the hugetlb reservation
> > counter. They test different configurations such as hugetlb memory usage
> > via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
> > MAP_POPULATE.
> >
> > Also add test for hugetlb reservation reparenting, since this is
> > a subtle issue.
> >
> > Signed-off-by: Mina Almasry <[email protected]>
> >
>
> For powerpc64, either 16MB/16GB or 2MB/1GB huge pages are supported depending
> on the MMU type (Hash or Radix). I was just running these tests on a powerpc64
> system with Hash MMU and ran into problems because the tests assume that the
> hugepage size is always 2MB. Can you determine the huge page size at runtime?
>

I assume this is only testing failures of the tools/testing/selftests
additions that hardcode 2MB paths and not a kernel problem? In other
words, you can still boot, reserve, alloc, and free hugetlb pages on ppc
after this patchset without using the selftests?

2020-01-29 21:10:44

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v10 1/8] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

On Tue, 14 Jan 2020, Mina Almasry wrote:

> These counters will track hugetlb reservations rather than hugetlb
> memory faulted in. This patch only adds the counter, following patches
> add the charging and uncharging of the counter.
>
> This is patch 1 of an 8 patch series.
>
> Problem:
> Currently tasks attempting to reserve more hugetlb memory than is available get
> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> However, if a task attempts to reserve hugetlb memory only more than its
> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> but will SIGBUS the task when it attempts to fault the memory in.
>
> We have users hitting their hugetlb_cgroup limits and thus we've been
> looking at this failure mode. We'd like to improve this behavior such that users
> violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather
> than getting SIGBUS'd when they try to fault the excess memory in. This
> gives the user an opportunity to fallback more gracefully to
> non-hugetlbfs memory for example.
>
> The underlying problem is that today's hugetlb_cgroup accounting happens
> at hugetlb memory *fault* time, rather than at *reservation* time.
> Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> the offending task gets SIGBUS'd.
>
> Proposed Solution:
> A new page counter named
> 'hugetlb.xMB.reservation_[limit|usage|max_usage]_in_bytes'. This counter has
> slightly different semantics than
> 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
>

Changelog looks like it needs to be updated with the new resv naming.

> - While usage_in_bytes tracks all *faulted* hugetlb memory,
> reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
> hugetlb memory faulted in without a prior reservation.
>
> - If a task attempts to reserve more memory than limit_in_bytes allows,
> the kernel will allow it to do so. But if a task attempts to reserve
> more memory than reservation_limit_in_bytes, the kernel will fail this
> reservation.
>
> This proposal is implemented in this patch series, with tests to verify
> functionality and show the usage.
>
> Alternatives considered:
> 1. A new cgroup, instead of only a new page_counter attached to
> the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
> duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
> hugetlb_cgroup seemed cleaner as well.
>
> 2. Instead of adding a new counter, we considered adding a sysctl that modifies
> the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
> reservation time rather than fault time. Adding a new page_counter seems
> better as userspace could, if it wants, choose to enforce different cgroups
> differently: one via limit_in_bytes, and another via
> reservation_limit_in_bytes. This could be very useful if you're
> transitioning how hugetlb memory is partitioned on your system one
> cgroup at a time, for example. Also, someone may find usage for both
> limit_in_bytes and reservation_limit_in_bytes concurrently, and this
> approach gives them the option to do so.
>
> Testing:
> - Added tests passing.
> - Used libhugetlbfs for regression testing.
>
> [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
>
> Signed-off-by: Mina Almasry <[email protected]>
>
> ---
> Changes in v10:
> - Renamed reservation_* to resv.*
>
> ---
> include/linux/hugetlb.h | 4 +-
> mm/hugetlb_cgroup.c | 115 +++++++++++++++++++++++++++++++++++-----
> 2 files changed, 104 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1e897e4168ac1..dea6143aa0685 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -432,8 +432,8 @@ struct hstate {
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> #ifdef CONFIG_CGROUP_HUGETLB
> /* cgroup control files */
> - struct cftype cgroup_files_dfl[5];
> - struct cftype cgroup_files_legacy[5];
> + struct cftype cgroup_files_dfl[7];
> + struct cftype cgroup_files_legacy[9];
> #endif
> char name[HSTATE_NAME_LEN];
> };
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index e434b05416c68..209f9b9604d34 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -36,6 +36,11 @@ struct hugetlb_cgroup {
> */
> struct page_counter hugepage[HUGE_MAX_HSTATE];
>
> + /*
> + * the counter to account for hugepage reservations from hugetlb.
> + */
> + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
> +
> atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
> atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
>
> @@ -55,6 +60,14 @@ struct hugetlb_cgroup {
>
> static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
>
> +static inline struct page_counter *
> +hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
> +{
> + if (reserved)
> + return &h_cg->reserved_hugepage[idx];
> + return &h_cg->hugepage[idx];
> +}
> +
> static inline
> struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> {

Small nit: hugetlb_cgroup_get_counter(), to me, implies incrementing a
reference count, perhaps a better name would be in order. No strong
preference.

> @@ -295,28 +308,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>
> enum {
> RES_USAGE,
> + RES_RESERVATION_USAGE,
> RES_LIMIT,
> + RES_RESERVATION_LIMIT,
> RES_MAX_USAGE,
> + RES_RESERVATION_MAX_USAGE,
> RES_FAILCNT,
> + RES_RESERVATION_FAILCNT,
> };
>
> static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> struct cftype *cft)
> {
> struct page_counter *counter;
> + struct page_counter *reserved_counter;
> struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
>
> counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
>
> switch (MEMFILE_ATTR(cft->private)) {
> case RES_USAGE:
> return (u64)page_counter_read(counter) * PAGE_SIZE;
> + case RES_RESERVATION_USAGE:
> + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
> case RES_LIMIT:
> return (u64)counter->max * PAGE_SIZE;
> + case RES_RESERVATION_LIMIT:
> + return (u64)reserved_counter->max * PAGE_SIZE;
> case RES_MAX_USAGE:
> return (u64)counter->watermark * PAGE_SIZE;
> + case RES_RESERVATION_MAX_USAGE:
> + return (u64)reserved_counter->watermark * PAGE_SIZE;
> case RES_FAILCNT:
> return counter->failcnt;
> + case RES_RESERVATION_FAILCNT:
> + return reserved_counter->failcnt;
> default:
> BUG();
> }
> @@ -338,10 +365,16 @@ static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v)
> 1 << huge_page_order(&hstates[idx]));
>
> switch (MEMFILE_ATTR(cft->private)) {
> + case RES_RESERVATION_USAGE:
> + counter = &h_cg->reserved_hugepage[idx];
> + /* Fall through. */
> case RES_USAGE:
> val = (u64)page_counter_read(counter);
> seq_printf(seq, "%llu\n", val * PAGE_SIZE);
> break;
> + case RES_RESERVATION_LIMIT:
> + counter = &h_cg->reserved_hugepage[idx];
> + /* Fall through. */
> case RES_LIMIT:
> val = (u64)counter->max;
> if (val == limit)
> @@ -365,6 +398,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> int ret, idx;
> unsigned long nr_pages;
> struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> + bool reserved = false;
>
> if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> return -EINVAL;
> @@ -378,9 +412,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
>
> switch (MEMFILE_ATTR(of_cft(of)->private)) {
> + case RES_RESERVATION_LIMIT:
> + reserved = true;
> + /* Fall through. */
> case RES_LIMIT:
> mutex_lock(&hugetlb_limit_mutex);
> - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
> + ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
> + reserved),
> + nr_pages);
> mutex_unlock(&hugetlb_limit_mutex);
> break;
> default:
> @@ -406,18 +445,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off)
> {
> int ret = 0;
> - struct page_counter *counter;
> + struct page_counter *counter, *reserved_counter;
> struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
>
> counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> + reserved_counter =
> + &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];
>
> switch (MEMFILE_ATTR(of_cft(of)->private)) {
> case RES_MAX_USAGE:
> page_counter_reset_watermark(counter);
> break;
> + case RES_RESERVATION_MAX_USAGE:
> + page_counter_reset_watermark(reserved_counter);
> + break;
> case RES_FAILCNT:
> counter->failcnt = 0;
> break;
> + case RES_RESERVATION_FAILCNT:
> + reserved_counter->failcnt = 0;
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -472,7 +519,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> struct hstate *h = &hstates[idx];
>
> /* format the size */
> - mem_fmt(buf, 32, huge_page_size(h));
> + mem_fmt(buf, sizeof(buf), huge_page_size(h));
>
> /* Add the limit file */
> cft = &h->cgroup_files_dfl[0];
> @@ -482,15 +529,30 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> cft->write = hugetlb_cgroup_write_dfl;
> cft->flags = CFTYPE_NOT_ON_ROOT;
>
> - /* Add the current usage file */
> + /* Add the reservation limit file */
> cft = &h->cgroup_files_dfl[1];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.max", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
> + cft->seq_show = hugetlb_cgroup_read_u64_max;
> + cft->write = hugetlb_cgroup_write_dfl;
> + cft->flags = CFTYPE_NOT_ON_ROOT;
> +
> + /* Add the current usage file */
> + cft = &h->cgroup_files_dfl[2];
> snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf);
> cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> cft->seq_show = hugetlb_cgroup_read_u64_max;
> cft->flags = CFTYPE_NOT_ON_ROOT;
>
> + /* Add the current reservation usage file */
> + cft = &h->cgroup_files_dfl[3];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.resv.current", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
> + cft->seq_show = hugetlb_cgroup_read_u64_max;
> + cft->flags = CFTYPE_NOT_ON_ROOT;
> +
> /* Add the events file */
> - cft = &h->cgroup_files_dfl[2];
> + cft = &h->cgroup_files_dfl[4];
> snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
> cft->private = MEMFILE_PRIVATE(idx, 0);
> cft->seq_show = hugetlb_events_show;

Any cleanup to __hugetlb_cgroup_file_dfl_init() and
__hugetlb_cgroup_file_legacy_init() that is possible would be great in a
follow-up patch :)

Other than that, this looks very straight forward.

Acked-by: David Rientjes <[email protected]>

2020-01-30 06:15:24

by Sandipan Das

[permalink] [raw]
Subject: Re: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

Hi David,

On 30/01/20 2:30 am, David Rientjes wrote:
> On Thu, 23 Jan 2020, Sandipan Das wrote:
>
>> For powerpc64, either 16MB/16GB or 2MB/1GB huge pages are supported depending
>> on the MMU type (Hash or Radix). I was just running these tests on a powerpc64
>> system with Hash MMU and ran into problems because the tests assume that the
>> hugepage size is always 2MB. Can you determine the huge page size at runtime?
>>
>
> I assume this is only testing failures of the tools/testing/selftests
> additions that hardcode 2MB paths and not a kernel problem? In other
> words, you can still boot, reserve, alloc, and free hugetlb pages on ppc
> after this patchset without using the selftests?
>

Yes, its just the hardcoded paths. I didn't run into any kernel problems.

- Sandipan

2020-02-03 23:17:55

by Mina Almasry

[permalink] [raw]
Subject: Re: [PATCH v10 1/8] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

> > Proposed Solution:
> > A new page counter named
> > 'hugetlb.xMB.reservation_[limit|usage|max_usage]_in_bytes'. This counter has
> > slightly different semantics than
> > 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
> >
>
> Changelog looks like it needs to be updated with the new resv naming.
>

I updated to the rsvd naming, which you suggested earlier and Mike
agreed was better.

> > +
> > static inline
> > struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> > {
>
> Small nit: hugetlb_cgroup_get_counter(), to me, implies incrementing a
> reference count, perhaps a better name would be in order. No strong
> preference.
>

Changed
> > /* Add the events file */
> > - cft = &h->cgroup_files_dfl[2];
> > + cft = &h->cgroup_files_dfl[4];
> > snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
> > cft->private = MEMFILE_PRIVATE(idx, 0);
> > cft->seq_show = hugetlb_events_show;
>
> Any cleanup to __hugetlb_cgroup_file_dfl_init() and
> __hugetlb_cgroup_file_legacy_init() that is possible would be great in a
> follow-up patch :)
>

Will do!

> Other than that, this looks very straight forward.
>
> Acked-by: David Rientjes <[email protected]>

2020-02-03 23:20:12

by Mina Almasry

[permalink] [raw]
Subject: Re: [PATCH v10 7/8] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

On Wed, Jan 29, 2020 at 10:11 PM Sandipan Das <[email protected]> wrote:
>
> Hi David,
>
> On 30/01/20 2:30 am, David Rientjes wrote:
> > On Thu, 23 Jan 2020, Sandipan Das wrote:
> >
> >> For powerpc64, either 16MB/16GB or 2MB/1GB huge pages are supported depending
> >> on the MMU type (Hash or Radix). I was just running these tests on a powerpc64
> >> system with Hash MMU and ran into problems because the tests assume that the
> >> hugepage size is always 2MB. Can you determine the huge page size at runtime?
> >>
> >
> > I assume this is only testing failures of the tools/testing/selftests
> > additions that hardcode 2MB paths and not a kernel problem? In other
> > words, you can still boot, reserve, alloc, and free hugetlb pages on ppc
> > after this patchset without using the selftests?
> >
>
> Yes, its just the hardcoded paths. I didn't run into any kernel problems.
>

Sandipan, I updated the tests to not assume 2MB page size, but I'm
having trouble getting a setup with a non-2MB default size to test
with. I'm uploading v11 of the series shortly, please let me know if
the problem persists.

> - Sandipan
>