2021-04-06 08:27:52

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
others NUMA wise, but a byte of media has about the same cost whether it
is close or far. But, with new memory tiers such as Persistent Memory
(PMEM). there is a choice between fast/expensive DRAM and slow/cheap
PMEM.

The fast/expensive memory lives in the top tier of the memory hierachy.

Previously, the patchset
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/[email protected]/
provides a mechanism to demote cold pages from DRAM node into PMEM.

And the patchset
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
https://lore.kernel.org/linux-mm/[email protected]/
provides a mechanism to promote hot pages in PMEM to the DRAM node
leveraging autonuma.

The two patchsets together keep the hot pages in DRAM and colder pages
in PMEM.

To make fine grain cgroup based management of the precious top tier
DRAM memory possible, this patchset adds a few new features:
1. Provides memory monitors on the amount of top tier memory used per cgroup
and by the system as a whole.
2. Applies soft limits on the top tier memory each cgroup uses
3. Enables kswapd to demote top tier pages from cgroup with excess top
tier memory usages.

This allows us to provision different amount of top tier memory to each
cgroup according to the cgroup's latency need.

The patchset is based on cgroup v1 interface. One shortcoming of the v1
interface is the limit on the cgroup is a soft limit, so a cgroup can
exceed the limit quite a bit before reclaim before page demotion reins
it in.

We are also working on a cgroup v2 interface control interface that will will
have a max limit on the top tier memory per cgroup but requires much
additional logic to fall back and allocate from non top tier memory when a
cgroup reaches the maximum limit. This simpler cgroup v1 implementation
with all its warts is used to illustrate the concept of cgroup based
top tier memory management and serves as a starting point of discussions.

The soft limit and soft reclaim logic in this patchset will be similar for what
we would do for a cgroup v2 interface when we reach the high watermark
for top tier usage in a cgroup v2 interface.

This patchset is applied on top of
[PATCH 00/10] [v7] Migrate Pages in lieu of discard
and
[PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system

It is part of a larger patchset. You can play with the complete set of patches
using the tree:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=tiering-0.71

Tim Chen (11):
mm: Define top tier memory node mask
mm: Add soft memory limit for mem cgroup
mm: Account the top tier memory usage per cgroup
mm: Report top tier memory usage in sysfs
mm: Add soft_limit_top_tier tree for mem cgroup
mm: Handle top tier memory in cgroup soft limit memory tree utilities
mm: Account the total top tier memory in use
mm: Add toptier option for mem_cgroup_soft_limit_reclaim()
mm: Use kswapd to demote pages when toptier memory is tight
mm: Set toptier_scale_factor via sysctl
mm: Wakeup kswapd if toptier memory need soft reclaim

Documentation/admin-guide/sysctl/vm.rst | 12 +
drivers/base/node.c | 2 +
include/linux/memcontrol.h | 20 +-
include/linux/mm.h | 4 +
include/linux/mmzone.h | 7 +
include/linux/nodemask.h | 1 +
include/linux/vmstat.h | 18 ++
kernel/sysctl.c | 10 +
mm/memcontrol.c | 303 +++++++++++++++++++-----
mm/memory_hotplug.c | 3 +
mm/migrate.c | 1 +
mm/page_alloc.c | 36 ++-
mm/vmscan.c | 73 +++++-
mm/vmstat.c | 22 +-
14 files changed, 444 insertions(+), 68 deletions(-)

--
2.20.1


2021-04-06 08:27:56

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup

Define a per node soft_limit_top_tier red black tree that sort and track
the cgroups by each group's excess over its toptier soft limit. A cgroup
is added to the tree if it has exceeded its top tier soft limit and it
has used pages on the node.

Signed-off-by: Tim Chen <[email protected]>
---
mm/memcontrol.c | 68 +++++++++++++++++++++++++++++++++++++------------
1 file changed, 52 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 68590f46fa76..90a78ff3fca8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -122,6 +122,7 @@ struct mem_cgroup_tree {
};

static struct mem_cgroup_tree soft_limit_tree __read_mostly;
+static struct mem_cgroup_tree soft_limit_toptier_tree __read_mostly;

/* for OOM */
struct mem_cgroup_eventfd_list {
@@ -590,17 +591,27 @@ mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
}

static struct mem_cgroup_tree_per_node *
-soft_limit_tree_node(int nid)
-{
- return soft_limit_tree.rb_tree_per_node[nid];
+soft_limit_tree_node(int nid, enum node_states type)
+{
+ switch (type) {
+ case N_MEMORY:
+ return soft_limit_tree.rb_tree_per_node[nid];
+ case N_TOPTIER:
+ if (node_state(nid, N_TOPTIER))
+ return soft_limit_toptier_tree.rb_tree_per_node[nid];
+ else
+ return NULL;
+ default:
+ return NULL;
+ }
}

static struct mem_cgroup_tree_per_node *
-soft_limit_tree_from_page(struct page *page)
+soft_limit_tree_from_page(struct page *page, enum node_states type)
{
int nid = page_to_nid(page);

- return soft_limit_tree.rb_tree_per_node[nid];
+ return soft_limit_tree_node(nid, type);
}

static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
@@ -661,12 +672,24 @@ static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
spin_unlock_irqrestore(&mctz->lock, flags);
}

-static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg, enum node_states type)
{
- unsigned long nr_pages = page_counter_read(&memcg->memory);
- unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
+ unsigned long nr_pages;
+ unsigned long soft_limit;
unsigned long excess = 0;

+ switch (type) {
+ case N_MEMORY:
+ nr_pages = page_counter_read(&memcg->memory);
+ soft_limit = READ_ONCE(memcg->soft_limit);
+ break;
+ case N_TOPTIER:
+ nr_pages = page_counter_read(&memcg->toptier);
+ soft_limit = READ_ONCE(memcg->toptier_soft_limit);
+ break;
+ default:
+ return 0;
+ }
if (nr_pages > soft_limit)
excess = nr_pages - soft_limit;

@@ -679,7 +702,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;

- mctz = soft_limit_tree_from_page(page);
+ mctz = soft_limit_tree_from_page(page, N_MEMORY);
if (!mctz)
return;
/*
@@ -688,7 +711,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
*/
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
mz = mem_cgroup_page_nodeinfo(memcg, page);
- excess = soft_limit_excess(memcg);
+ excess = soft_limit_excess(memcg, N_MEMORY);
/*
* We have to update the tree if mz is on RB-tree or
* mem is over its softlimit.
@@ -718,7 +741,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)

for_each_node(nid) {
mz = mem_cgroup_nodeinfo(memcg, nid);
- mctz = soft_limit_tree_node(nid);
+ mctz = soft_limit_tree_node(nid, N_MEMORY);
if (mctz)
mem_cgroup_remove_exceeded(mz, mctz);
}
@@ -742,7 +765,7 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
* position in the tree.
*/
__mem_cgroup_remove_exceeded(mz, mctz);
- if (!soft_limit_excess(mz->memcg) ||
+ if (!soft_limit_excess(mz->memcg, N_MEMORY) ||
!css_tryget(&mz->memcg->css))
goto retry;
done:
@@ -1805,7 +1828,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
.pgdat = pgdat,
};

- excess = soft_limit_excess(root_memcg);
+ excess = soft_limit_excess(root_memcg, N_MEMORY);

while (1) {
victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
@@ -1834,7 +1857,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
total += mem_cgroup_shrink_node(victim, gfp_mask, false,
pgdat, &nr_scanned);
*total_scanned += nr_scanned;
- if (!soft_limit_excess(root_memcg))
+ if (!soft_limit_excess(root_memcg, N_MEMORY))
break;
}
mem_cgroup_iter_break(root_memcg, victim);
@@ -3457,7 +3480,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
if (order > 0)
return 0;

- mctz = soft_limit_tree_node(pgdat->node_id);
+ mctz = soft_limit_tree_node(pgdat->node_id, N_MEMORY);

/*
* Do not even bother to check the largest node if the root
@@ -3513,7 +3536,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
if (!reclaimed)
next_mz = __mem_cgroup_largest_soft_limit_node(mctz);

- excess = soft_limit_excess(mz->memcg);
+ excess = soft_limit_excess(mz->memcg, N_MEMORY);
/*
* One school of thought says that we should not add
* back the node to the tree if reclaim returns 0.
@@ -7189,6 +7212,19 @@ static int __init mem_cgroup_init(void)
rtpn->rb_rightmost = NULL;
spin_lock_init(&rtpn->lock);
soft_limit_tree.rb_tree_per_node[node] = rtpn;
+
+ if (!node_state(node, N_TOPTIER)) {
+ soft_limit_toptier_tree.rb_tree_per_node[node] = NULL;
+ continue;
+ }
+
+ rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
+ node_online(node) ? node : NUMA_NO_NODE);
+
+ rtpn->rb_root = RB_ROOT;
+ rtpn->rb_rightmost = NULL;
+ spin_lock_init(&rtpn->lock);
+ soft_limit_toptier_tree.rb_tree_per_node[node] = rtpn;
}

return 0;
--
2.20.1

2021-04-06 08:27:57

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities

Update the utility functions __mem_cgroup_insert_exceeded() and
__mem_cgroup_remove_exceeded(), to allow addition and removal of cgroups
from the new red black tree that tracks the cgroups that exceed their
toptier memory limits.

Update also the function +mem_cgroup_largest_soft_limit_node(),
to allow returning the cgroup that has the largest exceess usage
of toptier memory.

Signed-off-by: Tim Chen <[email protected]>
---
include/linux/memcontrol.h | 9 +++
mm/memcontrol.c | 152 +++++++++++++++++++++++++++----------
2 files changed, 122 insertions(+), 39 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 609d8590950c..0ed8ddfd5436 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,6 +124,15 @@ struct mem_cgroup_per_node {
unsigned long usage_in_excess;/* Set to the value by which */
/* the soft limit is exceeded*/
bool on_tree;
+
+ struct rb_node toptier_tree_node; /* RB tree node */
+ unsigned long toptier_usage_in_excess; /* Set to the value by which */
+ /* the soft limit is exceeded*/
+ bool on_toptier_tree;
+
+ bool congested; /* memcg has many dirty pages */
+ /* backed by a congested BDI */
+
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90a78ff3fca8..8a7648b79635 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -616,24 +616,44 @@ soft_limit_tree_from_page(struct page *page, enum node_states type)

static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
struct mem_cgroup_tree_per_node *mctz,
- unsigned long new_usage_in_excess)
+ unsigned long new_usage_in_excess,
+ enum node_states type)
{
struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
+ struct rb_node *parent = NULL, *mz_tree_node;
struct mem_cgroup_per_node *mz_node;
- bool rightmost = true;
+ bool rightmost = true, *mz_on_tree;
+ unsigned long usage_in_excess, *mz_usage_in_excess;

- if (mz->on_tree)
+ if (type == N_TOPTIER) {
+ mz_usage_in_excess = &mz->toptier_usage_in_excess;
+ mz_tree_node = &mz->toptier_tree_node;
+ mz_on_tree = &mz->on_toptier_tree;
+ } else {
+ mz_usage_in_excess = &mz->usage_in_excess;
+ mz_tree_node = &mz->tree_node;
+ mz_on_tree = &mz->on_tree;
+ }
+
+ if (*mz_on_tree)
return;

- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
+ if (!new_usage_in_excess)
return;
+
while (*p) {
parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_node,
+ if (type == N_TOPTIER) {
+ mz_node = rb_entry(parent, struct mem_cgroup_per_node,
+ toptier_tree_node);
+ usage_in_excess = mz_node->toptier_usage_in_excess;
+ } else {
+ mz_node = rb_entry(parent, struct mem_cgroup_per_node,
tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess) {
+ usage_in_excess = mz_node->usage_in_excess;
+ }
+
+ if (new_usage_in_excess < usage_in_excess) {
p = &(*p)->rb_left;
rightmost = false;
} else {
@@ -642,33 +662,47 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
}

if (rightmost)
- mctz->rb_rightmost = &mz->tree_node;
+ mctz->rb_rightmost = mz_tree_node;

- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
+ rb_link_node(mz_tree_node, parent, p);
+ rb_insert_color(mz_tree_node, &mctz->rb_root);
+ *mz_usage_in_excess = new_usage_in_excess;
+ *mz_on_tree = true;
}

static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
- struct mem_cgroup_tree_per_node *mctz)
+ struct mem_cgroup_tree_per_node *mctz,
+ enum node_states type)
{
- if (!mz->on_tree)
+ bool *mz_on_tree;
+ struct rb_node *mz_tree_node;
+
+ if (type == N_TOPTIER) {
+ mz_tree_node = &mz->toptier_tree_node;
+ mz_on_tree = &mz->on_toptier_tree;
+ } else {
+ mz_tree_node = &mz->tree_node;
+ mz_on_tree = &mz->on_tree;
+ }
+
+ if (!(*mz_on_tree))
return;

- if (&mz->tree_node == mctz->rb_rightmost)
- mctz->rb_rightmost = rb_prev(&mz->tree_node);
+ if (mz_tree_node == mctz->rb_rightmost)
+ mctz->rb_rightmost = rb_prev(mz_tree_node);

- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
+ rb_erase(mz_tree_node, &mctz->rb_root);
+ *mz_on_tree = false;
}

static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
- struct mem_cgroup_tree_per_node *mctz)
+ struct mem_cgroup_tree_per_node *mctz,
+ enum node_states type)
{
unsigned long flags;

spin_lock_irqsave(&mctz->lock, flags);
- __mem_cgroup_remove_exceeded(mz, mctz);
+ __mem_cgroup_remove_exceeded(mz, mctz, type);
spin_unlock_irqrestore(&mctz->lock, flags);
}

@@ -696,13 +730,18 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg, enum node_state
return excess;
}

-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
+static void mem_cgroup_update_tree(struct mem_cgroup *bottom_memcg, struct page *page)
{
unsigned long excess;
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;
+ enum node_states type = N_MEMORY;
+ struct mem_cgroup *memcg;
+
+repeat_toptier:
+ memcg = bottom_memcg;
+ mctz = soft_limit_tree_from_page(page, type);

- mctz = soft_limit_tree_from_page(page, N_MEMORY);
if (!mctz)
return;
/*
@@ -710,27 +749,37 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
* because their event counter is not touched.
*/
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ bool on_tree;
+
mz = mem_cgroup_page_nodeinfo(memcg, page);
- excess = soft_limit_excess(memcg, N_MEMORY);
+ excess = soft_limit_excess(memcg, type);
+
+ on_tree = (type == N_MEMORY) ? mz->on_tree: mz->on_toptier_tree;
/*
* We have to update the tree if mz is on RB-tree or
* mem is over its softlimit.
*/
- if (excess || mz->on_tree) {
+ if (excess || on_tree) {
unsigned long flags;

spin_lock_irqsave(&mctz->lock, flags);
/* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(mz, mctz);
+ if (on_tree)
+ __mem_cgroup_remove_exceeded(mz, mctz, type);
+
/*
* Insert again. mz->usage_in_excess will be updated.
* If excess is 0, no tree ops.
*/
- __mem_cgroup_insert_exceeded(mz, mctz, excess);
+ __mem_cgroup_insert_exceeded(mz, mctz, excess, type);
+
spin_unlock_irqrestore(&mctz->lock, flags);
}
}
+ if (type == N_MEMORY) {
+ type = N_TOPTIER;
+ goto repeat_toptier;
+ }
}

static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
@@ -743,12 +792,16 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
mz = mem_cgroup_nodeinfo(memcg, nid);
mctz = soft_limit_tree_node(nid, N_MEMORY);
if (mctz)
- mem_cgroup_remove_exceeded(mz, mctz);
+ mem_cgroup_remove_exceeded(mz, mctz, N_MEMORY);
+ mctz = soft_limit_tree_node(nid, N_TOPTIER);
+ if (mctz)
+ mem_cgroup_remove_exceeded(mz, mctz, N_TOPTIER);
}
}

static struct mem_cgroup_per_node *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz,
+ enum node_states type)
{
struct mem_cgroup_per_node *mz;

@@ -757,15 +810,19 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
if (!mctz->rb_rightmost)
goto done; /* Nothing to reclaim from */

- mz = rb_entry(mctz->rb_rightmost,
+ if (type == N_TOPTIER)
+ mz = rb_entry(mctz->rb_rightmost,
+ struct mem_cgroup_per_node, toptier_tree_node);
+ else
+ mz = rb_entry(mctz->rb_rightmost,
struct mem_cgroup_per_node, tree_node);
/*
* Remove the node now but someone else can add it back,
* we will to add it back at the end of reclaim to its correct
* position in the tree.
*/
- __mem_cgroup_remove_exceeded(mz, mctz);
- if (!soft_limit_excess(mz->memcg, N_MEMORY) ||
+ __mem_cgroup_remove_exceeded(mz, mctz, type);
+ if (!soft_limit_excess(mz->memcg, type) ||
!css_tryget(&mz->memcg->css))
goto retry;
done:
@@ -773,12 +830,13 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
}

static struct mem_cgroup_per_node *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz,
+ enum node_states type)
{
struct mem_cgroup_per_node *mz;

spin_lock_irq(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
+ mz = __mem_cgroup_largest_soft_limit_node(mctz, type);
spin_unlock_irq(&mctz->lock);
return mz;
}
@@ -3472,7 +3530,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
struct mem_cgroup_per_node *mz, *next_mz = NULL;
unsigned long reclaimed;
int loop = 0;
- struct mem_cgroup_tree_per_node *mctz;
+ struct mem_cgroup_tree_per_node *mctz, *mctz_sibling;
unsigned long excess;
unsigned long nr_scanned;
int migration_nid;
@@ -3481,6 +3539,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
return 0;

mctz = soft_limit_tree_node(pgdat->node_id, N_MEMORY);
+ mctz_sibling = soft_limit_tree_node(pgdat->node_id, N_TOPTIER);

/*
* Do not even bother to check the largest node if the root
@@ -3516,7 +3575,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
if (next_mz)
mz = next_mz;
else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
+ mz = mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);
if (!mz)
break;

@@ -3526,7 +3585,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
nr_reclaimed += reclaimed;
*total_scanned += nr_scanned;
spin_lock_irq(&mctz->lock);
- __mem_cgroup_remove_exceeded(mz, mctz);
+ __mem_cgroup_remove_exceeded(mz, mctz, N_MEMORY);

/*
* If we failed to reclaim anything from this memory cgroup
@@ -3534,7 +3593,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
*/
next_mz = NULL;
if (!reclaimed)
- next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
+ next_mz =
+ __mem_cgroup_largest_soft_limit_node(mctz, N_MEMORY);

excess = soft_limit_excess(mz->memcg, N_MEMORY);
/*
@@ -3546,8 +3606,20 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
* term TODO.
*/
/* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz, mctz, excess);
+ __mem_cgroup_insert_exceeded(mz, mctz, excess, N_MEMORY);
spin_unlock_irq(&mctz->lock);
+
+ /* update both affected N_MEMORY and N_TOPTIER trees */
+ if (mctz_sibling) {
+ spin_lock_irq(&mctz_sibling->lock);
+ __mem_cgroup_remove_exceeded(mz, mctz_sibling,
+ N_TOPTIER);
+ excess = soft_limit_excess(mz->memcg, N_TOPTIER);
+ __mem_cgroup_insert_exceeded(mz, mctz, excess,
+ N_TOPTIER);
+ spin_unlock_irq(&mctz_sibling->lock);
+ }
+
css_put(&mz->memcg->css);
loop++;
/*
@@ -5312,6 +5384,8 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
lruvec_init(&pn->lruvec);
pn->usage_in_excess = 0;
pn->on_tree = false;
+ pn->toptier_usage_in_excess = 0;
+ pn->on_toptier_tree = false;
pn->memcg = memcg;

memcg->nodeinfo[node] = pn;
--
2.20.1

2021-04-06 08:28:00

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 07/11] mm: Account the total top tier memory in use

Track the global top tier memory usage stats. They are used as the basis of
deciding when to start demoting pages from memory cgroups that have exceeded
their soft limit. We start reclaiming top tier memory when the total
top tier memory is low.

Signed-off-by: Tim Chen <[email protected]>
---
include/linux/vmstat.h | 18 ++++++++++++++++++
mm/vmstat.c | 20 +++++++++++++++++---
2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e1a4fa9abb3a..a3ad5a937fd8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -139,6 +139,7 @@ static inline void vm_events_fold_cpu(int cpu)
* Zone and node-based page accounting with per cpu differentials.
*/
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_toptier_zone_stat[NR_VM_ZONE_STAT_ITEMS];
extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];

@@ -175,6 +176,8 @@ static inline void zone_page_state_add(long x, struct zone *zone,
{
atomic_long_add(x, &zone->vm_stat[item]);
atomic_long_add(x, &vm_zone_stat[item]);
+ if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+ atomic_long_add(x, &vm_toptier_zone_stat[item]);
}

static inline void node_page_state_add(long x, struct pglist_data *pgdat,
@@ -212,6 +215,17 @@ static inline unsigned long global_node_page_state(enum node_stat_item item)
return global_node_page_state_pages(item);
}

+static inline unsigned long global_toptier_zone_page_state(enum zone_stat_item item)
+{
+ long x = atomic_long_read(&vm_toptier_zone_stat[item]);
+
+#ifdef CONFIG_SMP
+ if (x < 0)
+ x = 0;
+#endif
+ return x;
+}
+
static inline unsigned long zone_page_state(struct zone *zone,
enum zone_stat_item item)
{
@@ -325,6 +339,8 @@ static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
atomic_long_inc(&zone->vm_stat[item]);
atomic_long_inc(&vm_zone_stat[item]);
+ if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+ atomic_long_inc(&vm_toptier_zone_stat[item]);
}

static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -337,6 +353,8 @@ static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
atomic_long_dec(&vm_zone_stat[item]);
+ if (node_state(zone->zone_pgdat->node_id, N_TOPTIER))
+ atomic_long_dec(&vm_toptier_zone_stat[item]);
}

static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f299d2e89acb..b59efbcaef4e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -161,9 +161,11 @@ void vm_events_fold_cpu(int cpu)
* vm_stat contains the global counters
*/
atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_toptier_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp;
atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_toptier_zone_stat);
EXPORT_SYMBOL(vm_numa_stat);
EXPORT_SYMBOL(vm_node_stat);

@@ -695,7 +697,7 @@ EXPORT_SYMBOL(dec_node_page_state);
* Returns the number of counters updated.
*/
#ifdef CONFIG_NUMA
-static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
+static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff, int *toptier_diff)
{
int i;
int changes = 0;
@@ -717,6 +719,11 @@ static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff)
atomic_long_add(node_diff[i], &vm_node_stat[i]);
changes++;
}
+
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ if (toptier_diff[i]) {
+ atomic_long_add(toptier_diff[i], &vm_toptier_zone_stat[i]);
+ }
return changes;
}
#else
@@ -762,6 +769,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+ int global_toptier_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
#ifdef CONFIG_NUMA
int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
#endif
@@ -779,6 +787,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)

atomic_long_add(v, &zone->vm_stat[i]);
global_zone_diff[i] += v;
+ if (node_state(zone->zone_pgdat->node_id, N_TOPTIER)) {
+ global_toptier_diff[i] +=v;
+ }
#ifdef CONFIG_NUMA
/* 3 seconds idle till flush */
__this_cpu_write(p->expire, 3);
@@ -846,7 +857,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)

#ifdef CONFIG_NUMA
changes += fold_diff(global_zone_diff, global_numa_diff,
- global_node_diff);
+ global_node_diff, global_toptier_diff);
#else
changes += fold_diff(global_zone_diff, global_node_diff);
#endif
@@ -868,6 +879,7 @@ void cpu_vm_stats_fold(int cpu)
int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, };
#endif
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
+ int global_toptier_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };

for_each_populated_zone(zone) {
struct per_cpu_pageset *p;
@@ -910,11 +922,13 @@ void cpu_vm_stats_fold(int cpu)
p->vm_node_stat_diff[i] = 0;
atomic_long_add(v, &pgdat->vm_stat[i]);
global_node_diff[i] += v;
+ if (node_state(pgdat->node_id, N_TOPTIER))
+ global_toptier_diff[i] +=v;
}
}

#ifdef CONFIG_NUMA
- fold_diff(global_zone_diff, global_numa_diff, global_node_diff);
+ fold_diff(global_zone_diff, global_numa_diff, global_node_diff, global_toptier_diff);
#else
fold_diff(global_zone_diff, global_node_diff);
#endif
--
2.20.1

2021-04-06 08:28:20

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim

Detect during page allocation whether free toptier memory is low.
If so, wake up kswapd to reclaim memory from those mem cgroups
that have exceeded their limit.

Signed-off-by: Tim Chen <[email protected]>
---
include/linux/mmzone.h | 3 +++
mm/page_alloc.c | 2 ++
mm/vmscan.c | 2 +-
3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 789319dffe1c..3603948e95cc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -886,6 +886,8 @@ bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned int alloc_flags);
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
unsigned long mark, int highest_zoneidx);
+bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx);
+
/*
* Memory initialization context, use to differentiate memory added by
* the platform statically or via memory hotplug interface.
@@ -1466,5 +1468,6 @@ void sparse_init(void);
#endif

#endif /* !__GENERATING_BOUNDS.H */
+
#endif /* !__ASSEMBLY__ */
#endif /* _LINUX_MMZONE_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91212a837d8e..ca8aa789a967 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3519,6 +3519,8 @@ struct page *rmqueue(struct zone *preferred_zone,
if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+ } else if (!pgdat_toptier_balanced(zone->zone_pgdat, order, zone_idx(zone))) {
+ wakeup_kswapd(zone, 0, 0, zone_idx(zone));
}

VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 270880c8baef..8fe709e3f5e4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3625,7 +3625,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
return false;
}

-static bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx)
+bool pgdat_toptier_balanced(pg_data_t *pgdat, int order, int classzone_idx)
{
int i;
unsigned long mark;
--
2.20.1

2021-04-06 08:28:34

by Tim Chen

[permalink] [raw]
Subject: [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl

Update the toptier_scale_factor via sysctl. This variable determines
when kswapd wakes up to recalaim toptier memory from those mem cgroups
exceeding their toptier memory limit.

Signed-off-by: Tim Chen <[email protected]>
---
include/linux/mm.h | 4 ++++
include/linux/mmzone.h | 2 ++
kernel/sysctl.c | 10 ++++++++++
mm/page_alloc.c | 15 +++++++++++++++
mm/vmstat.c | 2 ++
5 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a43429d51fc0..af39e221d0f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3179,6 +3179,10 @@ static inline bool debug_guardpage_enabled(void) { return false; }
static inline bool page_is_guard(struct page *page) { return false; }
#endif /* CONFIG_DEBUG_PAGEALLOC */

+#ifdef CONFIG_MIGRATION
+extern int toptier_scale_factor;
+#endif
+
#if MAX_NUMNODES > 1
void __init setup_nr_node_ids(void);
#else
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4ee0073d255f..789319dffe1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1003,6 +1003,8 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
size_t *, loff_t *);
+int toptier_scale_factor_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 57f89fe1b0f2..e97c974f37b7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,7 @@ static int sixty = 60;
#endif

static int __maybe_unused neg_one = -1;
+static int __maybe_unused one = 1;
static int __maybe_unused two = 2;
static int __maybe_unused three = 3;
static int __maybe_unused four = 4;
@@ -2956,6 +2957,15 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = &one_thousand,
},
+ {
+ .procname = "toptier_scale_factor",
+ .data = &toptier_scale_factor,
+ .maxlen = sizeof(toptier_scale_factor),
+ .mode = 0644,
+ .proc_handler = toptier_scale_factor_sysctl_handler,
+ .extra1 = &one,
+ .extra2 = &ten_thousand,
+ },
{
.procname = "percpu_pagelist_fraction",
.data = &percpu_pagelist_fraction,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20f3caee60f3..91212a837d8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8094,6 +8094,21 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
return 0;
}

+int toptier_scale_factor_sysctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ if (write)
+ setup_per_zone_wmarks();
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static void setup_min_unmapped_ratio(void)
{
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b59efbcaef4e..c581753cf076 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1658,6 +1658,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n min %lu"
"\n low %lu"
"\n high %lu"
+ "\n toptier %lu"
"\n spanned %lu"
"\n present %lu"
"\n managed %lu",
@@ -1665,6 +1666,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
+ toptier_wmark_pages(zone),
zone->spanned_pages,
zone->present_pages,
zone_managed_pages(zone));
--
2.20.1

2021-04-06 18:22:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Mon 05-04-21 10:08:24, Tim Chen wrote:
[...]
> To make fine grain cgroup based management of the precious top tier
> DRAM memory possible, this patchset adds a few new features:
> 1. Provides memory monitors on the amount of top tier memory used per cgroup
> and by the system as a whole.
> 2. Applies soft limits on the top tier memory each cgroup uses
> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> tier memory usages.

Could you be more specific on how this interface is supposed to be used?

> This allows us to provision different amount of top tier memory to each
> cgroup according to the cgroup's latency need.
>
> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> interface is the limit on the cgroup is a soft limit, so a cgroup can
> exceed the limit quite a bit before reclaim before page demotion reins
> it in.

I have to say that I dislike abusing soft limit reclaim for this. In the
past we have learned that the existing implementation is unfixable and
changing the existing semantic impossible due to backward compatibility.
So I would really prefer the soft limit just find its rest rather than
see new potential usecases.

I haven't really looked into details of this patchset but from a cursory
look it seems like you are actually introducing a NUMA aware limits into
memcg that would control consumption from some nodes differently than
other nodes. This would be rather alien concept to the existing memcg
infrastructure IMO. It looks like it is fusing borders between memcg and
cputset controllers.

You also seem to be basing the interface on the very specific usecase.
Can we expect that there will be many different tiers requiring their
own balancing?

--
Michal Hocko
SUSE Labs

2021-04-07 22:46:45

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory



On 4/6/21 2:08 AM, Michal Hocko wrote:
> On Mon 05-04-21 10:08:24, Tim Chen wrote:
> [...]
>> To make fine grain cgroup based management of the precious top tier
>> DRAM memory possible, this patchset adds a few new features:
>> 1. Provides memory monitors on the amount of top tier memory used per cgroup
>> and by the system as a whole.
>> 2. Applies soft limits on the top tier memory each cgroup uses
>> 3. Enables kswapd to demote top tier pages from cgroup with excess top
>> tier memory usages.
>

Michal,

Thanks for giving your feedback. Much appreciated.

> Could you be more specific on how this interface is supposed to be used?

We created a README section on the cgroup control part of this patchset:
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
to illustrate how this interface should be used.

The top tier memory used is reported in

memory.toptier_usage_in_bytes

The amount of top tier memory usable by each cgroup without
triggering page reclaim is controlled by the

memory.toptier_soft_limit_in_bytes

knob for each cgroup.

We anticipate that for cgroup v2, we will have

memory_toptier.max (max allowed top tier memory)
memory_toptier.high (aggressive page demotion from top tier memory)
memory_toptier.min (not to page demote from top tier memory at this threshold)

this is analogous to existing controllers
memory.max, memory.high and memory.min

>
>> This allows us to provision different amount of top tier memory to each
>> cgroup according to the cgroup's latency need.
>>
>> The patchset is based on cgroup v1 interface. One shortcoming of the v1
>> interface is the limit on the cgroup is a soft limit, so a cgroup can
>> exceed the limit quite a bit before reclaim before page demotion reins
>> it in.
>
> I have to say that I dislike abusing soft limit reclaim for this. In the
> past we have learned that the existing implementation is unfixable and
> changing the existing semantic impossible due to backward compatibility.
> So I would really prefer the soft limit just find its rest rather than
> see new potential usecases.

Do you think we can reuse some of the existing soft reclaim machinery
for the v2 interface?

More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?
We sort how much each mem cgroup exceeds memory_toptier.high and
go after the cgroup that have largest excess first for page demotion.
Will appreciate if you can shed some insights on what could go wrong
with such an approach.

>
> I haven't really looked into details of this patchset but from a cursory
> look it seems like you are actually introducing a NUMA aware limits into
> memcg that would control consumption from some nodes differently than
> other nodes. This would be rather alien concept to the existing memcg
> infrastructure IMO. It looks like it is fusing borders between memcg and
> cputset controllers.

Want to make sure I understand what you mean by NUMA aware limits.
Yes, in the patch set, it does treat the NUMA nodes differently.
We are putting constraint on the "top tier" RAM nodes vs the lower
tier PMEM nodes. Is this what you mean? I can see it does has
some flavor of cpuset controller. In this case, it doesn't explicitly
set a node as allowed or forbidden as in cpuset, but put some constraints
on the usage of a group of nodes.

Do you have suggestions on alternative controller for allocating tiered memory resource?


>
> You also seem to be basing the interface on the very specific usecase.
> Can we expect that there will be many different tiers requiring their
> own balancing?
>

You mean more than two tiers of memory? We did think a bit about system
that has stuff like high bandwidth memory that's faster than DRAM.
Our thought is usage and freeing of those memory will require
explicit assignment (not used by default), so will be outside the
realm of auto balancing. So at this point, we think two tiers will be good.

Tim

2021-04-08 11:56:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Wed 07-04-21 15:33:26, Tim Chen wrote:
>
>
> On 4/6/21 2:08 AM, Michal Hocko wrote:
> > On Mon 05-04-21 10:08:24, Tim Chen wrote:
> > [...]
> >> To make fine grain cgroup based management of the precious top tier
> >> DRAM memory possible, this patchset adds a few new features:
> >> 1. Provides memory monitors on the amount of top tier memory used per cgroup
> >> and by the system as a whole.
> >> 2. Applies soft limits on the top tier memory each cgroup uses
> >> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> >> tier memory usages.
> >
>
> Michal,
>
> Thanks for giving your feedback. Much appreciated.
>
> > Could you be more specific on how this interface is supposed to be used?
>
> We created a README section on the cgroup control part of this patchset:
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
> to illustrate how this interface should be used.

I have to confess I didn't get to look at demotion patches yet.

> The top tier memory used is reported in
>
> memory.toptier_usage_in_bytes
>
> The amount of top tier memory usable by each cgroup without
> triggering page reclaim is controlled by the
>
> memory.toptier_soft_limit_in_bytes

Are you trying to say that soft limit acts as some sort of guarantee?
Does that mean that if the memcg is under memory pressure top tiear
memory is opted out from any reclaim if the usage is not in excess?

From you previous email it sounds more like the limit is evaluated on
the global memory pressure to balance specific memcgs which are in
excess when trying to reclaim/demote a toptier numa node.

Soft limit reclaim has several problems. Those are historical and
therefore the behavior cannot be changed. E.g. go after the biggest
excessed memcg (with priority 0 - aka potential full LRU scan) and then
continue with a normal reclaim. This can be really disruptive to the top
user.

So you can likely define a more sane semantic. E.g. push back memcgs
proporitional to their excess but then we have two different soft limits
behavior which is bad as well. I am not really sure there is a sensible
way out by (ab)using soft limit here.

Also I am not really sure how this is going to be used in practice.
There is no soft limit by default. So opting in would effectivelly
discriminate those memcgs. There has been a similar problem with the
soft limit we have in general. Is this really what you are looing for?
What would be a typical usecase?

[...]
> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> >> interface is the limit on the cgroup is a soft limit, so a cgroup can
> >> exceed the limit quite a bit before reclaim before page demotion reins
> >> it in.
> >
> > I have to say that I dislike abusing soft limit reclaim for this. In the
> > past we have learned that the existing implementation is unfixable and
> > changing the existing semantic impossible due to backward compatibility.
> > So I would really prefer the soft limit just find its rest rather than
> > see new potential usecases.
>
> Do you think we can reuse some of the existing soft reclaim machinery
> for the v2 interface?
>
> More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?

No, you should follow existing limits semantics. High limit acts as a
allocation throttling interface.

> We sort how much each mem cgroup exceeds memory_toptier.high and
> go after the cgroup that have largest excess first for page demotion.
> Will appreciate if you can shed some insights on what could go wrong
> with such an approach.

This cannot work as a thorttling interface.

> > I haven't really looked into details of this patchset but from a cursory
> > look it seems like you are actually introducing a NUMA aware limits into
> > memcg that would control consumption from some nodes differently than
> > other nodes. This would be rather alien concept to the existing memcg
> > infrastructure IMO. It looks like it is fusing borders between memcg and
> > cputset controllers.
>
> Want to make sure I understand what you mean by NUMA aware limits.
> Yes, in the patch set, it does treat the NUMA nodes differently.
> We are putting constraint on the "top tier" RAM nodes vs the lower
> tier PMEM nodes. Is this what you mean?

What I am trying to say (and I have brought that up when demotion has been
discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
The specific technology shouldn't be imprinted into the interface.
Fundamentally you are trying to balance memory among NUMA nodes as we do
not have other abstraction to use. So rather than talking about top,
secondary, nth tier we have different NUMA nodes with different
characteristics and you want to express your "priorities" for them.

> I can see it does has
> some flavor of cpuset controller. In this case, it doesn't explicitly
> set a node as allowed or forbidden as in cpuset, but put some constraints
> on the usage of a group of nodes.
>
> Do you have suggestions on alternative controller for allocating tiered memory resource?

I am not really sure what would be the best interface to be honest.
Maybe we want to carve this into memcg in some form of node priorities
for the reclaim. Any of the existing limits is numa aware so far. Maybe
we want to say hammer this node more than others if there is a memory
pressure. Not sure that would help you particular usecase though.

> > You also seem to be basing the interface on the very specific usecase.
> > Can we expect that there will be many different tiers requiring their
> > own balancing?
> >
>
> You mean more than two tiers of memory? We did think a bit about system
> that has stuff like high bandwidth memory that's faster than DRAM.
> Our thought is usage and freeing of those memory will require
> explicit assignment (not used by default), so will be outside the
> realm of auto balancing. So at this point, we think two tiers will be good.

Please keep in mind that once there is an interface it will be
impossible to change in the future. So do not bind yourself to the 2
tier setups that you have in hands right now.

--
Michal Hocko
SUSE Labs

2021-04-08 17:20:27

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

Hi Tim,

On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
>
> Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
> others NUMA wise, but a byte of media has about the same cost whether it
> is close or far. But, with new memory tiers such as Persistent Memory
> (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
> PMEM.
>
> The fast/expensive memory lives in the top tier of the memory hierachy.
>
> Previously, the patchset
> [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> https://lore.kernel.org/linux-mm/[email protected]/
> provides a mechanism to demote cold pages from DRAM node into PMEM.
>
> And the patchset
> [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> https://lore.kernel.org/linux-mm/[email protected]/
> provides a mechanism to promote hot pages in PMEM to the DRAM node
> leveraging autonuma.
>
> The two patchsets together keep the hot pages in DRAM and colder pages
> in PMEM.

Thanks for working on this as this is becoming more and more important
particularly in the data centers where memory is a big portion of the
cost.

I see you have responded to Michal and I will add my more specific
response there. Here I wanted to give my high level concern regarding
using v1's soft limit like semantics for top tier memory.

This patch series aims to distribute/partition top tier memory between
jobs of different priorities. We want high priority jobs to have
preferential access to the top tier memory and we don't want low
priority jobs to hog the top tier memory.

Using v1's soft limit like behavior can potentially cause high
priority jobs to stall to make enough space on top tier memory on
their allocation path and I think this patchset is aiming to reduce
that impact by making kswapd do that work. However I think the more
concerning issue is the low priority job hogging the top tier memory.

The possible ways the low priority job can hog the top tier memory are
by allocating non-movable memory or by mlocking the memory. (Oh there
is also pinning the memory but I don't know if there is a user api to
pin memory?) For the mlocked memory, you need to either modify the
reclaim code or use a different mechanism for demoting cold memory.

Basically I am saying we should put the upfront control (limit) on the
usage of top tier memory by the jobs.

2021-04-08 18:03:11

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <[email protected]> wrote:
>
> Hi Tim,
>
> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
> >
> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
> > others NUMA wise, but a byte of media has about the same cost whether it
> > is close or far. But, with new memory tiers such as Persistent Memory
> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
> > PMEM.
> >
> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >
> > Previously, the patchset
> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > https://lore.kernel.org/linux-mm/[email protected]/
> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >
> > And the patchset
> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > https://lore.kernel.org/linux-mm/[email protected]/
> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > leveraging autonuma.
> >
> > The two patchsets together keep the hot pages in DRAM and colder pages
> > in PMEM.
>
> Thanks for working on this as this is becoming more and more important
> particularly in the data centers where memory is a big portion of the
> cost.
>
> I see you have responded to Michal and I will add my more specific
> response there. Here I wanted to give my high level concern regarding
> using v1's soft limit like semantics for top tier memory.
>
> This patch series aims to distribute/partition top tier memory between
> jobs of different priorities. We want high priority jobs to have
> preferential access to the top tier memory and we don't want low
> priority jobs to hog the top tier memory.
>
> Using v1's soft limit like behavior can potentially cause high
> priority jobs to stall to make enough space on top tier memory on
> their allocation path and I think this patchset is aiming to reduce
> that impact by making kswapd do that work. However I think the more
> concerning issue is the low priority job hogging the top tier memory.
>
> The possible ways the low priority job can hog the top tier memory are
> by allocating non-movable memory or by mlocking the memory. (Oh there
> is also pinning the memory but I don't know if there is a user api to
> pin memory?) For the mlocked memory, you need to either modify the
> reclaim code or use a different mechanism for demoting cold memory.

Do you mean long term pin? RDMA should be able to simply pin the
memory for weeks. A lot of transient pins come from Direct I/O. They
should be less concerned.

The low priority jobs should be able to be restricted by cpuset, for
example, just keep them on second tier memory nodes. Then all the
above problems are gone.

>
> Basically I am saying we should put the upfront control (limit) on the
> usage of top tier memory by the jobs.

This sounds similar to what I talked about in LSFMM 2019
(https://lwn.net/Articles/787418/). We used to have some potential
usecase which divides DRAM:PMEM ratio for different jobs or memcgs
when I was with Alibaba.

In the first place I thought about per NUMA node limit, but it was
very hard to configure it correctly for users unless you know exactly
about your memory usage and hot/cold memory distribution.

I'm wondering, just off the top of my head, if we could extend the
semantic of low and min limit. For example, just redefine low and min
to "the limit on top tier memory". Then we could have low priority
jobs have 0 low/min limit.

>

2021-04-08 20:30:49

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <[email protected]> wrote:
>
> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <[email protected]> wrote:
> >
> > Hi Tim,
> >
> > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
> > >
> > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
> > > others NUMA wise, but a byte of media has about the same cost whether it
> > > is close or far. But, with new memory tiers such as Persistent Memory
> > > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
> > > PMEM.
> > >
> > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > >
> > > Previously, the patchset
> > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > https://lore.kernel.org/linux-mm/[email protected]/
> > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > >
> > > And the patchset
> > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > https://lore.kernel.org/linux-mm/[email protected]/
> > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > leveraging autonuma.
> > >
> > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > in PMEM.
> >
> > Thanks for working on this as this is becoming more and more important
> > particularly in the data centers where memory is a big portion of the
> > cost.
> >
> > I see you have responded to Michal and I will add my more specific
> > response there. Here I wanted to give my high level concern regarding
> > using v1's soft limit like semantics for top tier memory.
> >
> > This patch series aims to distribute/partition top tier memory between
> > jobs of different priorities. We want high priority jobs to have
> > preferential access to the top tier memory and we don't want low
> > priority jobs to hog the top tier memory.
> >
> > Using v1's soft limit like behavior can potentially cause high
> > priority jobs to stall to make enough space on top tier memory on
> > their allocation path and I think this patchset is aiming to reduce
> > that impact by making kswapd do that work. However I think the more
> > concerning issue is the low priority job hogging the top tier memory.
> >
> > The possible ways the low priority job can hog the top tier memory are
> > by allocating non-movable memory or by mlocking the memory. (Oh there
> > is also pinning the memory but I don't know if there is a user api to
> > pin memory?) For the mlocked memory, you need to either modify the
> > reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.
>

Yes that's an extreme way to overcome the issue but we can do less
extreme by just (hard) limiting the top tier usage of low priority
jobs.

> >
> > Basically I am saying we should put the upfront control (limit) on the
> > usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.
>

The low and min limits have semantics similar to the v1's soft limit
for this situation i.e. letting the low priority job occupy top tier
memory and depending on reclaim to take back the excess top tier
memory use of such jobs.

I have some thoughts on NUMA node limits which I will share in the other thread.

2021-04-08 20:51:34

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 1:29 PM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <[email protected]> wrote:
> >
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > Hi Tim,
> > >
> > > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
> > > >
> > > > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
> > > > others NUMA wise, but a byte of media has about the same cost whether it
> > > > is close or far. But, with new memory tiers such as Persistent Memory
> > > > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
> > > > PMEM.
> > > >
> > > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > > >
> > > > Previously, the patchset
> > > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > > https://lore.kernel.org/linux-mm/[email protected]/
> > > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > > >
> > > > And the patchset
> > > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> > > > https://lore.kernel.org/linux-mm/[email protected]/
> > > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > > leveraging autonuma.
> > > >
> > > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > > in PMEM.
> > >
> > > Thanks for working on this as this is becoming more and more important
> > > particularly in the data centers where memory is a big portion of the
> > > cost.
> > >
> > > I see you have responded to Michal and I will add my more specific
> > > response there. Here I wanted to give my high level concern regarding
> > > using v1's soft limit like semantics for top tier memory.
> > >
> > > This patch series aims to distribute/partition top tier memory between
> > > jobs of different priorities. We want high priority jobs to have
> > > preferential access to the top tier memory and we don't want low
> > > priority jobs to hog the top tier memory.
> > >
> > > Using v1's soft limit like behavior can potentially cause high
> > > priority jobs to stall to make enough space on top tier memory on
> > > their allocation path and I think this patchset is aiming to reduce
> > > that impact by making kswapd do that work. However I think the more
> > > concerning issue is the low priority job hogging the top tier memory.
> > >
> > > The possible ways the low priority job can hog the top tier memory are
> > > by allocating non-movable memory or by mlocking the memory. (Oh there
> > > is also pinning the memory but I don't know if there is a user api to
> > > pin memory?) For the mlocked memory, you need to either modify the
> > > reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
> >
>
> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.
>
> > >
> > > Basically I am saying we should put the upfront control (limit) on the
> > > usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
> >
>
> The low and min limits have semantics similar to the v1's soft limit
> for this situation i.e. letting the low priority job occupy top tier
> memory and depending on reclaim to take back the excess top tier
> memory use of such jobs.

I don't get why low priority jobs can *not* use top tier memory? I can
think of it may incur latency overhead for high priority jobs. If it
is not allowed, it could be restricted by cpuset without introducing
in any new interfaces.

I'm supposed the memory utilization could be maximized by allowing all
jobs allocate memory from all applicable nodes, then let reclaimer (or
something new if needed) do the job to migrate the memory to proper
nodes by time. We could achieve some kind of balance between memory
utilization and resource isolation.

>
> I have some thoughts on NUMA node limits which I will share in the other thread.

Look forward to reading it.

2021-04-09 03:00:41

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

Yang Shi <[email protected]> writes:

> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <[email protected]> wrote:
>>
>> Hi Tim,
>>
>> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
>> >
>> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
>> > others NUMA wise, but a byte of media has about the same cost whether it
>> > is close or far. But, with new memory tiers such as Persistent Memory
>> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
>> > PMEM.
>> >
>> > The fast/expensive memory lives in the top tier of the memory hierachy.
>> >
>> > Previously, the patchset
>> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
>> > https://lore.kernel.org/linux-mm/[email protected]/
>> > provides a mechanism to demote cold pages from DRAM node into PMEM.
>> >
>> > And the patchset
>> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
>> > https://lore.kernel.org/linux-mm/[email protected]/
>> > provides a mechanism to promote hot pages in PMEM to the DRAM node
>> > leveraging autonuma.
>> >
>> > The two patchsets together keep the hot pages in DRAM and colder pages
>> > in PMEM.
>>
>> Thanks for working on this as this is becoming more and more important
>> particularly in the data centers where memory is a big portion of the
>> cost.
>>
>> I see you have responded to Michal and I will add my more specific
>> response there. Here I wanted to give my high level concern regarding
>> using v1's soft limit like semantics for top tier memory.
>>
>> This patch series aims to distribute/partition top tier memory between
>> jobs of different priorities. We want high priority jobs to have
>> preferential access to the top tier memory and we don't want low
>> priority jobs to hog the top tier memory.
>>
>> Using v1's soft limit like behavior can potentially cause high
>> priority jobs to stall to make enough space on top tier memory on
>> their allocation path and I think this patchset is aiming to reduce
>> that impact by making kswapd do that work. However I think the more
>> concerning issue is the low priority job hogging the top tier memory.
>>
>> The possible ways the low priority job can hog the top tier memory are
>> by allocating non-movable memory or by mlocking the memory. (Oh there
>> is also pinning the memory but I don't know if there is a user api to
>> pin memory?) For the mlocked memory, you need to either modify the
>> reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.

To optimize the page placement of a process between DRAM and PMEM, we
want to place the hot pages in DRAM and the cold pages in PMEM. But the
memory accessing pattern changes overtime, so we need to migrate pages
between DRAM and PMEM to adapt to the changing.

To avoid the hot pages be pinned in PMEM always, one way is to online
the PMEM as movable zones. If so, and if the low priority jobs are
restricted by cpuset to allocate from PMEM only, we may fail to run
quite some workloads as being discussed in the following threads,

https://lore.kernel.org/linux-mm/[email protected]/

>>
>> Basically I am saying we should put the upfront control (limit) on the
>> usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.

Per my understanding, memory.low/min are for the memory protection
instead of the memory limiting. memory.high is for the memory limiting.

Best Regards,
Huang, Ying

2021-04-09 07:26:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu 08-04-21 13:29:08, Shakeel Butt wrote:
> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <[email protected]> wrote:
[...]
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.

Yes, if the aim is to isolate some users from certain numa node then
cpuset is a good fit but as Shakeel says this is very likely not what
this work is aiming for.

> Yes that's an extreme way to overcome the issue but we can do less
> extreme by just (hard) limiting the top tier usage of low priority
> jobs.

Per numa node high/hard limit would help with a more fine grained control.
The configuration would be tricky though. All low priority memcgs would
have to be carefully configured to leave enough for your important
processes. That includes also memory which is not accounted to any
memcg.
The behavior of those limits would be quite tricky for OOM situations
as well due to a lack of NUMA aware oom killer.
--
Michal Hocko
SUSE Labs

2021-04-09 20:51:59

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 7:58 PM Huang, Ying <[email protected]> wrote:
>
> Yang Shi <[email protected]> writes:
>
> > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt <[email protected]> wrote:
> >>
> >> Hi Tim,
> >>
> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen <[email protected]> wrote:
> >> >
> >> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than
> >> > others NUMA wise, but a byte of media has about the same cost whether it
> >> > is close or far. But, with new memory tiers such as Persistent Memory
> >> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap
> >> > PMEM.
> >> >
> >> > The fast/expensive memory lives in the top tier of the memory hierachy.
> >> >
> >> > Previously, the patchset
> >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> >> > https://lore.kernel.org/linux-mm/[email protected]/
> >> > provides a mechanism to demote cold pages from DRAM node into PMEM.
> >> >
> >> > And the patchset
> >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system
> >> > https://lore.kernel.org/linux-mm/[email protected]/
> >> > provides a mechanism to promote hot pages in PMEM to the DRAM node
> >> > leveraging autonuma.
> >> >
> >> > The two patchsets together keep the hot pages in DRAM and colder pages
> >> > in PMEM.
> >>
> >> Thanks for working on this as this is becoming more and more important
> >> particularly in the data centers where memory is a big portion of the
> >> cost.
> >>
> >> I see you have responded to Michal and I will add my more specific
> >> response there. Here I wanted to give my high level concern regarding
> >> using v1's soft limit like semantics for top tier memory.
> >>
> >> This patch series aims to distribute/partition top tier memory between
> >> jobs of different priorities. We want high priority jobs to have
> >> preferential access to the top tier memory and we don't want low
> >> priority jobs to hog the top tier memory.
> >>
> >> Using v1's soft limit like behavior can potentially cause high
> >> priority jobs to stall to make enough space on top tier memory on
> >> their allocation path and I think this patchset is aiming to reduce
> >> that impact by making kswapd do that work. However I think the more
> >> concerning issue is the low priority job hogging the top tier memory.
> >>
> >> The possible ways the low priority job can hog the top tier memory are
> >> by allocating non-movable memory or by mlocking the memory. (Oh there
> >> is also pinning the memory but I don't know if there is a user api to
> >> pin memory?) For the mlocked memory, you need to either modify the
> >> reclaim code or use a different mechanism for demoting cold memory.
> >
> > Do you mean long term pin? RDMA should be able to simply pin the
> > memory for weeks. A lot of transient pins come from Direct I/O. They
> > should be less concerned.
> >
> > The low priority jobs should be able to be restricted by cpuset, for
> > example, just keep them on second tier memory nodes. Then all the
> > above problems are gone.
>
> To optimize the page placement of a process between DRAM and PMEM, we
> want to place the hot pages in DRAM and the cold pages in PMEM. But the
> memory accessing pattern changes overtime, so we need to migrate pages
> between DRAM and PMEM to adapt to the changing.
>
> To avoid the hot pages be pinned in PMEM always, one way is to online
> the PMEM as movable zones. If so, and if the low priority jobs are
> restricted by cpuset to allocate from PMEM only, we may fail to run
> quite some workloads as being discussed in the following threads,
>
> https://lore.kernel.org/linux-mm/[email protected]/

Thanks for sharing the thread. It seems the configuration of movable
zone + node bind is not supported very well or need evolve to support
new use cases.

>
> >>
> >> Basically I am saying we should put the upfront control (limit) on the
> >> usage of top tier memory by the jobs.
> >
> > This sounds similar to what I talked about in LSFMM 2019
> > (https://lwn.net/Articles/787418/). We used to have some potential
> > usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> > when I was with Alibaba.
> >
> > In the first place I thought about per NUMA node limit, but it was
> > very hard to configure it correctly for users unless you know exactly
> > about your memory usage and hot/cold memory distribution.
> >
> > I'm wondering, just off the top of my head, if we could extend the
> > semantic of low and min limit. For example, just redefine low and min
> > to "the limit on top tier memory". Then we could have low priority
> > jobs have 0 low/min limit.
>
> Per my understanding, memory.low/min are for the memory protection
> instead of the memory limiting. memory.high is for the memory limiting.

Yes, it is not limit. I just misused the term, I actually do mean
protection but typed "limit". Sorry for the confusion.

>
> Best Regards,
> Huang, Ying

2021-04-09 23:27:40

by Tim Chen

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory


On 4/8/21 4:52 AM, Michal Hocko wrote:

>> The top tier memory used is reported in
>>
>> memory.toptier_usage_in_bytes
>>
>> The amount of top tier memory usable by each cgroup without
>> triggering page reclaim is controlled by the
>>
>> memory.toptier_soft_limit_in_bytes
>

Michal,

Thanks for your comments. I will like to take a step back and
look at the eventual goal we envision: a mechanism to partition the
tiered memory between the cgroups.

A typical use case may be a system with two set of tasks.
One set of task is very latency sensitive and we desire instantaneous
response from them. Another set of tasks will be running batch jobs
were latency and performance is not critical. In this case,
we want to carve out enough top tier memory such that the working set
of the latency sensitive tasks can fit entirely in the top tier memory.
The rest of the top tier memory can be assigned to the background tasks.

To achieve such cgroup based tiered memory management, we probably want
something like the following.

For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
where tier t_0 sits at the top and demotes to the lower tier.
We envision for this top tier memory t0 the following knobs and counters
in the cgroup memory controller

memory_t0.current Current usage of tier 0 memory by the cgroup.

memory_t0.min If tier 0 memory used by the cgroup falls below this low
boundary, the memory will not be subjected to demotion
to lower tiers to free up memory at tier 0.

memory_t0.low Above this boundary, the tier 0 memory will be subjected
to demotion. The demotion pressure will be proportional
to the overage.

memory_t0.high If tier 0 memory used by the cgroup exceeds this high
boundary, allocation of tier 0 memory by the cgroup will
be throttled. The tier 0 memory used by this cgroup
will also be subjected to heavy demotion.

memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup.

If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
This follows closely with the design of the general memory controller interface.

Will such an interface looks sane and acceptable with everyone?

The patch set I posted is meant to be a straw man cgroup v1 implementation
and I readily admits that it falls short of the eventual functionality
we want to achieve. It is meant to solicit feedback from everyone on how the tiered
memory management should work.

> Are you trying to say that soft limit acts as some sort of guarantee?

No, the soft limit does not offers guarantee. It will only serves to keep the usage
of the top tier memory in the vicinity of the soft limits.

> Does that mean that if the memcg is under memory pressure top tiear
> memory is opted out from any reclaim if the usage is not in excess?

In the prototype implementation, regular memory reclaim is still in effect
if we are under heavy memory pressure.

>
> From you previous email it sounds more like the limit is evaluated on
> the global memory pressure to balance specific memcgs which are in
> excess when trying to reclaim/demote a toptier numa node.

On a top tier node, if the free memory on the node falls below a percentage, then
we will start to reclaim/demote from the node.

>
> Soft limit reclaim has several problems. Those are historical and
> therefore the behavior cannot be changed. E.g. go after the biggest
> excessed memcg (with priority 0 - aka potential full LRU scan) and then
> continue with a normal reclaim. This can be really disruptive to the top
> user.

Thanks for pointing out these problems with soft limit explicitly.

>
> So you can likely define a more sane semantic. E.g. push back memcgs
> proporitional to their excess but then we have two different soft limits
> behavior which is bad as well. I am not really sure there is a sensible
> way out by (ab)using soft limit here.
>
> Also I am not really sure how this is going to be used in practice.
> There is no soft limit by default. So opting in would effectivelly
> discriminate those memcgs. There has been a similar problem with the
> soft limit we have in general. Is this really what you are looing for?
> What would be a typical usecase?

>> Want to make sure I understand what you mean by NUMA aware limits.
>> Yes, in the patch set, it does treat the NUMA nodes differently.
>> We are putting constraint on the "top tier" RAM nodes vs the lower
>> tier PMEM nodes. Is this what you mean?
>
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.

With node priorities, how would the system reserve enough
high performance memory for those performance critical task cgroup?

By priority, do you mean the order of allocation of nodes for a cgroup?
Or you mean that all the similar performing memory node will be grouped in
the same priority?

Tim

2021-04-12 14:04:40

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 1:50 PM Yang Shi <[email protected]> wrote:
>
[...]

> >
> > The low and min limits have semantics similar to the v1's soft limit
> > for this situation i.e. letting the low priority job occupy top tier
> > memory and depending on reclaim to take back the excess top tier
> > memory use of such jobs.
>
> I don't get why low priority jobs can *not* use top tier memory?

I am saying low priority jobs can use top tier memory. The only
difference is to limit them upfront (using limits) or reclaim from
them later (using min/low/soft-limit).

> I can
> think of it may incur latency overhead for high priority jobs. If it
> is not allowed, it could be restricted by cpuset without introducing
> in any new interfaces.
>
> I'm supposed the memory utilization could be maximized by allowing all
> jobs allocate memory from all applicable nodes, then let reclaimer (or
> something new if needed)

Most probably something new as we do want to consider unevictable
memory as well.

> do the job to migrate the memory to proper
> nodes by time. We could achieve some kind of balance between memory
> utilization and resource isolation.
>

Tradeoff between utilization and isolation should be decided by the user/admin.

2021-04-12 14:05:45

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Thu, Apr 8, 2021 at 4:52 AM Michal Hocko <[email protected]> wrote:
>
[...]
>
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.
>

I am also inclined towards NUMA based approach. It makes the solution
more general and even existing systems with multiple numa nodes and
DRAM can take advantage of this approach (if it makes sense).

2021-04-13 06:55:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <[email protected]> wrote:
>
>
> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes
> >
>
> Michal,
>
> Thanks for your comments. I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical. In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.
> We envision for this top tier memory t0 the following knobs and counters
> in the cgroup memory controller
>
> memory_t0.current Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min If tier 0 memory used by the cgroup falls below this low
> boundary, the memory will not be subjected to demotion
> to lower tiers to free up memory at tier 0.
>
> memory_t0.low Above this boundary, the tier 0 memory will be subjected
> to demotion. The demotion pressure will be proportional
> to the overage.
>
> memory_t0.high If tier 0 memory used by the cgroup exceeds this high
> boundary, allocation of tier 0 memory by the cgroup will
> be throttled. The tier 0 memory used by this cgroup
> will also be subjected to heavy demotion.
>
> memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.
>
> Will such an interface looks sane and acceptable with everyone?
>

I have a couple of questions. Let's suppose we have a two socket
system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
Based on the tier definition of this patch series, tier_0: {node_0,
node_1} and tier_1: {node_2, node_3}.

My questions are:

1) Can we assume that the cost of access within a tier will always be
less than the cost of access from the tier? (node_0 <-> node_1 vs
node_0 <-> node_2)
2) If yes to (1), is that assumption future proof? Will the future
systems with DRAM over CXL support have the same characteristics?
3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
<-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
might be third tier and similarly for jobs running on node_1, node_2
might be third tier.

The reason I am asking these questions is that the statically
partitioning memory nodes into tiers will inherently add platform
specific assumptions in the user API.

Assumptions like:
1) Access within tier is always cheaper than across tier.
2) Access from tier_i to tier_i+1 has uniform cost.

The reason I am more inclined towards having numa centric control is
that we don't have to make these assumptions. Though the usability
will be more difficult. Greg (CCed) has some ideas on making it better
and we will share our proposal after polishing it a bit more.

2021-04-13 13:10:55

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

Tim Chen <[email protected]> writes:

> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
>>> The top tier memory used is reported in
>>>
>>> memory.toptier_usage_in_bytes
>>>
>>> The amount of top tier memory usable by each cgroup without
>>> triggering page reclaim is controlled by the
>>>
>>> memory.toptier_soft_limit_in_bytes
>>
>
> Michal,
>
> Thanks for your comments. I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical. In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.
> We envision for this top tier memory t0 the following knobs and counters
> in the cgroup memory controller
>
> memory_t0.current Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min If tier 0 memory used by the cgroup falls below this low
> boundary, the memory will not be subjected to demotion
> to lower tiers to free up memory at tier 0.
>
> memory_t0.low Above this boundary, the tier 0 memory will be subjected
> to demotion. The demotion pressure will be proportional
> to the overage.
>
> memory_t0.high If tier 0 memory used by the cgroup exceeds this high
> boundary, allocation of tier 0 memory by the cgroup will
> be throttled. The tier 0 memory used by this cgroup
> will also be subjected to heavy demotion.

I think we don't really need throttle here, because we can fallback to
allocate memory from the t1. That will not cause something like IO
device bandwidth saturation.

Best Regards,
Huang, Ying

> memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> This follows closely with the design of the general memory controller interface.
>
> Will such an interface looks sane and acceptable with everyone?
>
> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality
> we want to achieve. It is meant to solicit feedback from everyone on how the tiered
> memory management should work.
>
>> Are you trying to say that soft limit acts as some sort of guarantee?
>
> No, the soft limit does not offers guarantee. It will only serves to keep the usage
> of the top tier memory in the vicinity of the soft limits.
>
>> Does that mean that if the memcg is under memory pressure top tiear
>> memory is opted out from any reclaim if the usage is not in excess?
>
> In the prototype implementation, regular memory reclaim is still in effect
> if we are under heavy memory pressure.
>
>>
>> From you previous email it sounds more like the limit is evaluated on
>> the global memory pressure to balance specific memcgs which are in
>> excess when trying to reclaim/demote a toptier numa node.
>
> On a top tier node, if the free memory on the node falls below a percentage, then
> we will start to reclaim/demote from the node.
>
>>
>> Soft limit reclaim has several problems. Those are historical and
>> therefore the behavior cannot be changed. E.g. go after the biggest
>> excessed memcg (with priority 0 - aka potential full LRU scan) and then
>> continue with a normal reclaim. This can be really disruptive to the top
>> user.
>
> Thanks for pointing out these problems with soft limit explicitly.
>
>>
>> So you can likely define a more sane semantic. E.g. push back memcgs
>> proporitional to their excess but then we have two different soft limits
>> behavior which is bad as well. I am not really sure there is a sensible
>> way out by (ab)using soft limit here.
>>
>> Also I am not really sure how this is going to be used in practice.
>> There is no soft limit by default. So opting in would effectivelly
>> discriminate those memcgs. There has been a similar problem with the
>> soft limit we have in general. Is this really what you are looing for?
>> What would be a typical usecase?
>
>>> Want to make sure I understand what you mean by NUMA aware limits.
>>> Yes, in the patch set, it does treat the NUMA nodes differently.
>>> We are putting constraint on the "top tier" RAM nodes vs the lower
>>> tier PMEM nodes. Is this what you mean?
>>
>> What I am trying to say (and I have brought that up when demotion has been
>> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
>> The specific technology shouldn't be imprinted into the interface.
>> Fundamentally you are trying to balance memory among NUMA nodes as we do
>> not have other abstraction to use. So rather than talking about top,
>> secondary, nth tier we have different NUMA nodes with different
>> characteristics and you want to express your "priorities" for them.
>
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup?
>
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?
>
> Tim

2021-04-13 14:06:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

On Fri 09-04-21 16:26:53, Tim Chen wrote:
>
> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes
> >
>
> Michal,
>
> Thanks for your comments. I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.

OK, this is goot mission statemet to start with. I would expect a follow
up to say what kind of granularity of control you want to achieve here.
Pressumably you want more than all or nothing because that is where
cpusets can be used for.

> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical. In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.

While from a very high level this makes sense I would be interested in
more details though. Your high letency sensitive applications very likely
want to be bound to high performance node, right? Can they tolerate
memory reclaim? Can they consume more memory than the node size? What do
you expect to happen then?

> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.

How is each tear defined? Is this an admin define set of NUMA nodes or
is this platform specific?

[...]

> Will such an interface looks sane and acceptable with everyone?

Let's talk more about usecases first before we even start talking about
the interface or which controller is the best fit for implementing it.

> The patch set I posted is meant to be a straw man cgroup v1 implementation
> and I readily admits that it falls short of the eventual functionality
> we want to achieve. It is meant to solicit feedback from everyone on how the tiered
> memory management should work.

OK, fair enough. Let me then just state that I strongly believe that
Soft limit based approach is a dead end and it would be better to focus
on the actual usecases and try to understand what you want to achieve
first.

[...]

> > What I am trying to say (and I have brought that up when demotion has been
> > discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> > The specific technology shouldn't be imprinted into the interface.
> > Fundamentally you are trying to balance memory among NUMA nodes as we do
> > not have other abstraction to use. So rather than talking about top,
> > secondary, nth tier we have different NUMA nodes with different
> > characteristics and you want to express your "priorities" for them.
>
> With node priorities, how would the system reserve enough
> high performance memory for those performance critical task cgroup?
>
> By priority, do you mean the order of allocation of nodes for a cgroup?
> Or you mean that all the similar performing memory node will be grouped in
> the same priority?

I have to say I do not yet have a clear idea on what those priorities
would look like. I just wanted to outline that usecases you are
interested about likely want to implement some form of (application
transparent) control for memory distribution over several nodes. There
is a long way to land on something more specific I am afraid.
--
Michal Hocko
SUSE Labs