2021-08-03 06:00:37

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 0/5] Introduce multi-preference mempolicy

This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Unlike the
MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
invoke the OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
friendlier API as well as a solution for more customized usage. It's unknown
if it's worth the complexity to support this. Here is sample code for how
this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).

In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.

- Ben/Dave/Feng

---
Changelog:

Sice v6:
* merge the 2/6, 3/6 patch into one (Michal Hocko)
* change the policy_node and policy_mask handling (Michal Hocko)
* refine the kernel doc for 'prefer-many' policy (Michal Hocko)

Since v5:
* Rebased against 5.14-rc1.

Since v4:
* Rebased on latest -mm tree (v5.13-rc), whose mempolicy code has
been refactored much since v4 submission
* add a dedicated alloc_page_preferred_many() (Michal Hocko)
* refactor and add fix to hugetlb supporting code (Michal Hocko)

Since v3:
* Rebased against v5.12-rc2
* Drop the v3/0013 patch of creating NO_SLOWPATH gfp_mask bit
* Skip direct reclaim for the first allocation try for
MPOL_PREFERRED_MANY, which makes its semantics close to
existing MPOL_PREFFERRED policy

Since v2:
* Rebased against v5.11
* Fix a stack overflow related panic, and a kernel warning (Feng)
* Some code clearup (Feng)
* One RFC patch to speedup mem alloc in some case (Feng)

Since v1:
* Dropped patch to replace numa_node_id in some places (mhocko)
* Dropped all the page allocation patches in favor of new mechanism to
use fallbacks. (mhocko)
* Dropped the special snowflake preferred node algorithm (bwidawsk)
* If the preferred node fails, ALL nodes are rechecked instead of just
the non-preferred nodes.


Ben Widawsky (2):
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

Dave Hansen (1):
mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

Feng Tang (2):
mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY
policy
mm/mempolicy: unify the create() func for bind/interleave/prefer-many
policies

.../admin-guide/mm/numa_memory_policy.rst | 15 +++-
include/uapi/linux/mempolicy.h | 1 +
mm/hugetlb.c | 27 ++++++
mm/mempolicy.c | 98 +++++++++++++++++-----
4 files changed, 114 insertions(+), 27 deletions(-)

--
2.14.1



2021-08-03 06:00:47

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

From: Ben Widawsky <[email protected]>

Implement the missing huge page allocation functionality while obeying
the preferred node semantics. This is similar to the implementation
for general page allocation, as it uses a fallback mechanism to try
multiple preferred nodes first, and then all other nodes.

[akpm: fix compling issue when merging with other hugetlb patch]
[Thanks to 0day bot for catching the missing #ifdef CONFIG_NUMA issue]
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Co-developed-by: Feng Tang <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/hugetlb.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95714fb28150..9279f6d478d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1166,7 +1166,20 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,

gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+#ifdef CONFIG_NUMA
+ if (mpol->mode == MPOL_PREFERRED_MANY) {
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+ if (page)
+ goto check_reserve;
+ /* Fallback to all nodes */
+ nodemask = NULL;
+ }
+#endif
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+
+#ifdef CONFIG_NUMA
+check_reserve:
+#endif
if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
SetHPageRestoreReserve(page);
h->resv_huge_pages--;
@@ -2147,6 +2160,21 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;

nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
+#ifdef CONFIG_NUMA
+ if (mpol->mode == MPOL_PREFERRED_MANY) {
+ gfp_t gfp = gfp_mask | __GFP_NOWARN;
+
+ gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
+ page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false);
+ if (page) {
+ mpol_cond_put(mpol);
+ return page;
+ }
+
+ /* Fallback to all nodes */
+ nodemask = NULL;
+ }
+#endif
page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false);
mpol_cond_put(mpol);

--
2.14.1


2021-08-03 06:01:13

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 2/5] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy

The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
that it will first try to allocate memory from the preferred node(s),
and fallback to all nodes in system when first try fails.

Add a dedicated function alloc_pages_preferred_many() for it just
like for 'interleave' policy, which will be used by 2 general
memoory allocation APIs: alloc_pages() and alloc_pages_vma()

Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Originally-by: Ben Widawsky <[email protected]>
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72f7ff760989..a00bb1c48a15 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2166,6 +2166,27 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
return page;
}

+static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
+ int nid, struct mempolicy *pol)
+{
+ struct page *page;
+ gfp_t preferred_gfp;
+
+ /*
+ * This is a two pass approach. The first pass will only try the
+ * preferred nodes but skip the direct reclaim and allow the
+ * allocation to fail, while the second pass will try all the
+ * nodes in system.
+ */
+ preferred_gfp = gfp | __GFP_NOWARN;
+ preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
+ page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes);
+ if (!page)
+ page = __alloc_pages(gfp, order, numa_node_id(), NULL);
+
+ return page;
+}
+
/**
* alloc_pages_vma - Allocate a page for a VMA.
* @gfp: GFP flags.
@@ -2201,6 +2222,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
goto out;
}

+ if (pol->mode == MPOL_PREFERRED_MANY) {
+ page = alloc_pages_preferred_many(gfp, order, node, pol);
+ mpol_cond_put(pol);
+ goto out;
+ }
+
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
int hpage_node = node;

@@ -2278,6 +2305,9 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
*/
if (pol->mode == MPOL_INTERLEAVE)
page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+ else if (pol->mode == MPOL_PREFERRED_MANY)
+ page = alloc_pages_preferred_many(gfp, order,
+ numa_node_id(), pol);
else
page = __alloc_pages(gfp, order,
policy_node(gfp, pol, numa_node_id()),
--
2.14.1


2021-08-03 06:01:21

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 5/5] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies

As they all do the same thing: sanity check and save nodemask info, create
one mpol_new_nodemask() to reduce redundancy.

Signed-off-by: Feng Tang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/mempolicy.c | 24 ++++--------------------
1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e437fe96acd0..14710960d1d4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -192,7 +192,7 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
nodes_onto(*ret, tmp, *rel);
}

-static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_new_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
return -EINVAL;
@@ -210,22 +210,6 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
return 0;
}

-static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
-{
- if (nodes_empty(*nodes))
- return -EINVAL;
- pol->nodes = *nodes;
- return 0;
-}
-
-static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
-{
- if (nodes_empty(*nodes))
- return -EINVAL;
- pol->nodes = *nodes;
- return 0;
-}
-
/*
* mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
* any, for the new policy. mpol_new() has already validated the nodes
@@ -405,7 +389,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.rebind = mpol_rebind_default,
},
[MPOL_INTERLEAVE] = {
- .create = mpol_new_interleave,
+ .create = mpol_new_nodemask,
.rebind = mpol_rebind_nodemask,
},
[MPOL_PREFERRED] = {
@@ -413,14 +397,14 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.rebind = mpol_rebind_preferred,
},
[MPOL_BIND] = {
- .create = mpol_new_bind,
+ .create = mpol_new_nodemask,
.rebind = mpol_rebind_nodemask,
},
[MPOL_LOCAL] = {
.rebind = mpol_rebind_default,
},
[MPOL_PREFERRED_MANY] = {
- .create = mpol_new_preferred_many,
+ .create = mpol_new_nodemask,
.rebind = mpol_rebind_preferred,
},
};
--
2.14.1


2021-08-03 06:01:34

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 1/5] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

From: Dave Hansen <[email protected]>

The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask. If more than one bit it set, bits
after the first are ignored.

This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU. However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location. For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated. Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode. In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior as that could lead to OOM killer or
SIGSEGV.

So add MPOL_PREFERRED_MANY policy to support the multiple preferred
nodes requirement. This is not a pie-in-the-sky dream for an API.
This was a response to a specific ask of more than one group at Intel.
Specifically:

1. There are existing libraries that target memory types such as
https://github.com/memkind/memkind. These are known to suffer
from SIGSEGV's when memory is low on targeted memory "kinds" that
span more than one node. The MCDRAM on a Xeon Phi in "Cluster on
Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
which is targeted at either "cheap and slow" (PMEM) or "expensive and
fast" (DRAM). However, they do not want to experience allocation
failures when the targeted type is unavailable.
3. Allocate-then-run. Generally, we let the process scheduler decide
on which physical CPU to run a task. That location provides a
default allocation policy, and memory availability is not generally
considered when placing tasks. For situations where memory is
valuable and constrained, some users want to allocate memory first,
*then* allocate close compute resources to the allocation. This is
the reverse of the normal (CPU) model. Accelerators such as GPUs
that operate on core-mm-managed memory are interested in this model.

A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.

[Michal Hocko: suggest to refine policy_node/policy_nodemask handling]
Link: https://lore.kernel.org/r/[email protected]
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 73 ++++++++++++++++++++++++++++++++++--------
2 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 19a00bc7fe86..046d0ccba4cd 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@ enum {
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
+ MPOL_PREFERRED_MANY,
MPOL_MAX, /* always last member of enum */
};

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e675bfb856da..72f7ff760989 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
* but useful to set in a VMA when you have a non default
* process policy.
*
+ * preferred many Try a set of nodes first before normal fallback. This is
+ * similar to preferred without the special case.
+ *
* default Allocate on the local node first, or when on a VMA
* use the process policy. This is what Linux always did
* in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
return 0;
}

+static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
+{
+ if (nodes_empty(*nodes))
+ return -EINVAL;
+ pol->nodes = *nodes;
+ return 0;
+}
+
static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
[MPOL_LOCAL] = {
.rebind = mpol_rebind_default,
},
+ [MPOL_PREFERRED_MANY] = {
+ .create = mpol_new_preferred_many,
+ .rebind = mpol_rebind_preferred,
+ },
};

static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
case MPOL_BIND:
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
+ case MPOL_PREFERRED_MANY:
*nodes = p->nodes;
break;
case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
{
*flags = *mode & MPOL_MODE_FLAGS;
*mode &= ~MPOL_MODE_FLAGS;
- if ((unsigned int)(*mode) >= MPOL_MAX)
+
+ /*
+ * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
+ * is not fully implemented, don't permit it to be used for now,
+ * and the logic will be restored in following patch
+ */
+ if ((unsigned int)(*mode) >= MPOL_PREFERRED_MANY)
return -EINVAL;
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
@@ -1875,16 +1897,27 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
*/
nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
+ int mode = policy->mode;
+
/* Lower zones don't get a nodemask applied for MPOL_BIND */
- if (unlikely(policy->mode == MPOL_BIND) &&
- apply_policy_zone(policy, gfp_zone(gfp)) &&
- cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+ if (unlikely(mode == MPOL_BIND) &&
+ apply_policy_zone(policy, gfp_zone(gfp)) &&
+ cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+ return &policy->nodes;
+
+ if (mode == MPOL_PREFERRED_MANY)
return &policy->nodes;

return NULL;
}

-/* Return the node id preferred by the given mempolicy, or the given id */
+/*
+ * Return the preferred node id for 'prefer' mempolicy, and return
+ * the given id for all other policies.
+ *
+ * policy_node() is always coupled with policy_nodemask(), which
+ * secures the nodemask limit for 'bind' and 'prefer-many' policy.
+ */
static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
{
if (policy->mode == MPOL_PREFERRED) {
@@ -1936,7 +1969,9 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);

- case MPOL_BIND: {
+ case MPOL_BIND:
+ case MPOL_PREFERRED_MANY:
+ {
struct zoneref *z;

/*
@@ -2008,12 +2043,12 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
* @addr: address in @vma for shared policy lookup and interleave policy
* @gfp_flags: for requested zone
* @mpol: pointer to mempolicy pointer for reference counted mempolicy
- * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
+ * @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy
*
* Returns a nid suitable for a huge page allocation and a pointer
* to the struct mempolicy for conditional unref after allocation.
- * If the effective policy is 'BIND, returns a pointer to the mempolicy's
- * @nodemask for filtering the zonelist.
+ * If the effective policy is 'bind' or 'prefer-many', returns a pointer
+ * to the mempolicy's @nodemask for filtering the zonelist.
*
* Must be protected by read_mems_allowed_begin()
*/
@@ -2021,16 +2056,18 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask)
{
int nid;
+ int mode;

*mpol = get_vma_policy(vma, addr);
- *nodemask = NULL; /* assume !MPOL_BIND */
+ *nodemask = NULL;
+ mode = (*mpol)->mode;

- if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
+ if (unlikely(mode == MPOL_INTERLEAVE)) {
nid = interleave_nid(*mpol, vma, addr,
huge_page_shift(hstate_vma(vma)));
} else {
nid = policy_node(gfp_flags, *mpol, numa_node_id());
- if ((*mpol)->mode == MPOL_BIND)
+ if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
*nodemask = &(*mpol)->nodes;
}
return nid;
@@ -2063,6 +2100,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
case MPOL_PREFERRED:
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
*mask = mempolicy->nodes;
@@ -2173,7 +2211,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* node and don't fall back to other nodes, as the cost of
* remote accesses would likely offset THP benefits.
*
- * If the policy is interleave, or does not allow the current
+ * If the policy is interleave or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
if (pol->mode == MPOL_PREFERRED)
@@ -2311,6 +2349,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_BIND:
case MPOL_INTERLEAVE:
case MPOL_PREFERRED:
+ case MPOL_PREFERRED_MANY:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_LOCAL:
return true;
@@ -2451,6 +2490,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
break;

case MPOL_PREFERRED:
+ if (node_isset(curnid, pol->nodes))
+ goto out;
polnid = first_node(pol->nodes);
break;

@@ -2465,9 +2506,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
break;
goto out;
}
+ fallthrough;

+ case MPOL_PREFERRED_MANY:
/*
- * allows binding to multiple nodes.
* use current page if in policy nodemask,
* else select nearest allowed node, if any.
* If no allowed nodes, use current [!misplaced].
@@ -2829,6 +2871,7 @@ static const char * const policy_modes[] =
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
[MPOL_LOCAL] = "local",
+ [MPOL_PREFERRED_MANY] = "prefer (many)",
};


@@ -2907,6 +2950,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
if (!nodelist)
err = 0;
goto out;
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
/*
* Insist on a nodelist
@@ -2993,6 +3037,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
case MPOL_LOCAL:
break;
case MPOL_PREFERRED:
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
nodes = pol->nodes;
--
2.14.1


2021-08-03 06:02:05

by Feng Tang

[permalink] [raw]
Subject: [PATCH v7 4/5] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

From: Ben Widawsky <[email protected]>

Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.

MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.

NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
nodes that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.

[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
Documentation/admin-guide/mm/numa_memory_policy.rst | 15 +++++++++++----
mm/mempolicy.c | 7 +------
2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a1499c..64fd0ba0d057 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -245,6 +245,13 @@ MPOL_INTERLEAVED
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.

+MPOL_PREFERRED_MANY
+ This mode specifices that the allocation should be preferrably
+ satisfied from the nodemask specified in the policy. If there is
+ a memory pressure on all nodes in the nodemask, the allocation
+ can fall back to all existing numa nodes. This is effectively
+ MPOL_PREFERRED allowed for a mask rather than a single node.
+
NUMA memory policy supports the following optional mode flags:

MPOL_F_STATIC_NODES
@@ -253,10 +260,10 @@ MPOL_F_STATIC_NODES
nodes changes after the memory policy has been defined.

Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
+ change in the set of allowed nodes, the preferred nodemask (Preferred
+ Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. This may result in nodes
+ being used that were previously undesired.

With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a00bb1c48a15..e437fe96acd0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1463,12 +1463,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
*flags = *mode & MPOL_MODE_FLAGS;
*mode &= ~MPOL_MODE_FLAGS;

- /*
- * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
- * is not fully implemented, don't permit it to be used for now,
- * and the logic will be restored in following patch
- */
- if ((unsigned int)(*mode) >= MPOL_PREFERRED_MANY)
+ if ((unsigned int)(*mode) >= MPOL_MAX)
return -EINVAL;
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
--
2.14.1


2021-08-06 18:32:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 1/5] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

On Tue 03-08-21 13:59:18, Feng Tang wrote:
[...]
> @@ -1936,7 +1969,9 @@ unsigned int mempolicy_slab_node(void)
> case MPOL_INTERLEAVE:
> return interleave_nodes(policy);
>
> - case MPOL_BIND: {
> + case MPOL_BIND:
> + case MPOL_PREFERRED_MANY:
> + {
> struct zoneref *z;

I guess this is ok for now but it would be great if slab maintainers
could have a look here. I suspect this will need some more changes. E.g.
I find it highly suspicious how fallback_alloc uses mempolicy_slab_node.
Let's say that the local node is not a part of the nodemask.
mempolicy_slab_node will switch to the first node in the order list
which is ok but fallback_alloc then iterates over the whole zonelist
without any policy node mask constrains. get_any_partial looks very
similar.


--
Michal Hocko
SUSE Labs

2021-08-06 18:32:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 1/5] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

On Tue 03-08-21 13:59:18, Feng Tang wrote:
> From: Dave Hansen <[email protected]>
>
> The NUMA APIs currently allow passing in a "preferred node" as a
> single bit set in a nodemask. If more than one bit it set, bits
> after the first are ignored.
>
> This single node is generally OK for location-based NUMA where
> memory being allocated will eventually be operated on by a single
> CPU. However, in systems with multiple memory types, folks want
> to target a *type* of memory instead of a location. For instance,
> someone might want some high-bandwidth memory but do not care about
> the CPU next to which it is allocated. Or, they want a cheap,
> high capacity allocation and want to target all NUMA nodes which
> have persistent memory in volatile mode. In both of these cases,
> the application wants to target a *set* of nodes, but does not
> want strict MPOL_BIND behavior as that could lead to OOM killer or
> SIGSEGV.
>
> So add MPOL_PREFERRED_MANY policy to support the multiple preferred
> nodes requirement. This is not a pie-in-the-sky dream for an API.
> This was a response to a specific ask of more than one group at Intel.
> Specifically:
>
> 1. There are existing libraries that target memory types such as
> https://github.com/memkind/memkind. These are known to suffer
> from SIGSEGV's when memory is low on targeted memory "kinds" that
> span more than one node. The MCDRAM on a Xeon Phi in "Cluster on
> Die" mode is an example of this.
> 2. Volatile-use persistent memory users want to have a memory policy
> which is targeted at either "cheap and slow" (PMEM) or "expensive and
> fast" (DRAM). However, they do not want to experience allocation
> failures when the targeted type is unavailable.
> 3. Allocate-then-run. Generally, we let the process scheduler decide
> on which physical CPU to run a task. That location provides a
> default allocation policy, and memory availability is not generally
> considered when placing tasks. For situations where memory is
> valuable and constrained, some users want to allocate memory first,
> *then* allocate close compute resources to the allocation. This is
> the reverse of the normal (CPU) model. Accelerators such as GPUs
> that operate on core-mm-managed memory are interested in this model.
>
> A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
> policy to be used for now, and will be removed in later patch after all
> implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
>
> [Michal Hocko: suggest to refine policy_node/policy_nodemask handling]
> Link: https://lore.kernel.org/r/[email protected]
> Co-developed-by: Ben Widawsky <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

Acked-by: Michal Hocko <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs

2021-08-06 18:32:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 2/5] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy

On Tue 03-08-21 13:59:19, Feng Tang wrote:
> The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> that it will first try to allocate memory from the preferred node(s),
> and fallback to all nodes in system when first try fails.
>
> Add a dedicated function alloc_pages_preferred_many() for it just
> like for 'interleave' policy, which will be used by 2 general
> memoory allocation APIs: alloc_pages() and alloc_pages_vma()
>
> Link: https://lore.kernel.org/r/[email protected]
> Suggested-by: Michal Hocko <[email protected]>
> Originally-by: Ben Widawsky <[email protected]>
> Co-developed-by: Ben Widawsky <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

Acked-by: Michal Hocko <[email protected]>
Thanks!

> ---
> mm/mempolicy.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 72f7ff760989..a00bb1c48a15 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2166,6 +2166,27 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> return page;
> }
>
> +static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
> + int nid, struct mempolicy *pol)
> +{
> + struct page *page;
> + gfp_t preferred_gfp;
> +
> + /*
> + * This is a two pass approach. The first pass will only try the
> + * preferred nodes but skip the direct reclaim and allow the
> + * allocation to fail, while the second pass will try all the
> + * nodes in system.
> + */
> + preferred_gfp = gfp | __GFP_NOWARN;
> + preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> + page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes);
> + if (!page)
> + page = __alloc_pages(gfp, order, numa_node_id(), NULL);
> +
> + return page;
> +}
> +
> /**
> * alloc_pages_vma - Allocate a page for a VMA.
> * @gfp: GFP flags.
> @@ -2201,6 +2222,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> goto out;
> }
>
> + if (pol->mode == MPOL_PREFERRED_MANY) {
> + page = alloc_pages_preferred_many(gfp, order, node, pol);
> + mpol_cond_put(pol);
> + goto out;
> + }
> +
> if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> int hpage_node = node;
>
> @@ -2278,6 +2305,9 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
> */
> if (pol->mode == MPOL_INTERLEAVE)
> page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> + else if (pol->mode == MPOL_PREFERRED_MANY)
> + page = alloc_pages_preferred_many(gfp, order,
> + numa_node_id(), pol);
> else
> page = __alloc_pages(gfp, order,
> policy_node(gfp, pol, numa_node_id()),
> --
> 2.14.1

--
Michal Hocko
SUSE Labs

2021-08-06 18:33:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Tue 03-08-21 13:59:20, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Implement the missing huge page allocation functionality while obeying
> the preferred node semantics. This is similar to the implementation
> for general page allocation, as it uses a fallback mechanism to try
> multiple preferred nodes first, and then all other nodes.
>
> [akpm: fix compling issue when merging with other hugetlb patch]
> [Thanks to 0day bot for catching the missing #ifdef CONFIG_NUMA issue]
> Link: https://lore.kernel.org/r/[email protected]
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Co-developed-by: Feng Tang <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

ifdefery is just ugly as hell. One way to get rid of that would be to
provide a mpol_is_preferred_many() wrapper and hide the CONFIG_NUMA in
mempolicy.h. I haven't checked but this might help to remove some other
ifdefery as well.

I especially dislike the label hidden in the ifdef. You can get rid of
that by checking the page for NULL.

> ---
> mm/hugetlb.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 95714fb28150..9279f6d478d9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1166,7 +1166,20 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>
> gfp_mask = htlb_alloc_mask(h);
> nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> +#ifdef CONFIG_NUMA
> + if (mpol->mode == MPOL_PREFERRED_MANY) {
> + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> + if (page)
> + goto check_reserve;
> + /* Fallback to all nodes */
> + nodemask = NULL;
> + }
> +#endif
> page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> +
> +#ifdef CONFIG_NUMA
> +check_reserve:
> +#endif
> if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> SetHPageRestoreReserve(page);
> h->resv_huge_pages--;
> @@ -2147,6 +2160,21 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> nodemask_t *nodemask;
>
> nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> +#ifdef CONFIG_NUMA
> + if (mpol->mode == MPOL_PREFERRED_MANY) {
> + gfp_t gfp = gfp_mask | __GFP_NOWARN;
> +
> + gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> + page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false);
> + if (page) {
> + mpol_cond_put(mpol);
> + return page;
> + }
> +
> + /* Fallback to all nodes */
> + nodemask = NULL;
> + }
> +#endif
> page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false);
> mpol_cond_put(mpol);
>
> --
> 2.14.1

--
Michal Hocko
SUSE Labs

2021-08-09 02:49:46

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

Hi Michal,

Thanks for the review and ACKs to 1/5 and 2/5 patches.

On Fri, Aug 06, 2021 at 03:35:48PM +0200, Michal Hocko wrote:
> On Tue 03-08-21 13:59:20, Feng Tang wrote:
> > From: Ben Widawsky <[email protected]>
> >
> > Implement the missing huge page allocation functionality while obeying
> > the preferred node semantics. This is similar to the implementation
> > for general page allocation, as it uses a fallback mechanism to try
> > multiple preferred nodes first, and then all other nodes.
> >
> > [akpm: fix compling issue when merging with other hugetlb patch]
> > [Thanks to 0day bot for catching the missing #ifdef CONFIG_NUMA issue]
> > Link: https://lore.kernel.org/r/[email protected]
> > Suggested-by: Michal Hocko <[email protected]>
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Co-developed-by: Feng Tang <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
>
> ifdefery is just ugly as hell. One way to get rid of that would be to
> provide a mpol_is_preferred_many() wrapper and hide the CONFIG_NUMA in
> mempolicy.h. I haven't checked but this might help to remove some other
> ifdefery as well.
>
> I especially dislike the label hidden in the ifdef. You can get rid of
> that by checking the page for NULL.

Yes, the 'ifdef's were annoying to me too, and thanks for the suggestions.
Following is the revised patch upon the suggestion.

Thanks,
Feng

-------8<---------------------

From fc30718c40f02ba5ea73456af49173e66b5032c1 Mon Sep 17 00:00:00 2001
From: Ben Widawsky <[email protected]>
Date: Thu, 5 Aug 2021 23:01:11 -0400
Subject: [PATCH] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

Implement the missing huge page allocation functionality while obeying the
preferred node semantics. This is similar to the implementation for
general page allocation, as it uses a fallback mechanism to try multiple
preferred nodes first, and then all other nodes.

To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.

[akpm: fix compling issue when merging with other hugetlb patch]
[Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
[Michal Hocko: suggest to remove the #ifdef CONFIG_NUMA check]
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Co-developed-by: Feng Tang <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
--
include/linux/mempolicy.h | 12 ++++++++++++
mm/hugetlb.c | 28 ++++++++++++++++++++++++----
2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0117e1e..60d5e6c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -187,6 +187,12 @@ extern void mpol_put_task_policy(struct task_struct *);

extern bool numa_demotion_enabled;

+static inline bool mpol_is_preferred_many(struct mempolicy *pol)
+{
+ return (pol->mode == MPOL_PREFERRED_MANY);
+}
+
+
#else

struct mempolicy {};
@@ -297,5 +303,11 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
}

#define numa_demotion_enabled false
+
+static inline bool mpol_is_preferred_many(struct mempolicy *pol)
+{
+ return false;
+}
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95714fb..75ea8bc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1145,7 +1145,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
unsigned long address, int avoid_reserve,
long chg)
{
- struct page *page;
+ struct page *page = NULL;
struct mempolicy *mpol;
gfp_t gfp_mask;
nodemask_t *nodemask;
@@ -1166,7 +1166,17 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,

gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+
+ if (mpol_is_preferred_many(mpol)) {
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+
+ /* Fallback to all nodes if page==NULL */
+ nodemask = NULL;
+ }
+
+ if (!page)
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+
if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
SetHPageRestoreReserve(page);
h->resv_huge_pages--;
@@ -2147,9 +2157,19 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;

nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
- page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false);
- mpol_cond_put(mpol);
+ if (mpol_is_preferred_many(mpol)) {
+ gfp_t gfp = gfp_mask | __GFP_NOWARN;

+ gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
+ page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false);
+
+ /* Fallback to all nodes if page==NULL */
+ nodemask = NULL;
+ }
+
+ if (!page)
+ page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false);
+ mpol_cond_put(mpol);
return page;
}

--
2.7.4


2021-08-09 08:43:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Mon 09-08-21 10:44:30, Feng Tang wrote:
> Hi Michal,
>
> Thanks for the review and ACKs to 1/5 and 2/5 patches.
>
> On Fri, Aug 06, 2021 at 03:35:48PM +0200, Michal Hocko wrote:
> > On Tue 03-08-21 13:59:20, Feng Tang wrote:
> > > From: Ben Widawsky <[email protected]>
> > >
> > > Implement the missing huge page allocation functionality while obeying
> > > the preferred node semantics. This is similar to the implementation
> > > for general page allocation, as it uses a fallback mechanism to try
> > > multiple preferred nodes first, and then all other nodes.
> > >
> > > [akpm: fix compling issue when merging with other hugetlb patch]
> > > [Thanks to 0day bot for catching the missing #ifdef CONFIG_NUMA issue]
> > > Link: https://lore.kernel.org/r/[email protected]
> > > Suggested-by: Michal Hocko <[email protected]>
> > > Signed-off-by: Ben Widawsky <[email protected]>
> > > Co-developed-by: Feng Tang <[email protected]>
> > > Signed-off-by: Feng Tang <[email protected]>
> >
> > ifdefery is just ugly as hell. One way to get rid of that would be to
> > provide a mpol_is_preferred_many() wrapper and hide the CONFIG_NUMA in
> > mempolicy.h. I haven't checked but this might help to remove some other
> > ifdefery as well.
> >
> > I especially dislike the label hidden in the ifdef. You can get rid of
> > that by checking the page for NULL.
>
> Yes, the 'ifdef's were annoying to me too, and thanks for the suggestions.
> Following is the revised patch upon the suggestion.
>
> Thanks,
> Feng
>
> -------8<---------------------
>
> >From fc30718c40f02ba5ea73456af49173e66b5032c1 Mon Sep 17 00:00:00 2001
> From: Ben Widawsky <[email protected]>
> Date: Thu, 5 Aug 2021 23:01:11 -0400
> Subject: [PATCH] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
>
> Implement the missing huge page allocation functionality while obeying the
> preferred node semantics. This is similar to the implementation for
> general page allocation, as it uses a fallback mechanism to try multiple
> preferred nodes first, and then all other nodes.
>
> To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
> in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.
>
> [akpm: fix compling issue when merging with other hugetlb patch]
> [Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
> [Michal Hocko: suggest to remove the #ifdef CONFIG_NUMA check]
> Link: https://lore.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Co-developed-by: Feng Tang <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

Yeah. This looks much better. Thanks!
Acked-by: Michal Hocko <[email protected]>
Do you think you can provide same helpers for other policies as well?
Maybe we can get rid of some other ifdefery as well.

Thanks!
--
Michal Hocko
SUSE Labs

2021-08-09 12:45:07

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Mon, Aug 09, 2021 at 10:41:40AM +0200, Michal Hocko wrote:
[snip]
> > >From fc30718c40f02ba5ea73456af49173e66b5032c1 Mon Sep 17 00:00:00 2001
> > From: Ben Widawsky <[email protected]>
> > Date: Thu, 5 Aug 2021 23:01:11 -0400
> > Subject: [PATCH] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
> >
> > Implement the missing huge page allocation functionality while obeying the
> > preferred node semantics. This is similar to the implementation for
> > general page allocation, as it uses a fallback mechanism to try multiple
> > preferred nodes first, and then all other nodes.
> >
> > To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
> > in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.
> >
> > [akpm: fix compling issue when merging with other hugetlb patch]
> > [Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
> > [Michal Hocko: suggest to remove the #ifdef CONFIG_NUMA check]
> > Link: https://lore.kernel.org/r/[email protected]
> > Link: https://lkml.kernel.org/r/[email protected]
> > Suggested-by: Michal Hocko <[email protected]>
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Co-developed-by: Feng Tang <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
>
> Yeah. This looks much better. Thanks!
> Acked-by: Michal Hocko <[email protected]>

Thank you!

> Do you think you can provide same helpers for other policies as well?
> Maybe we can get rid of some other ifdefery as well.

Sure. I can make separate patch(es) for that.

And you mean helper like mpol_is_bind/default/local/preferred?

I just run 'git-grep MPOL', and for places using "mode == MPOL_XXX",
mostly they are in mempolicy.[ch], the only another place is in
shmem.c, do we need to create all the helpers for it and the
potential future users?

Thanks,
Feng

2021-08-09 13:21:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Mon 09-08-21 20:37:47, Feng Tang wrote:
> On Mon, Aug 09, 2021 at 10:41:40AM +0200, Michal Hocko wrote:
> [snip]
> > > >From fc30718c40f02ba5ea73456af49173e66b5032c1 Mon Sep 17 00:00:00 2001
> > > From: Ben Widawsky <[email protected]>
> > > Date: Thu, 5 Aug 2021 23:01:11 -0400
> > > Subject: [PATCH] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
> > >
> > > Implement the missing huge page allocation functionality while obeying the
> > > preferred node semantics. This is similar to the implementation for
> > > general page allocation, as it uses a fallback mechanism to try multiple
> > > preferred nodes first, and then all other nodes.
> > >
> > > To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
> > > in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.
> > >
> > > [akpm: fix compling issue when merging with other hugetlb patch]
> > > [Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
> > > [Michal Hocko: suggest to remove the #ifdef CONFIG_NUMA check]
> > > Link: https://lore.kernel.org/r/[email protected]
> > > Link: https://lkml.kernel.org/r/[email protected]
> > > Suggested-by: Michal Hocko <[email protected]>
> > > Signed-off-by: Ben Widawsky <[email protected]>
> > > Co-developed-by: Feng Tang <[email protected]>
> > > Signed-off-by: Feng Tang <[email protected]>
> >
> > Yeah. This looks much better. Thanks!
> > Acked-by: Michal Hocko <[email protected]>
>
> Thank you!
>
> > Do you think you can provide same helpers for other policies as well?
> > Maybe we can get rid of some other ifdefery as well.
>
> Sure. I can make separate patch(es) for that.
>
> And you mean helper like mpol_is_bind/default/local/preferred?
>
> I just run 'git-grep MPOL', and for places using "mode == MPOL_XXX",
> mostly they are in mempolicy.[ch], the only another place is in
> shmem.c, do we need to create all the helpers for it and the
> potential future users?

I would just go with those instances which need to ifdef for NUMA.
Thanks!
--
Michal Hocko
SUSE Labs

2021-08-10 09:58:06

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Mon, Aug 09, 2021 at 03:19:32PM +0200, Michal Hocko wrote:
[snip]
> > > Do you think you can provide same helpers for other policies as well?
> > > Maybe we can get rid of some other ifdefery as well.
> >
> > Sure. I can make separate patch(es) for that.
> >
> > And you mean helper like mpol_is_bind/default/local/preferred?
> >
> > I just run 'git-grep MPOL', and for places using "mode == MPOL_XXX",
> > mostly they are in mempolicy.[ch], the only another place is in
> > shmem.c, do we need to create all the helpers for it and the
> > potential future users?
>
> I would just go with those instances which need to ifdef for NUMA.
> Thanks!

Yes, following is a patch to remove one CONFIG_NUMA check, though
an bolder idea to extend the patch by removing the CONFIG_TMPFS
check in the same line.

Thanks,
Feng

---------8<---------------------------------

From 1a5858721ac8ce99c27c13d310bba2983dc73d97 Mon Sep 17 00:00:00 2001
From: Feng Tang <[email protected]>
Date: Tue, 10 Aug 2021 17:00:59 +0800
Subject: [PATCH] mm: shmem: avoid open coded check for mempolicy's mode

Add a mempolicy helper to do the check, which can also remove
a CONFIG_NUMA option check.

Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/mempolicy.h | 14 ++++++++++++++
mm/shmem.c | 8 ++++----
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 60d5e6c3340c..8fc518ad4f3c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -192,6 +192,10 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
return (pol->mode == MPOL_PREFERRED_MANY);
}

+static inline bool mpol_is_default(struct mempolicy *pol)
+{
+ return (pol->mode == MPOL_DEFAULT);
+}

#else

@@ -287,6 +291,10 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
}
#endif

+static inline void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
+{
+}
+
static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
unsigned long address)
{
@@ -309,5 +317,11 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
return false;
}

+static inline bool mpol_is_default(struct mempolicy *pol)
+{
+ return false;
+}
+
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 96f05f6af8bb..26b195209ef7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1437,12 +1437,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
return 0;
}

-#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+#ifdef CONFIG_TMPFS
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];

- if (!mpol || mpol->mode == MPOL_DEFAULT)
+ if (!mpol || mpol_is_default(mpol))
return; /* show nothing */

mpol_to_str(buffer, sizeof(buffer), mpol);
@@ -1461,7 +1461,7 @@ static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
}
return mpol;
}
-#else /* !CONFIG_NUMA || !CONFIG_TMPFS */
+#else /* !CONFIG_TMPFS */
static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
}
@@ -1469,7 +1469,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
{
return NULL;
}
-#endif /* CONFIG_NUMA && CONFIG_TMPFS */
+#endif /* CONFIG_TMPFS */
#ifndef CONFIG_NUMA
#define vm_policy vm_private_data
#endif
--
2.14.1


2021-08-10 20:07:58

by Nathan Chancellor

[permalink] [raw]
Subject: [PATCH] mm/hugetlb: Initialize page to NULL in alloc_buddy_huge_page_with_mpol()

Clang warns:

mm/hugetlb.c:2162:6: warning: variable 'page' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
if (mpol_is_preferred_many(mpol)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:2172:7: note: uninitialized use occurs here
if (!page)
^~~~
mm/hugetlb.c:2162:2: note: remove the 'if' if its condition is always
true
if (mpol_is_preferred_many(mpol)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:2155:19: note: initialize the variable 'page' to silence
this warning
struct page *page;
^
= NULL
1 warning generated.

Initialize page to NULL like in dequeue_huge_page_vma() so that page is
not used uninitialized.

Signed-off-by: Nathan Chancellor <[email protected]>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d9cd2722ea5..604e2d6bd506 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2152,7 +2152,7 @@ static
struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- struct page *page;
+ struct page *page = NULL;
struct mempolicy *mpol;
gfp_t gfp_mask = htlb_alloc_mask(h);
int nid;

base-commit: 18f73b217b4633e27a61832e1485ce927a8ee5c1
--
2.33.0.rc1

2021-08-10 21:36:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

On Tue, 10 Aug 2021, Feng Tang wrote:
> On Mon, Aug 09, 2021 at 03:19:32PM +0200, Michal Hocko wrote:
> [snip]
> > > > Do you think you can provide same helpers for other policies as well?
> > > > Maybe we can get rid of some other ifdefery as well.
> > >
> > > Sure. I can make separate patch(es) for that.
> > >
> > > And you mean helper like mpol_is_bind/default/local/preferred?
> > >
> > > I just run 'git-grep MPOL', and for places using "mode == MPOL_XXX",
> > > mostly they are in mempolicy.[ch], the only another place is in
> > > shmem.c, do we need to create all the helpers for it and the
> > > potential future users?
> >
> > I would just go with those instances which need to ifdef for NUMA.
> > Thanks!
>
> Yes, following is a patch to remove one CONFIG_NUMA check, though
> an bolder idea to extend the patch by removing the CONFIG_TMPFS
> check in the same line.
>
> Thanks,
> Feng
>
> ---------8<---------------------------------
>
> From 1a5858721ac8ce99c27c13d310bba2983dc73d97 Mon Sep 17 00:00:00 2001
> From: Feng Tang <[email protected]>
> Date: Tue, 10 Aug 2021 17:00:59 +0800
> Subject: [PATCH] mm: shmem: avoid open coded check for mempolicy's mode
>
> Add a mempolicy helper to do the check, which can also remove
> a CONFIG_NUMA option check.
>
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>

No thanks: this is not an improvement.

The "#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)" is there to
eliminate dead code that would not be automatically eliminated by the
optimizer: it's not there just to avoid MPOL_DEFAULT, and it's there
to cover shmem_get_sbmpol() along with shmem_show_mpol().

I know we tend to avoid #ifdefs in .c files, and that's good; and
I know you could find other code in mm/shmem.c which might also be
#ifdef'ed to eliminate other dead code in other configs; but unless
there's a new drive to purge our .c source of all #ifdefs,
please just leave this as is.

Thanks,
Hugh

> ---
> include/linux/mempolicy.h | 14 ++++++++++++++
> mm/shmem.c | 8 ++++----
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 60d5e6c3340c..8fc518ad4f3c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -192,6 +192,10 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
> return (pol->mode == MPOL_PREFERRED_MANY);
> }
>
> +static inline bool mpol_is_default(struct mempolicy *pol)
> +{
> + return (pol->mode == MPOL_DEFAULT);
> +}
>
> #else
>
> @@ -287,6 +291,10 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
> }
> #endif
>
> +static inline void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> +{
> +}
> +
> static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
> unsigned long address)
> {
> @@ -309,5 +317,11 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
> return false;
> }
>
> +static inline bool mpol_is_default(struct mempolicy *pol)
> +{
> + return false;
> +}
> +
> +
> #endif /* CONFIG_NUMA */
> #endif
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 96f05f6af8bb..26b195209ef7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1437,12 +1437,12 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> return 0;
> }
>
> -#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
> +#ifdef CONFIG_TMPFS
> static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
> {
> char buffer[64];
>
> - if (!mpol || mpol->mode == MPOL_DEFAULT)
> + if (!mpol || mpol_is_default(mpol))
> return; /* show nothing */
>
> mpol_to_str(buffer, sizeof(buffer), mpol);
> @@ -1461,7 +1461,7 @@ static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> }
> return mpol;
> }
> -#else /* !CONFIG_NUMA || !CONFIG_TMPFS */
> +#else /* !CONFIG_TMPFS */
> static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
> {
> }
> @@ -1469,7 +1469,7 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> {
> return NULL;
> }
> -#endif /* CONFIG_NUMA && CONFIG_TMPFS */
> +#endif /* CONFIG_TMPFS */
> #ifndef CONFIG_NUMA
> #define vm_policy vm_private_data
> #endif
> --
> 2.14.1

2021-08-11 01:23:40

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/hugetlb: Initialize page to NULL in alloc_buddy_huge_page_with_mpol()

On Tue, Aug 10, 2021 at 01:06:32PM -0700, Nathan Chancellor wrote:
> Clang warns:
>
> mm/hugetlb.c:2162:6: warning: variable 'page' is used uninitialized
> whenever 'if' condition is false [-Wsometimes-uninitialized]
> if (mpol_is_preferred_many(mpol)) {
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/hugetlb.c:2172:7: note: uninitialized use occurs here
> if (!page)
> ^~~~
> mm/hugetlb.c:2162:2: note: remove the 'if' if its condition is always
> true
> if (mpol_is_preferred_many(mpol)) {
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/hugetlb.c:2155:19: note: initialize the variable 'page' to silence
> this warning
> struct page *page;
> ^
> = NULL
> 1 warning generated.
>
> Initialize page to NULL like in dequeue_huge_page_vma() so that page is
> not used uninitialized.
>
> Signed-off-by: Nathan Chancellor <[email protected]>

Thanks for the catch!

In my original patch, I initialized it in one function, but overlooked
this one.

Thanks,
Feng

2021-08-11 01:40:00

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v7 3/5] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY

Hi Huge,

On Tue, Aug 10, 2021 at 02:35:05PM -0700, Hugh Dickins wrote:
> On Tue, 10 Aug 2021, Feng Tang wrote:
> > On Mon, Aug 09, 2021 at 03:19:32PM +0200, Michal Hocko wrote:
> > [snip]
> > > > > Do you think you can provide same helpers for other policies as well?
> > > > > Maybe we can get rid of some other ifdefery as well.
> > > >
> > > > Sure. I can make separate patch(es) for that.
> > > >
> > > > And you mean helper like mpol_is_bind/default/local/preferred?
> > > >
> > > > I just run 'git-grep MPOL', and for places using "mode == MPOL_XXX",
> > > > mostly they are in mempolicy.[ch], the only another place is in
> > > > shmem.c, do we need to create all the helpers for it and the
> > > > potential future users?
> > >
> > > I would just go with those instances which need to ifdef for NUMA.
> > > Thanks!
> >
> > Yes, following is a patch to remove one CONFIG_NUMA check, though
> > an bolder idea to extend the patch by removing the CONFIG_TMPFS
> > check in the same line.
> >
> > Thanks,
> > Feng
> >
> > ---------8<---------------------------------
> >
> > From 1a5858721ac8ce99c27c13d310bba2983dc73d97 Mon Sep 17 00:00:00 2001
> > From: Feng Tang <[email protected]>
> > Date: Tue, 10 Aug 2021 17:00:59 +0800
> > Subject: [PATCH] mm: shmem: avoid open coded check for mempolicy's mode
> >
> > Add a mempolicy helper to do the check, which can also remove
> > a CONFIG_NUMA option check.
> >
> > Suggested-by: Michal Hocko <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
>
> No thanks: this is not an improvement.
>
> The "#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)" is there to
> eliminate dead code that would not be automatically eliminated by the
> optimizer: it's not there just to avoid MPOL_DEFAULT, and it's there
> to cover shmem_get_sbmpol() along with shmem_show_mpol().

Thanks for the explaination! I did some tests that in !NUMA case, the
'sbinfo->mpol' is always NULL (I could be wrong) which makes the
2 functions almost non-ops.

> I know we tend to avoid #ifdefs in .c files, and that's good; and
> I know you could find other code in mm/shmem.c which might also be
> #ifdef'ed to eliminate other dead code in other configs; but unless
> there's a new drive to purge our .c source of all #ifdefs,
> please just leave this as is.

Ok, and sorry for the noise.

Thanks,
Feng

> Thanks,
> Hugh
>

2021-12-01 03:09:26

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v7 0/5] Introduce multi-preference mempolicy

Hi Feng:

I am a little confused:
We have `MPOL_PREFERRED`, why you introduce `MPOL_PREFERRED_MANY`
instead of making `MPOL_PREFERRED` support multiple preferred nodes?

--
Thanks
Gang Li


2021-12-01 05:33:29

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v7 0/5] Introduce multi-preference mempolicy

Hi Gang,

On Wed, Dec 01, 2021 at 11:09:18AM +0800, Gang Li wrote:
> Hi Feng:
>
> I am a little confused:
> We have `MPOL_PREFERRED`, why you introduce `MPOL_PREFERRED_MANY` instead of
> making `MPOL_PREFERRED` support multiple preferred nodes?

Cc Ben and Dave.

Actually in the end of this cover letter, there is some explaination
about it, qutoed here:

"
In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).

In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.
"

Ben made this as he initiated the patchset, and I agree this can keep
the API consistent for user . Also at that time, there was another
factor that policy MPOL_PREFERRED and MPOL_LOCAL were coupled tightly
together.

Thanks,
Feng