This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Unlike the
MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
invoke the OOM killer if those preferred nodes are not available.
Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`
The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.
Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.
Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.
> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
friendlier API as well as a solution for more customized usage. It's unknown
if it's worth the complexity to support this. Here is sample code for how
this might work:
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).
In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.
Changelog:
Since v3:
* Rebased against v5.12-rc2
* Drop the v3/0013 patch of creating NO_SLOWPATH gfp_mask bit
* Skip direct reclaim for the first allocation try for
MPOL_PREFERRED_MANY, which makes its semantics close to
existing MPOL_PREFFERRED policy
Since v2:
* Rebased against v5.11
* Fix a stack overflow related panic, and a kernel warning (Feng)
* Some code clearup (Feng)
* One RFC patch to speedup mem alloc in some case (Feng)
Since v1:
* Dropped patch to replace numa_node_id in some places (mhocko)
* Dropped all the page allocation patches in favor of new mechanism to
use fallbacks. (mhocko)
* Dropped the special snowflake preferred node algorithm (bwidawsk)
* If the preferred node fails, ALL nodes are rechecked instead of just
the non-preferred nodes.
v4 Summary:
1: Random fix I found along the way
2-5: Represent node preference as a mask internally
6-7: Tread many preferred like bind
8-11: Handle page allocation for the new policy
12: Enable the uapi
13: unifiy 2 functions
Ben Widawsky (8):
mm/mempolicy: Add comment for missing LOCAL
mm/mempolicy: kill v.preferred_nodes
mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND
mm/mempolicy: Create a page allocator for policy
mm/mempolicy: Thread allocation for many preferred
mm/mempolicy: VMA allocation for many preferred
mm/mempolicy: huge-page allocation for many preferred
mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
Dave Hansen (4):
mm/mempolicy: convert single preferred_node to full nodemask
mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
mm/mempolicy: allow preferred code to take a nodemask
mm/mempolicy: refactor rebind code for PREFERRED_MANY
Feng Tang (1):
mem/mempolicy: unify mpol_new_preferred() and
mpol_new_preferred_many()
.../admin-guide/mm/numa_memory_policy.rst | 22 +-
include/linux/mempolicy.h | 6 +-
include/uapi/linux/mempolicy.h | 6 +-
mm/hugetlb.c | 26 +-
mm/mempolicy.c | 272 ++++++++++++++-------
5 files changed, 225 insertions(+), 107 deletions(-)
--
2.7.4
From: Dave Hansen <[email protected]>
Create a helper function (mpol_new_preferred_many()) which is usable
both by the old, single-node MPOL_PREFERRED and the new
MPOL_PREFERRED_MANY.
Enforce the old single-node MPOL_PREFERRED behavior in the "new"
version of mpol_new_preferred() which calls mpol_new_preferred_many().
v3:
* fix a stack overflow caused by emty nodemask (Feng)
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1228d8e..6fb2cab 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -203,17 +203,34 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
return 0;
}
-static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_new_preferred_many(struct mempolicy *pol,
+ const nodemask_t *nodes)
{
if (!nodes)
pol->flags |= MPOL_F_LOCAL; /* local allocation */
else if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
else
- pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
+ pol->v.preferred_nodes = *nodes;
return 0;
}
+static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+{
+ if (nodes) {
+ /* MPOL_PREFERRED can only take a single node: */
+ nodemask_t tmp;
+
+ if (nodes_empty(*nodes))
+ return -EINVAL;
+
+ tmp = nodemask_of_node(first_node(*nodes));
+ return mpol_new_preferred_many(pol, &tmp);
+ }
+
+ return mpol_new_preferred_many(pol, NULL);
+}
+
static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
--
2.7.4
From: Dave Hansen <[email protected]>
The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask. If more than one bit it set, bits
after the first are ignored. Internally, this is implemented as
a single integer: mempolicy->preferred_node.
This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU. However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location. For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated. Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode. In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior as that could lead to OOM killer or
SIGSEGV.
To get that behavior, a MPOL_PREFERRED mode is desirable, but one
that honors multiple nodes to be set in the nodemask.
The first step in that direction is to be able to internally store
multiple preferred nodes, which is implemented in this patch.
This should not have any function changes and just switches the
internal representation of mempolicy->preferred_node from an
integer to a nodemask called 'mempolicy->preferred_nodes'.
This is not a pie-in-the-sky dream for an API. This was a response to a
specific ask of more than one group at Intel. Specifically:
1. There are existing libraries that target memory types such as
https://github.com/memkind/memkind. These are known to suffer
from SIGSEGV's when memory is low on targeted memory "kinds" that
span more than one node. The MCDRAM on a Xeon Phi in "Cluster on
Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
which is targeted at either "cheap and slow" (PMEM) or "expensive and
fast" (DRAM). However, they do not want to experience allocation
failures when the targeted type is unavailable.
3. Allocate-then-run. Generally, we let the process scheduler decide
on which physical CPU to run a task. That location provides a
default allocation policy, and memory availability is not generally
considered when placing tasks. For situations where memory is
valuable and constrained, some users want to allocate memory first,
*then* allocate close compute resources to the allocation. This is
the reverse of the normal (CPU) model. Accelerators such as GPUs
that operate on core-mm-managed memory are interested in this model.
v2:
Fix spelling errors in commit message. (Ben)
clang-format. (Ben)
Integrated bit from another patch. (Ben)
Update the docs to reflect the internal data structure change (Ben)
Don't advertise MPOL_PREFERRED_MANY in UAPI until we can handle it (Ben)
Added more to the commit message (Dave)
Link: https://lore.kernel.org/r/[email protected]
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
.../admin-guide/mm/numa_memory_policy.rst | 6 ++--
include/linux/mempolicy.h | 4 +--
mm/mempolicy.c | 40 ++++++++++++----------
3 files changed, 27 insertions(+), 23 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a..1ad020c 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -205,9 +205,9 @@ MPOL_PREFERRED
of increasing distance from the preferred node based on
information provided by the platform firmware.
- Internally, the Preferred policy uses a single node--the
- preferred_node member of struct mempolicy. When the internal
- mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
+ Internally, the Preferred policy uses a nodemask--the
+ preferred_nodes member of struct mempolicy. When the internal
+ mode flag MPOL_F_LOCAL is set, the preferred_nodes are ignored
and the policy is interpreted as local allocation. "Local"
allocation policy can be viewed as a Preferred policy that
starts at the node containing the cpu where the allocation
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f1c74d..23ee105 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -47,8 +47,8 @@ struct mempolicy {
unsigned short mode; /* See MPOL_* above */
unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
union {
- short preferred_node; /* preferred */
- nodemask_t nodes; /* interleave/bind */
+ nodemask_t preferred_nodes; /* preferred */
+ nodemask_t nodes; /* interleave/bind */
/* undefined for default */
} v;
union {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4193566..2b1e0e4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -205,7 +205,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
else if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
else
- pol->v.preferred_node = first_node(*nodes);
+ pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
return 0;
}
@@ -345,22 +345,26 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
const nodemask_t *nodes)
{
nodemask_t tmp;
+ nodemask_t preferred_node;
+
+ /* MPOL_PREFERRED uses only the first node in the mask */
+ preferred_node = nodemask_of_node(first_node(*nodes));
if (pol->flags & MPOL_F_STATIC_NODES) {
int node = first_node(pol->w.user_nodemask);
if (node_isset(node, *nodes)) {
- pol->v.preferred_node = node;
+ pol->v.preferred_nodes = nodemask_of_node(node);
pol->flags &= ~MPOL_F_LOCAL;
} else
pol->flags |= MPOL_F_LOCAL;
} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
- pol->v.preferred_node = first_node(tmp);
+ pol->v.preferred_nodes = tmp;
} else if (!(pol->flags & MPOL_F_LOCAL)) {
- pol->v.preferred_node = node_remap(pol->v.preferred_node,
- pol->w.cpuset_mems_allowed,
- *nodes);
+ nodes_remap(tmp, pol->v.preferred_nodes,
+ pol->w.cpuset_mems_allowed, preferred_node);
+ pol->v.preferred_nodes = tmp;
pol->w.cpuset_mems_allowed = *nodes;
}
}
@@ -922,7 +926,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
break;
case MPOL_PREFERRED:
if (!(p->flags & MPOL_F_LOCAL))
- node_set(p->v.preferred_node, *nodes);
+ *nodes = p->v.preferred_nodes;
/* else return empty node mask for local allocation */
break;
default:
@@ -1891,9 +1895,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
/* Return the node id preferred by the given mempolicy, or the given id */
static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
{
- if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
- nd = policy->v.preferred_node;
- else {
+ if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) {
+ nd = first_node(policy->v.preferred_nodes);
+ } else {
/*
* __GFP_THISNODE shouldn't even be used with the bind policy
* because we might easily break the expectation to stay on the
@@ -1938,7 +1942,7 @@ unsigned int mempolicy_slab_node(void)
/*
* handled MPOL_F_LOCAL above
*/
- return policy->v.preferred_node;
+ return first_node(policy->v.preferred_nodes);
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
@@ -2072,7 +2076,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
if (mempolicy->flags & MPOL_F_LOCAL)
nid = numa_node_id();
else
- nid = mempolicy->v.preferred_node;
+ nid = first_node(mempolicy->v.preferred_nodes);
init_nodemask_of_node(mask, nid);
break;
@@ -2210,7 +2214,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* node in its nodemask, we allocate the standard way.
*/
if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
- hpage_node = pol->v.preferred_node;
+ hpage_node = first_node(pol->v.preferred_nodes);
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2349,7 +2353,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
/* a's ->flags is the same as b's */
if (a->flags & MPOL_F_LOCAL)
return true;
- return a->v.preferred_node == b->v.preferred_node;
+ return nodes_equal(a->v.preferred_nodes, b->v.preferred_nodes);
default:
BUG();
return false;
@@ -2493,7 +2497,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
if (pol->flags & MPOL_F_LOCAL)
polnid = numa_node_id();
else
- polnid = pol->v.preferred_node;
+ polnid = first_node(pol->v.preferred_nodes);
break;
case MPOL_BIND:
@@ -2816,7 +2820,7 @@ void __init numa_policy_init(void)
.refcnt = ATOMIC_INIT(1),
.mode = MPOL_PREFERRED,
.flags = MPOL_F_MOF | MPOL_F_MORON,
- .v = { .preferred_node = nid, },
+ .v = { .preferred_nodes = nodemask_of_node(nid), },
};
}
@@ -2982,7 +2986,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
if (mode != MPOL_PREFERRED)
new->v.nodes = nodes;
else if (nodelist)
- new->v.preferred_node = first_node(nodes);
+ new->v.preferred_nodes = nodemask_of_node(first_node(nodes));
else
new->flags |= MPOL_F_LOCAL;
@@ -3035,7 +3039,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
if (flags & MPOL_F_LOCAL)
mode = MPOL_LOCAL;
else
- node_set(pol->v.preferred_node, nodes);
+ nodes_or(nodes, nodes, pol->v.preferred_nodes);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
--
2.7.4
From: Ben Widawsky <[email protected]>
Begin the real plumbing for handling this new policy. Now that the
internal representation for preferred nodes and bound nodes is the same,
and we can envision what multiple preferred nodes will behave like,
there are obvious places where we can simply reuse the bind behavior.
In v1 of this series, the moral equivalent was:
"mm: Finish handling MPOL_PREFERRED_MANY". Like that, this attempts to
implement the easiest spots for the new policy. Unlike that, this just
reuses BIND.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eba207e..d945f29 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -963,8 +963,6 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
switch (p->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- *nodes = p->nodes;
- break;
case MPOL_PREFERRED_MANY:
*nodes = p->nodes;
break;
@@ -1928,7 +1926,8 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
/* Lower zones don't get a nodemask applied for MPOL_BIND */
- if (unlikely(policy->mode == MPOL_BIND) &&
+ if (unlikely(policy->mode == MPOL_BIND ||
+ policy->mode == MPOL_PREFERRED_MANY) &&
apply_policy_zone(policy, gfp_zone(gfp)) &&
cpuset_nodemask_valid_mems_allowed(&policy->nodes))
return &policy->nodes;
@@ -1984,7 +1983,6 @@ unsigned int mempolicy_slab_node(void)
return node;
switch (policy->mode) {
- case MPOL_PREFERRED_MANY:
case MPOL_PREFERRED:
/*
* handled MPOL_F_LOCAL above
@@ -1994,6 +1992,7 @@ unsigned int mempolicy_slab_node(void)
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND: {
struct zoneref *z;
@@ -2119,9 +2118,6 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
task_lock(current);
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
- case MPOL_PREFERRED_MANY:
- *mask = mempolicy->nodes;
- break;
case MPOL_PREFERRED:
if (mempolicy->flags & MPOL_F_LOCAL)
nid = numa_node_id();
@@ -2132,6 +2128,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_PREFERRED_MANY:
*mask = mempolicy->nodes;
break;
@@ -2175,12 +2172,11 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
* Thus, it's possible for tsk to have allocated memory from
* nodes in mask.
*/
- break;
- case MPOL_PREFERRED_MANY:
ret = nodes_intersects(mempolicy->nodes, *mask);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
+ case MPOL_PREFERRED_MANY:
ret = nodes_intersects(mempolicy->nodes, *mask);
break;
default:
@@ -2404,7 +2400,6 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
switch (a->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED_MANY:
return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED:
@@ -2558,6 +2553,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = first_node(pol->nodes);
break;
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
/* Optimize placement among multiple nodes via NUMA balancing */
if (pol->flags & MPOL_F_MORON) {
@@ -2580,8 +2576,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = zone_to_nid(z->zone);
break;
- /* case MPOL_PREFERRED_MANY: */
-
default:
BUG();
}
@@ -3094,15 +3088,13 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
switch (mode) {
case MPOL_DEFAULT:
break;
- case MPOL_PREFERRED_MANY:
- WARN_ON(flags & MPOL_F_LOCAL);
- fallthrough;
case MPOL_PREFERRED:
if (flags & MPOL_F_LOCAL)
mode = MPOL_LOCAL;
else
nodes_or(nodes, nodes, pol->nodes);
break;
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
case MPOL_INTERLEAVE:
nodes = pol->nodes;
--
2.7.4
From: Ben Widawsky <[email protected]>
Now that preferred_nodes is just a mask, and policies are mutually
exclusive, there is no reason to have a separate mask.
This patch is optional. It definitely helps clean up code in future
patches, but there is no functional difference to leaving it with the
previous name. I do believe it helps demonstrate the exclusivity of the
fields.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/mempolicy.h | 6 +--
mm/mempolicy.c | 114 ++++++++++++++++++++++------------------------
2 files changed, 56 insertions(+), 64 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 23ee105..ec811c3 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -46,11 +46,7 @@ struct mempolicy {
atomic_t refcnt;
unsigned short mode; /* See MPOL_* above */
unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
- union {
- nodemask_t preferred_nodes; /* preferred */
- nodemask_t nodes; /* interleave/bind */
- /* undefined for default */
- } v;
+ nodemask_t nodes; /* interleave/bind/many */
union {
nodemask_t cpuset_mems_allowed; /* relative to these nodes */
nodemask_t user_nodemask; /* nodemask passed by user */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fbfa3ce..eba207e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -199,7 +199,7 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
return -EINVAL;
- pol->v.nodes = *nodes;
+ pol->nodes = *nodes;
return 0;
}
@@ -211,7 +211,7 @@ static int mpol_new_preferred_many(struct mempolicy *pol,
else if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
else
- pol->v.preferred_nodes = *nodes;
+ pol->nodes = *nodes;
return 0;
}
@@ -235,7 +235,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
{
if (nodes_empty(*nodes))
return -EINVAL;
- pol->v.nodes = *nodes;
+ pol->nodes = *nodes;
return 0;
}
@@ -352,15 +352,15 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
else if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
else {
- nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
- *nodes);
+ nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
+ *nodes);
pol->w.cpuset_mems_allowed = *nodes;
}
if (nodes_empty(tmp))
tmp = *nodes;
- pol->v.nodes = tmp;
+ pol->nodes = tmp;
}
static void mpol_rebind_preferred_common(struct mempolicy *pol,
@@ -373,17 +373,17 @@ static void mpol_rebind_preferred_common(struct mempolicy *pol,
int node = first_node(pol->w.user_nodemask);
if (node_isset(node, *nodes)) {
- pol->v.preferred_nodes = nodemask_of_node(node);
+ pol->nodes = nodemask_of_node(node);
pol->flags &= ~MPOL_F_LOCAL;
} else
pol->flags |= MPOL_F_LOCAL;
} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
- pol->v.preferred_nodes = tmp;
+ pol->nodes = tmp;
} else if (!(pol->flags & MPOL_F_LOCAL)) {
- nodes_remap(tmp, pol->v.preferred_nodes,
- pol->w.cpuset_mems_allowed, *preferred_nodes);
- pol->v.preferred_nodes = tmp;
+ nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
+ *preferred_nodes);
+ pol->nodes = tmp;
pol->w.cpuset_mems_allowed = *nodes;
}
}
@@ -963,14 +963,14 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
switch (p->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- *nodes = p->v.nodes;
+ *nodes = p->nodes;
break;
case MPOL_PREFERRED_MANY:
- *nodes = p->v.preferred_nodes;
+ *nodes = p->nodes;
break;
case MPOL_PREFERRED:
if (!(p->flags & MPOL_F_LOCAL))
- *nodes = p->v.preferred_nodes;
+ *nodes = p->nodes;
/* else return empty node mask for local allocation */
break;
default:
@@ -1056,7 +1056,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
*policy = err;
} else if (pol == current->mempolicy &&
pol->mode == MPOL_INTERLEAVE) {
- *policy = next_node_in(current->il_prev, pol->v.nodes);
+ *policy = next_node_in(current->il_prev, pol->nodes);
} else {
err = -EINVAL;
goto out;
@@ -1908,14 +1908,14 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
/*
- * if policy->v.nodes has movable memory only,
+ * if policy->nodes has movable memory only,
* we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
*
- * policy->v.nodes is intersect with node_states[N_MEMORY].
+ * policy->nodes is intersect with node_states[N_MEMORY].
* so if the following test faile, it implies
- * policy->v.nodes has movable memory only.
+ * policy->nodes has movable memory only.
*/
- if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY]))
+ if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
dynamic_policy_zone = ZONE_MOVABLE;
return zone >= dynamic_policy_zone;
@@ -1929,9 +1929,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
/* Lower zones don't get a nodemask applied for MPOL_BIND */
if (unlikely(policy->mode == MPOL_BIND) &&
- apply_policy_zone(policy, gfp_zone(gfp)) &&
- cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
- return &policy->v.nodes;
+ apply_policy_zone(policy, gfp_zone(gfp)) &&
+ cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+ return &policy->nodes;
return NULL;
}
@@ -1942,7 +1942,7 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
if ((policy->mode == MPOL_PREFERRED ||
policy->mode == MPOL_PREFERRED_MANY) &&
!(policy->flags & MPOL_F_LOCAL)) {
- nd = first_node(policy->v.preferred_nodes);
+ nd = first_node(policy->nodes);
} else {
/*
* __GFP_THISNODE shouldn't even be used with the bind policy
@@ -1961,7 +1961,7 @@ static unsigned interleave_nodes(struct mempolicy *policy)
unsigned next;
struct task_struct *me = current;
- next = next_node_in(me->il_prev, policy->v.nodes);
+ next = next_node_in(me->il_prev, policy->nodes);
if (next < MAX_NUMNODES)
me->il_prev = next;
return next;
@@ -1989,7 +1989,7 @@ unsigned int mempolicy_slab_node(void)
/*
* handled MPOL_F_LOCAL above
*/
- return first_node(policy->v.preferred_nodes);
+ return first_node(policy->nodes);
case MPOL_INTERLEAVE:
return interleave_nodes(policy);
@@ -2005,7 +2005,7 @@ unsigned int mempolicy_slab_node(void)
enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK];
z = first_zones_zonelist(zonelist, highest_zoneidx,
- &policy->v.nodes);
+ &policy->nodes);
return z->zone ? zone_to_nid(z->zone) : node;
}
@@ -2016,12 +2016,12 @@ unsigned int mempolicy_slab_node(void)
/*
* Do static interleaving for a VMA with known offset @n. Returns the n'th
- * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
+ * node in pol->nodes (starting from n=0), wrapping around if n exceeds the
* number of present nodes.
*/
static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
{
- unsigned nnodes = nodes_weight(pol->v.nodes);
+ unsigned nnodes = nodes_weight(pol->nodes);
unsigned target;
int i;
int nid;
@@ -2029,9 +2029,9 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
if (!nnodes)
return numa_node_id();
target = (unsigned int)n % nnodes;
- nid = first_node(pol->v.nodes);
+ nid = first_node(pol->nodes);
for (i = 0; i < target; i++)
- nid = next_node(nid, pol->v.nodes);
+ nid = next_node(nid, pol->nodes);
return nid;
}
@@ -2087,7 +2087,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
} else {
nid = policy_node(gfp_flags, *mpol, numa_node_id());
if ((*mpol)->mode == MPOL_BIND)
- *nodemask = &(*mpol)->v.nodes;
+ *nodemask = &(*mpol)->nodes;
}
return nid;
}
@@ -2120,19 +2120,19 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
case MPOL_PREFERRED_MANY:
- *mask = mempolicy->v.preferred_nodes;
+ *mask = mempolicy->nodes;
break;
case MPOL_PREFERRED:
if (mempolicy->flags & MPOL_F_LOCAL)
nid = numa_node_id();
else
- nid = first_node(mempolicy->v.preferred_nodes);
+ nid = first_node(mempolicy->nodes);
init_nodemask_of_node(mask, nid);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- *mask = mempolicy->v.nodes;
+ *mask = mempolicy->nodes;
break;
default:
@@ -2177,11 +2177,11 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
*/
break;
case MPOL_PREFERRED_MANY:
- ret = nodes_intersects(mempolicy->v.preferred_nodes, *mask);
+ ret = nodes_intersects(mempolicy->nodes, *mask);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- ret = nodes_intersects(mempolicy->v.nodes, *mask);
+ ret = nodes_intersects(mempolicy->nodes, *mask);
break;
default:
BUG();
@@ -2270,7 +2270,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
if ((pol->mode == MPOL_PREFERRED ||
pol->mode == MPOL_PREFERRED_MANY) &&
!(pol->flags & MPOL_F_LOCAL))
- hpage_node = first_node(pol->v.preferred_nodes);
+ hpage_node = first_node(pol->nodes);
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2404,15 +2404,14 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
switch (a->mode) {
case MPOL_BIND:
case MPOL_INTERLEAVE:
- return !!nodes_equal(a->v.nodes, b->v.nodes);
+ return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED_MANY:
- return !!nodes_equal(a->v.preferred_nodes,
- b->v.preferred_nodes);
+ return !!nodes_equal(a->nodes, b->nodes);
case MPOL_PREFERRED:
/* a's ->flags is the same as b's */
if (a->flags & MPOL_F_LOCAL)
return true;
- return nodes_equal(a->v.preferred_nodes, b->v.preferred_nodes);
+ return nodes_equal(a->nodes, b->nodes);
default:
BUG();
return false;
@@ -2556,13 +2555,13 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
if (pol->flags & MPOL_F_LOCAL)
polnid = numa_node_id();
else
- polnid = first_node(pol->v.preferred_nodes);
+ polnid = first_node(pol->nodes);
break;
case MPOL_BIND:
/* Optimize placement among multiple nodes via NUMA balancing */
if (pol->flags & MPOL_F_MORON) {
- if (node_isset(thisnid, pol->v.nodes))
+ if (node_isset(thisnid, pol->nodes))
break;
goto out;
}
@@ -2573,12 +2572,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* else select nearest allowed node, if any.
* If no allowed nodes, use current [!misplaced].
*/
- if (node_isset(curnid, pol->v.nodes))
+ if (node_isset(curnid, pol->nodes))
goto out;
- z = first_zones_zonelist(
- node_zonelist(numa_node_id(), GFP_HIGHUSER),
- gfp_zone(GFP_HIGHUSER),
- &pol->v.nodes);
+ z = first_zones_zonelist(node_zonelist(numa_node_id(),
+ GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER), &pol->nodes);
polnid = zone_to_nid(z->zone);
break;
@@ -2779,11 +2777,9 @@ int mpol_set_shared_policy(struct shared_policy *info,
struct sp_node *new = NULL;
unsigned long sz = vma_pages(vma);
- pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
- vma->vm_pgoff,
- sz, npol ? npol->mode : -1,
- npol ? npol->flags : -1,
- npol ? nodes_addr(npol->v.nodes)[0] : NUMA_NO_NODE);
+ pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n", vma->vm_pgoff, sz,
+ npol ? npol->mode : -1, npol ? npol->flags : -1,
+ npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE);
if (npol) {
new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
@@ -2877,11 +2873,11 @@ void __init numa_policy_init(void)
0, SLAB_PANIC, NULL);
for_each_node(nid) {
- preferred_node_policy[nid] = (struct mempolicy) {
+ preferred_node_policy[nid] = (struct mempolicy){
.refcnt = ATOMIC_INIT(1),
.mode = MPOL_PREFERRED,
.flags = MPOL_F_MOF | MPOL_F_MORON,
- .v = { .preferred_nodes = nodemask_of_node(nid), },
+ .nodes = nodemask_of_node(nid),
};
}
@@ -3047,9 +3043,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
* for /proc/mounts, /proc/pid/mounts and /proc/pid/mountinfo.
*/
if (mode != MPOL_PREFERRED)
- new->v.nodes = nodes;
+ new->nodes = nodes;
else if (nodelist)
- new->v.preferred_nodes = nodemask_of_node(first_node(nodes));
+ new->nodes = nodemask_of_node(first_node(nodes));
else
new->flags |= MPOL_F_LOCAL;
@@ -3105,11 +3101,11 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
if (flags & MPOL_F_LOCAL)
mode = MPOL_LOCAL;
else
- nodes_or(nodes, nodes, pol->v.preferred_nodes);
+ nodes_or(nodes, nodes, pol->nodes);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- nodes = pol->v.nodes;
+ nodes = pol->nodes;
break;
default:
WARN_ON_ONCE(1);
--
2.7.4
From: Ben Widawsky <[email protected]>
In order to support MPOL_PREFERRED_MANY as the mode used by
set_mempolicy(2), alloc_pages_current() needs to support it. This patch
does that by using the new helper function to allocate properly based on
policy.
All the actual machinery to make this work was part of
("mm/mempolicy: Create a page allocator for policy")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d21105b..a92efe7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2357,7 +2357,7 @@ EXPORT_SYMBOL(alloc_pages_vma);
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
struct mempolicy *pol = &default_policy;
- struct page *page;
+ int nid = NUMA_NO_NODE;
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
@@ -2367,14 +2367,9 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
* nor system default_policy
*/
if (pol->mode == MPOL_INTERLEAVE)
- page = alloc_pages_policy(pol, gfp, order,
- interleave_nodes(pol));
- else
- page = __alloc_pages_nodemask(gfp, order,
- policy_node(gfp, pol, numa_node_id()),
- policy_nodemask(gfp, pol));
+ nid = interleave_nodes(pol);
- return page;
+ return alloc_pages_policy(pol, gfp, order, nid);
}
EXPORT_SYMBOL(alloc_pages_current);
--
2.7.4
From: Ben Widawsky <[email protected]>
Add a helper function which takes care of handling multiple preferred
nodes. It will be called by future patches that need to handle this,
specifically VMA based page allocation, and task based page allocation.
Huge pages don't quite fit the same pattern because they use different
underlying page allocation functions. This consumes the previous
interleave policy specific allocation function to make a one stop shop
for policy based allocation.
With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
that it will first try the preferred node/nodes, and fallback to all
other nodes when first try fails. Thanks to Michal Hocko for suggestions
on this.
For now, only interleaved policy will be used so there should be no
functional change yet. However, if bisection points to issues in the
next few commits, it was likely the fault of this patch.
Similar functionality is offered via policy_node() and
policy_nodemask(). By themselves however, neither can achieve this
fallback style of sets of nodes.
[ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
to speedup allocation in some case ]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 52 insertions(+), 13 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d945f29..d21105b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
return ret;
}
-/* Allocate a page in interleaved policy.
- Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
- unsigned nid)
+/* Handle page allocation for all but interleaved policies */
+static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
+ unsigned int order, int preferred_nid)
{
struct page *page;
+ gfp_t gfp_mask = gfp;
- page = __alloc_pages(gfp, order, nid);
- /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
- if (!static_branch_likely(&vm_numa_stat_key))
+ if (pol->mode == MPOL_INTERLEAVE) {
+ page = __alloc_pages(gfp, order, preferred_nid);
+ /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
+ if (!static_branch_likely(&vm_numa_stat_key))
+ return page;
+ if (page && page_to_nid(page) == preferred_nid) {
+ preempt_disable();
+ __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
+ preempt_enable();
+ }
return page;
- if (page && page_to_nid(page) == nid) {
- preempt_disable();
- __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
- preempt_enable();
}
+
+ VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
+
+ preferred_nid = numa_node_id();
+
+ /*
+ * There is a two pass approach implemented here for
+ * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
+ * but allow the allocation to fail. The below table explains how
+ * this is achieved.
+ *
+ * | Policy | preferred nid | nodemask |
+ * |-------------------------------|---------------|------------|
+ * | MPOL_DEFAULT | local | NULL |
+ * | MPOL_PREFERRED | best | NULL |
+ * | MPOL_INTERLEAVE | ERR | ERR |
+ * | MPOL_BIND | local | pol->nodes |
+ * | MPOL_PREFERRED_MANY | best | pol->nodes |
+ * | MPOL_PREFERRED_MANY (round 2) | local | NULL |
+ * +-------------------------------+---------------+------------+
+ */
+ if (pol->mode == MPOL_PREFERRED_MANY) {
+ gfp_mask |= __GFP_NOWARN;
+
+ /* Skip direct reclaim, as there will be a second try */
+ gfp_mask &= ~__GFP_DIRECT_RECLAIM;
+ }
+
+ page = __alloc_pages_nodemask(gfp_mask, order,
+ policy_node(gfp, pol, preferred_nid),
+ policy_nodemask(gfp, pol));
+
+ if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
+ page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
+
return page;
}
@@ -2244,8 +2282,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned nid;
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+ page = alloc_pages_policy(pol, gfp, order, nid);
mpol_cond_put(pol);
- page = alloc_page_interleave(gfp, order, nid);
goto out;
}
@@ -2329,7 +2367,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
* nor system default_policy
*/
if (pol->mode == MPOL_INTERLEAVE)
- page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+ page = alloc_pages_policy(pol, gfp, order,
+ interleave_nodes(pol));
else
page = __alloc_pages_nodemask(gfp, order,
policy_node(gfp, pol, numa_node_id()),
--
2.7.4
From: Ben Widawsky <[email protected]>
Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.
NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
nodes that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.
Generally speaking, this is similar to the way MPOL_BIND works, except
the user will only get a SIGSEGV if all nodes in the system are unable
to satisfy the allocation request.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
include/uapi/linux/mempolicy.h | 6 +++---
mm/hugetlb.c | 4 ++--
mm/mempolicy.c | 14 ++++++--------
4 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 1ad020c..fcdaf97 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -245,6 +245,14 @@ MPOL_INTERLEAVED
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
+MPOL_PREFERRED_MANY
+ This mode specifies that the allocation should be attempted from the
+ nodemask specified in the policy. If that allocation fails, the kernel
+ will search other nodes, in order of increasing distance from the first
+ set bit in the nodemask based on information provided by the platform
+ firmware. It is similar to MPOL_PREFERRED with the main exception that
+ is an error to have an empty nodemask.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
@@ -253,10 +261,10 @@ MPOL_F_STATIC_NODES
nodes changes after the memory policy has been defined.
Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
+ change in the set of allowed nodes, the preferred nodemask (Preferred
+ Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. This may result in nodes
+ being used that were previously undesired.
With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8948467..3dddd1e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -16,13 +16,13 @@
*/
/* Policies */
-enum {
- MPOL_DEFAULT,
+enum { MPOL_DEFAULT,
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
- MPOL_MAX, /* always last member of enum */
+ MPOL_PREFERRED_MANY,
+ MPOL_MAX, /* always last member of enum */
};
/* Flags for set_mempolicy */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9dfbfa3..03ec958 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1126,7 +1126,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
+ if (mpol->mode == MPOL_PREFERRED_MANY) {
gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
@@ -1893,7 +1893,7 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;
nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
- if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
+ if (mpol->mode == MPOL_PREFERRED_MANY) {
gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 40d32cb..18aa7dc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -108,8 +108,6 @@
#include "internal.h"
-#define MPOL_PREFERRED_MANY MPOL_MAX
-
/* Internal flags */
#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
@@ -180,7 +178,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
-} mpol_ops[MPOL_MAX + 1];
+} mpol_ops[MPOL_MAX];
static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
{
@@ -389,8 +387,8 @@ static void mpol_rebind_preferred_common(struct mempolicy *pol,
}
/* MPOL_PREFERRED_MANY allows multiple nodes to be set in 'nodes' */
-static void __maybe_unused mpol_rebind_preferred_many(struct mempolicy *pol,
- const nodemask_t *nodes)
+static void mpol_rebind_preferred_many(struct mempolicy *pol,
+ const nodemask_t *nodes)
{
mpol_rebind_preferred_common(pol, nodes, nodes);
}
@@ -452,7 +450,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
mmap_write_unlock(mm);
}
-static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
+static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
[MPOL_DEFAULT] = {
.rebind = mpol_rebind_default,
},
@@ -470,8 +468,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
},
/* [MPOL_LOCAL] - see mpol_new() */
[MPOL_PREFERRED_MANY] = {
- .create = NULL,
- .rebind = NULL,
+ .create = mpol_new_preferred_many,
+ .rebind = mpol_rebind_preferred_many,
},
};
--
2.7.4
From: Ben Widawsky <[email protected]>
Implement the missing huge page allocation functionality while obeying
the preferred node semantics.
This uses a fallback mechanism to try multiple preferred nodes first,
and then all other nodes. It cannot use the helper function that was
introduced because huge page allocation already has its own helpers and
it was more LOC, and effort to try to consolidate that.
The weirdness is MPOL_PREFERRED_MANY can't be called yet because it is
part of the UAPI we haven't yet exposed. Instead of make that define
global, it's simply changed with the UAPI patch.
[ feng: add NOWARN flag, and skip the direct reclaim to speedup allocation
in some case ]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/hugetlb.c | 26 +++++++++++++++++++++++---
mm/mempolicy.c | 3 ++-
2 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8fb42c6..9dfbfa3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1105,7 +1105,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
unsigned long address, int avoid_reserve,
long chg)
{
- struct page *page;
+ struct page *page = NULL;
struct mempolicy *mpol;
gfp_t gfp_mask;
nodemask_t *nodemask;
@@ -1126,7 +1126,17 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
gfp_mask = htlb_alloc_mask(h);
nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+ if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
+ gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
+
+ gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
+ page = dequeue_huge_page_nodemask(h,
+ gfp_mask1, nid, nodemask);
+ if (!page)
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
+ } else {
+ page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+ }
if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
SetHPageRestoreReserve(page);
h->resv_huge_pages--;
@@ -1883,7 +1893,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;
nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
- page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
+ if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
+ gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
+
+ gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
+ page = alloc_surplus_huge_page(h,
+ gfp_mask1, nid, nodemask);
+ if (!page)
+ alloc_surplus_huge_page(h, gfp_mask, nid, NULL);
+ } else {
+ page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
+ }
mpol_cond_put(mpol);
return page;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8fe76a7..40d32cb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2085,7 +2085,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
huge_page_shift(hstate_vma(vma)));
} else {
nid = policy_node(gfp_flags, *mpol, numa_node_id());
- if ((*mpol)->mode == MPOL_BIND)
+ if ((*mpol)->mode == MPOL_BIND ||
+ (*mpol)->mode == MPOL_PREFERRED_MANY)
*nodemask = &(*mpol)->nodes;
}
return nid;
--
2.7.4
From: Dave Hansen <[email protected]>
MPOL_PREFERRED honors only a single node set in the nodemask. Add the
bare define for a new mode which will allow more than one.
The patch does all the plumbing without actually adding the new policy
type.
v2:
Plumb most MPOL_PREFERRED_MANY without exposing UAPI (Ben)
Fixes for checkpatch (Ben)
Link: https://lore.kernel.org/r/[email protected]
Co-developed-by: Ben Widawsky <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 46 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 40 insertions(+), 6 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2b1e0e4..1228d8e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
* but useful to set in a VMA when you have a non default
* process policy.
*
+ * preferred many Try a set of nodes first before normal fallback. This is
+ * similar to preferred without the special case.
+ *
* default Allocate on the local node first, or when on a VMA
* use the process policy. This is what Linux always did
* in a NUMA aware kernel and still does by, ahem, default.
@@ -105,6 +108,8 @@
#include "internal.h"
+#define MPOL_PREFERRED_MANY MPOL_MAX
+
/* Internal flags */
#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
@@ -175,7 +180,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
-} mpol_ops[MPOL_MAX];
+} mpol_ops[MPOL_MAX + 1];
static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
{
@@ -415,7 +420,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
mmap_write_unlock(mm);
}
-static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
+static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
[MPOL_DEFAULT] = {
.rebind = mpol_rebind_default,
},
@@ -432,6 +437,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.rebind = mpol_rebind_nodemask,
},
/* [MPOL_LOCAL] - see mpol_new() */
+ [MPOL_PREFERRED_MANY] = {
+ .create = NULL,
+ .rebind = NULL,
+ },
};
static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -924,6 +933,9 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
case MPOL_INTERLEAVE:
*nodes = p->v.nodes;
break;
+ case MPOL_PREFERRED_MANY:
+ *nodes = p->v.preferred_nodes;
+ break;
case MPOL_PREFERRED:
if (!(p->flags & MPOL_F_LOCAL))
*nodes = p->v.preferred_nodes;
@@ -1895,7 +1907,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
/* Return the node id preferred by the given mempolicy, or the given id */
static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
{
- if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) {
+ if ((policy->mode == MPOL_PREFERRED ||
+ policy->mode == MPOL_PREFERRED_MANY) &&
+ !(policy->flags & MPOL_F_LOCAL)) {
nd = first_node(policy->v.preferred_nodes);
} else {
/*
@@ -1938,6 +1952,7 @@ unsigned int mempolicy_slab_node(void)
return node;
switch (policy->mode) {
+ case MPOL_PREFERRED_MANY:
case MPOL_PREFERRED:
/*
* handled MPOL_F_LOCAL above
@@ -2072,6 +2087,9 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
task_lock(current);
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
+ case MPOL_PREFERRED_MANY:
+ *mask = mempolicy->v.preferred_nodes;
+ break;
case MPOL_PREFERRED:
if (mempolicy->flags & MPOL_F_LOCAL)
nid = numa_node_id();
@@ -2126,6 +2144,9 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
* nodes in mask.
*/
break;
+ case MPOL_PREFERRED_MANY:
+ ret = nodes_intersects(mempolicy->v.preferred_nodes, *mask);
+ break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
ret = nodes_intersects(mempolicy->v.nodes, *mask);
@@ -2210,10 +2231,13 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* node and don't fall back to other nodes, as the cost of
* remote accesses would likely offset THP benefits.
*
- * If the policy is interleave, or does not allow the current
- * node in its nodemask, we allocate the standard way.
+ * If the policy is interleave or multiple preferred nodes, or
+ * does not allow the current node in its nodemask, we allocate
+ * the standard way.
*/
- if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+ if ((pol->mode == MPOL_PREFERRED ||
+ pol->mode == MPOL_PREFERRED_MANY) &&
+ !(pol->flags & MPOL_F_LOCAL))
hpage_node = first_node(pol->v.preferred_nodes);
nmask = policy_nodemask(gfp, pol);
@@ -2349,6 +2373,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_BIND:
case MPOL_INTERLEAVE:
return !!nodes_equal(a->v.nodes, b->v.nodes);
+ case MPOL_PREFERRED_MANY:
+ return !!nodes_equal(a->v.preferred_nodes,
+ b->v.preferred_nodes);
case MPOL_PREFERRED:
/* a's ->flags is the same as b's */
if (a->flags & MPOL_F_LOCAL)
@@ -2523,6 +2550,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = zone_to_nid(z->zone);
break;
+ /* case MPOL_PREFERRED_MANY: */
+
default:
BUG();
}
@@ -2874,6 +2903,7 @@ static const char * const policy_modes[] =
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
[MPOL_LOCAL] = "local",
+ [MPOL_PREFERRED_MANY] = "prefer (many)",
};
@@ -2953,6 +2983,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
if (!nodelist)
err = 0;
goto out;
+ case MPOL_PREFERRED_MANY:
case MPOL_BIND:
/*
* Insist on a nodelist
@@ -3035,6 +3066,9 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
switch (mode) {
case MPOL_DEFAULT:
break;
+ case MPOL_PREFERRED_MANY:
+ WARN_ON(flags & MPOL_F_LOCAL);
+ fallthrough;
case MPOL_PREFERRED:
if (flags & MPOL_F_LOCAL)
mode = MPOL_LOCAL;
--
2.7.4
From: Dave Hansen <[email protected]>
Again, this extracts the "only one node must be set" behavior of
MPOL_PREFERRED. It retains virtually all of the existing code so it can
be used by MPOL_PREFERRED_MANY as well.
v2:
Fixed typos in commit message. (Ben)
Merged bits from other patches. (Ben)
annotate mpol_rebind_preferred_many as unused (Ben)
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6fb2cab..fbfa3ce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -363,14 +363,11 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
pol->v.nodes = tmp;
}
-static void mpol_rebind_preferred(struct mempolicy *pol,
- const nodemask_t *nodes)
+static void mpol_rebind_preferred_common(struct mempolicy *pol,
+ const nodemask_t *preferred_nodes,
+ const nodemask_t *nodes)
{
nodemask_t tmp;
- nodemask_t preferred_node;
-
- /* MPOL_PREFERRED uses only the first node in the mask */
- preferred_node = nodemask_of_node(first_node(*nodes));
if (pol->flags & MPOL_F_STATIC_NODES) {
int node = first_node(pol->w.user_nodemask);
@@ -385,12 +382,30 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
pol->v.preferred_nodes = tmp;
} else if (!(pol->flags & MPOL_F_LOCAL)) {
nodes_remap(tmp, pol->v.preferred_nodes,
- pol->w.cpuset_mems_allowed, preferred_node);
+ pol->w.cpuset_mems_allowed, *preferred_nodes);
pol->v.preferred_nodes = tmp;
pol->w.cpuset_mems_allowed = *nodes;
}
}
+/* MPOL_PREFERRED_MANY allows multiple nodes to be set in 'nodes' */
+static void __maybe_unused mpol_rebind_preferred_many(struct mempolicy *pol,
+ const nodemask_t *nodes)
+{
+ mpol_rebind_preferred_common(pol, nodes, nodes);
+}
+
+static void mpol_rebind_preferred(struct mempolicy *pol,
+ const nodemask_t *nodes)
+{
+ nodemask_t preferred_node;
+
+ /* MPOL_PREFERRED uses only the first node in 'nodes' */
+ preferred_node = nodemask_of_node(first_node(*nodes));
+
+ mpol_rebind_preferred_common(pol, &preferred_node, nodes);
+}
+
/*
* mpol_rebind_policy - Migrate a policy to a different set of nodes
*
--
2.7.4
From: Ben Widawsky <[email protected]>
This patch implements MPOL_PREFERRED_MANY for alloc_pages_vma(). Like
alloc_pages_current(), alloc_pages_vma() needs to support policy based
decisions if they've been configured via mbind(2).
The temporary "hack" of treating MPOL_PREFERRED and MPOL_PREFERRED_MANY
can now be removed with this, too.
All the actual machinery to make this work was part of
("mm/mempolicy: Create a page allocator for policy")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a92efe7..8fe76a7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2273,8 +2273,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
{
struct mempolicy *pol;
struct page *page;
- int preferred_nid;
- nodemask_t *nmask;
pol = get_vma_policy(vma, addr);
@@ -2288,6 +2286,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
}
if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+ nodemask_t *nmask;
int hpage_node = node;
/*
@@ -2301,10 +2300,26 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* does not allow the current node in its nodemask, we allocate
* the standard way.
*/
- if ((pol->mode == MPOL_PREFERRED ||
- pol->mode == MPOL_PREFERRED_MANY) &&
- !(pol->flags & MPOL_F_LOCAL))
+ if (pol->mode == MPOL_PREFERRED || !(pol->flags & MPOL_F_LOCAL)) {
hpage_node = first_node(pol->nodes);
+ } else if (pol->mode == MPOL_PREFERRED_MANY) {
+ struct zoneref *z;
+
+ /*
+ * In this policy, with direct reclaim, the normal
+ * policy based allocation will do the right thing - try
+ * twice using the preferred nodes first, and all nodes
+ * second.
+ */
+ if (gfp & __GFP_DIRECT_RECLAIM) {
+ page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
+ goto out;
+ }
+
+ z = first_zones_zonelist(node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER), &pol->nodes);
+ hpage_node = zone_to_nid(z->zone);
+ }
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2330,9 +2345,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
}
}
- nmask = policy_nodemask(gfp, pol);
- preferred_nid = policy_node(gfp, pol, node);
- page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
+ page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
mpol_cond_put(pol);
out:
return page;
--
2.7.4
To reduce some code duplication.
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 25 +++++++------------------
1 file changed, 7 insertions(+), 18 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 18aa7dc..ee99ecc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -201,32 +201,21 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
return 0;
}
-static int mpol_new_preferred_many(struct mempolicy *pol,
+/* cover both MPOL_PREFERRED and MPOL_PREFERRED_MANY */
+static int mpol_new_preferred(struct mempolicy *pol,
const nodemask_t *nodes)
{
if (!nodes)
pol->flags |= MPOL_F_LOCAL; /* local allocation */
else if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
- else
- pol->nodes = *nodes;
- return 0;
-}
-
-static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
-{
- if (nodes) {
+ else {
/* MPOL_PREFERRED can only take a single node: */
- nodemask_t tmp;
+ nodemask_t tmp = nodemask_of_node(first_node(*nodes));
- if (nodes_empty(*nodes))
- return -EINVAL;
-
- tmp = nodemask_of_node(first_node(*nodes));
- return mpol_new_preferred_many(pol, &tmp);
+ pol->nodes = (pol->mode == MPOL_PREFERRED) ? tmp : *nodes;
}
-
- return mpol_new_preferred_many(pol, NULL);
+ return 0;
}
static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
@@ -468,7 +457,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
},
/* [MPOL_LOCAL] - see mpol_new() */
[MPOL_PREFERRED_MANY] = {
- .create = mpol_new_preferred_many,
+ .create = mpol_new_preferred,
.rebind = mpol_rebind_preferred_many,
},
};
--
2.7.4
Hi Feng,
I love your patch! Yet something to improve:
[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.12-rc3]
[cannot apply to next-20210316]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Feng-Tang/Introduced-multi-preference-mempolicy/20210317-114204
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git a74e6a014c9d4d4161061f770c9b4f98372ac778
config: s390-randconfig-r022-20210317 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 8ef111222a3dd12a9175f69c3bff598c46e8bdf7)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/3bfe0c833846b79ae8bbfff60906c9d7c244c3b8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Feng-Tang/Introduced-multi-preference-mempolicy/20210317-114204
git checkout 3bfe0c833846b79ae8bbfff60906c9d7c244c3b8
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
___constant_swab32(x) : \
^
include/uapi/linux/swab.h:19:12: note: expanded from macro '___constant_swab32'
(((__u32)(x) & (__u32)0x000000ffUL) << 24) | \
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:80:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
___constant_swab32(x) : \
^
include/uapi/linux/swab.h:20:12: note: expanded from macro '___constant_swab32'
(((__u32)(x) & (__u32)0x0000ff00UL) << 8) | \
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:80:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
___constant_swab32(x) : \
^
include/uapi/linux/swab.h:21:12: note: expanded from macro '___constant_swab32'
(((__u32)(x) & (__u32)0x00ff0000UL) >> 8) | \
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:80:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
___constant_swab32(x) : \
^
include/uapi/linux/swab.h:22:12: note: expanded from macro '___constant_swab32'
(((__u32)(x) & (__u32)0xff000000UL) >> 24)))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:80:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:120:12: note: expanded from macro '__swab32'
__fswab32(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:80:
include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writeb(value, PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
>> mm/hugetlb.c:1129:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
>> mm/hugetlb.c:1129:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
>> mm/hugetlb.c:1129:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
(cond) ? \
^~~~
mm/hugetlb.c:1896:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
mm/hugetlb.c:1896:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
mm/hugetlb.c:1896:12: error: no member named 'mode' in 'struct mempolicy'
if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
~~~~ ^
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^~~~
include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^~~~
include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
(cond) ? \
^~~~
20 warnings and 6 errors generated.
vim +1129 mm/hugetlb.c
1102
1103 static struct page *dequeue_huge_page_vma(struct hstate *h,
1104 struct vm_area_struct *vma,
1105 unsigned long address, int avoid_reserve,
1106 long chg)
1107 {
1108 struct page *page = NULL;
1109 struct mempolicy *mpol;
1110 gfp_t gfp_mask;
1111 nodemask_t *nodemask;
1112 int nid;
1113
1114 /*
1115 * A child process with MAP_PRIVATE mappings created by their parent
1116 * have no page reserves. This check ensures that reservations are
1117 * not "stolen". The child may still get SIGKILLed
1118 */
1119 if (!vma_has_reserves(vma, chg) &&
1120 h->free_huge_pages - h->resv_huge_pages == 0)
1121 goto err;
1122
1123 /* If reserves cannot be used, ensure enough pages are in the pool */
1124 if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
1125 goto err;
1126
1127 gfp_mask = htlb_alloc_mask(h);
1128 nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> 1129 if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
1130 gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
1131
1132 gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
1133 page = dequeue_huge_page_nodemask(h,
1134 gfp_mask1, nid, nodemask);
1135 if (!page)
1136 page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
1137 } else {
1138 page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
1139 }
1140 if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
1141 SetHPageRestoreReserve(page);
1142 h->resv_huge_pages--;
1143 }
1144
1145 mpol_cond_put(mpol);
1146 return page;
1147
1148 err:
1149 return NULL;
1150 }
1151
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
[Cc linux-api]
On Wed 17-03-21 11:39:57, Feng Tang wrote:
> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> preference for nodes which will fulfil memory allocation requests. Unlike the
> MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
> works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
> invoke the OOM killer if those preferred nodes are not available.
>
> Along with these patches are patches for libnuma, numactl, numademo, and memhog.
> They still need some polish, but can be found here:
> https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
> It allows new usage: `numactl -P 0,3,4`
>
> The goal of the new mode is to enable some use-cases when using tiered memory
> usage models which I've lovingly named.
> 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> requirements allowing preference to be given to all nodes with "fast" memory.
> 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> perhaps slow memory), but doesn't care which node it runs on. The application
> can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> etc). This reverses the nodes are chosen today where the kernel attempts to use
> local memory to the CPU whenever possible. This will attempt to use the local
> accelerator to the memory.
> 2. The Tortoise - The administrator (or the application itself) is aware it only
> needs slow memory, and so can prefer that.
>
> Much of this is almost achievable with the bind interface, but the bind
> interface suffers from an inability to fallback to another set of nodes if
> binding fails to all nodes in the nodemask.
>
> Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> preference.
>
> > /* Set first two nodes as preferred in an 8 node system. */
> > const unsigned long nodes = 0x3
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
>
> > /* Mimic interleave policy, but have fallback *.
> > const unsigned long nodes = 0xaa
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
>
> Some internal discussion took place around the interface. There are two
> alternatives which we have discussed, plus one I stuck in:
> 1. Ordered list of nodes. Currently it's believed that the added complexity is
> nod needed for expected usecases.
> 2. A flag for bind to allow falling back to other nodes. This confuses the
> notion of binding and is less flexible than the current solution.
> 3. Create flags or new modes that helps with some ordering. This offers both a
> friendlier API as well as a solution for more customized usage. It's unknown
> if it's worth the complexity to support this. Here is sample code for how
> this might work:
>
> > // Prefer specific nodes for some something wacky
> > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
> >
> > // Default
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > // which is the same as
> > set_mempolicy(MPOL_DEFAULT, NULL, 0);
> >
> > // The Hare
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> >
> > // The Tortoise
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> >
> > // Prefer the fast memory of the first two sockets
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> >
>
> In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
> There wasn't consensus around this, so I've left the existing API as it was. I'm
> open to more feedback here, but my slight preference is to use a new API as it
> ensures if people are using it, they are entirely aware of what they're doing
> and not accidentally misusing the old interface. (In a similar way to how
> MPOL_LOCAL was introduced).
>
> In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
> fine with that change, but I hadn't heard much emphatic support for one way or
> another, so I've left that too.
>
> Changelog:
>
> Since v3:
> * Rebased against v5.12-rc2
> * Drop the v3/0013 patch of creating NO_SLOWPATH gfp_mask bit
> * Skip direct reclaim for the first allocation try for
> MPOL_PREFERRED_MANY, which makes its semantics close to
> existing MPOL_PREFFERRED policy
>
> Since v2:
> * Rebased against v5.11
> * Fix a stack overflow related panic, and a kernel warning (Feng)
> * Some code clearup (Feng)
> * One RFC patch to speedup mem alloc in some case (Feng)
>
> Since v1:
> * Dropped patch to replace numa_node_id in some places (mhocko)
> * Dropped all the page allocation patches in favor of new mechanism to
> use fallbacks. (mhocko)
> * Dropped the special snowflake preferred node algorithm (bwidawsk)
> * If the preferred node fails, ALL nodes are rechecked instead of just
> the non-preferred nodes.
>
> v4 Summary:
> 1: Random fix I found along the way
> 2-5: Represent node preference as a mask internally
> 6-7: Tread many preferred like bind
> 8-11: Handle page allocation for the new policy
> 12: Enable the uapi
> 13: unifiy 2 functions
>
> Ben Widawsky (8):
> mm/mempolicy: Add comment for missing LOCAL
> mm/mempolicy: kill v.preferred_nodes
> mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND
> mm/mempolicy: Create a page allocator for policy
> mm/mempolicy: Thread allocation for many preferred
> mm/mempolicy: VMA allocation for many preferred
> mm/mempolicy: huge-page allocation for many preferred
> mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
>
> Dave Hansen (4):
> mm/mempolicy: convert single preferred_node to full nodemask
> mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
> mm/mempolicy: allow preferred code to take a nodemask
> mm/mempolicy: refactor rebind code for PREFERRED_MANY
>
> Feng Tang (1):
> mem/mempolicy: unify mpol_new_preferred() and
> mpol_new_preferred_many()
>
> .../admin-guide/mm/numa_memory_policy.rst | 22 +-
> include/linux/mempolicy.h | 6 +-
> include/uapi/linux/mempolicy.h | 6 +-
> mm/hugetlb.c | 26 +-
> mm/mempolicy.c | 272 ++++++++++++++-------
> 5 files changed, 225 insertions(+), 107 deletions(-)
>
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:00, Feng Tang wrote:
> From: Dave Hansen <[email protected]>
>
> MPOL_PREFERRED honors only a single node set in the nodemask. Add the
> bare define for a new mode which will allow more than one.
>
> The patch does all the plumbing without actually adding the new policy
> type.
>
> v2:
> Plumb most MPOL_PREFERRED_MANY without exposing UAPI (Ben)
> Fixes for checkpatch (Ben)
>
> Link: https://lore.kernel.org/r/[email protected]
> Co-developed-by: Ben Widawsky <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/mempolicy.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2b1e0e4..1228d8e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -31,6 +31,9 @@
> * but useful to set in a VMA when you have a non default
> * process policy.
> *
> + * preferred many Try a set of nodes first before normal fallback. This is
> + * similar to preferred without the special case.
> + *
> * default Allocate on the local node first, or when on a VMA
> * use the process policy. This is what Linux always did
> * in a NUMA aware kernel and still does by, ahem, default.
> @@ -105,6 +108,8 @@
>
> #include "internal.h"
>
> +#define MPOL_PREFERRED_MANY MPOL_MAX
> +
> /* Internal flags */
> #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
> #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
> @@ -175,7 +180,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
> static const struct mempolicy_operations {
> int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
> void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
> -} mpol_ops[MPOL_MAX];
> +} mpol_ops[MPOL_MAX + 1];
>
> static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> {
> @@ -415,7 +420,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
> mmap_write_unlock(mm);
> }
>
> -static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> +static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
> [MPOL_DEFAULT] = {
> .rebind = mpol_rebind_default,
> },
> @@ -432,6 +437,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> .rebind = mpol_rebind_nodemask,
> },
> /* [MPOL_LOCAL] - see mpol_new() */
> + [MPOL_PREFERRED_MANY] = {
> + .create = NULL,
> + .rebind = NULL,
> + },
> };
I do get that you wanted to keep MPOL_PREFERRED_MANY unaccessible for
the userspace but wouldn't it be much easier to simply check in two
syscall entries rather than playing thise MAX+1 games which make the
review more complicated than necessary?
>
> static int migrate_page_add(struct page *page, struct list_head *pagelist,
> @@ -924,6 +933,9 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> case MPOL_INTERLEAVE:
> *nodes = p->v.nodes;
> break;
> + case MPOL_PREFERRED_MANY:
> + *nodes = p->v.preferred_nodes;
> + break;
> case MPOL_PREFERRED:
> if (!(p->flags & MPOL_F_LOCAL))
> *nodes = p->v.preferred_nodes;
Why those two do a slightly different thing? Is this because unlike
MPOL_PREFERRED it can never have MPOL_F_LOCAL cleared? If that is the
case I would still stick the two together and use the same code for
both to make the code easier to follow. Now that both use the same
nodemask it should really be just about syscall inputs sanitization and
to keep the original behavior for MPOL_PREFERRED.
[...]
> @@ -2072,6 +2087,9 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> task_lock(current);
> mempolicy = current->mempolicy;
> switch (mempolicy->mode) {
> + case MPOL_PREFERRED_MANY:
> + *mask = mempolicy->v.preferred_nodes;
> + break;
> case MPOL_PREFERRED:
> if (mempolicy->flags & MPOL_F_LOCAL)
> nid = numa_node_id();
Same here
> @@ -2126,6 +2144,9 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> * nodes in mask.
> */
> break;
> + case MPOL_PREFERRED_MANY:
> + ret = nodes_intersects(mempolicy->v.preferred_nodes, *mask);
> + break;
I do not think this is a correct behavior. Preferred policy, whether it
is a single node or a nodemask, is a hint not a requirement. So we
should always treat it as intersecting. I do understand that the naming
can be confusing because intersect operation should indeed check
nodemaska but this is yet another trap of the mempolicy code. It is
only used for the OOM selection.
Btw. the code is wrong for INTERLEAVE as well because it uses the
interleaving node as a hint as well. It is not bound by the interleave
nodemask. Sigh...
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> ret = nodes_intersects(mempolicy->v.nodes, *mask);
[...]
> @@ -2349,6 +2373,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> return !!nodes_equal(a->v.nodes, b->v.nodes);
> + case MPOL_PREFERRED_MANY:
> + return !!nodes_equal(a->v.preferred_nodes,
> + b->v.preferred_nodes);
Again different from MPOL_PREFERRED...
> case MPOL_PREFERRED:
> /* a's ->flags is the same as b's */
> if (a->flags & MPOL_F_LOCAL)
> @@ -2523,6 +2550,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> polnid = zone_to_nid(z->zone);
> break;
>
> + /* case MPOL_PREFERRED_MANY: */
> +
I hope a follow up patch will make this not panic but as you are already
plumbing everything in it should really be as simple as node_isset
check.
> default:
> BUG();
Besides that, this should really go!
> @@ -3035,6 +3066,9 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
> switch (mode) {
> case MPOL_DEFAULT:
> break;
> + case MPOL_PREFERRED_MANY:
> + WARN_ON(flags & MPOL_F_LOCAL);
Why WARN_ON here?
> + fallthrough;
> case MPOL_PREFERRED:
> if (flags & MPOL_F_LOCAL)
> mode = MPOL_LOCAL;
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:01, Feng Tang wrote:
> From: Dave Hansen <[email protected]>
>
> Create a helper function (mpol_new_preferred_many()) which is usable
> both by the old, single-node MPOL_PREFERRED and the new
> MPOL_PREFERRED_MANY.
>
> Enforce the old single-node MPOL_PREFERRED behavior in the "new"
> version of mpol_new_preferred() which calls mpol_new_preferred_many().
>
> v3:
> * fix a stack overflow caused by emty nodemask (Feng)
>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Dave Hansen <[email protected]>
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/mempolicy.c | 21 +++++++++++++++++++--
> 1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1228d8e..6fb2cab 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -203,17 +203,34 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
> return 0;
> }
>
> -static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> +static int mpol_new_preferred_many(struct mempolicy *pol,
> + const nodemask_t *nodes)
> {
> if (!nodes)
> pol->flags |= MPOL_F_LOCAL; /* local allocation */
Now you have confused me. I thought that MPOL_PREFERRED_MANY for NULL
nodemask will be disallowed as it is effectively MPOL_PREFERRED aka
MPOL_F_LOCAL. Or do I misread the code?
> else if (nodes_empty(*nodes))
> return -EINVAL; /* no allowed nodes */
> else
> - pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
> + pol->v.preferred_nodes = *nodes;
> return 0;
> }
>
> +static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> +{
> + if (nodes) {
> + /* MPOL_PREFERRED can only take a single node: */
> + nodemask_t tmp;
> +
> + if (nodes_empty(*nodes))
> + return -EINVAL;
> +
> + tmp = nodemask_of_node(first_node(*nodes));
> + return mpol_new_preferred_many(pol, &tmp);
> + }
> +
> + return mpol_new_preferred_many(pol, NULL);
> +}
> +
> static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> {
> if (nodes_empty(*nodes))
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:04, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Begin the real plumbing for handling this new policy. Now that the
> internal representation for preferred nodes and bound nodes is the same,
> and we can envision what multiple preferred nodes will behave like,
> there are obvious places where we can simply reuse the bind behavior.
>
> In v1 of this series, the moral equivalent was:
> "mm: Finish handling MPOL_PREFERRED_MANY". Like that, this attempts to
> implement the easiest spots for the new policy. Unlike that, this just
> reuses BIND.
No, this is a bug step back. I think we really want to treat this as
PREFERRED. It doesn't have much to do with the BIND semantic at all.
At this stage there should be 2 things remaining - syscalls plumbing and
2 pass allocation request (optimistic preferred nodes restricted and
fallback to all nodes).
--
Michal Hocko
SUSE Labs
Please use hugetlb prefix to make it explicit that this is hugetlb
related.
On Wed 17-03-21 11:40:08, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Implement the missing huge page allocation functionality while obeying
> the preferred node semantics.
>
> This uses a fallback mechanism to try multiple preferred nodes first,
> and then all other nodes. It cannot use the helper function that was
> introduced because huge page allocation already has its own helpers and
> it was more LOC, and effort to try to consolidate that.
>
> The weirdness is MPOL_PREFERRED_MANY can't be called yet because it is
> part of the UAPI we haven't yet exposed. Instead of make that define
> global, it's simply changed with the UAPI patch.
>
> [ feng: add NOWARN flag, and skip the direct reclaim to speedup allocation
> in some case ]
>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/hugetlb.c | 26 +++++++++++++++++++++++---
> mm/mempolicy.c | 3 ++-
> 2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8fb42c6..9dfbfa3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1105,7 +1105,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> unsigned long address, int avoid_reserve,
> long chg)
> {
> - struct page *page;
> + struct page *page = NULL;
> struct mempolicy *mpol;
> gfp_t gfp_mask;
> nodemask_t *nodemask;
> @@ -1126,7 +1126,17 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>
> gfp_mask = htlb_alloc_mask(h);
> nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
Please use MPOL_PREFERRED_MANY explicitly here.
> + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> +
> + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> + page = dequeue_huge_page_nodemask(h,
> + gfp_mask1, nid, nodemask);
> + if (!page)
> + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
> + } else {
> + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> + }
> if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> SetHPageRestoreReserve(page);
> h->resv_huge_pages--;
__GFP_DIRECT_RECLAIM handing is not needed here. dequeue_huge_page_nodemask
only uses gfp mask to get zone and cpusets constraines. So the above
should have simply been
if (mpol->mode == MPOL_PREFERRED_MANY) {
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
if (page)
goto got_page;
/* fallback to all nodes */
nodemask = NULL;
}
page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
got_page:
if (page ...)
> @@ -1883,7 +1893,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> nodemask_t *nodemask;
>
> nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
> + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> +
> + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> + page = alloc_surplus_huge_page(h,
> + gfp_mask1, nid, nodemask);
> + if (!page)
> + alloc_surplus_huge_page(h, gfp_mask, nid, NULL);
> + } else {
> + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> + }
And here similar
if (mpol->mode == MPOL_PREFERRED_MANY) {
page = alloc_surplus_huge_page(h, (gfp_mask | __GFP_NOWARN) & ~(__GFP_DIRECT_RECLAIM), nodemask);
if (page)
goto got_page;
/* fallback to all nodes */
nodemask = NULL;
}
page = alloc_surplus_huge_page(h, gfp_mask, nodemask);
got_page:
> mpol_cond_put(mpol);
You can have a dedicated gfp mask here if you prefer of course but I
calling out MPOL_PREFERRED_MANY explicitly will make the code easier to
read.
> return page;
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:39:59, Feng Tang wrote:
> From: Dave Hansen <[email protected]>
>
> The NUMA APIs currently allow passing in a "preferred node" as a
> single bit set in a nodemask. If more than one bit it set, bits
> after the first are ignored. Internally, this is implemented as
> a single integer: mempolicy->preferred_node.
>
> This single node is generally OK for location-based NUMA where
> memory being allocated will eventually be operated on by a single
> CPU. However, in systems with multiple memory types, folks want
> to target a *type* of memory instead of a location. For instance,
> someone might want some high-bandwidth memory but do not care about
> the CPU next to which it is allocated. Or, they want a cheap,
> high capacity allocation and want to target all NUMA nodes which
> have persistent memory in volatile mode. In both of these cases,
> the application wants to target a *set* of nodes, but does not
> want strict MPOL_BIND behavior as that could lead to OOM killer or
> SIGSEGV.
>
> To get that behavior, a MPOL_PREFERRED mode is desirable, but one
> that honors multiple nodes to be set in the nodemask.
>
> The first step in that direction is to be able to internally store
> multiple preferred nodes, which is implemented in this patch.
>
> This should not have any function changes and just switches the
> internal representation of mempolicy->preferred_node from an
> integer to a nodemask called 'mempolicy->preferred_nodes'.
>
> This is not a pie-in-the-sky dream for an API. This was a response to a
> specific ask of more than one group at Intel. Specifically:
>
> 1. There are existing libraries that target memory types such as
> https://github.com/memkind/memkind. These are known to suffer
> from SIGSEGV's when memory is low on targeted memory "kinds" that
> span more than one node. The MCDRAM on a Xeon Phi in "Cluster on
> Die" mode is an example of this.
> 2. Volatile-use persistent memory users want to have a memory policy
> which is targeted at either "cheap and slow" (PMEM) or "expensive and
> fast" (DRAM). However, they do not want to experience allocation
> failures when the targeted type is unavailable.
> 3. Allocate-then-run. Generally, we let the process scheduler decide
> on which physical CPU to run a task. That location provides a
> default allocation policy, and memory availability is not generally
> considered when placing tasks. For situations where memory is
> valuable and constrained, some users want to allocate memory first,
> *then* allocate close compute resources to the allocation. This is
> the reverse of the normal (CPU) model. Accelerators such as GPUs
> that operate on core-mm-managed memory are interested in this model.
This is a very useful background for the feature. The changelog for the
specific patch is rather modest and it would help to add more details
about the change. The mempolicy code is a maze and it is quite easy to
get lost there. I hope we are not going to miss something just by hunting
preferred_node usage...
[...]
> @@ -345,22 +345,26 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
> const nodemask_t *nodes)
> {
> nodemask_t tmp;
> + nodemask_t preferred_node;
This is rather harsh. Some distribution kernels use high NODES_SHIFT
(SLES has 10 for x86) so this will consume additional 1K on the stack.
Unless I am missing something this shouldn't be called in deep call
chains but still.
> +
> + /* MPOL_PREFERRED uses only the first node in the mask */
> + preferred_node = nodemask_of_node(first_node(*nodes));
>
> if (pol->flags & MPOL_F_STATIC_NODES) {
> int node = first_node(pol->w.user_nodemask);
>
> if (node_isset(node, *nodes)) {
> - pol->v.preferred_node = node;
> + pol->v.preferred_nodes = nodemask_of_node(node);
> pol->flags &= ~MPOL_F_LOCAL;
> } else
> pol->flags |= MPOL_F_LOCAL;
> } else if (pol->flags & MPOL_F_RELATIVE_NODES) {
> mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
> - pol->v.preferred_node = first_node(tmp);
> + pol->v.preferred_nodes = tmp;
> } else if (!(pol->flags & MPOL_F_LOCAL)) {
> - pol->v.preferred_node = node_remap(pol->v.preferred_node,
> - pol->w.cpuset_mems_allowed,
> - *nodes);
> + nodes_remap(tmp, pol->v.preferred_nodes,
> + pol->w.cpuset_mems_allowed, preferred_node);
> + pol->v.preferred_nodes = tmp;
> pol->w.cpuset_mems_allowed = *nodes;
> }
I have to say that I really disliked the original code (becasuse it
fiddles with user provided input behind the back) I got lost here
completely. What the heck is going on?
a) why do we even care remaping a hint which is overriden by the cpuset
at the page allocator level and b) why do we need to allocate _two_
potentially large temporary bitmaps for that here?
I haven't spotted anything unexpected in the rest.
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:02, Feng Tang wrote:
> From: Dave Hansen <[email protected]>
>
> Again, this extracts the "only one node must be set" behavior of
> MPOL_PREFERRED. It retains virtually all of the existing code so it can
> be used by MPOL_PREFERRED_MANY as well.
>
> v2:
> Fixed typos in commit message. (Ben)
> Merged bits from other patches. (Ben)
> annotate mpol_rebind_preferred_many as unused (Ben)
I am giving up on the rebinding code for now until we clarify that in my
earlier email.
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:03, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Now that preferred_nodes is just a mask, and policies are mutually
> exclusive, there is no reason to have a separate mask.
>
> This patch is optional. It definitely helps clean up code in future
> patches, but there is no functional difference to leaving it with the
> previous name. I do believe it helps demonstrate the exclusivity of the
> fields.
Yeah, let's just do it after the whole thing is merged. The separation
helps a bit to review the code at this stage because it is so much
easier to grep for preferred_nodes than nodes.
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:05, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Add a helper function which takes care of handling multiple preferred
> nodes. It will be called by future patches that need to handle this,
> specifically VMA based page allocation, and task based page allocation.
> Huge pages don't quite fit the same pattern because they use different
> underlying page allocation functions. This consumes the previous
> interleave policy specific allocation function to make a one stop shop
> for policy based allocation.
>
> With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
> that it will first try the preferred node/nodes, and fallback to all
> other nodes when first try fails. Thanks to Michal Hocko for suggestions
> on this.
>
> For now, only interleaved policy will be used so there should be no
> functional change yet. However, if bisection points to issues in the
> next few commits, it was likely the fault of this patch.
I am not sure this is helping much. Let's see in later patches but I
would keep them separate and rather create a dedicated function for the
new policy allocation mode.
> Similar functionality is offered via policy_node() and
> policy_nodemask(). By themselves however, neither can achieve this
> fallback style of sets of nodes.
>
> [ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
> to speedup allocation in some case ]
>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 52 insertions(+), 13 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d945f29..d21105b 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> return ret;
> }
>
> -/* Allocate a page in interleaved policy.
> - Own path because it needs to do special accounting. */
> -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> - unsigned nid)
> +/* Handle page allocation for all but interleaved policies */
> +static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> + unsigned int order, int preferred_nid)
> {
> struct page *page;
> + gfp_t gfp_mask = gfp;
>
> - page = __alloc_pages(gfp, order, nid);
> - /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> - if (!static_branch_likely(&vm_numa_stat_key))
> + if (pol->mode == MPOL_INTERLEAVE) {
> + page = __alloc_pages(gfp, order, preferred_nid);
> + /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> + if (!static_branch_likely(&vm_numa_stat_key))
> + return page;
> + if (page && page_to_nid(page) == preferred_nid) {
> + preempt_disable();
> + __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> + preempt_enable();
> + }
> return page;
> - if (page && page_to_nid(page) == nid) {
> - preempt_disable();
> - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> - preempt_enable();
> }
> +
> + VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
> +
> + preferred_nid = numa_node_id();
> +
> + /*
> + * There is a two pass approach implemented here for
> + * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
> + * but allow the allocation to fail. The below table explains how
> + * this is achieved.
> + *
> + * | Policy | preferred nid | nodemask |
> + * |-------------------------------|---------------|------------|
> + * | MPOL_DEFAULT | local | NULL |
> + * | MPOL_PREFERRED | best | NULL |
> + * | MPOL_INTERLEAVE | ERR | ERR |
> + * | MPOL_BIND | local | pol->nodes |
> + * | MPOL_PREFERRED_MANY | best | pol->nodes |
> + * | MPOL_PREFERRED_MANY (round 2) | local | NULL |
> + * +-------------------------------+---------------+------------+
> + */
> + if (pol->mode == MPOL_PREFERRED_MANY) {
> + gfp_mask |= __GFP_NOWARN;
> +
> + /* Skip direct reclaim, as there will be a second try */
> + gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> + }
> +
> + page = __alloc_pages_nodemask(gfp_mask, order,
> + policy_node(gfp, pol, preferred_nid),
> + policy_nodemask(gfp, pol));
> +
> + if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
> + page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
> +
> return page;
> }
>
> @@ -2244,8 +2282,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> unsigned nid;
>
> nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
> + page = alloc_pages_policy(pol, gfp, order, nid);
> mpol_cond_put(pol);
> - page = alloc_page_interleave(gfp, order, nid);
> goto out;
> }
>
> @@ -2329,7 +2367,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> * nor system default_policy
> */
> if (pol->mode == MPOL_INTERLEAVE)
> - page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> + page = alloc_pages_policy(pol, gfp, order,
> + interleave_nodes(pol));
> else
> page = __alloc_pages_nodemask(gfp, order,
> policy_node(gfp, pol, numa_node_id()),
> --
> 2.7.4
--
Michal Hocko
SUSE Labs
On Wed 17-03-21 11:40:07, Feng Tang wrote:
[...]
> @@ -2301,10 +2300,26 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> * does not allow the current node in its nodemask, we allocate
> * the standard way.
> */
> - if ((pol->mode == MPOL_PREFERRED ||
> - pol->mode == MPOL_PREFERRED_MANY) &&
> - !(pol->flags & MPOL_F_LOCAL))
> + if (pol->mode == MPOL_PREFERRED || !(pol->flags & MPOL_F_LOCAL)) {
> hpage_node = first_node(pol->nodes);
> + } else if (pol->mode == MPOL_PREFERRED_MANY) {
> + struct zoneref *z;
> +
> + /*
> + * In this policy, with direct reclaim, the normal
> + * policy based allocation will do the right thing - try
> + * twice using the preferred nodes first, and all nodes
> + * second.
> + */
> + if (gfp & __GFP_DIRECT_RECLAIM) {
> + page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
> + goto out;
> + }
> +
> + z = first_zones_zonelist(node_zonelist(numa_node_id(), GFP_HIGHUSER),
> + gfp_zone(GFP_HIGHUSER), &pol->nodes);
> + hpage_node = zone_to_nid(z->zone);
> + }
>
> nmask = policy_nodemask(gfp, pol);
> if (!nmask || node_isset(hpage_node, *nmask)) {
> @@ -2330,9 +2345,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> }
> }
>
> - nmask = policy_nodemask(gfp, pol);
> - preferred_nid = policy_node(gfp, pol, node);
> - page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> + page = alloc_pages_policy(pol, gfp, order, NUMA_NO_NODE);
> mpol_cond_put(pol);
> out:
> return page;
OK, it took me a while to grasp this but the code is a mess I have to
say. Not that it was an act of beauty before but this just makes it much
harder to follow. And alloc_pages_policy doesn't really help I have to
say. I would have expected that a dedicated alloc_pages_preferred and a
general fallback to __alloc_pages_nodemask would have been much easier
to follow.
--
Michal Hocko
SUSE Labs
Hi Michal,
Many thanks for reviewing the whole patchset! We will check them.
On Wed, Apr 14, 2021 at 03:25:34PM +0200, Michal Hocko wrote:
> Please use hugetlb prefix to make it explicit that this is hugetlb
> related.
>
> On Wed 17-03-21 11:40:08, Feng Tang wrote:
> > From: Ben Widawsky <[email protected]>
> >
> > Implement the missing huge page allocation functionality while obeying
> > the preferred node semantics.
> >
> > This uses a fallback mechanism to try multiple preferred nodes first,
> > and then all other nodes. It cannot use the helper function that was
> > introduced because huge page allocation already has its own helpers and
> > it was more LOC, and effort to try to consolidate that.
> >
> > The weirdness is MPOL_PREFERRED_MANY can't be called yet because it is
> > part of the UAPI we haven't yet exposed. Instead of make that define
> > global, it's simply changed with the UAPI patch.
> >
> > [ feng: add NOWARN flag, and skip the direct reclaim to speedup allocation
> > in some case ]
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > mm/hugetlb.c | 26 +++++++++++++++++++++++---
> > mm/mempolicy.c | 3 ++-
> > 2 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8fb42c6..9dfbfa3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1105,7 +1105,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> > unsigned long address, int avoid_reserve,
> > long chg)
> > {
> > - struct page *page;
> > + struct page *page = NULL;
> > struct mempolicy *mpol;
> > gfp_t gfp_mask;
> > nodemask_t *nodemask;
> > @@ -1126,7 +1126,17 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >
> > gfp_mask = htlb_alloc_mask(h);
> > nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> > - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
>
> Please use MPOL_PREFERRED_MANY explicitly here.
>
> > + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> > +
> > + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> > + page = dequeue_huge_page_nodemask(h,
> > + gfp_mask1, nid, nodemask);
> > + if (!page)
> > + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL);
> > + } else {
> > + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > + }
> > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> > SetHPageRestoreReserve(page);
> > h->resv_huge_pages--;
>
> __GFP_DIRECT_RECLAIM handing is not needed here. dequeue_huge_page_nodemask
> only uses gfp mask to get zone and cpusets constraines. So the above
> should have simply been
> if (mpol->mode == MPOL_PREFERRED_MANY) {
> page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> if (page)
> goto got_page;
> /* fallback to all nodes */
> nodemask = NULL;
> }
> page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> got_page:
> if (page ...)
You are right, no need to change the gfp_mask here.
> > @@ -1883,7 +1893,17 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> > nodemask_t *nodemask;
> >
> > nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> > - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> > + if (mpol->mode != MPOL_BIND && nodemask) { /* AKA MPOL_PREFERRED_MANY */
> > + gfp_t gfp_mask1 = gfp_mask | __GFP_NOWARN;
> > +
> > + gfp_mask1 &= ~__GFP_DIRECT_RECLAIM;
> > + page = alloc_surplus_huge_page(h,
> > + gfp_mask1, nid, nodemask);
> > + if (!page)
> > + alloc_surplus_huge_page(h, gfp_mask, nid, NULL);
> > + } else {
> > + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> > + }
>
> And here similar
> if (mpol->mode == MPOL_PREFERRED_MANY) {
> page = alloc_surplus_huge_page(h, (gfp_mask | __GFP_NOWARN) & ~(__GFP_DIRECT_RECLAIM), nodemask);
> if (page)
> goto got_page;
> /* fallback to all nodes */
> nodemask = NULL;
> }
> page = alloc_surplus_huge_page(h, gfp_mask, nodemask);
> got_page:
> > mpol_cond_put(mpol);
>
> You can have a dedicated gfp mask here if you prefer of course but I
> calling out MPOL_PREFERRED_MANY explicitly will make the code easier to
> read.
Will follow. The "if (mpol->mode != MPOL_BIND && nodemask) {
/* AKA MPOL_PREFERRED_MANY *a/ " and "MPOL_MAX + 1" will be replaced
in the 12/13 patch.
Thanks,
Feng
> > return page;
> --
> Michal Hocko
> SUSE Labs
On Wed, Apr 14, 2021 at 03:08:19PM +0200, Michal Hocko wrote:
> On Wed 17-03-21 11:40:05, Feng Tang wrote:
> > From: Ben Widawsky <[email protected]>
> >
> > Add a helper function which takes care of handling multiple preferred
> > nodes. It will be called by future patches that need to handle this,
> > specifically VMA based page allocation, and task based page allocation.
> > Huge pages don't quite fit the same pattern because they use different
> > underlying page allocation functions. This consumes the previous
> > interleave policy specific allocation function to make a one stop shop
> > for policy based allocation.
> >
> > With this, MPOL_PREFERRED_MANY's semantic is more like MPOL_PREFERRED
> > that it will first try the preferred node/nodes, and fallback to all
> > other nodes when first try fails. Thanks to Michal Hocko for suggestions
> > on this.
> >
> > For now, only interleaved policy will be used so there should be no
> > functional change yet. However, if bisection points to issues in the
> > next few commits, it was likely the fault of this patch.
>
> I am not sure this is helping much. Let's see in later patches but I
> would keep them separate and rather create a dedicated function for the
> new policy allocation mode.
Thanks for the suggestion, we will rethink the implementations.
- Feng
> > Similar functionality is offered via policy_node() and
> > policy_nodemask(). By themselves however, neither can achieve this
> > fallback style of sets of nodes.
> >
> > [ Feng: for the first try, add NOWARN flag, and skip the direct reclaim
> > to speedup allocation in some case ]
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > mm/mempolicy.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++------------
> > 1 file changed, 52 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d945f29..d21105b 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2187,22 +2187,60 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> > return ret;
> > }
> >
> > -/* Allocate a page in interleaved policy.
> > - Own path because it needs to do special accounting. */
> > -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> > - unsigned nid)
> > +/* Handle page allocation for all but interleaved policies */
> > +static struct page *alloc_pages_policy(struct mempolicy *pol, gfp_t gfp,
> > + unsigned int order, int preferred_nid)
> > {
> > struct page *page;
> > + gfp_t gfp_mask = gfp;
> >
> > - page = __alloc_pages(gfp, order, nid);
> > - /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> > - if (!static_branch_likely(&vm_numa_stat_key))
> > + if (pol->mode == MPOL_INTERLEAVE) {
> > + page = __alloc_pages(gfp, order, preferred_nid);
> > + /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
> > + if (!static_branch_likely(&vm_numa_stat_key))
> > + return page;
> > + if (page && page_to_nid(page) == preferred_nid) {
> > + preempt_disable();
> > + __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> > + preempt_enable();
> > + }
> > return page;
> > - if (page && page_to_nid(page) == nid) {
> > - preempt_disable();
> > - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT);
> > - preempt_enable();
> > }
> > +
> > + VM_BUG_ON(preferred_nid != NUMA_NO_NODE);
> > +
> > + preferred_nid = numa_node_id();
> > +
> > + /*
> > + * There is a two pass approach implemented here for
> > + * MPOL_PREFERRED_MANY. In the first pass we try the preferred nodes
> > + * but allow the allocation to fail. The below table explains how
> > + * this is achieved.
> > + *
> > + * | Policy | preferred nid | nodemask |
> > + * |-------------------------------|---------------|------------|
> > + * | MPOL_DEFAULT | local | NULL |
> > + * | MPOL_PREFERRED | best | NULL |
> > + * | MPOL_INTERLEAVE | ERR | ERR |
> > + * | MPOL_BIND | local | pol->nodes |
> > + * | MPOL_PREFERRED_MANY | best | pol->nodes |
> > + * | MPOL_PREFERRED_MANY (round 2) | local | NULL |
> > + * +-------------------------------+---------------+------------+
> > + */
> > + if (pol->mode == MPOL_PREFERRED_MANY) {
> > + gfp_mask |= __GFP_NOWARN;
> > +
> > + /* Skip direct reclaim, as there will be a second try */
> > + gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > + }
> > +
> > + page = __alloc_pages_nodemask(gfp_mask, order,
> > + policy_node(gfp, pol, preferred_nid),
> > + policy_nodemask(gfp, pol));
> > +
> > + if (unlikely(!page && pol->mode == MPOL_PREFERRED_MANY))
> > + page = __alloc_pages_nodemask(gfp, order, preferred_nid, NULL);
> > +
> > return page;
> > }
> >
> > @@ -2244,8 +2282,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> > unsigned nid;
> >
> > nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
> > + page = alloc_pages_policy(pol, gfp, order, nid);
> > mpol_cond_put(pol);
> > - page = alloc_page_interleave(gfp, order, nid);
> > goto out;
> > }
> >
> > @@ -2329,7 +2367,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> > * nor system default_policy
> > */
> > if (pol->mode == MPOL_INTERLEAVE)
> > - page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > + page = alloc_pages_policy(pol, gfp, order,
> > + interleave_nodes(pol));
> > else
> > page = __alloc_pages_nodemask(gfp, order,
> > policy_node(gfp, pol, numa_node_id()),
> > --
> > 2.7.4
>
> --
> Michal Hocko
> SUSE Labs
On Wed, Apr 14, 2021 at 02:55:39PM +0200, Michal Hocko wrote:
> On Wed 17-03-21 11:40:01, Feng Tang wrote:
> > From: Dave Hansen <[email protected]>
> >
> > Create a helper function (mpol_new_preferred_many()) which is usable
> > both by the old, single-node MPOL_PREFERRED and the new
> > MPOL_PREFERRED_MANY.
> >
> > Enforce the old single-node MPOL_PREFERRED behavior in the "new"
> > version of mpol_new_preferred() which calls mpol_new_preferred_many().
> >
> > v3:
> > * fix a stack overflow caused by emty nodemask (Feng)
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Dave Hansen <[email protected]>
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > mm/mempolicy.c | 21 +++++++++++++++++++--
> > 1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 1228d8e..6fb2cab 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -203,17 +203,34 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
> > return 0;
> > }
> >
> > -static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> > +static int mpol_new_preferred_many(struct mempolicy *pol,
> > + const nodemask_t *nodes)
> > {
> > if (!nodes)
> > pol->flags |= MPOL_F_LOCAL; /* local allocation */
>
> Now you have confused me. I thought that MPOL_PREFERRED_MANY for NULL
> nodemask will be disallowed as it is effectively MPOL_PREFERRED aka
> MPOL_F_LOCAL. Or do I misread the code?
I think you are right, with current code, the 'nodes' can't be NULL for
MPOL_PREFERRED_MANY, we'll revisit this.
And I have to admit that I am confused by the current logic for MPOL_PREFERRED,
that the nodemask paramter changes between raw user input, empty nodes and NULL.
Maybe the following patch can make it more clear, as it doesn't play the
NULL nmask trick?
---
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index be160d4..9cabfca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -200,12 +200,9 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
{
- if (!nodes)
- pol->flags |= MPOL_F_LOCAL; /* local allocation */
- else if (nodes_empty(*nodes))
+ if (nodes_empty(*nodes))
return -EINVAL; /* no allowed nodes */
- else
- pol->v.preferred_node = first_node(*nodes);
+ pol->v.preferred_node = first_node(*nodes);
return 0;
}
@@ -239,9 +236,11 @@ static int mpol_set_nodemask(struct mempolicy *pol,
cpuset_current_mems_allowed, node_states[N_MEMORY]);
VM_BUG_ON(!nodes);
- if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
- nodes = NULL; /* explicit local allocation */
- else {
+ if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes)) {
+ /* explicit local allocation */
+ pol->flags |= MPOL_F_LOCAL;
+ return 0;
+ } else {
if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
else
@@ -254,10 +253,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
cpuset_current_mems_allowed;
}
- if (nodes)
- ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
- else
- ret = mpol_ops[pol->mode].create(pol, NULL);
+ ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
return ret;
}
Thanks,
Feng
On Wed, Apr 14, 2021 at 02:50:53PM +0200, Michal Hocko wrote:
> On Wed 17-03-21 11:40:00, Feng Tang wrote:
> > From: Dave Hansen <[email protected]>
> >
> > MPOL_PREFERRED honors only a single node set in the nodemask. Add the
> > bare define for a new mode which will allow more than one.
> >
> > The patch does all the plumbing without actually adding the new policy
> > type.
> >
> > v2:
> > Plumb most MPOL_PREFERRED_MANY without exposing UAPI (Ben)
> > Fixes for checkpatch (Ben)
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Co-developed-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Dave Hansen <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > mm/mempolicy.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 2b1e0e4..1228d8e 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -31,6 +31,9 @@
> > * but useful to set in a VMA when you have a non default
> > * process policy.
> > *
> > + * preferred many Try a set of nodes first before normal fallback. This is
> > + * similar to preferred without the special case.
> > + *
> > * default Allocate on the local node first, or when on a VMA
> > * use the process policy. This is what Linux always did
> > * in a NUMA aware kernel and still does by, ahem, default.
> > @@ -105,6 +108,8 @@
> >
> > #include "internal.h"
> >
> > +#define MPOL_PREFERRED_MANY MPOL_MAX
> > +
> > /* Internal flags */
> > #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
> > #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
> > @@ -175,7 +180,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
> > static const struct mempolicy_operations {
> > int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
> > void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
> > -} mpol_ops[MPOL_MAX];
> > +} mpol_ops[MPOL_MAX + 1];
> >
> > static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> > {
> > @@ -415,7 +420,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
> > mmap_write_unlock(mm);
> > }
> >
> > -static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> > +static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
> > [MPOL_DEFAULT] = {
> > .rebind = mpol_rebind_default,
> > },
> > @@ -432,6 +437,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> > .rebind = mpol_rebind_nodemask,
> > },
> > /* [MPOL_LOCAL] - see mpol_new() */
> > + [MPOL_PREFERRED_MANY] = {
> > + .create = NULL,
> > + .rebind = NULL,
> > + },
> > };
>
> I do get that you wanted to keep MPOL_PREFERRED_MANY unaccessible for
> the userspace but wouldn't it be much easier to simply check in two
> syscall entries rather than playing thise MAX+1 games which make the
> review more complicated than necessary?
I will check this way, and currently the user input paramter
handling are quite complex.
Also the sanity check in kernel_mbind() and kernel_set_mempolicy()
are almost identical, which can be unified.
> >
> > static int migrate_page_add(struct page *page, struct list_head *pagelist,
> > @@ -924,6 +933,9 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> > case MPOL_INTERLEAVE:
> > *nodes = p->v.nodes;
> > break;
> > + case MPOL_PREFERRED_MANY:
> > + *nodes = p->v.preferred_nodes;
> > + break;
> > case MPOL_PREFERRED:
> > if (!(p->flags & MPOL_F_LOCAL))
> > *nodes = p->v.preferred_nodes;
>
> Why those two do a slightly different thing? Is this because unlike
> MPOL_PREFERRED it can never have MPOL_F_LOCAL cleared? If that is the
> case I would still stick the two together and use the same code for
> both to make the code easier to follow. Now that both use the same
> nodemask it should really be just about syscall inputs sanitization and
> to keep the original behavior for MPOL_PREFERRED.
>
> [...]
Our intention is to make MPOL_PREFERRED_MANY be similar to
MPOL_PREFERRED, except it perfers multiple nodes. So will try to
achieve this in following version.
Also for MPOL_LOCAL and MPOL_PREFERRED, current code logic is
turning 'MPOL_LOCAL' to 'MPOL_PREFERRED' with MPOL_F_LOCAL set.
I don't understand why not use the other way around, that
turning MPOL_PREFERRED with empty nodemask to MPOL_LOCAL, which
looks more logical.
Thanks,
Feng
> > @@ -2072,6 +2087,9 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> > task_lock(current);
> > mempolicy = current->mempolicy;
> > switch (mempolicy->mode) {
> > + case MPOL_PREFERRED_MANY:
> > + *mask = mempolicy->v.preferred_nodes;
> > + break;
> > case MPOL_PREFERRED:
> > if (mempolicy->flags & MPOL_F_LOCAL)
> > nid = numa_node_id();
>
> Same here
mempolicy: kill MPOL_F_LOCAL bit
Now the only remaining case of actual 'local' policy faked by
'prefer' policy plus MPOL_F_LOCAL bit is:
A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
to a nodemask which doesn't contains the 'preferred' node, then it
will handle allocation with 'local' policy.
Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
MPOL_F_LOCAL bit, which could simplify the code much.
Signed-off-by: Feng Tang <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 80 +++++++++++++++++++++++-------------------
2 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 4832fd0..2f71177 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -63,6 +63,7 @@ enum {
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
#define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_LOCAL_TEMP (1 << 5) /* MPOL_PREFERRED policy temporarily change to MPOL_LOCAL */
/*
* These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2f20f079..9cdbb78 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -332,6 +332,22 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
pol->v.nodes = tmp;
}
+static void mpol_rebind_local(struct mempolicy *pol,
+ const nodemask_t *nodes)
+{
+ if (unlikely(pol->flags & MPOL_F_STATIC_NODES)) {
+ int node = first_node(pol->w.user_nodemask);
+
+ BUG_ON(!(pol->flags & MPOL_F_LOCAL_TEMP));
+
+ if (node_isset(node, *nodes)) {
+ pol->v.preferred_node = node;
+ pol->mode = MPOL_PREFERRED;
+ pol->flags &= ~MPOL_F_LOCAL_TEMP;
+ }
+ }
+}
+
static void mpol_rebind_preferred(struct mempolicy *pol,
const nodemask_t *nodes)
{
@@ -342,13 +358,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
if (node_isset(node, *nodes)) {
pol->v.preferred_node = node;
- pol->flags &= ~MPOL_F_LOCAL;
- } else
- pol->flags |= MPOL_F_LOCAL;
+ } else {
+ /*
+ * If there is no valid node, change the mode to
+ * MPOL_LOCAL, which will be restored back when
+ * next rebind() see a valid node.
+ */
+ pol->mode = MPOL_LOCAL;
+ pol->flags |= MPOL_F_LOCAL_TEMP;
+ }
} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
pol->v.preferred_node = first_node(tmp);
- } else if (!(pol->flags & MPOL_F_LOCAL)) {
+ } else {
pol->v.preferred_node = node_remap(pol->v.preferred_node,
pol->w.cpuset_mems_allowed,
*nodes);
@@ -367,7 +389,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
{
if (!pol)
return;
- if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
+ if (!mpol_store_user_nodemask(pol) &&
nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
return;
@@ -419,7 +441,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.rebind = mpol_rebind_nodemask,
},
[MPOL_LOCAL] = {
- .rebind = mpol_rebind_default,
+ .rebind = mpol_rebind_local,
},
};
@@ -913,10 +935,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
case MPOL_INTERLEAVE:
*nodes = p->v.nodes;
break;
+ case MPOL_LOCAL:
+ /* return empty node mask for local allocation */
+ break;
+
case MPOL_PREFERRED:
- if (!(p->flags & MPOL_F_LOCAL))
- node_set(p->v.preferred_node, *nodes);
- /* else return empty node mask for local allocation */
+ node_set(p->v.preferred_node, *nodes);
break;
default:
BUG();
@@ -1888,9 +1912,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
/* Return the node id preferred by the given mempolicy, or the given id */
static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
{
- if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+ if (policy->mode == MPOL_PREFERRED) {
nd = policy->v.preferred_node;
- else {
+ } else {
/*
* __GFP_THISNODE shouldn't even be used with the bind policy
* because we might easily break the expectation to stay on the
@@ -1927,14 +1951,11 @@ unsigned int mempolicy_slab_node(void)
return node;
policy = current->mempolicy;
- if (!policy || policy->flags & MPOL_F_LOCAL)
+ if (!policy)
return node;
switch (policy->mode) {
case MPOL_PREFERRED:
- /*
- * handled MPOL_F_LOCAL above
- */
return policy->v.preferred_node;
case MPOL_INTERLEAVE:
@@ -2068,16 +2089,13 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
mempolicy = current->mempolicy;
switch (mempolicy->mode) {
case MPOL_PREFERRED:
- if (mempolicy->flags & MPOL_F_LOCAL)
- nid = numa_node_id();
- else
- nid = mempolicy->v.preferred_node;
+ nid = mempolicy->v.preferred_node;
init_nodemask_of_node(mask, nid);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- *mask = mempolicy->v.nodes;
+ *mask = mempolicy->v.nodes;
break;
case MPOL_LOCAL:
@@ -2119,8 +2137,9 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
switch (mempolicy->mode) {
case MPOL_PREFERRED:
+ case MPOL_LOCAL:
/*
- * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+ * MPOL_PREFERRED and MPOL_LOCAL are only preferred nodes to
* allocate from, they may fallback to other nodes when oom.
* Thus, it's possible for tsk to have allocated memory from
* nodes in mask.
@@ -2205,7 +2224,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
* If the policy is interleave, or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+ if (pol->mode == MPOL_PREFERRED )
hpage_node = pol->v.preferred_node;
nmask = policy_nodemask(gfp, pol);
@@ -2341,9 +2360,6 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
case MPOL_INTERLEAVE:
return !!nodes_equal(a->v.nodes, b->v.nodes);
case MPOL_PREFERRED:
- /* a's ->flags is the same as b's */
- if (a->flags & MPOL_F_LOCAL)
- return true;
return a->v.preferred_node == b->v.preferred_node;
case MPOL_LOCAL:
return true;
@@ -2484,10 +2500,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
break;
case MPOL_PREFERRED:
- if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
- else
- polnid = pol->v.preferred_node;
+ polnid = pol->v.preferred_node;
break;
case MPOL_LOCAL:
@@ -2858,9 +2871,6 @@ void numa_default_policy(void)
* Parse and format mempolicy from/to strings
*/
-/*
- * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag.
- */
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
@@ -3027,12 +3037,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
switch (mode) {
case MPOL_DEFAULT:
+ case MPOL_LOCAL:
break;
case MPOL_PREFERRED:
- if (flags & MPOL_F_LOCAL)
- mode = MPOL_LOCAL;
- else
- node_set(pol->v.preferred_node, nodes);
+ node_set(pol->v.preferred_node, nodes);
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
--
2.7.4
mempolicy: don't handle MPOL_LOCAL as a fake MPOL_PREFERRED policy
MPOL_LOCAL policy has been setup as a real policy, but it is still
handled as a faked POL_PREFERRED policy with one internal
MPOL_F_LOCAL flag bit set, and there are many places having to
judge the real 'prefer' or the 'local' policy, which are quite
confusing.
In current code, there are four cases that MPOL_LOCAL are used:
* user specifies 'local' policy
* user specifies 'prefer' policy, but with empty nodemask
* system 'default' policy is used
* 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
flag set, and when it is 'rebind' to a nodemask which doesn't
contains the 'preferred' node, it will add the MPOL_F_LOCAL bit
and performs as 'local' policy. In future if it is 'rebind' again
with valid nodemask, the policy will be restored back to 'prefer'
So for the first three cases, we make 'local' a real policy
instead of a fake 'prefer' one, this will reduce confusion and
make it easier to integrate our new 'prefer-many' policy
And next optional patch will kill the 'MPOL_F_LOCAL' bit.
Signed-off-by: Feng Tang <[email protected]>
---
mm/mempolicy.c | 60 ++++++++++++++++++++++++++++++++--------------------------
1 file changed, 33 insertions(+), 27 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d79fa29..2f20f079 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -121,8 +121,7 @@ enum zone_type policy_zone = 0;
*/
static struct mempolicy default_policy = {
.refcnt = ATOMIC_INIT(1), /* never free it */
- .mode = MPOL_PREFERRED,
- .flags = MPOL_F_LOCAL,
+ .mode = MPOL_LOCAL,
};
static struct mempolicy preferred_node_policy[MAX_NUMNODES];
@@ -200,12 +199,9 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
{
- if (!nodes)
- pol->flags |= MPOL_F_LOCAL; /* local allocation */
- else if (nodes_empty(*nodes))
- return -EINVAL; /* no allowed nodes */
- else
- pol->v.preferred_node = first_node(*nodes);
+ if (nodes_empty(*nodes))
+ return -EINVAL;
+ pol->v.preferred_node = first_node(*nodes);
return 0;
}
@@ -239,25 +235,19 @@ static int mpol_set_nodemask(struct mempolicy *pol,
cpuset_current_mems_allowed, node_states[N_MEMORY]);
VM_BUG_ON(!nodes);
- if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
- nodes = NULL; /* explicit local allocation */
- else {
- if (pol->flags & MPOL_F_RELATIVE_NODES)
- mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
- else
- nodes_and(nsc->mask2, *nodes, nsc->mask1);
- if (mpol_store_user_nodemask(pol))
- pol->w.user_nodemask = *nodes;
- else
- pol->w.cpuset_mems_allowed =
- cpuset_current_mems_allowed;
- }
+ if (pol->flags & MPOL_F_RELATIVE_NODES)
+ mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
+ else
+ nodes_and(nsc->mask2, *nodes, nsc->mask1);
- if (nodes)
- ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
+ if (mpol_store_user_nodemask(pol))
+ pol->w.user_nodemask = *nodes;
else
- ret = mpol_ops[pol->mode].create(pol, NULL);
+ pol->w.cpuset_mems_allowed =
+ cpuset_current_mems_allowed;
+
+ ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
return ret;
}
@@ -290,13 +280,14 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
if (((flags & MPOL_F_STATIC_NODES) ||
(flags & MPOL_F_RELATIVE_NODES)))
return ERR_PTR(-EINVAL);
+
+ mode = MPOL_LOCAL;
}
} else if (mode == MPOL_LOCAL) {
if (!nodes_empty(*nodes) ||
(flags & MPOL_F_STATIC_NODES) ||
(flags & MPOL_F_RELATIVE_NODES))
return ERR_PTR(-EINVAL);
- mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -427,6 +418,9 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
.create = mpol_new_bind,
.rebind = mpol_rebind_nodemask,
},
+ [MPOL_LOCAL] = {
+ .rebind = mpol_rebind_default,
+ },
};
static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -1960,6 +1954,8 @@ unsigned int mempolicy_slab_node(void)
&policy->v.nodes);
return z->zone ? zone_to_nid(z->zone) : node;
}
+ case MPOL_LOCAL:
+ return node;
default:
BUG();
@@ -2084,6 +2080,11 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
*mask = mempolicy->v.nodes;
break;
+ case MPOL_LOCAL:
+ nid = numa_node_id();
+ init_nodemask_of_node(mask, nid);
+ break;
+
default:
BUG();
}
@@ -2344,6 +2345,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
if (a->flags & MPOL_F_LOCAL)
return true;
return a->v.preferred_node == b->v.preferred_node;
+ case MPOL_LOCAL:
+ return true;
default:
BUG();
return false;
@@ -2487,6 +2490,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
polnid = pol->v.preferred_node;
break;
+ case MPOL_LOCAL:
+ polnid = numa_node_id();
+ break;
+
case MPOL_BIND:
/* Optimize placement among multiple nodes via NUMA balancing */
if (pol->flags & MPOL_F_MORON) {
@@ -2931,7 +2938,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
*/
if (nodelist)
goto out;
- mode = MPOL_PREFERRED;
break;
case MPOL_DEFAULT:
/*
@@ -2975,7 +2981,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
else if (nodelist)
new->v.preferred_node = first_node(nodes);
else
- new->flags |= MPOL_F_LOCAL;
+ new->mode = MPOL_LOCAL;
/*
* Save nodes for contextualization: this will be used to "clone"
--
2.7.4
On 5/13/2021 12:25 AM, Feng Tang wrote:
> mempolicy: kill MPOL_F_LOCAL bit
>
> Now the only remaining case of actual 'local' policy faked by
> 'prefer' policy plus MPOL_F_LOCAL bit is:
>
> A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> to a nodemask which doesn't contains the 'preferred' node, then it
> will handle allocation with 'local' policy.
>
> Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> MPOL_F_LOCAL bit, which could simplify the code much.
Reviewed-by: Andi Kleen <[email protected]>
-Andi