From: Ben Widawsky <[email protected]>
Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.
NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
nodes that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Ben Widawsky <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
mm/mempolicy.c | 7 +------
2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a1499c..cd653561e531 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -245,6 +245,14 @@ MPOL_INTERLEAVED
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
+MPOL_PREFERRED_MANY
+ This mode specifies that the allocation should be attempted from the
+ nodemask specified in the policy. If that allocation fails, the kernel
+ will search other nodes, in order of increasing distance from the first
+ set bit in the nodemask based on information provided by the platform
+ firmware. It is similar to MPOL_PREFERRED with the main exception that
+ is an error to have an empty nodemask.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
@@ -253,10 +261,10 @@ MPOL_F_STATIC_NODES
nodes changes after the memory policy has been defined.
Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
+ change in the set of allowed nodes, the preferred nodemask (Preferred
+ Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. This may result in nodes
+ being used that were previously undesired.
With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 93f8789758a7..d90247d6a71b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1463,12 +1463,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
*flags = *mode & MPOL_MODE_FLAGS;
*mode &= ~MPOL_MODE_FLAGS;
- /*
- * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
- * is not fully implemented, don't permit it to be used for now,
- * and the logic will be restored in following patch
- */
- if ((unsigned int)(*mode) >= MPOL_PREFERRED_MANY)
+ if ((unsigned int)(*mode) >= MPOL_MAX)
return -EINVAL;
if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
return -EINVAL;
--
2.7.4
On Mon 12-07-21 16:09:33, Feng Tang wrote:
> From: Ben Widawsky <[email protected]>
>
> Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
>
> MPOL_PREFERRED_MANY will be adequately documented in the internal
> admin-guide with this patch. Eventually, the man pages for mbind(2),
> get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
> about this mode. Those shall contain the canonical reference.
>
> NUMA systems continue to become more prevalent. New technologies like
> PMEM make finer grain control over memory access patterns increasingly
> desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
> nodes that will be tried first when performing allocations. If those
> allocations fail, all remaining nodes will be tried. It's a straight
> forward API which solves many of the presumptive needs of system
> administrators wanting to optimize workloads on such machines. The mode
> will work either per VMA, or per thread.
>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Ben Widawsky <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
> mm/mempolicy.c | 7 +------
> 2 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 067a90a1499c..cd653561e531 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -245,6 +245,14 @@ MPOL_INTERLEAVED
> address range or file. During system boot up, the temporary
> interleaved system default policy works in this mode.
>
> +MPOL_PREFERRED_MANY
> + This mode specifies that the allocation should be attempted from the
> + nodemask specified in the policy. If that allocation fails, the kernel
> + will search other nodes, in order of increasing distance from the first
> + set bit in the nodemask based on information provided by the platform
> + firmware. It is similar to MPOL_PREFERRED with the main exception that
> + is an error to have an empty nodemask.
I believe the target audience of this documents are users rather than
kernel developers and for those the wording might be rather cryptic. I
would rephrase like this
This mode specifices that the allocation should be preferrably
satisfied from the nodemask specified in the policy. If there is
a memory pressure on all nodes in the nodemask the allocation
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
With that or similar feel free to add
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
On Wed, Jul 28, 2021 at 02:47:23PM +0200, Michal Hocko wrote:
> On Mon 12-07-21 16:09:33, Feng Tang wrote:
> > From: Ben Widawsky <[email protected]>
> >
> > Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
> >
> > MPOL_PREFERRED_MANY will be adequately documented in the internal
> > admin-guide with this patch. Eventually, the man pages for mbind(2),
> > get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
> > about this mode. Those shall contain the canonical reference.
> >
> > NUMA systems continue to become more prevalent. New technologies like
> > PMEM make finer grain control over memory access patterns increasingly
> > desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
> > nodes that will be tried first when performing allocations. If those
> > allocations fail, all remaining nodes will be tried. It's a straight
> > forward API which solves many of the presumptive needs of system
> > administrators wanting to optimize workloads on such machines. The mode
> > will work either per VMA, or per thread.
> >
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Ben Widawsky <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
> > mm/mempolicy.c | 7 +------
> > 2 files changed, 13 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> > index 067a90a1499c..cd653561e531 100644
> > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> > @@ -245,6 +245,14 @@ MPOL_INTERLEAVED
> > address range or file. During system boot up, the temporary
> > interleaved system default policy works in this mode.
> >
> > +MPOL_PREFERRED_MANY
> > + This mode specifies that the allocation should be attempted from the
> > + nodemask specified in the policy. If that allocation fails, the kernel
> > + will search other nodes, in order of increasing distance from the first
> > + set bit in the nodemask based on information provided by the platform
> > + firmware. It is similar to MPOL_PREFERRED with the main exception that
> > + is an error to have an empty nodemask.
>
> I believe the target audience of this documents are users rather than
> kernel developers and for those the wording might be rather cryptic. I
> would rephrase like this
> This mode specifices that the allocation should be preferrably
> satisfied from the nodemask specified in the policy. If there is
> a memory pressure on all nodes in the nodemask the allocation
> can fall back to all existing numa nodes. This is effectively
> MPOL_PREFERRED allowed for a mask rather than a single node.
>
> With that or similar feel free to add
> Acked-by: Michal Hocko <[email protected]>
Thanks!
Will revise the test as suggested.
- Feng
> --
> Michal Hocko
> SUSE Labs