2023-11-15 05:59:19

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <[email protected]> writes:

> On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
>> On Tue 14-11-23 10:50:51, Gregory Price wrote:
>> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
>> [...]
>> > > That being said, I still believe that a cgroup based interface is a much
>> > > better choice over a global one. Cpusets seem to be a good fit as the
>> > > controller does control memory placement wrt NUMA interfaces.
>> >
>> > I think cpusets is a non-starter due to the global spinlock required when
>> > reading informaiton from it:
>> >
>> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
>>
>> Right, our current cpuset implementation indeed requires callback lock
>> from the page allocator. But that is an implementation detail. I do not
>> remember bug reports about the lock being a bottle neck though. If
>> anything cpusets lock optimizations would be win also for users who do
>> not want to use weighted interleave interface.
>
> Definitely agree, but that's a rather large increase of scope :[
>
> We could consider a push-model similar to how cpuset nodemasks are
> pushed down to mempolicies, rather than a pull-model of having
> mempolicy read directly from cpusets, at least until cpusets lock
> optimization is undertaken.
>
> This pattern looks like a wart to me, which is why I avoided it, but the
> locking implications on the pull-model make me sad.
>
> Would like to point out that Tejun pushed back on implementing weights
> in cgroups (regardless of subcomponent), so I think we need to come
> to a consensus on where this data should live in a "more global"
> context (cpusets, memcg, nodes, etc) before I go mucking around
> further.
>
> So far we have:
> * mempolicy: updating weights is a very complicated undertaking,
> and no (good) way to do this from outside the task.
> would be better to have a coarser grained control.
>
> New syscall is likely needed to add/set weights in the
> per-task mempolicy, or bite the bullet on set_mempolicy2
> and make the syscall extensible for the future.
>
> * memtiers: tier=node when devices are already interleaved or when all
> devices are different, so why add yet another layer of
> complexity if other constructs already exist. Additionally,
> you lose task-placement relative weighting (or it becomes
> very complex to implement.

Because we usually have multiple nodes in one mem-tier, I still think
mem-tier-based interface is simpler than node-based. But, it seems more
complex to introduce mem-tier into mempolicy. Especially if we have
per-task weights. So, I am fine to go with node-based interface.

> * cgroups: "this doesn't involve dynamic resource accounting /
> enforcement at all" and "these aren't resource
> allocations, it's unclear what the hierarchical
> relationship mean".
>
> * node: too global, explore smaller scope first then expand.

Why is it too global? I understand that it doesn't cover all possible
use cases (although I don't know whether these use cases are practical
or not). But it can provide a reasonable default per-node weight based
on available node performance information (such as, HMAT, CDAT, etc.).
And, quite some workloads can just use it. I think this is an useful
feature.

> For now I think there is consensus that mempolicy should have weights
> per-task regardless of how the more-global mechanism is defined, so i'll
> go ahead and put up another RFC for some options on that in the next
> week or so.
>
> The limitations on the first pass will be that only the task is capable
> of re-weighting should cpusets.mems or the nodemask change.

--
Best Regards,
Huang, Ying


2023-12-04 03:45:18

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> Because we usually have multiple nodes in one mem-tier, I still think
> mem-tier-based interface is simpler than node-based. But, it seems more
> complex to introduce mem-tier into mempolicy. Especially if we have
> per-task weights. So, I am fine to go with node-based interface.
>
> > * cgroups: "this doesn't involve dynamic resource accounting /
> > enforcement at all" and "these aren't resource
> > allocations, it's unclear what the hierarchical
> > relationship mean".
> >
> > * node: too global, explore smaller scope first then expand.
>
> Why is it too global? I understand that it doesn't cover all possible
> use cases (although I don't know whether these use cases are practical
> or not). But it can provide a reasonable default per-node weight based
> on available node performance information (such as, HMAT, CDAT, etc.).
> And, quite some workloads can just use it. I think this is an useful
> feature.
>

Have been sharing notes with more folks. Michal thinks a global set of
weights is unintuitive and not useful, and would prefer to see the
per-task weights first.

Though this may have been in response to adding it as an attribute of
nodes directly.

Another proposal here suggested adding a new sysfs setting
https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a

$ tree /sys/kernel/mm/interleave_weight/
/sys/kernel/mm/interleave_weight/
├── enabled [1]
├── possible [2]
└── node
├── node0
│ └── interleave_weight [3]
└── node1
└── interleave_weight [3]

(this could be changed to /sys/kernel/mm/mempolicy/...)

I think the internal representation of this can be simplified greatly,
over what the patch provides now, but maybe this solves the "it doesn't
belong in these other components" issue.

Answer: Simply leave it as a static global kobject in mempolicy, which
also deals with many of the issues regarding race conditions.

If a user provides weights, use those. If they do not, use globals.

On a cpuset rebind event (container migration, mems_allowed changes),
manually set weights would have to remain, so in a bad case, the
weights would be very out of line with the real distribution of memory.

Example: if your nodemask is (0,1,2) and a migration changes it to
(3,4,5), then unfortunately your weights will likely revert to [1,1,1]

If set with global weights, they could automatically adjust. It
would not be perfect, but it would be better than the potential worst
case above. If that same migration occurs, the next allocation would
simply use whatever the target node weights are in the global config.

So if globally you have weights [3,2,1,1,2,3], and you move from
nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
[1,2,3]. If the structure is built as a matrix of (cpu_node,mem_nodes),
the you can also optimize based on the node the task is running on.

That feels very intuitive, deals with many race condition issues, and
the global setting can actually be implemented without the need for
set_mempolicy2 at all - which is certainly a bonus.

Would love more thoughts here. Will have a new RFC with set_mempolicy2,
mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.

Regards
~Gregory

2023-12-04 08:21:28

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <[email protected]> writes:

> On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> Because we usually have multiple nodes in one mem-tier, I still think
>> mem-tier-based interface is simpler than node-based. But, it seems more
>> complex to introduce mem-tier into mempolicy. Especially if we have
>> per-task weights. So, I am fine to go with node-based interface.
>>
>> > * cgroups: "this doesn't involve dynamic resource accounting /
>> > enforcement at all" and "these aren't resource
>> > allocations, it's unclear what the hierarchical
>> > relationship mean".
>> >
>> > * node: too global, explore smaller scope first then expand.
>>
>> Why is it too global? I understand that it doesn't cover all possible
>> use cases (although I don't know whether these use cases are practical
>> or not). But it can provide a reasonable default per-node weight based
>> on available node performance information (such as, HMAT, CDAT, etc.).
>> And, quite some workloads can just use it. I think this is an useful
>> feature.
>>
>
> Have been sharing notes with more folks. Michal thinks a global set of
> weights is unintuitive and not useful, and would prefer to see the
> per-task weights first.
>
> Though this may have been in response to adding it as an attribute of
> nodes directly.
>
> Another proposal here suggested adding a new sysfs setting
> https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a
>
> $ tree /sys/kernel/mm/interleave_weight/
> /sys/kernel/mm/interleave_weight/
> ├── enabled [1]
> ├── possible [2]
> └── node
> ├── node0
> │ └── interleave_weight [3]
> └── node1
> └── interleave_weight [3]
>
> (this could be changed to /sys/kernel/mm/mempolicy/...)
>
> I think the internal representation of this can be simplified greatly,
> over what the patch provides now, but maybe this solves the "it doesn't
> belong in these other components" issue.
>
> Answer: Simply leave it as a static global kobject in mempolicy, which
> also deals with many of the issues regarding race conditions.

Although personally I prefer to add interleave weight as an attribute of
nodes. I understand that some people think it's not appropriate to
place anything node-specific there. So, some place under /sys/kernel/mm
sounds reasonable too.

> If a user provides weights, use those. If they do not, use globals.

Yes. That is the target use case.

> On a cpuset rebind event (container migration, mems_allowed changes),
> manually set weights would have to remain, so in a bad case, the
> weights would be very out of line with the real distribution of memory.
>
> Example: if your nodemask is (0,1,2) and a migration changes it to
> (3,4,5), then unfortunately your weights will likely revert to [1,1,1]
>
> If set with global weights, they could automatically adjust. It
> would not be perfect, but it would be better than the potential worst
> case above. If that same migration occurs, the next allocation would
> simply use whatever the target node weights are in the global config.
>
> So if globally you have weights [3,2,1,1,2,3], and you move from
> nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
> [1,2,3].

That is nice. And I prefer to emphasize the simple use case. Users
don't need to specify interleave weight always. Just use
MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable
default weight.

> If the structure is built as a matrix of (cpu_node,mem_nodes),
> the you can also optimize based on the node the task is running on.

The matrix stuff makes the situation complex. If people do need
something like that, they can just use set_memorypolicy2() with user
specified weights. I still believe that "make simple stuff simple, and
complex stuff possible".

> That feels very intuitive, deals with many race condition issues, and
> the global setting can actually be implemented without the need for
> set_mempolicy2 at all - which is certainly a bonus.
>
> Would love more thoughts here. Will have a new RFC with set_mempolicy2,
> mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.

Thanks for doing all these!

--
Best Regards,
Huang, Ying

2023-12-04 13:51:04

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > If the structure is built as a matrix of (cpu_node,mem_nodes),
> > the you can also optimize based on the node the task is running on.
>
> The matrix stuff makes the situation complex. If people do need
> something like that, they can just use set_memorypolicy2() with user
> specified weights. I still believe that "make simple stuff simple, and
> complex stuff possible".
>

I don't think it's particularly complex, since we already have a
distance matrix for numa nodes:

available: 2 nodes (0-1)
... snip ...
node distances:
node 0 1
0: 10 21
1: 21 10

This would follow the same thing, just adjustable for bandwidth.

I personally find the (src,dst) matrix very important for flexibility.

But if there is particular pushback against it, having a one dimensional
array is better than not having it, so I will take what I can get.

> > That feels very intuitive, deals with many race condition issues, and
> > the global setting can actually be implemented without the need for
> > set_mempolicy2 at all - which is certainly a bonus.
> >
> > Would love more thoughts here. Will have a new RFC with set_mempolicy2,
> > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.
>
> Thanks for doing all these!
>

Someone's got to :]

~Gregory

2023-12-05 09:04:26

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <[email protected]> writes:

> On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > If the structure is built as a matrix of (cpu_node,mem_nodes),
>> > the you can also optimize based on the node the task is running on.
>>
>> The matrix stuff makes the situation complex. If people do need
>> something like that, they can just use set_memorypolicy2() with user
>> specified weights. I still believe that "make simple stuff simple, and
>> complex stuff possible".
>>
>
> I don't think it's particularly complex, since we already have a
> distance matrix for numa nodes:
>
> available: 2 nodes (0-1)
> ... snip ...
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
>
> This would follow the same thing, just adjustable for bandwidth.

We add complexity for requirement. Not there's something similar
already.

> I personally find the (src,dst) matrix very important for flexibility.

With set_memorypolicy2(), I think we have the needed flexibility for
users needs the complexity.

> But if there is particular pushback against it, having a one dimensional
> array is better than not having it, so I will take what I can get.

TBH, I don't think that we really need that. Especially given we will
have set_memorypolicy2().

>> > That feels very intuitive, deals with many race condition issues, and
>> > the global setting can actually be implemented without the need for
>> > set_mempolicy2 at all - which is certainly a bonus.
>> >
>> > Would love more thoughts here. Will have a new RFC with set_mempolicy2,
>> > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.
>>
>> Thanks for doing all these!
>>
>
> Someone's got to :]
>

--
Best Regards,
Huang, Ying

2023-12-05 14:48:16

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
> >> Gregory Price <[email protected]> writes:
> >>
> >> > If the structure is built as a matrix of (cpu_node,mem_nodes),
> >> > the you can also optimize based on the node the task is running on.
> >>
> >> The matrix stuff makes the situation complex. If people do need
> >> something like that, they can just use set_memorypolicy2() with user
> >> specified weights. I still believe that "make simple stuff simple, and
> >> complex stuff possible".
> >>
> >
> > I don't think it's particularly complex, since we already have a
> > distance matrix for numa nodes:
> >
> > available: 2 nodes (0-1)
> > ... snip ...
> > node distances:
> > node 0 1
> > 0: 10 21
> > 1: 21 10
> >
> > This would follow the same thing, just adjustable for bandwidth.
>
> We add complexity for requirement. Not there's something similar
> already.
>
> > I personally find the (src,dst) matrix very important for flexibility.
>
> With set_memorypolicy2(), I think we have the needed flexibility for
> users needs the complexity.
>
> > But if there is particular pushback against it, having a one dimensional
> > array is better than not having it, so I will take what I can get.
>
> TBH, I don't think that we really need that. Especially given we will
> have set_memorypolicy2().
>

From a complexity standpoint, it is exactly as complex as the hardware
configuration itself: each socket has a different view of the memory
topology. If you have a non-homogeneous memory configuration (e.g. a
different number of CXL expanders on one socket thant he other), a flat
array of weights has no way of capturing this hardware configuration.

That makes the feature significantly less useful. In fact, it makes the
feature equivalent to set_mempolicy2 - except that weights could be
changed at runtime from outside a process.


A matrix resolves one very specific use case: task migration


set_mempolicy2 is not sufficient to solve this. There is presently no
way for an external task to change the mempolicy of an existing task.
That means a task must become "migration aware" to use weighting in the
context of containers where migrations are likely.

Two things to consider: A task...
a) has no way of knowing a migration occured
b) may not have visibility of numa nodes outside its cpusets prior to
a migration - making it unlikely/not possible for them to set
weights correctly in the event a migration occurs.

If a server with 2 sockets is set up non-homogeneously (different amount
of CXL memory expanders on each socket), then the effective bandwidth
distribution between sockets will be different.

If a container is migrated between sockets in this situation, then tasks
with manually set weights, or if global weights are a single array, will
have poor memory distributions in relation to the new view of the system.

Requiring the global settings to be an array basically requires global
weights to be sub-optimal for any use cases that is not explicitly a
single workload that consumes all the cores on the system.

If the system provides a matrix, then the global settings can be optimal
and re-weighting in response to migration happens cleanly and transparently.

~Gregory

2023-12-06 00:52:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <[email protected]> writes:

> On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
>> >> Gregory Price <[email protected]> writes:
>> >>
>> >> > If the structure is built as a matrix of (cpu_node,mem_nodes),
>> >> > the you can also optimize based on the node the task is running on.
>> >>
>> >> The matrix stuff makes the situation complex. If people do need
>> >> something like that, they can just use set_memorypolicy2() with user
>> >> specified weights. I still believe that "make simple stuff simple, and
>> >> complex stuff possible".
>> >>
>> >
>> > I don't think it's particularly complex, since we already have a
>> > distance matrix for numa nodes:
>> >
>> > available: 2 nodes (0-1)
>> > ... snip ...
>> > node distances:
>> > node 0 1
>> > 0: 10 21
>> > 1: 21 10
>> >
>> > This would follow the same thing, just adjustable for bandwidth.
>>
>> We add complexity for requirement. Not there's something similar
>> already.
>>
>> > I personally find the (src,dst) matrix very important for flexibility.
>>
>> With set_memorypolicy2(), I think we have the needed flexibility for
>> users needs the complexity.
>>
>> > But if there is particular pushback against it, having a one dimensional
>> > array is better than not having it, so I will take what I can get.
>>
>> TBH, I don't think that we really need that. Especially given we will
>> have set_memorypolicy2().
>>
>
> From a complexity standpoint, it is exactly as complex as the hardware
> configuration itself: each socket has a different view of the memory
> topology. If you have a non-homogeneous memory configuration (e.g. a
> different number of CXL expanders on one socket thant he other), a flat
> array of weights has no way of capturing this hardware configuration.

One important task of the software is to hide the complexity of hardware
from the users. At least it should provide the option. It only add
complexity based on real requirements.

> That makes the feature significantly less useful. In fact, it makes the
> feature equivalent to set_mempolicy2 - except that weights could be
> changed at runtime from outside a process.
>
>
> A matrix resolves one very specific use case: task migration
>
>
> set_mempolicy2 is not sufficient to solve this. There is presently no
> way for an external task to change the mempolicy of an existing task.
> That means a task must become "migration aware" to use weighting in the
> context of containers where migrations are likely.
>
> Two things to consider: A task...
> a) has no way of knowing a migration occured
> b) may not have visibility of numa nodes outside its cpusets prior to
> a migration - making it unlikely/not possible for them to set
> weights correctly in the event a migration occurs.
>
> If a server with 2 sockets is set up non-homogeneously (different amount
> of CXL memory expanders on each socket), then the effective bandwidth
> distribution between sockets will be different.
>
> If a container is migrated between sockets in this situation, then tasks
> with manually set weights, or if global weights are a single array, will
> have poor memory distributions in relation to the new view of the system.
>
> Requiring the global settings to be an array basically requires global
> weights to be sub-optimal for any use cases that is not explicitly a
> single workload that consumes all the cores on the system.
>
> If the system provides a matrix, then the global settings can be optimal
> and re-weighting in response to migration happens cleanly and transparently.

For these complex requirements, we will have process_set_mempolicy2().
I think that it's even more flexible than the global matrix.

--
Best Regards,
Huang, Ying

2023-12-06 02:02:28

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

On Wed, Dec 06, 2023 at 08:50:23AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
> >
> > From a complexity standpoint, it is exactly as complex as the hardware
> > configuration itself: each socket has a different view of the memory
> > topology. If you have a non-homogeneous memory configuration (e.g. a
> > different number of CXL expanders on one socket thant he other), a flat
> > array of weights has no way of capturing this hardware configuration.
>
> One important task of the software is to hide the complexity of hardware
> from the users. At least it should provide the option. It only add
> complexity based on real requirements.
>

The global weights are intended to help adminstrators hide that
complexity from actual end-users.

The administrator of a system should already be aware of the hardware
configuration, however to hide this complexity a system service can
be made which auto-configures these weights at system-bringup and on
memory-device hostplug to simplify and hide the complexity even further.

A system service can use ACPI HMAT (ACPI Heterogeneous Memory Attribute
Table) information to automatically set the global weight information at
boot time and/or on hotplug. Such extensions have already been proposed
in prior RFCs and on the cxl mailing list.



To break this down a little more explicitly into 6 example use-cases,
lets consider the potential ways in which weighted interleave may be
set via set_mempolicy or set_mempolicy2().

1. Actual end-user software calls it directly (or through libnuma)
a) they can call set_mempolicy() without task-weights and accept the
administrator configured global weights
b) they can call set_mempolicy2() with task-weights and use task-local
defined weighting
2. Actual end-user uses `numactl -w[weights] --interleave ...`
a) if weights are not defined, use global weights
b) if weights are defined, use task-local weights
3. Administrator / Orchestrator opts user-software into weighted
interleave by wrapping their software into `numactl -w --interleave`
a) if weights are not defined, use global weights
b) if weights are not defined, use task-local weights

The most common use case is likely to be (3a) - an administrator opting
a user-workload into weighted-interleave via `numactl -w --interleave`
or an orchestrator such as kubernetes doing something similar on
pod/container dispatching.



In all cases where the user does not define weights, they are trusting
the administrator (or system-daemon) set weights to provide the optimal
distribution, removing the complexity of understanding the hardware
environment from the end-user.



In all cases where the user does define weights, they are accepting the
complexity of understanding the hardware environment.



On the topic of the ACTUAL complexity of system hardware that is being
hidden, we must consider a non-homogeneous bandwidth environment. The
simplest form is an off the shelf Intel 2-socket server with CXL memory
expander.

Lets Consider a 2 socket system with the following configuration::

DRAM on Socket0: 300GB/s local DRAM bandwidth (node 0)
DRAM on Socket1: 300GB/s local DRAM bandwidth (node 1)
CXL on socket0: 128GB/s bandwidth (node 2)
CXL on socket1: 128GB/s bandwidth (node 3)

A single linear array of weights is not sufficient to capture the
complexities of bandwidth distributions on this system, because
of the presence of a UPI link between socket0 and socket1, which
changes the bandwidth distribution depending on where a task runs.

For example, 3 links of UPI is 62.4GB/s full-duplex.

From the perspective of socket 0, the following is true:

Bandwidth to Socket0 DRAM: 300GB/s (node 0)
Bandwidth to Socket0 CXL: 100GB/s (node 2)
Aggregate bandwidth to nodes (1,3): 62.4GB/s

From the perspective of socket 1, this changes to:
Bandwidth to Socket0 DRAM: 300GB/s (node 1)
Bandwidth to Socket0 CXL: 100GB/s (node 3)
Aggregate bandwidth to nodes (0,2): 62.4GB/s

With a single linear array of weights that apply to the entire system,
you cannot represent this configuration. And in fact, a single
configuration of weights will always provide a sub-optimal distribution.

The goal of simplicity defeats the entire goal of weighted interleave in
a heterogeneous environment.

>
> For these complex requirements, we will have process_set_mempolicy2().
> I think that it's even more flexible than the global matrix.
>

process_set_mempolicy2() has a *very* long road to exist. The problem of
mempolicy reference counting is non-trivial, and the plumbing requires
changes to no less than 4 subsystems.

Beyond that, the complexity of actually using process_set_mempolicy2()
is the same as any situation in which set_mempolicy2() with task-local
weights set: The absolute highest.

The global weighting matrix actually hides this complexity entirely.

> --
> Best Regards,
> Huang, Ying