2022-09-05 11:33:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

On Mon 05-09-22 18:30:55, Zhongkun He wrote:
> Hi Michal, thanks for your reply.
>
> The current 'mempolicy' is hierarchically independent. The default value of
> the child is to inherit from the parent. The modification of the child
> policy will not be restricted by the parent.

This breaks cgroup fundamental property of hierarchical enforcement of
each property. And as such it is a no go.

> Of course, there are other options, such as the child's policy mode must be
> the same as the parent's. node can be the subset of parent's, but the
> interleave type will be complicated, that's why hierarchy independence is
> used. It would be better if you have other suggestions?

Honestly, I am not really sure cgroup cpusets is a great fit for this
usecase. It would be probably better to elaborate some more what are the
existing shortcomings and what you would like to achieve. Just stating
the syscall is a hard to use interface is not quite clear on its own.

Btw. have you noticed this question?

> > What is the hierarchical behavior of the policy? Say parent has a
> > stronger requirement (say bind) than a child (prefer)?
> > > How to use the mempolicy interface:
> > > echo prefer:2 > /sys/fs/cgroup/zz/cpuset.mems.policy
> > > echo bind:1-3 > /sys/fs/cgroup/zz/cpuset.mems.policy
> > > echo interleave:0,1,2,3 >/sys/fs/cgroup/zz/cpuset.mems.policy
> >
> > Am I just confused or did you really mean to combine all these
> > together?

--
Michal Hocko
SUSE Labs


2022-09-06 10:42:40

by Zhongkun He

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

> On Mon 05-09-22 18:30:55, Zhongkun He wrote:
>> Hi Michal, thanks for your reply.
>>
>> The current 'mempolicy' is hierarchically independent. The default value of
>> the child is to inherit from the parent. The modification of the child
>> policy will not be restricted by the parent.
>
> This breaks cgroup fundamental property of hierarchical enforcement of
> each property. And as such it is a no go.
>
>> Of course, there are other options, such as the child's policy mode must be
>> the same as the parent's. node can be the subset of parent's, but the
>> interleave type will be complicated, that's why hierarchy independence is
>> used. It would be better if you have other suggestions?
>
> Honestly, I am not really sure cgroup cpusets is a great fit for this
> usecase. It would be probably better to elaborate some more what are the
> existing shortcomings and what you would like to achieve. Just stating
> the syscall is a hard to use interface is not quite clear on its own.
>
> Btw. have you noticed this question?
>
>>> What is the hierarchical behavior of the policy? Say parent has a
>>> stronger requirement (say bind) than a child (prefer)?
>>>> How to use the mempolicy interface:
>>>> echo prefer:2 > /sys/fs/cgroup/zz/cpuset.mems.policy
>>>> echo bind:1-3 > /sys/fs/cgroup/zz/cpuset.mems.policy
>>>> echo interleave:0,1,2,3 >/sys/fs/cgroup/zz/cpuset.mems.policy
>>>
>>> Am I just confused or did you really mean to combine all these
>>> together?
>

Hi Michal, thanks for your reply.

>>Say parent has a stronger requirement (say bind) than a child(prefer)?

Yes, combine all these together. The parent's task will use 'bind',
child's use 'prefer'.This is the current implementation, and we can
discuss and modify it together if there are other suggestions.

1:Existing shortcomings

In our use case, the application and the control plane are two separate
systems. When the application is created, it doesn't know how to use
memory, and it doesn't care. The control plane will decide the memory
usage policy based on different reasons (the attributes of the
application itself, the priority, the remaining resources of the
system). Currently, numactl is used to set it at program startup, and
the child process will inherit the mempolicy. But we can't dynamically
adjust the memory policy, except restart, the memory policy will not change.

2:Our goals

For the above reasons, we want to create a mempolicy at the cgroup
level. Usually processes under a cgroup have the same priority and
attributes, and we can dynamically adjust the memory allocation strategy
according to the remaining resources of the system. For example, a
low-priority cgroup uses the 'bind:2-3' policy, and a high-priority
cgroup uses bind:0-1. When resources are insufficient, it will be
changed to bind:3, bind:0-2 by control plane, etc.Further more, more
mempolicy can be extended, such as allocating memory according to node
weight, etc.

Thanks.







2022-09-06 13:22:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

On Tue 06-09-22 18:37:40, Zhongkun He wrote:
> > On Mon 05-09-22 18:30:55, Zhongkun He wrote:
> > > Hi Michal, thanks for your reply.
> > >
> > > The current 'mempolicy' is hierarchically independent. The default value of
> > > the child is to inherit from the parent. The modification of the child
> > > policy will not be restricted by the parent.
> >
> > This breaks cgroup fundamental property of hierarchical enforcement of
> > each property. And as such it is a no go.
> >
> > > Of course, there are other options, such as the child's policy mode must be
> > > the same as the parent's. node can be the subset of parent's, but the
> > > interleave type will be complicated, that's why hierarchy independence is
> > > used. It would be better if you have other suggestions?
> >
> > Honestly, I am not really sure cgroup cpusets is a great fit for this
> > usecase. It would be probably better to elaborate some more what are the
> > existing shortcomings and what you would like to achieve. Just stating
> > the syscall is a hard to use interface is not quite clear on its own.
> >
> > Btw. have you noticed this question?
> >
> > > > What is the hierarchical behavior of the policy? Say parent has a
> > > > stronger requirement (say bind) than a child (prefer)?
> > > > > How to use the mempolicy interface:
> > > > > echo prefer:2 > /sys/fs/cgroup/zz/cpuset.mems.policy
> > > > > echo bind:1-3 > /sys/fs/cgroup/zz/cpuset.mems.policy
> > > > > echo interleave:0,1,2,3 >/sys/fs/cgroup/zz/cpuset.mems.policy
> > > >
> > > > Am I just confused or did you really mean to combine all these
> > > > together?
> >
>
> Hi Michal, thanks for your reply.
>
> >>Say parent has a stronger requirement (say bind) than a child(prefer)?
>
> Yes, combine all these together.

What is the semantic of the resulting policy?

> The parent's task will use 'bind', child's
> use 'prefer'.This is the current implementation, and we can discuss and
> modify it together if there are other suggestions.
>
> 1:Existing shortcomings
>
> In our use case, the application and the control plane are two separate
> systems. When the application is created, it doesn't know how to use memory,
> and it doesn't care. The control plane will decide the memory usage policy
> based on different reasons (the attributes of the application itself, the
> priority, the remaining resources of the system). Currently, numactl is used
> to set it at program startup, and the child process will inherit the
> mempolicy.

Yes this is common practice I have seen so far.

> But we can't dynamically adjust the memory policy, except
> restart, the memory policy will not change.

Do you really need to change the policy itself or only the effective
nodemask? I mean what is your usecase to go from say mbind to preferred
policy? Do you need any other policy than bind and preferred?

> 2:Our goals
>
> For the above reasons, we want to create a mempolicy at the cgroup level.
> Usually processes under a cgroup have the same priority and attributes, and
> we can dynamically adjust the memory allocation strategy according to the
> remaining resources of the system. For example, a low-priority cgroup uses
> the 'bind:2-3' policy, and a high-priority cgroup uses bind:0-1. When
> resources are insufficient, it will be changed to bind:3, bind:0-2 by
> control plane, etc.Further more, more mempolicy can be extended, such as
> allocating memory according to node weight, etc.

Yes, I do understand that you want to change the node affinity and that
is already possible with cpuset cgroup. The existing constrain is that
the policy is hardcoded mbind IIRC. So you cannot really implement a dynamic
preferred policy which would make some sense to me. The question is how
to implement that with a sensible semantic. It is hard to partition the
system into several cgroups if subset allows to spill over to others.
Say something like the following
root (nodes=0-3)
/ \
A (0, 1) B (2, 3)

if both are MBIND then this makes sense because they are kinda isolated
(at least for user allocations) but if B is PREFERRED and therefore
allowed to use nodes 0 and 1 then it can deplete the memory from A and
therefore isolation doesn't work at all.

I can imagine that the all cgroups would use PREFERRED policy and then
nobody can expect anything and the configuration is mostly best effort.
But it feels like this is an abuse of the cgroup interface and a proper
syscall interface is likely due. Would it make more sense to add
pidfd_set_mempolicy and allow sufficiently privileged process to
manipulate default memory policy of a remote process?
--
Michal Hocko
SUSE Labs

2022-09-07 14:02:15

by Zhongkun He

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

>> Hi Michal, thanks for your reply.
>>
>>>> Say parent has a stronger requirement (say bind) than a child(prefer)?
>>
>> Yes, combine all these together.
>
> What is the semantic of the resulting policy?
>
>> The parent's task will use 'bind', child's
>> use 'prefer'.This is the current implementation, and we can discuss and
>> modify it together if there are other suggestions.
>>
>> 1:Existing shortcomings
>>
>> In our use case, the application and the control plane are two separate
>> systems. When the application is created, it doesn't know how to use memory,
>> and it doesn't care. The control plane will decide the memory usage policy
>> based on different reasons (the attributes of the application itself, the
>> priority, the remaining resources of the system). Currently, numactl is used
>> to set it at program startup, and the child process will inherit the
>> mempolicy.
>
> Yes this is common practice I have seen so far.
>
>> But we can't dynamically adjust the memory policy, except
>> restart, the memory policy will not change.
>
> Do you really need to change the policy itself or only the effective
> nodemask? I mean what is your usecase to go from say mbind to preferred
> policy? Do you need any other policy than bind and preferred?
>
>> 2:Our goals
>>
>> For the above reasons, we want to create a mempolicy at the cgroup level.
>> Usually processes under a cgroup have the same priority and attributes, and
>> we can dynamically adjust the memory allocation strategy according to the
>> remaining resources of the system. For example, a low-priority cgroup uses
>> the 'bind:2-3' policy, and a high-priority cgroup uses bind:0-1. When
>> resources are insufficient, it will be changed to bind:3, bind:0-2 by
>> control plane, etc.Further more, more mempolicy can be extended, such as
>> allocating memory according to node weight, etc.
>
> Yes, I do understand that you want to change the node affinity and that
> is already possible with cpuset cgroup. The existing constrain is that
> the policy is hardcoded mbind IIRC. So you cannot really implement a dynamic
> preferred policy which would make some sense to me. The question is how
> to implement that with a sensible semantic. It is hard to partition the
> system into several cgroups if subset allows to spill over to others.
> Say something like the following
> root (nodes=0-3)
> / \
> A (0, 1) B (2, 3)
>
> if both are MBIND then this makes sense because they are kinda isolated
> (at least for user allocations) but if B is PREFERRED and therefore
> allowed to use nodes 0 and 1 then it can deplete the memory from A and
> therefore isolation doesn't work at all.
>
> I can imagine that the all cgroups would use PREFERRED policy and then
> nobody can expect anything and the configuration is mostly best effort.
> But it feels like this is an abuse of the cgroup interface and a proper
> syscall interface is likely due. Would it make more sense to add
> pidfd_set_mempolicy and allow sufficiently privileged process to
> manipulate default memory policy of a remote process?

Hi Michal, thanks for your reply.

> Do you really need to change the policy itself or only the effective
> nodemask? Do you need any other policy than bind and preferred?

Yes, we need to change the policy, not only his nodemask. we really want
policy is interleave, and extend it to weight-interleave.
Say something like the following
node weight
interleave: 0-3 1:1:1:1 default one by one
weight-interleave: 0-3 1:2:4:6 alloc pages by weight
(User set weight.)
In the actual usecase, the remaining resources of each node are
different, and the use of interleave cannot maximize the use of resources.

Back to the previous question.
>The question is how to implement that with a sensible semantic.

Thanks for your analysis and suggestions.It is really difficult to add
policy directly to cgroup for the hierarchical enforcement. It would be
a good idea to add pidfd_set_mempolicy.

Also, there is a new idea.
We can try to separate the elements of mempolicy and use them independently.
Mempolicy has two meanings:
nodes:which nodes to use(nodes,0-3), we can use cpuset's
effective_mems directly.
mode:how to use them(bind,prefer,etc). change the mode to a
cpuset->flags,such as CS_INTERLEAVE。
task_struct->mems_allowed is equal to cpuset->effective_mems,which is
hierarchical enforcement。CS_INTERLEAVE can also be updated into tasks,
just like other flags(CS_SPREAD_PAGE).
When a process needs to allocate memory, it can find the appropriate
node to allocate pages according to the flag and mems_allowed.

thanks.



2022-09-08 07:57:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

On Wed 07-09-22 21:50:24, Zhongkun He wrote:
[...]
> > Do you really need to change the policy itself or only the effective
> > nodemask? Do you need any other policy than bind and preferred?
>
> Yes, we need to change the policy, not only his nodemask. we really want
> policy is interleave, and extend it to weight-interleave.
> Say something like the following
> node weight
> interleave: 0-3 1:1:1:1 default one by one
> weight-interleave: 0-3 1:2:4:6 alloc pages by weight
> (User set weight.)
> In the actual usecase, the remaining resources of each node are different,
> and the use of interleave cannot maximize the use of resources.

OK, this seems a separate topic. It would be good to start by proposing
that new policy in isolation with the semantic description.

> Back to the previous question.
> >The question is how to implement that with a sensible semantic.
>
> Thanks for your analysis and suggestions.It is really difficult to add
> policy directly to cgroup for the hierarchical enforcement. It would be a
> good idea to add pidfd_set_mempolicy.

Are you going to pursue that path?

> Also, there is a new idea.
> We can try to separate the elements of mempolicy and use them independently.
> Mempolicy has two meanings:
> nodes:which nodes to use(nodes,0-3), we can use cpuset's effective_mems
> directly.
> mode:how to use them(bind,prefer,etc). change the mode to a
> cpuset->flags,such as CS_INTERLEAVE。
> task_struct->mems_allowed is equal to cpuset->effective_mems,which is
> hierarchical enforcement。CS_INTERLEAVE can also be updated into tasks,
> just like other flags(CS_SPREAD_PAGE).
> When a process needs to allocate memory, it can find the appropriate node to
> allocate pages according to the flag and mems_allowed.

I am not sure I see the advantage as the mode and nodes are always
closely coupled. You cannot really have one wihtout the other.

--
Michal Hocko
SUSE Labs

2022-09-09 03:05:26

by Zhongkun He

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

> On Wed 07-09-22 21:50:24, Zhongkun He wrote:
> [...]
>>> Do you really need to change the policy itself or only the effective
>>> nodemask? Do you need any other policy than bind and preferred?
>>
>> Yes, we need to change the policy, not only his nodemask. we really want
>> policy is interleave, and extend it to weight-interleave.
>> Say something like the following
>> node weight
>> interleave: 0-3 1:1:1:1 default one by one
>> weight-interleave: 0-3 1:2:4:6 alloc pages by weight
>> (User set weight.)
>> In the actual usecase, the remaining resources of each node are different,
>> and the use of interleave cannot maximize the use of resources.
>
> OK, this seems a separate topic. It would be good to start by proposing
> that new policy in isolation with the semantic description.
>
>> Back to the previous question.
>>> The question is how to implement that with a sensible semantic.
>>
>> Thanks for your analysis and suggestions.It is really difficult to add
>> policy directly to cgroup for the hierarchical enforcement. It would be a
>> good idea to add pidfd_set_mempolicy.
>
> Are you going to pursue that path?
>
>> Also, there is a new idea.
>> We can try to separate the elements of mempolicy and use them independently.
>> Mempolicy has two meanings:
>> nodes:which nodes to use(nodes,0-3), we can use cpuset's effective_mems
>> directly.
>> mode:how to use them(bind,prefer,etc). change the mode to a
>> cpuset->flags,such as CS_INTERLEAVE。
>> task_struct->mems_allowed is equal to cpuset->effective_mems,which is
>> hierarchical enforcement。CS_INTERLEAVE can also be updated into tasks,
>> just like other flags(CS_SPREAD_PAGE).
>> When a process needs to allocate memory, it can find the appropriate node to
>> allocate pages according to the flag and mems_allowed.
>
> I am not sure I see the advantage as the mode and nodes are always
> closely coupled. You cannot really have one wihtout the other.
>

Hi Michal, thanks for your suggestion and reply.

> Are you going to pursue that path?

Yes,I'll give it a try as it makes sense to modify the policy dynamically.

Thanks.

2022-09-14 15:13:31

by Zhongkun He

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

>>
>>> Back to the previous question.
>>>> The question is how to implement that with a sensible semantic.
>>>
>>> Thanks for your analysis and suggestions.It is really difficult to add
>>> policy directly to cgroup for the hierarchical enforcement. It would
>>> be a good idea to add pidfd_set_mempolicy.
>>
>> Are you going to pursue that path?

> Hi Michal, thanks for your suggestion and reply.
>
> > Are you going to pursue that path?
>
> Yes,I'll give it a try as it makes sense to modify the policy dynamically.
>
> Thanks.

Hi Michal, i have a question about pidfd_set_mempolicy, it would be
better if you have some suggestions.

The task_struct of processes and threads are independent. If we change
the mempolicy of the process through pidfd_set_mempolicy, the mempolicy
of its thread will not change. Of course users can set the mempolicy of
all threads by iterating through /proc/tgid/task.

The question is whether we should override the thread's mempolicy when
setting the process's mempolicy.

There are two options:
A:Change the process's mempolicy and set that mempolicy to all it's threads.
B:Only change the process's mempolicy in kernel. The mempolicy of the
thread needs to be modified by the user through pidfd_set_mempolicy in
userspace, if necessary.

Thanks.

2022-09-23 07:45:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

On Wed 14-09-22 23:10:47, Zhongkun He wrote:
> > >
> > > > Back to the previous question.
> > > > > The question is how to implement that with a sensible semantic.
> > > >
> > > > Thanks for your analysis and suggestions.It is really difficult to add
> > > > policy directly to cgroup for the hierarchical enforcement. It
> > > > would be a good idea to add pidfd_set_mempolicy.
> > >
> > > Are you going to pursue that path?
>
> > Hi Michal, thanks for your suggestion and reply.
> >
> > > Are you going to pursue that path?
> >
> > Yes,I'll give it a try as it makes sense to modify the policy dynamically.
> >
> > Thanks.
>
> Hi Michal, i have a question about pidfd_set_mempolicy, it would be better
> if you have some suggestions.
>
> The task_struct of processes and threads are independent. If we change the
> mempolicy of the process through pidfd_set_mempolicy, the mempolicy of its
> thread will not change. Of course users can set the mempolicy of all threads
> by iterating through /proc/tgid/task.
>
> The question is whether we should override the thread's mempolicy when
> setting the process's mempolicy.
>
> There are two options:
> A:Change the process's mempolicy and set that mempolicy to all it's threads.
> B:Only change the process's mempolicy in kernel. The mempolicy of the thread
> needs to be modified by the user through pidfd_set_mempolicy in
> userspace, if necessary.

set_mempolicy is a per task_struct operation and so should be pidfd
based API as well. If somebody requires a per-thread-group setting then
the whole group should be iterated. I do not think we have any
precendence where pidfd operation on the thread group leader has side
effects on other threads as well.
--
Michal Hocko
SUSE Labs

2022-09-23 15:43:14

by Zhongkun He

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] cgroup/cpuset: Add a new isolated mems.policy type.

>
> set_mempolicy is a per task_struct operation and so should be pidfd
> based API as well. If somebody requires a per-thread-group setting then
> the whole group should be iterated. I do not think we have any
> precendence where pidfd operation on the thread group leader has side
> effects on other threads as well.

Hi Michal,

I got it, thanks for your suggestions and reply.