2023-04-12 15:51:31

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

This patch series introduces a new "isolcpus" partition type to the
existing list of {member, root, isolated} types. The primary reason
of adding this new "isolcpus" partition is to facilitate the
distribution of isolated CPUs down the cgroup v2 hierarchy.

The other non-member partition types have the limitation that their
parents have to be valid partitions too. It will be hard to create a
partition a few layers down the hierarchy.

It is relatively rare to have applications that require creation of
a separate scheduling domain (root). However, it is more common to
have applications that require the use of isolated CPUs (isolated),
e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
to get that statically. Of course, the "isolated" partition is another
way to achieve that dynamically.

Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. If a container needs to use
isolated CPUs, it is hard to get those with existing set of cpuset
partition types. With this patch series, a new "isolcpus" partition
can be created to hold a set of isolated CPUs that can be pull into
other "isolated" partitions.

The "isolcpus" partition is special that there can have at most one
instance of this in a system. It serves as a pool for isolated CPUs
and cannot hold tasks or sub-cpusets underneath it. It is also not
cpu-exclusive so that the isolated CPUs can be distributed down the
sibling hierarchies, though those isolated CPUs will not be useable
until the partition type becomes "isolated".

Once isolated CPUs are needed in a cgroup, the administrator can write
a list of isolated CPUs into its "cpuset.cpus" and change its partition
type to "isolated" to pull in those isolated CPUs from the "isolcpus"
partition and use them in that cgroup. That will make the distribution
of isolated CPUs to cgroups that need them much easier.

In the future, we may be able to extend this special "isolcpus" partition
type to support other isolation attributes like those that can be
specified with the "isolcpus" boot command line and related options.

Waiman Long (5):
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Add a new "isolcpus" paritition root state
cgroup/cpuset: Make isolated partition pull CPUs from isolcpus
partition
cgroup/cpuset: Documentation update for the new "isolcpus" partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test isolcpus partition

Documentation/admin-guide/cgroup-v2.rst | 89 ++-
kernel/cgroup/cpuset.c | 548 +++++++++++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 376 ++++++++----
3 files changed, 789 insertions(+), 224 deletions(-)

--
2.31.1


2023-04-12 19:46:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Wed, Apr 12, 2023 at 11:37:53AM -0400, Waiman Long wrote:
> This patch series introduces a new "isolcpus" partition type to the
> existing list of {member, root, isolated} types. The primary reason
> of adding this new "isolcpus" partition is to facilitate the
> distribution of isolated CPUs down the cgroup v2 hierarchy.
>
> The other non-member partition types have the limitation that their
> parents have to be valid partitions too. It will be hard to create a
> partition a few layers down the hierarchy.
>
> It is relatively rare to have applications that require creation of
> a separate scheduling domain (root). However, it is more common to
> have applications that require the use of isolated CPUs (isolated),
> e.g. DPDK. One can use the "isolcpus" or "nohz_full" boot command options
> to get that statically. Of course, the "isolated" partition is another
> way to achieve that dynamically.
>
> Modern container orchestration tools like Kubernetes use the cgroup
> hierarchy to manage different containers. If a container needs to use
> isolated CPUs, it is hard to get those with existing set of cpuset
> partition types. With this patch series, a new "isolcpus" partition
> can be created to hold a set of isolated CPUs that can be pull into
> other "isolated" partitions.
>
> The "isolcpus" partition is special that there can have at most one
> instance of this in a system. It serves as a pool for isolated CPUs
> and cannot hold tasks or sub-cpusets underneath it. It is also not
> cpu-exclusive so that the isolated CPUs can be distributed down the
> sibling hierarchies, though those isolated CPUs will not be useable
> until the partition type becomes "isolated".
>
> Once isolated CPUs are needed in a cgroup, the administrator can write
> a list of isolated CPUs into its "cpuset.cpus" and change its partition
> type to "isolated" to pull in those isolated CPUs from the "isolcpus"
> partition and use them in that cgroup. That will make the distribution
> of isolated CPUs to cgroups that need them much easier.

I'm not sure about this. It feels really hacky in that it side-steps the
distribution hierarchy completely. I can imagine a non-isolated cpuset
wanting to allow isolated cpusets downstream but that should be done
hierarchically - e.g. by allowing a cgroup to express what isolated cpus are
allowed in the subtree. Also, can you give more details on the targeted use
cases?

Thanks.

--
tejun

2023-04-12 20:26:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
> There is still a distribution hierarchy as the list of isolation CPUs have
> to be distributed down to the target cgroup through the hierarchy. For
> example,
>
> cgroup root
> ? +- isolcpus? (cpus 8,9; isolcpus)
> ? +- user.slice (cpus 1-9; ecpus 1-7; member)
> ??? +- user-x.slice (cpus 8,9; ecpus 8,9; isolated)
> +- user-y.slice (cpus 1,2; ecpus 1,2; member)
>
> OTOH, I do agree that this can be somewhat hacky. That is why I post it as a
> RFC to solicit feedback.

Wouldn't it be possible to make it hierarchical by adding another cpumask to
cpuset which lists the cpus which are allowed in the hierarchy but not used
unless claimed by an isolated domain?

Thanks.

--
tejun

2023-04-12 20:34:52

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 4/12/23 16:22, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
>> There is still a distribution hierarchy as the list of isolation CPUs have
>> to be distributed down to the target cgroup through the hierarchy. For
>> example,
>>
>> cgroup root
>>   +- isolcpus  (cpus 8,9; isolcpus)
>>   +- user.slice (cpus 1-9; ecpus 1-7; member)
>>     +- user-x.slice (cpus 8,9; ecpus 8,9; isolated)
>> +- user-y.slice (cpus 1,2; ecpus 1,2; member)
>>
>> OTOH, I do agree that this can be somewhat hacky. That is why I post it as a
>> RFC to solicit feedback.
> Wouldn't it be possible to make it hierarchical by adding another cpumask to
> cpuset which lists the cpus which are allowed in the hierarchy but not used
> unless claimed by an isolated domain?

I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs
file. So there will be one in the root cgroup that defines all the
isolated CPUs one can have. It is then distributed down the hierarchy
and can be claimed only if a cgroup becomes an "isolated" partition.
There will be a slight change in the semantics of an "isolated"
partition, but I doubt there will be much users out there.

If you are OK with this approach, I can modify my patch series to do that.

Cheers,
Longman

2023-04-13 00:04:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello,

On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file.
> So there will be one in the root cgroup that defines all the isolated CPUs
> one can have. It is then distributed down the hierarchy and can be claimed
> only if a cgroup becomes an "isolated" partition. There will be a slight

Yeah, that seems a lot more congruent with the typical pattern.

> change in the semantics of an "isolated" partition, but I doubt there will
> be much users out there.

I haven't thought through it too hard but what prevents staying compatible
with the current behavior?

> If you are OK with this approach, I can modify my patch series to do that.

Thanks.

--
tejun

2023-04-13 00:45:06

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 4/12/23 20:03, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
>> I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file.
>> So there will be one in the root cgroup that defines all the isolated CPUs
>> one can have. It is then distributed down the hierarchy and can be claimed
>> only if a cgroup becomes an "isolated" partition. There will be a slight
> Yeah, that seems a lot more congruent with the typical pattern.
>
>> change in the semantics of an "isolated" partition, but I doubt there will
>> be much users out there.
> I haven't thought through it too hard but what prevents staying compatible
> with the current behavior?

It is possible to stay compatible with existing behavior. It is just
that a break from existing behavior will make the solution more clean.

So the new behavior will be:

  If the "cpuset.cpus.isolated" isn't set, the existing rules applies.
If it is set, the new rule will be used.

Does that look reasonable to you?

Cheers,
Longman

2023-04-13 00:46:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello,

On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
> ? If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
> is set, the new rule will be used.
>
> Does that look reasonable to you?

Sounds a bit contrived. Does it need to be something defined in the root
cgroup? The only thing that's needed is that a cgroup needs to claim CPUs
exclusively without using them, right? Let's say we add a new interface
file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
by children whichever way they want, wouldn't that be sufficient? Then,
there would be nothing to describe in the root cgroup.

Thanks.

--
tejun

2023-04-13 01:19:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
> > Sounds a bit contrived. Does it need to be something defined in the root
> > cgroup?
>
> Yes, because we need to take away the isolated CPUs from the effective cpus
> of the root cgroup. So it needs to start from the root. That is also why we
> have the partition rule that the parent of a partition has to be a partition
> root itself. With the new scheme, we don't need a special cgroup to hold the

I'm following. The root is already a partition root and the cgroupfs control
knobs are owned by the parent, so the root cgroup would own the first level
cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some
CPUs exclusively to a first level cgroup, it can then set that cgroup's
reserve knob accordingly (or maybe the better name is
cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's
partition and give them to the first level cgroup. The first level cgroup
then is free to do whatever with those CPUs that now belong exclusively to
the cgroup subtree.

> isolated CPUs. The new root cgroup file will be enough to inform the system
> what CPUs will have to be isolated.
>
> My current thinking is that the root's "cpuset.cpus.isolated" will start
> with whatever have been set in the "isolcpus" or "nohz_full" boot command
> line and can be extended from there but not shrank below that as there can
> be additional isolation attributes with those isolated CPUs.

I'm not sure we wanna tie with those automatically. I think it'd be
confusing than helpful.

Thanks.

--
tejun

2023-04-13 01:19:39

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 4/12/23 20:33, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
>>   If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it
>> is set, the new rule will be used.
>>
>> Does that look reasonable to you?
> Sounds a bit contrived. Does it need to be something defined in the root
> cgroup?

Yes, because we need to take away the isolated CPUs from the effective
cpus of the root cgroup. So it needs to start from the root. That is
also why we have the partition rule that the parent of a partition has
to be a partition root itself. With the new scheme, we don't need a
special cgroup to hold the isolated CPUs. The new root cgroup file will
be enough to inform the system what CPUs will have to be isolated.

My current thinking is that the root's "cpuset.cpus.isolated" will start
with whatever have been set in the "isolcpus" or "nohz_full" boot
command line and can be extended from there but not shrank below that as
there can be additional isolation attributes with those isolated CPUs.

Cheers,
Longman

> The only thing that's needed is that a cgroup needs to claim CPUs
> exclusively without using them, right? Let's say we add a new interface
> file, say, cpuset.cpus.reserve which is always exclusive and can be consumed
> by children whichever way they want, wouldn't that be sufficient? Then,
> there would be nothing to describe in the root cgroup.
>
> Thanks.
>

2023-04-13 02:00:54

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 4/12/23 21:17, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
>>> Sounds a bit contrived. Does it need to be something defined in the root
>>> cgroup?
>> Yes, because we need to take away the isolated CPUs from the effective cpus
>> of the root cgroup. So it needs to start from the root. That is also why we
>> have the partition rule that the parent of a partition has to be a partition
>> root itself. With the new scheme, we don't need a special cgroup to hold the
> I'm following. The root is already a partition root and the cgroupfs control
> knobs are owned by the parent, so the root cgroup would own the first level
> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some
> CPUs exclusively to a first level cgroup, it can then set that cgroup's
> reserve knob accordingly (or maybe the better name is
> cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's
> partition and give them to the first level cgroup. The first level cgroup
> then is free to do whatever with those CPUs that now belong exclusively to
> the cgroup subtree.

I am OK with the cpuset.cpus.reserve name, but not that much with the
cpuset.cpus.exclusive name as it can get confused with cgroup v1's
cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated name
a bit more. Once an isolated CPU gets used in an isolated partition, it
is exclusive and it can't be used in another isolated partition.

Since we will allow users to set cpuset.cpus.reserve to whatever value
they want. The distribution of isolated CPUs is only valid if the cpus
are present in its parent's cpuset.cpus.reserve and all the way up to
the root. It is a bit expensive, but it should be a relatively rare
operation.

>
>> isolated CPUs. The new root cgroup file will be enough to inform the system
>> what CPUs will have to be isolated.
>>
>> My current thinking is that the root's "cpuset.cpus.isolated" will start
>> with whatever have been set in the "isolcpus" or "nohz_full" boot command
>> line and can be extended from there but not shrank below that as there can
>> be additional isolation attributes with those isolated CPUs.
> I'm not sure we wanna tie with those automatically. I think it'd be
> confusing than helpful.

Yes, I am fine with taking this off for now.

Cheers,
Longman

2023-04-14 01:24:55

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 4/12/23 21:55, Waiman Long wrote:
> On 4/12/23 21:17, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
>>>> Sounds a bit contrived. Does it need to be something defined in the
>>>> root
>>>> cgroup?
>>> Yes, because we need to take away the isolated CPUs from the
>>> effective cpus
>>> of the root cgroup. So it needs to start from the root. That is also
>>> why we
>>> have the partition rule that the parent of a partition has to be a
>>> partition
>>> root itself. With the new scheme, we don't need a special cgroup to
>>> hold the
>> I'm following. The root is already a partition root and the cgroupfs
>> control
>> knobs are owned by the parent, so the root cgroup would own the first
>> level
>> cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to
>> assign some
>> CPUs exclusively to a first level cgroup, it can then set that cgroup's
>> reserve knob accordingly (or maybe the better name is
>> cpuset.cpus.exclusive), which will take those CPUs out of the root
>> cgroup's
>> partition and give them to the first level cgroup. The first level
>> cgroup
>> then is free to do whatever with those CPUs that now belong
>> exclusively to
>> the cgroup subtree.
>
> I am OK with the cpuset.cpus.reserve name, but not that much with the
> cpuset.cpus.exclusive name as it can get confused with cgroup v1's
> cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated
> name a bit more. Once an isolated CPU gets used in an isolated
> partition, it is exclusive and it can't be used in another isolated
> partition.
>
> Since we will allow users to set cpuset.cpus.reserve to whatever value
> they want. The distribution of isolated CPUs is only valid if the cpus
> are present in its parent's cpuset.cpus.reserve and all the way up to
> the root. It is a bit expensive, but it should be a relatively rare
> operation.

I now have a slightly different idea of how to do that. We already have
an internal cpumask for partitioning - subparts_cpus. I am thinking
about exposing it as cpuset.cpus.reserve. The current way of creating
subpartitions will be called automatic reservation and require a direct
parent/child partition relationship. But as soon as a user write
anything to it, it will break automatic reservation and require manual
reservation going forward.

In that way, we can keep the old behavior, but also support new use
cases. I am going to work on that.

Cheers,
Longman

2023-04-14 17:03:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
> I now have a slightly different idea of how to do that. We already have an
> internal cpumask for partitioning - subparts_cpus. I am thinking about
> exposing it as cpuset.cpus.reserve. The current way of creating
> subpartitions will be called automatic reservation and require a direct
> parent/child partition relationship. But as soon as a user write anything to
> it, it will break automatic reservation and require manual reservation going
> forward.
>
> In that way, we can keep the old behavior, but also support new use cases. I
> am going to work on that.

I'm not sure I fully understand the proposed behavior but it does sound more
quirky.

Thanks.

--
tejun

2023-04-14 17:33:48

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 4/14/23 12:54, Tejun Heo wrote:
> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>> I now have a slightly different idea of how to do that. We already have an
>> internal cpumask for partitioning - subparts_cpus. I am thinking about
>> exposing it as cpuset.cpus.reserve. The current way of creating
>> subpartitions will be called automatic reservation and require a direct
>> parent/child partition relationship. But as soon as a user write anything to
>> it, it will break automatic reservation and require manual reservation going
>> forward.
>>
>> In that way, we can keep the old behavior, but also support new use cases. I
>> am going to work on that.
> I'm not sure I fully understand the proposed behavior but it does sound more
> quirky.

The idea is to use the existing subparts_cpus for cpu reservation
instead of adding a new cpumask for that purpose. The current way of
partition creation does cpus reservation (setting subparts_cpus)
automatically with the constraint that the parent of a partition must be
a partition root itself. One way to relax this constraint is to allow a
new manual reservation mode where users can set reserve cpus manually
and distribute them down the hierarchy before activating a partition to
use those cpus.

Now the question is how to enable this new manual reservation mode. One
way to do it is to enable it whenever the new cpuset.cpus.reserve file
is modified. Alternatively, we may enable it by a cgroupfs mount option
or a boot command line option.

Hope this can clarify your confusion.

Cheers,
Longman

2023-04-14 17:35:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
>
> On 4/14/23 12:54, Tejun Heo wrote:
> > On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
> > > I now have a slightly different idea of how to do that. We already have an
> > > internal cpumask for partitioning - subparts_cpus. I am thinking about
> > > exposing it as cpuset.cpus.reserve. The current way of creating
> > > subpartitions will be called automatic reservation and require a direct
> > > parent/child partition relationship. But as soon as a user write anything to
> > > it, it will break automatic reservation and require manual reservation going
> > > forward.
> > >
> > > In that way, we can keep the old behavior, but also support new use cases. I
> > > am going to work on that.
> > I'm not sure I fully understand the proposed behavior but it does sound more
> > quirky.
>
> The idea is to use the existing subparts_cpus for cpu reservation instead of
> adding a new cpumask for that purpose. The current way of partition creation
> does cpus reservation (setting subparts_cpus) automatically with the
> constraint that the parent of a partition must be a partition root itself.
> One way to relax this constraint is to allow a new manual reservation mode
> where users can set reserve cpus manually and distribute them down the
> hierarchy before activating a partition to use those cpus.
>
> Now the question is how to enable this new manual reservation mode. One way
> to do it is to enable it whenever the new cpuset.cpus.reserve file is
> modified. Alternatively, we may enable it by a cgroupfs mount option or a
> boot command line option.

It'd probably be best if we can keep the behavior within cgroupfs if
possible. Would you mind writing up the documentation section describing the
behavior beforehand? I think things would be clearer if we look at it from
the interface documentation side.

Thanks.

--
tejun

2023-04-14 17:44:14

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 4/14/23 13:34, Tejun Heo wrote:
> On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
>> On 4/14/23 12:54, Tejun Heo wrote:
>>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>>>> I now have a slightly different idea of how to do that. We already have an
>>>> internal cpumask for partitioning - subparts_cpus. I am thinking about
>>>> exposing it as cpuset.cpus.reserve. The current way of creating
>>>> subpartitions will be called automatic reservation and require a direct
>>>> parent/child partition relationship. But as soon as a user write anything to
>>>> it, it will break automatic reservation and require manual reservation going
>>>> forward.
>>>>
>>>> In that way, we can keep the old behavior, but also support new use cases. I
>>>> am going to work on that.
>>> I'm not sure I fully understand the proposed behavior but it does sound more
>>> quirky.
>> The idea is to use the existing subparts_cpus for cpu reservation instead of
>> adding a new cpumask for that purpose. The current way of partition creation
>> does cpus reservation (setting subparts_cpus) automatically with the
>> constraint that the parent of a partition must be a partition root itself.
>> One way to relax this constraint is to allow a new manual reservation mode
>> where users can set reserve cpus manually and distribute them down the
>> hierarchy before activating a partition to use those cpus.
>>
>> Now the question is how to enable this new manual reservation mode. One way
>> to do it is to enable it whenever the new cpuset.cpus.reserve file is
>> modified. Alternatively, we may enable it by a cgroupfs mount option or a
>> boot command line option.
> It'd probably be best if we can keep the behavior within cgroupfs if
> possible. Would you mind writing up the documentation section describing the
> behavior beforehand? I think things would be clearer if we look at it from
> the interface documentation side.

Sure, will do that. I need some time and so it will be early next week.

Cheers,
Longman

2023-04-14 19:09:51

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 4/14/23 13:38, Waiman Long wrote:
> On 4/14/23 13:34, Tejun Heo wrote:
>> On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
>>> On 4/14/23 12:54, Tejun Heo wrote:
>>>> On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
>>>>> I now have a slightly different idea of how to do that. We already
>>>>> have an
>>>>> internal cpumask for partitioning - subparts_cpus. I am thinking
>>>>> about
>>>>> exposing it as cpuset.cpus.reserve. The current way of creating
>>>>> subpartitions will be called automatic reservation and require a
>>>>> direct
>>>>> parent/child partition relationship. But as soon as a user write
>>>>> anything to
>>>>> it, it will break automatic reservation and require manual
>>>>> reservation going
>>>>> forward.
>>>>>
>>>>> In that way, we can keep the old behavior, but also support new
>>>>> use cases. I
>>>>> am going to work on that.
>>>> I'm not sure I fully understand the proposed behavior but it does
>>>> sound more
>>>> quirky.
>>> The idea is to use the existing subparts_cpus for cpu reservation
>>> instead of
>>> adding a new cpumask for that purpose. The current way of partition
>>> creation
>>> does cpus reservation (setting subparts_cpus) automatically with the
>>> constraint that the parent of a partition must be a partition root
>>> itself.
>>> One way to relax this constraint is to allow a new manual
>>> reservation mode
>>> where users can set reserve cpus manually and distribute them down the
>>> hierarchy before activating a partition to use those cpus.
>>>
>>> Now the question is how to enable this new manual reservation mode.
>>> One way
>>> to do it is to enable it whenever the new cpuset.cpus.reserve file is
>>> modified. Alternatively, we may enable it by a cgroupfs mount option
>>> or a
>>> boot command line option.
>> It'd probably be best if we can keep the behavior within cgroupfs if
>> possible. Would you mind writing up the documentation section
>> describing the
>> behavior beforehand? I think things would be clearer if we look at it
>> from
>> the interface documentation side.
>
> Sure, will do that. I need some time and so it will be early next week.

Just kidding :-)

Below is a draft of the new cpuset.cpus.reserve cgroupfs file:

  cpuset.cpus.reserve
        A read-write multiple values file which exists on all
        cpuset-enabled cgroups.

        It lists the reserved CPUs to be used for the creation of
        child partitions.  See the section on "cpuset.cpus.partition"
        below for more information on cpuset partition.  These reserved
        CPUs should be a subset of "cpuset.cpus" and will be mutually
        exclusive of "cpuset.cpus.effective" when used since these
        reserved CPUs cannot be used by tasks in the current cgroup.

        There are two modes for partition CPUs reservation -
        auto or manual.  The system starts up in auto mode where
        "cpuset.cpus.reserve" will be set automatically when valid
        child partitions are created and users don't need to touch the
        file at all.  This mode has the limitation that the parent of a
        partition must be a partition root itself.  So child partition
        has to be created one-by-one from the cgroup root down.

        To enable the creation of a partition down in the hierarchy
        without the intermediate cgroups to be partition roots, one
        has to turn on the manual reservation mode by writing directly
        to "cpuset.cpus.reserve" with a value different from its
        current value.  By distributing the reserve CPUs down the cgroup
        hierarchy to the parent of the target cgroup, this target cgroup
        can be switched to become a partition root if its "cpuset.cpus"
        is a subset of the set of valid reserve CPUs in its parent. The
        set of valid reserve CPUs is the set that are present in all
        its ancestors' "cpuset.cpus.reserve" up to cgroup root and
        which have not been allocated to another valid partition yet.

        Once manual reservation mode is enabled, a cgroup administrator
        must always set up "cpuset.cpus.reserve" files properly before
        a valid partition can be created. So this mode has more
        administrative overhead but with greater flexibility.

Cheers,
Longman

2023-05-02 18:08:42

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello.

The previous thread arrived incomplete to me, so I respond to the last
message only. Point me to a message URL if it was covered.

On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <[email protected]> wrote:
> Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
>
> ? cpuset.cpus.reserve
> ??????? A read-write multiple values file which exists on all
> ??????? cpuset-enabled cgroups.
>
> ??????? It lists the reserved CPUs to be used for the creation of
> ??????? child partitions.? See the section on "cpuset.cpus.partition"
> ??????? below for more information on cpuset partition.? These reserved
> ??????? CPUs should be a subset of "cpuset.cpus" and will be mutually
> ??????? exclusive of "cpuset.cpus.effective" when used since these
> ??????? reserved CPUs cannot be used by tasks in the current cgroup.
>
> ??????? There are two modes for partition CPUs reservation -
> ??????? auto or manual.? The system starts up in auto mode where
> ??????? "cpuset.cpus.reserve" will be set automatically when valid
> ??????? child partitions are created and users don't need to touch the
> ??????? file at all.? This mode has the limitation that the parent of a
> ??????? partition must be a partition root itself.? So child partition
> ??????? has to be created one-by-one from the cgroup root down.
>
> ??????? To enable the creation of a partition down in the hierarchy
> ??????? without the intermediate cgroups to be partition roots,

Why would be this needed? Owning a CPU (a resource) must logically be
passed all the way from root to the target cgroup, i.e. this is
expressed by valid partitioning down to given level.

> one
> ??????? has to turn on the manual reservation mode by writing directly
> ??????? to "cpuset.cpus.reserve" with a value different from its
> ??????? current value.? By distributing the reserve CPUs down the cgroup
> ??????? hierarchy to the parent of the target cgroup, this target cgroup
> ??????? can be switched to become a partition root if its "cpuset.cpus"
> ??????? is a subset of the set of valid reserve CPUs in its parent.

level n
`- level n+1
cpuset.cpus // these are actually configured by "owner" of level n
cpuset.cpus.partition // similrly here, level n decides if child is a partition

I.e. what would be level n/cpuset.cpus.reserve good for when it can
directly control level n+1/cpuset.cpus?

Thanks,
Michal


Attachments:
(No filename) (2.40 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-02 21:39:06

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 5/2/23 14:01, Michal Koutný wrote:
> Hello.
>
> The previous thread arrived incomplete to me, so I respond to the last
> message only. Point me to a message URL if it was covered.
>
> On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long <[email protected]> wrote:
>> Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
>>
>>   cpuset.cpus.reserve
>>         A read-write multiple values file which exists on all
>>         cpuset-enabled cgroups.
>>
>>         It lists the reserved CPUs to be used for the creation of
>>         child partitions.  See the section on "cpuset.cpus.partition"
>>         below for more information on cpuset partition.  These reserved
>>         CPUs should be a subset of "cpuset.cpus" and will be mutually
>>         exclusive of "cpuset.cpus.effective" when used since these
>>         reserved CPUs cannot be used by tasks in the current cgroup.
>>
>>         There are two modes for partition CPUs reservation -
>>         auto or manual.  The system starts up in auto mode where
>>         "cpuset.cpus.reserve" will be set automatically when valid
>>         child partitions are created and users don't need to touch the
>>         file at all.  This mode has the limitation that the parent of a
>>         partition must be a partition root itself.  So child partition
>>         has to be created one-by-one from the cgroup root down.
>>
>>         To enable the creation of a partition down in the hierarchy
>>         without the intermediate cgroups to be partition roots,
> Why would be this needed? Owning a CPU (a resource) must logically be
> passed all the way from root to the target cgroup, i.e. this is
> expressed by valid partitioning down to given level.
>
>> one
>>         has to turn on the manual reservation mode by writing directly
>>         to "cpuset.cpus.reserve" with a value different from its
>>         current value.  By distributing the reserve CPUs down the cgroup
>>         hierarchy to the parent of the target cgroup, this target cgroup
>>         can be switched to become a partition root if its "cpuset.cpus"
>>         is a subset of the set of valid reserve CPUs in its parent.
> level n
> `- level n+1
> cpuset.cpus // these are actually configured by "owner" of level n
> cpuset.cpus.partition // similrly here, level n decides if child is a partition
>
> I.e. what would be level n/cpuset.cpus.reserve good for when it can
> directly control level n+1/cpuset.cpus?

In the new scheme, the available cpus are still directly passed down to
a descendant cgroup. However, isolated CPUs (or more generally CPUs
dedicated to a partition) have to be exclusive. So what the
cpuset.cpus.reserve does is to identify those exclusive CPUs that can be
excluded from the effective_cpus of the parent cgroups before they are
claimed by a child partition. Currently this is done automatically when
a child partition is created off a parent partition root. The new scheme
will break it into 2 separate steps without the requirement that the
parent of a partition has to be a partition root itself.

Cheers,
Longman

claimed by a partition and will be excluded from the effective_cpus of
the parent

2023-05-02 22:36:43

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <[email protected]> wrote:
> In the new scheme, the available cpus are still directly passed down to a
> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> is to identify those exclusive CPUs that can be excluded from the
> effective_cpus of the parent cgroups before they are claimed by a child
> partition. Currently this is done automatically when a child partition is
> created off a parent partition root. The new scheme will break it into 2
> separate steps without the requirement that the parent of a partition has to
> be a partition root itself.

new scheme
1st step:
echo C >p/cpuset.cpus.reserve
# p/cpuset.cpus.effective == A-C (1)
2nd step (claim):
echo C' >p/c/cpuset.cpus # C'⊆C
echo root >p/c/cpuset.cpus.partition

current scheme
1st step (configure):
echo C >p/c/cpuset.cpus
2nd step (reserve & claim):
echo root >p/c/cpuset.cpus.partition
# p/cpuset.cpus.effective == A-C (2)

As long as p/c is unpopulated, (1) and (2) are equal situations.
Why is the (different) two step procedure needed?

Also the relaxation of requirement of a parent being a partition
confuses me -- if the parent is not a partition, i.e. it has no
exclusive ownership of CPUs but it can still "give" it to children -- is
child partition meant to be exclusive? (IOW can parent siblings reserve
some same CPUs?)

Thanks,
Michal

2023-05-04 03:14:47

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 5/2/23 18:27, Michal Koutný wrote:
> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <[email protected]> wrote:
>> In the new scheme, the available cpus are still directly passed down to a
>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
>> is to identify those exclusive CPUs that can be excluded from the
>> effective_cpus of the parent cgroups before they are claimed by a child
>> partition. Currently this is done automatically when a child partition is
>> created off a parent partition root. The new scheme will break it into 2
>> separate steps without the requirement that the parent of a partition has to
>> be a partition root itself.
> new scheme
> 1st step:
> echo C >p/cpuset.cpus.reserve
> # p/cpuset.cpus.effective == A-C (1)
> 2nd step (claim):
> echo C' >p/c/cpuset.cpus # C'⊆C
> echo root >p/c/cpuset.cpus.partition

It is something like that. However, the current scheme of automatic
reservation is also supported, i.e. cpuset.cpus.reserve will be set
automatically when the child cgroup becomes a valid partition as long as
the cpuset.cpus.reserve file is not written to. This is for backward
compatibility.

Once it is written to, automatic mode will end and users have to
manually set it afterward.


>
> current scheme
> 1st step (configure):
> echo C >p/c/cpuset.cpus
> 2nd step (reserve & claim):
> echo root >p/c/cpuset.cpus.partition
> # p/cpuset.cpus.effective == A-C (2)
>
> As long as p/c is unpopulated, (1) and (2) are equal situations.
> Why is the (different) two step procedure needed?
>
> Also the relaxation of requirement of a parent being a partition
> confuses me -- if the parent is not a partition, i.e. it has no
> exclusive ownership of CPUs but it can still "give" it to children -- is
> child partition meant to be exclusive? (IOW can parent siblings reserve
> some same CPUs?)

A valid partition root has exclusive ownership of its CPUs. That is a
rule that won't be changed. As a result, an incoming partition root
cannot claim CPUs that have been allocated to another partition. To
simplify thing, transition to a valid partition root is not possible if
any of the CPUs in its cpuset.cpus are not in the cpuset.cpus.reserve of
its ancestor or have been allocated to another partition. The partition
root simply becomes invalid.

The parent can virtually give the reserved CPUs from the root down the
hierarchy and a child can claim them once it becomes a partition root.
In manual mode, we need to check all the way up the hierarchy to the
root to figure out what CPUs in cpuset.cpus.reserve are valid. It has
higher overhead, but enabling partition is not a fast operation anyway.

Cheers,
Longman

2023-05-05 16:16:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
>
> On 5/2/23 18:27, Michal Koutný wrote:
> > On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <[email protected]> wrote:
> > > In the new scheme, the available cpus are still directly passed down to a
> > > descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
> > > to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
> > > is to identify those exclusive CPUs that can be excluded from the
> > > effective_cpus of the parent cgroups before they are claimed by a child
> > > partition. Currently this is done automatically when a child partition is
> > > created off a parent partition root. The new scheme will break it into 2
> > > separate steps without the requirement that the parent of a partition has to
> > > be a partition root itself.
> > new scheme
> > 1st step:
> > echo C >p/cpuset.cpus.reserve
> > # p/cpuset.cpus.effective == A-C (1)
> > 2nd step (claim):
> > echo C' >p/c/cpuset.cpus # C'⊆C
> > echo root >p/c/cpuset.cpus.partition
>
> It is something like that. However, the current scheme of automatic
> reservation is also supported, i.e. cpuset.cpus.reserve will be set
> automatically when the child cgroup becomes a valid partition as long as the
> cpuset.cpus.reserve file is not written to. This is for backward
> compatibility.
>
> Once it is written to, automatic mode will end and users have to manually
> set it afterward.

I really don't like the implicit switching behavior. This is interface
behavior modifying internal state that userspace can't view or control
directly. Regardless of how the rest of the discussion develops, this part
should be improved (e.g. would it work to always try to auto-reserve if the
cpu isn't already reserved?).

Thanks.

--
tejun

2023-05-05 16:35:16

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition


On 5/5/23 12:03, Tejun Heo wrote:
> On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
>> On 5/2/23 18:27, Michal Koutný wrote:
>>> On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long <[email protected]> wrote:
>>>> In the new scheme, the available cpus are still directly passed down to a
>>>> descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated
>>>> to a partition) have to be exclusive. So what the cpuset.cpus.reserve does
>>>> is to identify those exclusive CPUs that can be excluded from the
>>>> effective_cpus of the parent cgroups before they are claimed by a child
>>>> partition. Currently this is done automatically when a child partition is
>>>> created off a parent partition root. The new scheme will break it into 2
>>>> separate steps without the requirement that the parent of a partition has to
>>>> be a partition root itself.
>>> new scheme
>>> 1st step:
>>> echo C >p/cpuset.cpus.reserve
>>> # p/cpuset.cpus.effective == A-C (1)
>>> 2nd step (claim):
>>> echo C' >p/c/cpuset.cpus # C'⊆C
>>> echo root >p/c/cpuset.cpus.partition
>> It is something like that. However, the current scheme of automatic
>> reservation is also supported, i.e. cpuset.cpus.reserve will be set
>> automatically when the child cgroup becomes a valid partition as long as the
>> cpuset.cpus.reserve file is not written to. This is for backward
>> compatibility.
>>
>> Once it is written to, automatic mode will end and users have to manually
>> set it afterward.
> I really don't like the implicit switching behavior. This is interface
> behavior modifying internal state that userspace can't view or control
> directly. Regardless of how the rest of the discussion develops, this part
> should be improved (e.g. would it work to always try to auto-reserve if the
> cpu isn't already reserved?).

After some more thought yesterday, I have a slight change in my design
that auto-reserve as it is now will stay for partitions that have a
partition root parent. For remote partition that doesn't have a
partition root parent, its creation will require pre-allocating
additional CPUs into top_cpuset's cpuset.cpus.reserve first. So there
will be no change in behavior for existing use cases whether a remote
partition is created or not.

Cheers,
Longman

2023-05-08 01:13:01

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hi,

The following is the proposed text for "cpuset.cpus.reserve" and
"cpuset.cpus.partition" of the new cpuset partition in
Documentation/admin-guide/cgroup-v2.rst.

  cpuset.cpus.reserve
    A read-write multiple values file which exists only on root
    cgroup.

    It lists all the CPUs that are reserved for adjacent and remote
    partitions created in the system.  See the next section for
    more information on what an adjacent or remote partitions is.

    Creation of adjacent partition does not require touching this
    control file as CPU reservation will be done automatically.
    In order to create a remote partition, the CPUs needed by the
    remote partition has to be written to this file first.

    A "+" prefix can be used to indicate a list of additional
    CPUs that are to be added without disturbing the CPUs that are
    originally there.  For example, if its current value is "3-4",
    echoing ""+5" to it will change it to "3-5".

    Once a remote partition is destroyed, its CPUs have to be
    removed from this file or no other process can use them.  A "-"
    prefix can be used to remove a list of CPUs from it.  However,
    removing CPUs that are currently used in existing partitions
    may cause those partitions to become invalid.  A single "-"
    character without any number can be used to indicate removal
    of all the free CPUs not allocated to any partitions to avoid
    accidental partition invalidation.

  cpuset.cpus.partition
    A read-write single value file which exists on non-root
    cpuset-enabled cgroups.  This flag is owned by the parent cgroup
    and is not delegatable.

    It accepts only the following input values when written to.

      ==========    =====================================
      "member"    Non-root member of a partition
      "root"    Partition root
      "isolated"    Partition root without load balancing
      ==========    =====================================

    A cpuset partition is a collection of cgroups with a partition
    root at the top of the hierarchy and its descendants except
    those that are separate partition roots themselves and their
    descendants.  A partition has exclusive access to the set of
    CPUs allocated to it.  Other cgroups outside of that partition
    cannot use any CPUs in that set.

    There are two types of partitions - adjacent and remote.  The
    parent of an adjacent partition must be a valid partition root.
    Partition roots of adjacent partitions are all clustered around
    the root cgroup.  Creation of adjacent partition is done by
    writing the desired partition type into "cpuset.cpus.partition".

    A remote partition does not require a partition root parent.
    So a remote partition can be formed far from the root cgroup.
    However, its creation is a 2-step process.  The CPUs needed
    by a remote partition ("cpuset.cpus" of the partition root)
    has to be written into "cpuset.cpus.reserve" of the root
    cgroup first.  After that, "isolated" can be written into
    "cpuset.cpus.partition" of the partition root to form a remote
    isolated partition which is the only supported remote partition
    type for now.

    All remote partitions are terminal as adjacent partition cannot
    be created underneath it.

    The root cgroup is always a partition root and its state cannot
    be changed.  All other non-root cgroups start out as "member".

    When set to "root", the current cgroup is the root of a new
    partition or scheduling domain.

    When set to "isolated", the CPUs in that partition will
    be in an isolated state without any load balancing from the
    scheduler.  Tasks placed in such a partition with multiple
    CPUs should be carefully distributed and bound to each of the
    individual CPUs for optimal performance.

    The value shown in "cpuset.cpus.effective" of a partition root is
    the CPUs that are dedicated to that partition and not available
    to cgroups outside of that partittion.

    A partition root ("root" or "isolated") can be in one of the
    two possible states - valid or invalid.  An invalid partition
    root is in a degraded state where some state information may
    be retained, but behaves more like a "member".

    All possible state transitions among "member", "root" and
    "isolated" are allowed.

    On read, the "cpuset.cpus.partition" file can show the following
    values.

      ============================= =====================================
      "member"            Non-root member of a partition
      "root"            Partition root
      "isolated"            Partition root without load balancing
      "root invalid (<reason>)"    Invalid partition root
      "isolated invalid (<reason>)"    Invalid isolated partition root
      ============================= =====================================

    In the case of an invalid partition root, a descriptive string on
    why the partition is invalid is included within parentheses.

    For an adjacent partition root to be valid, the following
    conditions must be met.

    1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
       are not shared by any of its siblings (exclusivity rule).
    2) The parent cgroup is a valid partition root.
    3) The "cpuset.cpus" is not empty and must contain at least
       one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
    4) The "cpuset.cpus.effective" cannot be empty unless there is
       no task associated with this partition.

    For a remote partition root to be valid, the following conditions
    must be met.

    1) The same exclusivity rule as adjacent partition root.
    2) The "cpuset.cpus" is not empty and all the CPUs must be
       present in "cpuset.cpus.reserve" of the root cgroup and none
       of them are allocated to another partition.
    3) The "cpuset.cpus" value must be present in all its ancestors
       to ensure proper hierarchical cpu distribution.

    External events like hotplug or changes to "cpuset.cpus" can
    cause a valid partition root to become invalid and vice versa.
    Note that a task cannot be moved to a cgroup with empty
    "cpuset.cpus.effective".

    For a valid partition root with the sibling cpu exclusivity
    rule enabled, changes made to "cpuset.cpus" that violate the
    exclusivity rule will invalidate the partition as well as its
    sibling partitions with conflicting cpuset.cpus values. So
    care must be taking in changing "cpuset.cpus".

    A valid non-root parent partition may distribute out all its CPUs
    to its child partitions when there is no task associated with it.

    Care must be taken to change a valid partition root to
    "member" as all its child partitions, if present, will become
    invalid causing disruption to tasks running in those child
    partitions. These inactivated partitions could be recovered if
    their parent is switched back to a partition root with a proper
    set of "cpuset.cpus".

    Poll and inotify events are triggered whenever the state of
    "cpuset.cpus.partition" changes.  That includes changes caused
    by write to "cpuset.cpus.partition", cpu hotplug or other
    changes that modify the validity status of the partition.
    This will allow user space agents to monitor unexpected changes
    to "cpuset.cpus.partition" without the need to do continuous
    polling.

Cheers,
Longman

2023-05-22 20:15:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote:
...
> ? cpuset.cpus.reserve
> ?? ?A read-write multiple values file which exists only on root
> ?? ?cgroup.
>
> ?? ?It lists all the CPUs that are reserved for adjacent and remote
> ?? ?partitions created in the system.? See the next section for
> ?? ?more information on what an adjacent or remote partitions is.
>
> ?? ?Creation of adjacent partition does not require touching this
> ?? ?control file as CPU reservation will be done automatically.
> ?? ?In order to create a remote partition, the CPUs needed by the
> ?? ?remote partition has to be written to this file first.
>
> ?? ?A "+" prefix can be used to indicate a list of additional
> ?? ?CPUs that are to be added without disturbing the CPUs that are
> ?? ?originally there.? For example, if its current value is "3-4",
> ?? ?echoing ""+5" to it will change it to "3-5".
>
> ?? ?Once a remote partition is destroyed, its CPUs have to be
> ?? ?removed from this file or no other process can use them.? A "-"
> ?? ?prefix can be used to remove a list of CPUs from it.? However,
> ?? ?removing CPUs that are currently used in existing partitions
> ?? ?may cause those partitions to become invalid.? A single "-"
> ?? ?character without any number can be used to indicate removal
> ?? ?of all the free CPUs not allocated to any partitions to avoid
> ?? ?accidental partition invalidation.

Why is the syntax different from .cpus? Wouldn't it be better to keep them
the same?

> ? cpuset.cpus.partition
> ?? ?A read-write single value file which exists on non-root
> ?? ?cpuset-enabled cgroups.? This flag is owned by the parent cgroup
> ?? ?and is not delegatable.
>
> ?? ?It accepts only the following input values when written to.
>
> ?? ?? ==========??? =====================================
> ?? ?? "member"??? Non-root member of a partition
> ?? ?? "root"??? Partition root
> ?? ?? "isolated"??? Partition root without load balancing
> ?? ?? ==========??? =====================================
>
> ?? ?A cpuset partition is a collection of cgroups with a partition
> ?? ?root at the top of the hierarchy and its descendants except
> ?? ?those that are separate partition roots themselves and their
> ?? ?descendants.? A partition has exclusive access to the set of
> ?? ?CPUs allocated to it.? Other cgroups outside of that partition
> ?? ?cannot use any CPUs in that set.
>
> ?? ?There are two types of partitions - adjacent and remote.? The
> ?? ?parent of an adjacent partition must be a valid partition root.
> ?? ?Partition roots of adjacent partitions are all clustered around
> ?? ?the root cgroup.? Creation of adjacent partition is done by
> ?? ?writing the desired partition type into "cpuset.cpus.partition".
>
> ?? ?A remote partition does not require a partition root parent.
> ?? ?So a remote partition can be formed far from the root cgroup.
> ?? ?However, its creation is a 2-step process.? The CPUs needed
> ?? ?by a remote partition ("cpuset.cpus" of the partition root)
> ?? ?has to be written into "cpuset.cpus.reserve" of the root
> ?? ?cgroup first.? After that, "isolated" can be written into
> ?? ?"cpuset.cpus.partition" of the partition root to form a remote
> ?? ?isolated partition which is the only supported remote partition
> ?? ?type for now.
>
> ?? ?All remote partitions are terminal as adjacent partition cannot
> ?? ?be created underneath it.

Can you elaborate this extra restriction a bit further?

In general, I think it'd be really helpful if the document explains the
reasoning behind the design decisions. ie. Why is reserving for? What
purpose does it serve that the regular isolated ones cannot? That'd help
clarifying the design decisions.

Thanks.

--
tejun

2023-05-28 22:01:43

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 5/22/23 15:49, Tejun Heo wrote:
> Hello, Waiman.

Sorry for the late reply as I had been off for almost 2 weeks due to PTO.


>
> On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote:
> ...
>>   cpuset.cpus.reserve
>>     A read-write multiple values file which exists only on root
>>     cgroup.
>>
>>     It lists all the CPUs that are reserved for adjacent and remote
>>     partitions created in the system.  See the next section for
>>     more information on what an adjacent or remote partitions is.
>>
>>     Creation of adjacent partition does not require touching this
>>     control file as CPU reservation will be done automatically.
>>     In order to create a remote partition, the CPUs needed by the
>>     remote partition has to be written to this file first.
>>
>>     A "+" prefix can be used to indicate a list of additional
>>     CPUs that are to be added without disturbing the CPUs that are
>>     originally there.  For example, if its current value is "3-4",
>>     echoing ""+5" to it will change it to "3-5".
>>
>>     Once a remote partition is destroyed, its CPUs have to be
>>     removed from this file or no other process can use them.  A "-"
>>     prefix can be used to remove a list of CPUs from it.  However,
>>     removing CPUs that are currently used in existing partitions
>>     may cause those partitions to become invalid.  A single "-"
>>     character without any number can be used to indicate removal
>>     of all the free CPUs not allocated to any partitions to avoid
>>     accidental partition invalidation.
> Why is the syntax different from .cpus? Wouldn't it be better to keep them
> the same?

Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs
that are used in multiple partitions. Also automatic reservation of
adjacent partitions can happen in parallel. That is why I think it will
be safer if we allow incremental increase or decrease of reserve CPUs to
be used for remote partitions. I will include this reasoning into the
doc file.


>>   cpuset.cpus.partition
>>     A read-write single value file which exists on non-root
>>     cpuset-enabled cgroups.  This flag is owned by the parent cgroup
>>     and is not delegatable.
>>
>>     It accepts only the following input values when written to.
>>
>>       ==========    =====================================
>>       "member"    Non-root member of a partition
>>       "root"    Partition root
>>       "isolated"    Partition root without load balancing
>>       ==========    =====================================
>>
>>     A cpuset partition is a collection of cgroups with a partition
>>     root at the top of the hierarchy and its descendants except
>>     those that are separate partition roots themselves and their
>>     descendants.  A partition has exclusive access to the set of
>>     CPUs allocated to it.  Other cgroups outside of that partition
>>     cannot use any CPUs in that set.
>>
>>     There are two types of partitions - adjacent and remote.  The
>>     parent of an adjacent partition must be a valid partition root.
>>     Partition roots of adjacent partitions are all clustered around
>>     the root cgroup.  Creation of adjacent partition is done by
>>     writing the desired partition type into "cpuset.cpus.partition".
>>
>>     A remote partition does not require a partition root parent.
>>     So a remote partition can be formed far from the root cgroup.
>>     However, its creation is a 2-step process.  The CPUs needed
>>     by a remote partition ("cpuset.cpus" of the partition root)
>>     has to be written into "cpuset.cpus.reserve" of the root
>>     cgroup first.  After that, "isolated" can be written into
>>     "cpuset.cpus.partition" of the partition root to form a remote
>>     isolated partition which is the only supported remote partition
>>     type for now.
>>
>>     All remote partitions are terminal as adjacent partition cannot
>>     be created underneath it.
> Can you elaborate this extra restriction a bit further?

Are you referring to the fact that only remote isolated partitions are
supported? I do not preclude the support of load balancing remote
partitions. I keep it to isolated partitions for now for ease of
implementation and I am not currently aware of a use case where such a
remote partition type is needed.

If you are talking about remote partition being terminal. It is mainly
because it can be more tricky to support hierarchical adjacent
partitions underneath it especially if it is not isolated. We can
certainly support it if a use case arises. I just don't want to
implement code that nobody is really going to use.

BTW, with the current way the remote partition is created, it is not
possible to have another remote partition underneath it.

>
> In general, I think it'd be really helpful if the document explains the
> reasoning behind the design decisions. ie. Why is reserving for? What
> purpose does it serve that the regular isolated ones cannot? That'd help
> clarifying the design decisions.

I understand your concern. If you think it is better to support both
types of remote partitions or hierarchical adjacent partitions
underneath it for symmetry purpose, I can certain do that. It just needs
to take a bit more time.

Cheers,
Longman


2023-06-05 18:32:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
> On 5/22/23 15:49, Tejun Heo wrote:
> Sorry for the late reply as I had been off for almost 2 weeks due to PTO.

And me too. Just moved.

> > Why is the syntax different from .cpus? Wouldn't it be better to keep them
> > the same?
>
> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that
> are used in multiple partitions. Also automatic reservation of adjacent
> partitions can happen in parallel. That is why I think it will be safer if

Ah, I see, this is because cpu.reserve is only in the root cgroup, so you
can't say that the knob is owned by the parent cgroup and thus access is
controlled that way.

...
> > > ?? ?There are two types of partitions - adjacent and remote.? The
> > > ?? ?parent of an adjacent partition must be a valid partition root.
> > > ?? ?Partition roots of adjacent partitions are all clustered around
> > > ?? ?the root cgroup.? Creation of adjacent partition is done by
> > > ?? ?writing the desired partition type into "cpuset.cpus.partition".
> > >
> > > ?? ?A remote partition does not require a partition root parent.
> > > ?? ?So a remote partition can be formed far from the root cgroup.
> > > ?? ?However, its creation is a 2-step process.? The CPUs needed
> > > ?? ?by a remote partition ("cpuset.cpus" of the partition root)
> > > ?? ?has to be written into "cpuset.cpus.reserve" of the root
> > > ?? ?cgroup first.? After that, "isolated" can be written into
> > > ?? ?"cpuset.cpus.partition" of the partition root to form a remote
> > > ?? ?isolated partition which is the only supported remote partition
> > > ?? ?type for now.
> > >
> > > ?? ?All remote partitions are terminal as adjacent partition cannot
> > > ?? ?be created underneath it.
> >
> > Can you elaborate this extra restriction a bit further?
>
> Are you referring to the fact that only remote isolated partitions are
> supported? I do not preclude the support of load balancing remote
> partitions. I keep it to isolated partitions for now for ease of
> implementation and I am not currently aware of a use case where such a
> remote partition type is needed.
>
> If you are talking about remote partition being terminal. It is mainly
> because it can be more tricky to support hierarchical adjacent partitions
> underneath it especially if it is not isolated. We can certainly support it
> if a use case arises. I just don't want to implement code that nobody is
> really going to use.
>
> BTW, with the current way the remote partition is created, it is not
> possible to have another remote partition underneath it.

The fact that the control is spread across a root-only file and per-cgroup
file seems hacky to me. e.g. How would it interact with namespacing? Are
there reasons why this can't be properly hierarchical other than the amount
of work needed? For example:

cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
that the cgroup holds exclusively. The mask is always a subset of
cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
child by setting the CPU in the child's cpus.exclusive and the CPU can't
be given to more than one child. IOW, exclusive CPUs are available only to
the leaf cgroups that have them set in their .exclusive file.

When a cgroup is turned into a partition, its cpuset.cpus and
cpuset.cpus.exclusive should be the same. For backward compatibility, if
the cgroup's parent is already a partition, cpuset will automatically
attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.

I could well be missing something important but I'd really like to see
something like the above where the reservation feature blends in with the
rest of cpuset.

Thanks.

--
tejun

2023-06-05 20:20:11

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 6/5/23 14:03, Tejun Heo wrote:
> Hello, Waiman.
>
> On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
>> On 5/22/23 15:49, Tejun Heo wrote:
>> Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
> And me too. Just moved.
>
>>> Why is the syntax different from .cpus? Wouldn't it be better to keep them
>>> the same?
>> Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that
>> are used in multiple partitions. Also automatic reservation of adjacent
>> partitions can happen in parallel. That is why I think it will be safer if
> Ah, I see, this is because cpu.reserve is only in the root cgroup, so you
> can't say that the knob is owned by the parent cgroup and thus access is
> controlled that way.
>
> ...
>>>>     There are two types of partitions - adjacent and remote.  The
>>>>     parent of an adjacent partition must be a valid partition root.
>>>>     Partition roots of adjacent partitions are all clustered around
>>>>     the root cgroup.  Creation of adjacent partition is done by
>>>>     writing the desired partition type into "cpuset.cpus.partition".
>>>>
>>>>     A remote partition does not require a partition root parent.
>>>>     So a remote partition can be formed far from the root cgroup.
>>>>     However, its creation is a 2-step process.  The CPUs needed
>>>>     by a remote partition ("cpuset.cpus" of the partition root)
>>>>     has to be written into "cpuset.cpus.reserve" of the root
>>>>     cgroup first.  After that, "isolated" can be written into
>>>>     "cpuset.cpus.partition" of the partition root to form a remote
>>>>     isolated partition which is the only supported remote partition
>>>>     type for now.
>>>>
>>>>     All remote partitions are terminal as adjacent partition cannot
>>>>     be created underneath it.
>>> Can you elaborate this extra restriction a bit further?
>> Are you referring to the fact that only remote isolated partitions are
>> supported? I do not preclude the support of load balancing remote
>> partitions. I keep it to isolated partitions for now for ease of
>> implementation and I am not currently aware of a use case where such a
>> remote partition type is needed.
>>
>> If you are talking about remote partition being terminal. It is mainly
>> because it can be more tricky to support hierarchical adjacent partitions
>> underneath it especially if it is not isolated. We can certainly support it
>> if a use case arises. I just don't want to implement code that nobody is
>> really going to use.
>>
>> BTW, with the current way the remote partition is created, it is not
>> possible to have another remote partition underneath it.
> The fact that the control is spread across a root-only file and per-cgroup
> file seems hacky to me. e.g. How would it interact with namespacing? Are
> there reasons why this can't be properly hierarchical other than the amount
> of work needed? For example:
>
> cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
> that the cgroup holds exclusively. The mask is always a subset of
> cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
> child by setting the CPU in the child's cpus.exclusive and the CPU can't
> be given to more than one child. IOW, exclusive CPUs are available only to
> the leaf cgroups that have them set in their .exclusive file.
>
> When a cgroup is turned into a partition, its cpuset.cpus and
> cpuset.cpus.exclusive should be the same. For backward compatibility, if
> the cgroup's parent is already a partition, cpuset will automatically
> attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
>
> I could well be missing something important but I'd really like to see
> something like the above where the reservation feature blends in with the
> rest of cpuset.

It can certainly be made hierarchical as you suggest. It does increase
complexity from both user and kernel point of view.

From the user point of view, there is one more knob to manage
hierarchically which is not used that often.

From the kernel point of view, we may need to have one more cpumask per
cpuset as the current subparts_cpus is used to track automatic
reservation. We need another cpumask to contain extra exclusive CPUs not
allocated through automatic reservation. The fact that you mention this
new control file as a list of exclusively owned CPUs for this cgroup.
Creating a partition is in fact allocating exclusive CPUs to a cgroup.
So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail
a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be
granted or will this exclusive list is only valid if a valid partition
can be formed. So we need to properly manage the dependency between
these 2 control files.

Alternatively, I have no problem exposing cpuset.cpus.exclusive as a
read-only file. It is a bit problematic if we need to make it writable.

As for namespacing, you do raise a good point. I was thinking mostly
from a whole system point of view as the use case that I am aware of
does not needs that. To allow delegation of exclusive CPUs to a child
cgroup, that cgroup has to be a partition root itself. One compromise
that I can think of is to only allow automatic reservation only in such
a scenario. In that case, I need to support a remote load balanced
partition as well and hierarchical sub-partitions underneath it. That
can be done with some extra code to the existing v2 patchset without
introducing too much complexity.

IOW, the use of remote partition is only allowed on the whole system
level where one has access to the cgroup root. Exclusive CPUs
distribution within a container can only be done via the use of adjacent
partitions with automatic reservation. Will that be a good enough
compromise from your point of view?

Cheers,
Longman


2023-06-05 20:50:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello,

On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote:
...
> > file seems hacky to me. e.g. How would it interact with namespacing? Are
> > there reasons why this can't be properly hierarchical other than the amount
> > of work needed? For example:
> >
> > cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
> > that the cgroup holds exclusively. The mask is always a subset of
> > cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
> > child by setting the CPU in the child's cpus.exclusive and the CPU can't
> > be given to more than one child. IOW, exclusive CPUs are available only to
> > the leaf cgroups that have them set in their .exclusive file.
> >
> > When a cgroup is turned into a partition, its cpuset.cpus and
> > cpuset.cpus.exclusive should be the same. For backward compatibility, if
> > the cgroup's parent is already a partition, cpuset will automatically
> > attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
> >
> > I could well be missing something important but I'd really like to see
> > something like the above where the reservation feature blends in with the
> > rest of cpuset.
>
> It can certainly be made hierarchical as you suggest. It does increase
> complexity from both user and kernel point of view.
>
> From the user point of view, there is one more knob to manage hierarchically
> which is not used that often.

From user pov, this only affects them when they want to create partitions
down the tree, right?

> From the kernel point of view, we may need to have one more cpumask per
> cpuset as the current subparts_cpus is used to track automatic reservation.
> We need another cpumask to contain extra exclusive CPUs not allocated
> through automatic reservation. The fact that you mention this new control
> file as a list of exclusively owned CPUs for this cgroup. Creating a
> partition is in fact allocating exclusive CPUs to a cgroup. So it kind of
> overlaps with the cpuset.cpus.partititon file. Can we fail a write to

Yes, it substitutes and expands on cpuset.cpus.partition behavior.

> cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this
> exclusive list is only valid if a valid partition can be formed. So we need
> to properly manage the dependency between these 2 control files.

So, I think cpus.exclusive can become the sole mechanism to arbitrate
exclusive owenership of CPUs and .partition can depend on .exclusive.

> Alternatively, I have no problem exposing cpuset.cpus.exclusive as a
> read-only file. It is a bit problematic if we need to make it writable.

I don't follow. How would remote partitions work then?

> As for namespacing, you do raise a good point. I was thinking mostly from a
> whole system point of view as the use case that I am aware of does not needs
> that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup
> has to be a partition root itself. One compromise that I can think of is to
> only allow automatic reservation only in such a scenario. In that case, I
> need to support a remote load balanced partition as well and hierarchical
> sub-partitions underneath it. That can be done with some extra code to the
> existing v2 patchset without introducing too much complexity.
>
> IOW, the use of remote partition is only allowed on the whole system level
> where one has access to the cgroup root. Exclusive CPUs distribution within
> a container can only be done via the use of adjacent partitions with
> automatic reservation. Will that be a good enough compromise from your point
> of view?

It seems too twisted to me. I'd much prefer it to be better integrated with
the rest of cpuset.

Thanks.

--
tejun

2023-06-06 03:40:17

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 6/5/23 16:27, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote:
> ...
>>> file seems hacky to me. e.g. How would it interact with namespacing? Are
>>> there reasons why this can't be properly hierarchical other than the amount
>>> of work needed? For example:
>>>
>>> cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs
>>> that the cgroup holds exclusively. The mask is always a subset of
>>> cpuset.cpus. The parent loses access to a CPU when the CPU is given to a
>>> child by setting the CPU in the child's cpus.exclusive and the CPU can't
>>> be given to more than one child. IOW, exclusive CPUs are available only to
>>> the leaf cgroups that have them set in their .exclusive file.
>>>
>>> When a cgroup is turned into a partition, its cpuset.cpus and
>>> cpuset.cpus.exclusive should be the same. For backward compatibility, if
>>> the cgroup's parent is already a partition, cpuset will automatically
>>> attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
>>>
>>> I could well be missing something important but I'd really like to see
>>> something like the above where the reservation feature blends in with the
>>> rest of cpuset.
>> It can certainly be made hierarchical as you suggest. It does increase
>> complexity from both user and kernel point of view.
>>
>> From the user point of view, there is one more knob to manage hierarchically
>> which is not used that often.
> From user pov, this only affects them when they want to create partitions
> down the tree, right?
>
>> From the kernel point of view, we may need to have one more cpumask per
>> cpuset as the current subparts_cpus is used to track automatic reservation.
>> We need another cpumask to contain extra exclusive CPUs not allocated
>> through automatic reservation. The fact that you mention this new control
>> file as a list of exclusively owned CPUs for this cgroup. Creating a
>> partition is in fact allocating exclusive CPUs to a cgroup. So it kind of
>> overlaps with the cpuset.cpus.partititon file. Can we fail a write to
> Yes, it substitutes and expands on cpuset.cpus.partition behavior.
>
>> cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this
>> exclusive list is only valid if a valid partition can be formed. So we need
>> to properly manage the dependency between these 2 control files.
> So, I think cpus.exclusive can become the sole mechanism to arbitrate
> exclusive owenership of CPUs and .partition can depend on .exclusive.
>
>> Alternatively, I have no problem exposing cpuset.cpus.exclusive as a
>> read-only file. It is a bit problematic if we need to make it writable.
> I don't follow. How would remote partitions work then?

I had a different idea on the semantics of the cpuset.cpus.exclusive at
the beginning. My original thinking is that it was the actual exclusive
CPUs that are allocated to the cgroup. Now if we treat this as a hint of
what exclusive CPUs should be used and it becomes valid only if the
cgroup can become a valid partition. I can see it as a value that can be
hierarchically set throughout the whole cpuset hierarchy.

So a transition to a valid partition is possible iff

1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
cpuset.cpus.exclusive of all its ancestors.
2) If its parent is not a partition root, none of the CPUs in
cpuset.cpus.exclusive are currently allocated to other partitions. This
the same remote partition concept in my v2 patch. If its parent is a
partition root, part of its exclusive CPUs will be distributed to this
child partition like the current behavior of cpuset partition.

I can rework my patch to adopt this model if it is what you have in mind.

Thanks,
Longman


2023-06-06 20:14:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello, Waiman.

On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
...
> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
> beginning. My original thinking is that it was the actual exclusive CPUs
> that are allocated to the cgroup. Now if we treat this as a hint of what
> exclusive CPUs should be used and it becomes valid only if the cgroup can

I wouldn't call it a hint. It's still hard allocation of the CPUs to the
cgroups that own them. Setting up a partition requires exclusive CPUs and
thus would depend on exclusive allocations set up accordingly.

> become a valid partition. I can see it as a value that can be hierarchically
> set throughout the whole cpuset hierarchy.
>
> So a transition to a valid partition is possible iff
>
> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
> cpuset.cpus.exclusive of all its ancestors.

Yes.

> 2) If its parent is not a partition root, none of the CPUs in
> cpuset.cpus.exclusive are currently allocated to other partitions. This the

Not just that, the CPUs aren't available to cgroups which don't have them
set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
the CPU in their cpus.exclusive.

So, .exclusive explicitly establishes exclusive ownership of CPUs and
partitions depend on that with an implicit "turn CPUs exclusive" behavior in
case the parent is a partition root for backward compatibility.

> same remote partition concept in my v2 patch. If its parent is a partition
> root, part of its exclusive CPUs will be distributed to this child partition
> like the current behavior of cpuset partition.

Yes, similar in a sense. Please do away with the "once .reserve is used, the
behavior is switched" part. Instead, it can be sth like "if the parent is a
partition root, cpuset implicitly tries to set all CPUs in its cpus file in
its cpus.exclusive file" so that user-visible behavior stays unchanged
depending on past history.

Thanks.

--
tejun

2023-06-06 20:22:09

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

On 6/6/23 15:58, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote:
> ...
>> I had a different idea on the semantics of the cpuset.cpus.exclusive at the
>> beginning. My original thinking is that it was the actual exclusive CPUs
>> that are allocated to the cgroup. Now if we treat this as a hint of what
>> exclusive CPUs should be used and it becomes valid only if the cgroup can
> I wouldn't call it a hint. It's still hard allocation of the CPUs to the
> cgroups that own them. Setting up a partition requires exclusive CPUs and
> thus would depend on exclusive allocations set up accordingly.
>
>> become a valid partition. I can see it as a value that can be hierarchically
>> set throughout the whole cpuset hierarchy.
>>
>> So a transition to a valid partition is possible iff
>>
>> 1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
>> cpuset.cpus.exclusive of all its ancestors.
> Yes.
>
>> 2) If its parent is not a partition root, none of the CPUs in
>> cpuset.cpus.exclusive are currently allocated to other partitions. This the
> Not just that, the CPUs aren't available to cgroups which don't have them
> set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some
> cgroups, it shouldn't appear in cpus.effective of cgroups which don't have
> the CPU in their cpus.exclusive.
>
> So, .exclusive explicitly establishes exclusive ownership of CPUs and
> partitions depend on that with an implicit "turn CPUs exclusive" behavior in
> case the parent is a partition root for backward compatibility.
The current CPU exclusive behavior is limited to sibling cgroups only.
Because of the hierarchical nature of cpu distribution, the set of
exclusive CPUs have to appear in all its ancestors. When partition is
enabled, we do a sibling exclusivity test at that point to verify that
it is exclusive. It looks like you want to do an exclusivity test even
when the partition isn't active. I can certainly do that when the file
is being updated. However, it will fail the write if the exclusivity
test fails just like the v1 cpuset.cpus.exclusive flag if you are OK
with that.
>
>> same remote partition concept in my v2 patch. If its parent is a partition
>> root, part of its exclusive CPUs will be distributed to this child partition
>> like the current behavior of cpuset partition.
> Yes, similar in a sense. Please do away with the "once .reserve is used, the
> behavior is switched" part.

That behavior has been gone in my v2 patch.

> Instead, it can be sth like "if the parent is a
> partition root, cpuset implicitly tries to set all CPUs in its cpus file in
> its cpus.exclusive file" so that user-visible behavior stays unchanged
> depending on past history.

If parent is a partition root, auto reservation will be done and
cpus.exclusive will be set automatically just like before. So existing
applications using partition will not be affected.

Cheers,
Longman


2023-06-06 20:25:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] cgroup/cpuset: A new "isolcpus" paritition

Hello,

On Tue, Jun 06, 2023 at 04:11:02PM -0400, Waiman Long wrote:
...
> The current CPU exclusive behavior is limited to sibling cgroups only.
> Because of the hierarchical nature of cpu distribution, the set of exclusive
> CPUs have to appear in all its ancestors. When partition is enabled, we do a
> sibling exclusivity test at that point to verify that it is exclusive. It
> looks like you want to do an exclusivity test even when the partition isn't
> active. I can certainly do that when the file is being updated. However, it
> will fail the write if the exclusivity test fails just like the v1
> cpuset.cpus.exclusive flag if you are OK with that.

Yeah, doesn't look like there's a way around it if we want to make
.exclusive a feature which is useful on its own.

> > Instead, it can be sth like "if the parent is a
> > partition root, cpuset implicitly tries to set all CPUs in its cpus file in
> > its cpus.exclusive file" so that user-visible behavior stays unchanged
> > depending on past history.
>
> If parent is a partition root, auto reservation will be done and
> cpus.exclusive will be set automatically just like before. So existing
> applications using partition will not be affected.

Sounds great.

Thanks.

--
tejun