2022-10-07 10:50:01

by Peter Newman

[permalink] [raw]
Subject: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette, Fenghua,

I'd like to talk about the tasks file interface in CTRL_MON and MON
groups.

For some background, we are using the memory-bandwidth monitoring and
allocation features of resctrl to maintain QoS on external memory
bandwidth for latency-sensitive containers to help enable batch
containers to use up leftover CPU/memory resources on a machine. We
also monitor the external memory bandwidth usage of all hosted
containers to identify ones which are misusing their latency-sensitive
CoS assignment and downgrade them to the batch CoS.

The trouble is, container manager developers working with the tasks
interface have complained that it's not usable for them because it takes
many (or an unbounded number of) passes to move all tasks from a
container over, as the list is always changing.

Our solution for them is to remove the need for moving tasks between
CTRL_MON groups. Because we are mainly using MB throttling to implement
QoS, we only need two classes of service. Therefore we've modified
resctrl to reuse existing CLOSIDs for CTRL_MON groups with identical
configurations, allowing us to create a CTRL_MON group for every
container. Instead of moving the tasks over, we only need to update
their CTRL_MON group's schemata. Another benefit for us is that we do
not need to also move all of the tasks over to a new monitoring group in
the batch CTRL_MON group, and the usage counts remain intact.

The CLOSID management rules would roughly be:

1. If an update would cause a CTRL_MON group's config to match that of
an existing group, the CTRL_MON group's CLOSID should change to that
of the existing group, where the definition of "match" is: all
control values match in all domains for all resources, as well as
the cpu masks matching.

2. If an update to a CTRL_MON group sharing a CLOSID with another group
causes that group to no longer match any others, a new CLOSID must
be allocated.

3. An update to a CTRL_MON group using a non-shared CLOSID which
continues to not match any others follows the current resctrl
behavior.

Before I prepare any patches for review, I'm interested in any comments
or suggestions on the use case and solution.

Are there simpler strategies for reassigning a running container's tasks
to a different CTRL_MON group that we should be considering first?

Any concerns about the CLOSID-reusing behavior? The hope is existing
users who aren't creating identically-configured CTRL_MON groups would
be minimally impacted. Would it help if the proposed behavior were
opt-in at mount-time?

Thanks!
-Peter


2022-10-07 15:47:26

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

+Tony

On 10/7/2022 3:39 AM, Peter Newman wrote:
> Hi Reinette, Fenghua,
>
> I'd like to talk about the tasks file interface in CTRL_MON and MON
> groups.
>
> For some background, we are using the memory-bandwidth monitoring and
> allocation features of resctrl to maintain QoS on external memory
> bandwidth for latency-sensitive containers to help enable batch
> containers to use up leftover CPU/memory resources on a machine. We
> also monitor the external memory bandwidth usage of all hosted
> containers to identify ones which are misusing their latency-sensitive
> CoS assignment and downgrade them to the batch CoS.
>
> The trouble is, container manager developers working with the tasks
> interface have complained that it's not usable for them because it takes
> many (or an unbounded number of) passes to move all tasks from a
> container over, as the list is always changing.
>
> Our solution for them is to remove the need for moving tasks between
> CTRL_MON groups. Because we are mainly using MB throttling to implement
> QoS, we only need two classes of service. Therefore we've modified
> resctrl to reuse existing CLOSIDs for CTRL_MON groups with identical
> configurations, allowing us to create a CTRL_MON group for every
> container. Instead of moving the tasks over, we only need to update
> their CTRL_MON group's schemata. Another benefit for us is that we do
> not need to also move all of the tasks over to a new monitoring group in
> the batch CTRL_MON group, and the usage counts remain intact.
>
> The CLOSID management rules would roughly be:
>
> 1. If an update would cause a CTRL_MON group's config to match that of
> an existing group, the CTRL_MON group's CLOSID should change to that
> of the existing group, where the definition of "match" is: all
> control values match in all domains for all resources, as well as
> the cpu masks matching.
>
> 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> causes that group to no longer match any others, a new CLOSID must
> be allocated.
>
> 3. An update to a CTRL_MON group using a non-shared CLOSID which
> continues to not match any others follows the current resctrl
> behavior.
>
> Before I prepare any patches for review, I'm interested in any comments
> or suggestions on the use case and solution.
>
> Are there simpler strategies for reassigning a running container's tasks
> to a different CTRL_MON group that we should be considering first?
>
> Any concerns about the CLOSID-reusing behavior? The hope is existing
> users who aren't creating identically-configured CTRL_MON groups would
> be minimally impacted. Would it help if the proposed behavior were
> opt-in at mount-time?
>
> Thanks!
> -Peter

2022-10-07 16:37:59

by Fenghua Yu

[permalink] [raw]
Subject: RE: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi, Peter,

> On 10/7/2022 3:39 AM, Peter Newman wrote:
> > Hi Reinette, Fenghua,
> >
> > I'd like to talk about the tasks file interface in CTRL_MON and MON
> > groups.
> >
> > For some background, we are using the memory-bandwidth monitoring and
> > allocation features of resctrl to maintain QoS on external memory
> > bandwidth for latency-sensitive containers to help enable batch
> > containers to use up leftover CPU/memory resources on a machine. We
> > also monitor the external memory bandwidth usage of all hosted
> > containers to identify ones which are misusing their latency-sensitive
> > CoS assignment and downgrade them to the batch CoS.
> >
> > The trouble is, container manager developers working with the tasks
> > interface have complained that it's not usable for them because it
> > takes many (or an unbounded number of) passes to move all tasks from a
> > container over, as the list is always changing.

Are the "all tasks" children of the container process? Is it possible to move the
parent container process and its all children tasks to a different group in one shot
instead of one by one?

> >
> > Our solution for them is to remove the need for moving tasks between
> > CTRL_MON groups. Because we are mainly using MB throttling to
> > implement QoS, we only need two classes of service. Therefore we've
> > modified resctrl to reuse existing CLOSIDs for CTRL_MON groups with
> > identical configurations, allowing us to create a CTRL_MON group for
> > every container. Instead of moving the tasks over, we only need to
> > update their CTRL_MON group's schemata. Another benefit for us is that
> > we do not need to also move all of the tasks over to a new monitoring
> > group in the batch CTRL_MON group, and the usage counts remain intact.
> >
> > The CLOSID management rules would roughly be:
> >
> > 1. If an update would cause a CTRL_MON group's config to match that of
> > an existing group, the CTRL_MON group's CLOSID should change to that
> > of the existing group, where the definition of "match" is: all
> > control values match in all domains for all resources, as well as
> > the cpu masks matching.
> >
> > 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> > causes that group to no longer match any others, a new CLOSID must
> > be allocated.
> >
> > 3. An update to a CTRL_MON group using a non-shared CLOSID which
> > continues to not match any others follows the current resctrl
> > behavior.
> >
> > Before I prepare any patches for review, I'm interested in any
> > comments or suggestions on the use case and solution.
> >
> > Are there simpler strategies for reassigning a running container's
> > tasks to a different CTRL_MON group that we should be considering first?
> >
> > Any concerns about the CLOSID-reusing behavior? The hope is existing
> > users who aren't creating identically-configured CTRL_MON groups would
> > be minimally impacted. Would it help if the proposed behavior were
> > opt-in at mount-time?
> >

Thanks.

-Fenghua

2022-10-07 17:46:56

by Luck, Tony

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Fri, Oct 07, 2022 at 08:44:53AM -0700, Yu, Fenghua wrote:
> Hi, Peter,
>
> > On 10/7/2022 3:39 AM, Peter Newman wrote:

> > > The CLOSID management rules would roughly be:
> > >
> > > 1. If an update would cause a CTRL_MON group's config to match that of
> > > an existing group, the CTRL_MON group's CLOSID should change to that
> > > of the existing group, where the definition of "match" is: all
> > > control values match in all domains for all resources, as well as
> > > the cpu masks matching.

So the micro steps are:

# mkdir newgroup
# New groups are created with maximum resources. So this might
# match the root/default group (if the root schemata had not
# been edited) ... so you could re-use CLOSID=0 for this, or
# perhaps allocate a new CLOSID
# edit newgroup/schemata
# if this update makes this schemata match some other group,
# then update the CLOSID for this group to be same as the other
# group.
> > >
> > > 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> > > causes that group to no longer match any others, a new CLOSID must
> > > be allocated.
# So you have reference counts for CLOSIDs for how many groups
# share it. In above example the change to the schemata and
# alloction of a new CLOSID would decrement the reference count
# and free the old CLOSID if it goes to zero
> > >
> > > 3. An update to a CTRL_MON group using a non-shared CLOSID which
> > > continues to not match any others follows the current resctrl
> > > behavior.
# An update to a CTRL_MON group that has a CLOSID reference
# count > 1 would try to allocate a new CLOSID if the new
# schemata doesn't match any other group. If all CLOSIDs are
# already in use, the write(2) to the schemata file must fail
# ... maybe -ENOSPC is the right error code?

Note that if the root/default CTRL_MON had been editted you might not be
able to create a new group (even though you intend to make to match some
existing group and share a CLOSID). Perhaps we could change existing
semantics so that new groups copy the root group schemata instead of
being maximally permissibe with all resources?
> > >
> > > Before I prepare any patches for review, I'm interested in any
> > > comments or suggestions on the use case and solution.
> > >
> > > Are there simpler strategies for reassigning a running container's
> > > tasks to a different CTRL_MON group that we should be considering first?

Do tasks in a container share a "process group"? If they do, then a
simpler option would be some syntax to assign a group to a resctrl group
(perhaps as a negative task-id? or with a "G" prefix??).

Or is there some other simple way to enumerate all the tasks in a
container with some syntax that is convenient for both the user and the
kernel? If there is, then add code to allow something like:
# echo C{containername} > tasks
and have the resctrl code move all tasks en masse.

Yet another option would be syntax to apply the move recursively to all
descendents of the given task id.

# echo R{process-id} > tasks

I don't know how complex it would for the kernel to implement this. Or
whether it would meet Google's needs.

> > > Any concerns about the CLOSID-reusing behavior? The hope is existing
> > > users who aren't creating identically-configured CTRL_MON groups would
> > > be minimally impacted. Would it help if the proposed behavior were
> > > opt-in at mount-time?

I would suppose that few users are *deliberatley* creating groups with
identical schemata files (doesn't seem like there is a use case for
this). So I agree with your "minimal impact" assessment.

I think I'd prefer you explore modes for bulk moving tasks in a
container before going to the shared-CLOSID path.

-Tony

2022-10-07 18:05:21

by Moger, Babu

[permalink] [raw]
Subject: RE: [RFD] resctrl: reassigning a running container's CTRL_MON group

[AMD Official Use Only - General]

Hi Peter,

> -----Original Message-----
> From: Peter Newman <[email protected]>
> Sent: Friday, October 7, 2022 5:40 AM
> To: Reinette Chatre <[email protected]>; Fenghua Yu
> <[email protected]>
> Cc: Stephane Eranian <[email protected]>; [email protected];
> Thomas Gleixner <[email protected]>; James Morse
> <[email protected]>; Moger, Babu <[email protected]>
> Subject: [RFD] resctrl: reassigning a running container's CTRL_MON group
>
> Hi Reinette, Fenghua,
>
> I'd like to talk about the tasks file interface in CTRL_MON and MON groups.
>
> For some background, we are using the memory-bandwidth monitoring and
> allocation features of resctrl to maintain QoS on external memory bandwidth
> for latency-sensitive containers to help enable batch containers to use up
> leftover CPU/memory resources on a machine. We also monitor the external
> memory bandwidth usage of all hosted containers to identify ones which are
> misusing their latency-sensitive CoS assignment and downgrade them to the
> batch CoS.
>
> The trouble is, container manager developers working with the tasks interface
> have complained that it's not usable for them because it takes many (or an
> unbounded number of) passes to move all tasks from a container over, as the
> list is always changing.
>
> Our solution for them is to remove the need for moving tasks between
> CTRL_MON groups. Because we are mainly using MB throttling to implement
> QoS, we only need two classes of service. Therefore we've modified resctrl to
> reuse existing CLOSIDs for CTRL_MON groups with identical configurations,
> allowing us to create a CTRL_MON group for every container. Instead of
> moving the tasks over, we only need to update their CTRL_MON group's
> schemata. Another benefit for us is that we do not need to also move all of the
> tasks over to a new monitoring group in the batch CTRL_MON group, and the
> usage counts remain intact.
>
> The CLOSID management rules would roughly be:
>
> 1. If an update would cause a CTRL_MON group's config to match that of
> an existing group, the CTRL_MON group's CLOSID should change to that
> of the existing group, where the definition of "match" is: all
> control values match in all domains for all resources, as well as
> the cpu masks matching.
>
> 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> causes that group to no longer match any others, a new CLOSID must
> be allocated.
>
> 3. An update to a CTRL_MON group using a non-shared CLOSID which
> continues to not match any others follows the current resctrl
> behavior.
>
> Before I prepare any patches for review, I'm interested in any comments or
> suggestions on the use case and solution.
>
> Are there simpler strategies for reassigning a running container's tasks to a
> different CTRL_MON group that we should be considering first?
>
> Any concerns about the CLOSID-reusing behavior? The hope is existing users
> who aren't creating identically-configured CTRL_MON groups would be
> minimally impacted. Would it help if the proposed behavior were opt-in at
> mount-time?

I am still trying to understand. I would think when creating a new group, the new CLOS id will be used. Basically, it will remain as default behavior. You probably don’t need to create two identical group during the group creation.
New behavior of changing the CLOS id will happen when the "match check" is triggered. How is the match check is triggered? Is it triggered by user?
Thanks
Babu


2022-10-11 00:00:21

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On 10/7/2022 10:28 AM, Tony Luck wrote:
> On Fri, Oct 07, 2022 at 08:44:53AM -0700, Yu, Fenghua wrote:
>> Hi, Peter,
>>
>>> On 10/7/2022 3:39 AM, Peter Newman wrote:
>
>>>> The CLOSID management rules would roughly be:
>>>>
>>>> 1. If an update would cause a CTRL_MON group's config to match that of
>>>> an existing group, the CTRL_MON group's CLOSID should change to that
>>>> of the existing group, where the definition of "match" is: all
>>>> control values match in all domains for all resources, as well as
>>>> the cpu masks matching.
>
> So the micro steps are:
>
> # mkdir newgroup
> # New groups are created with maximum resources. So this might
> # match the root/default group (if the root schemata had not
> # been edited) ... so you could re-use CLOSID=0 for this, or
> # perhaps allocate a new CLOSID
> # edit newgroup/schemata
> # if this update makes this schemata match some other group,
> # then update the CLOSID for this group to be same as the other
> # group.
>>>>
>>>> 2. If an update to a CTRL_MON group sharing a CLOSID with another group
>>>> causes that group to no longer match any others, a new CLOSID must
>>>> be allocated.
> # So you have reference counts for CLOSIDs for how many groups
> # share it. In above example the change to the schemata and
> # alloction of a new CLOSID would decrement the reference count
> # and free the old CLOSID if it goes to zero
>>>>
>>>> 3. An update to a CTRL_MON group using a non-shared CLOSID which
>>>> continues to not match any others follows the current resctrl
>>>> behavior.
> # An update to a CTRL_MON group that has a CLOSID reference
> # count > 1 would try to allocate a new CLOSID if the new
> # schemata doesn't match any other group. If all CLOSIDs are
> # already in use, the write(2) to the schemata file must fail
> # ... maybe -ENOSPC is the right error code?
>
> Note that if the root/default CTRL_MON had been editted you might not be
> able to create a new group (even though you intend to make to match some
> existing group and share a CLOSID). Perhaps we could change existing
> semantics so that new groups copy the root group schemata instead of
> being maximally permissibe with all resources?
>>>>
>>>> Before I prepare any patches for review, I'm interested in any
>>>> comments or suggestions on the use case and solution.
>>>>
>>>> Are there simpler strategies for reassigning a running container's
>>>> tasks to a different CTRL_MON group that we should be considering first?
>
> Do tasks in a container share a "process group"? If they do, then a
> simpler option would be some syntax to assign a group to a resctrl group
> (perhaps as a negative task-id? or with a "G" prefix??).
>
> Or is there some other simple way to enumerate all the tasks in a
> container with some syntax that is convenient for both the user and the
> kernel? If there is, then add code to allow something like:
> # echo C{containername} > tasks
> and have the resctrl code move all tasks en masse.
>
> Yet another option would be syntax to apply the move recursively to all
> descendents of the given task id.
>
> # echo R{process-id} > tasks
>
> I don't know how complex it would for the kernel to implement this. Or
> whether it would meet Google's needs.
>

How about moving monitor groups from one control group to another?

Based on the initial description I got the impression that there is
already a monitor group for every container. (Please correct me if I am
wrong). If this is the case then it may be possible to create an interface
that could move an entire monitor group to another control group. This would
keep the benefit of usage counts remaining intact, tasks get a new closid, but
keep their rmid. There would be no need for the user to specify process-ids.

Reinette

2022-10-11 15:43:18

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Fri, Oct 7, 2022 at 10:57 AM Moger, Babu <[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> Hi Peter,
>
> > -----Original Message-----
> > From: Peter Newman <[email protected]>
> > Sent: Friday, October 7, 2022 5:40 AM
> > To: Reinette Chatre <[email protected]>; Fenghua Yu
> > <[email protected]>
> > Cc: Stephane Eranian <[email protected]>; [email protected];
> > Thomas Gleixner <[email protected]>; James Morse
> > <[email protected]>; Moger, Babu <[email protected]>
> > Subject: [RFD] resctrl: reassigning a running container's CTRL_MON group
> >
> > Hi Reinette, Fenghua,
> >
> > I'd like to talk about the tasks file interface in CTRL_MON and MON groups.
> >
> > For some background, we are using the memory-bandwidth monitoring and
> > allocation features of resctrl to maintain QoS on external memory bandwidth
> > for latency-sensitive containers to help enable batch containers to use up
> > leftover CPU/memory resources on a machine. We also monitor the external
> > memory bandwidth usage of all hosted containers to identify ones which are
> > misusing their latency-sensitive CoS assignment and downgrade them to the
> > batch CoS.
> >
> > The trouble is, container manager developers working with the tasks interface
> > have complained that it's not usable for them because it takes many (or an
> > unbounded number of) passes to move all tasks from a container over, as the
> > list is always changing.
> >
> > Our solution for them is to remove the need for moving tasks between
> > CTRL_MON groups. Because we are mainly using MB throttling to implement
> > QoS, we only need two classes of service. Therefore we've modified resctrl to
> > reuse existing CLOSIDs for CTRL_MON groups with identical configurations,
> > allowing us to create a CTRL_MON group for every container. Instead of
> > moving the tasks over, we only need to update their CTRL_MON group's
> > schemata. Another benefit for us is that we do not need to also move all of the
> > tasks over to a new monitoring group in the batch CTRL_MON group, and the
> > usage counts remain intact.
> >
> > The CLOSID management rules would roughly be:
> >
> > 1. If an update would cause a CTRL_MON group's config to match that of
> > an existing group, the CTRL_MON group's CLOSID should change to that
> > of the existing group, where the definition of "match" is: all
> > control values match in all domains for all resources, as well as
> > the cpu masks matching.
> >
> > 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> > causes that group to no longer match any others, a new CLOSID must
> > be allocated.
> >
> > 3. An update to a CTRL_MON group using a non-shared CLOSID which
> > continues to not match any others follows the current resctrl
> > behavior.
> >
> > Before I prepare any patches for review, I'm interested in any comments or
> > suggestions on the use case and solution.
> >
> > Are there simpler strategies for reassigning a running container's tasks to a
> > different CTRL_MON group that we should be considering first?
> >
> > Any concerns about the CLOSID-reusing behavior? The hope is existing users
> > who aren't creating identically-configured CTRL_MON groups would be
> > minimally impacted. Would it help if the proposed behavior were opt-in at
> > mount-time?
>
> I am still trying to understand. I would think when creating a new group, the new CLOS id will be used. Basically, it will remain as default behavior. You probably don’t need to create two identical group during the group creation.
> New behavior of changing the CLOS id will happen when the "match check" is triggered. How is the match check is triggered? Is it triggered by user?

The matching check is triggered when the schemata is changed for any
resctrl group.

> Thanks
> Babu
>
>

2022-10-11 15:43:41

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi,

On Fri, Oct 7, 2022 at 3:39 AM Peter Newman <[email protected]> wrote:
>
> Hi Reinette, Fenghua,
>
> I'd like to talk about the tasks file interface in CTRL_MON and MON
> groups.
>
> For some background, we are using the memory-bandwidth monitoring and
> allocation features of resctrl to maintain QoS on external memory
> bandwidth for latency-sensitive containers to help enable batch
> containers to use up leftover CPU/memory resources on a machine. We
> also monitor the external memory bandwidth usage of all hosted
> containers to identify ones which are misusing their latency-sensitive
> CoS assignment and downgrade them to the batch CoS.
>
> The trouble is, container manager developers working with the tasks
> interface have complained that it's not usable for them because it takes
> many (or an unbounded number of) passes to move all tasks from a
> container over, as the list is always changing.
>
> Our solution for them is to remove the need for moving tasks between
> CTRL_MON groups. Because we are mainly using MB throttling to implement
> QoS, we only need two classes of service. Therefore we've modified
> resctrl to reuse existing CLOSIDs for CTRL_MON groups with identical
> configurations, allowing us to create a CTRL_MON group for every
> container. Instead of moving the tasks over, we only need to update
> their CTRL_MON group's schemata. Another benefit for us is that we do
> not need to also move all of the tasks over to a new monitoring group in
> the batch CTRL_MON group, and the usage counts remain intact.
>
> The CLOSID management rules would roughly be:
>
> 1. If an update would cause a CTRL_MON group's config to match that of
> an existing group, the CTRL_MON group's CLOSID should change to that
> of the existing group, where the definition of "match" is: all
> control values match in all domains for all resources, as well as
> the cpu masks matching.
>
> 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> causes that group to no longer match any others, a new CLOSID must
> be allocated.
>
> 3. An update to a CTRL_MON group using a non-shared CLOSID which
> continues to not match any others follows the current resctrl
> behavior.
>
Another important aspect of this change is that unlike the default
model of moving all the threads to the
control group corresponding to the restriction, it allows each
container group (cgroup) to have its own
resctrl group and therefore its own RMID, and therefore its own
monitoring capabilities. This is important
when we need to track who is responsible for bandwidth consumption,
for instance.

>
> Before I prepare any patches for review, I'm interested in any comments
> or suggestions on the use case and solution.
>
> Are there simpler strategies for reassigning a running container's tasks
> to a different CTRL_MON group that we should be considering first?
>
> Any concerns about the CLOSID-reusing behavior? The hope is existing
> users who aren't creating identically-configured CTRL_MON groups would
> be minimally impacted. Would it help if the proposed behavior were
> opt-in at mount-time?
>
> Thanks!
> -Peter

2022-10-12 11:46:06

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

[Adding Gaurang to CC]

On Tue, Oct 11, 2022 at 1:35 AM Reinette Chatre
<[email protected]> wrote:
>
> On 10/7/2022 10:28 AM, Tony Luck wrote:
> > I don't know how complex it would for the kernel to implement this. Or
> > whether it would meet Google's needs.
> >
>
> How about moving monitor groups from one control group to another?
>
> Based on the initial description I got the impression that there is
> already a monitor group for every container. (Please correct me if I am
> wrong). If this is the case then it may be possible to create an interface
> that could move an entire monitor group to another control group. This would
> keep the benefit of usage counts remaining intact, tasks get a new closid, but
> keep their rmid. There would be no need for the user to specify process-ids.

Yes, Stephane also pointed out the importance of maintaining RMID assignments
as well and I don't believe I put enough emphasis on it during my
original email.

We need to maintain accurate memory bandwidth usage counts on all
containers, so it's important to be able to maintain an RMID assignment
and its event counts across a CoS downgrade. The solutions Tony
suggested do solve the races in moving the tasks, but the container
would need to temporarily join the default MON group in the new CTRL_MON
group before it can be moved to its replacement MON group.

Being able to re-parent a MON group would allow us to change the CLOSID
independently of the RMID in a container and would address the issue.

The only other point I can think of to differentiate it from the
automatic CLOSID management solution is whether the 1:1 CTRL_MON:CLOSID
approach will become too limiting going forward. For example, if there
are configurations where one resource has far fewer CLOSIDs than others
and we want to start assigning CLOSIDs on-demand, per-resource to avoid
wasting other resources' available CLOSID spaces. If we can foresee this
becoming a concern, then automatic CLOSID management would be
inevitable.

-Peter


On Tue, Oct 11, 2022 at 1:35 AM Reinette Chatre
<[email protected]> wrote:
>
> On 10/7/2022 10:28 AM, Tony Luck wrote:
> > On Fri, Oct 07, 2022 at 08:44:53AM -0700, Yu, Fenghua wrote:
> >> Hi, Peter,
> >>
> >>> On 10/7/2022 3:39 AM, Peter Newman wrote:
> >
> >>>> The CLOSID management rules would roughly be:
> >>>>
> >>>> 1. If an update would cause a CTRL_MON group's config to match that of
> >>>> an existing group, the CTRL_MON group's CLOSID should change to that
> >>>> of the existing group, where the definition of "match" is: all
> >>>> control values match in all domains for all resources, as well as
> >>>> the cpu masks matching.
> >
> > So the micro steps are:
> >
> > # mkdir newgroup
> > # New groups are created with maximum resources. So this might
> > # match the root/default group (if the root schemata had not
> > # been edited) ... so you could re-use CLOSID=0 for this, or
> > # perhaps allocate a new CLOSID
> > # edit newgroup/schemata
> > # if this update makes this schemata match some other group,
> > # then update the CLOSID for this group to be same as the other
> > # group.
> >>>>
> >>>> 2. If an update to a CTRL_MON group sharing a CLOSID with another group
> >>>> causes that group to no longer match any others, a new CLOSID must
> >>>> be allocated.
> > # So you have reference counts for CLOSIDs for how many groups
> > # share it. In above example the change to the schemata and
> > # alloction of a new CLOSID would decrement the reference count
> > # and free the old CLOSID if it goes to zero
> >>>>
> >>>> 3. An update to a CTRL_MON group using a non-shared CLOSID which
> >>>> continues to not match any others follows the current resctrl
> >>>> behavior.
> > # An update to a CTRL_MON group that has a CLOSID reference
> > # count > 1 would try to allocate a new CLOSID if the new
> > # schemata doesn't match any other group. If all CLOSIDs are
> > # already in use, the write(2) to the schemata file must fail
> > # ... maybe -ENOSPC is the right error code?
> >
> > Note that if the root/default CTRL_MON had been editted you might not be
> > able to create a new group (even though you intend to make to match some
> > existing group and share a CLOSID). Perhaps we could change existing
> > semantics so that new groups copy the root group schemata instead of
> > being maximally permissibe with all resources?
> >>>>
> >>>> Before I prepare any patches for review, I'm interested in any
> >>>> comments or suggestions on the use case and solution.
> >>>>
> >>>> Are there simpler strategies for reassigning a running container's
> >>>> tasks to a different CTRL_MON group that we should be considering first?
> >
> > Do tasks in a container share a "process group"? If they do, then a
> > simpler option would be some syntax to assign a group to a resctrl group
> > (perhaps as a negative task-id? or with a "G" prefix??).
> >
> > Or is there some other simple way to enumerate all the tasks in a
> > container with some syntax that is convenient for both the user and the
> > kernel? If there is, then add code to allow something like:
> > # echo C{containername} > tasks
> > and have the resctrl code move all tasks en masse.
> >
> > Yet another option would be syntax to apply the move recursively to all
> > descendents of the given task id.
> >
> > # echo R{process-id} > tasks
> >
> > I don't know how complex it would for the kernel to implement this. Or
> > whether it would meet Google's needs.
> >
>
> How about moving monitor groups from one control group to another?
>
> Based on the initial description I got the impression that there is
> already a monitor group for every container. (Please correct me if I am
> wrong). If this is the case then it may be possible to create an interface
> that could move an entire monitor group to another control group. This would
> keep the benefit of usage counts remaining intact, tasks get a new closid, but
> keep their rmid. There would be no need for the user to specify process-ids.
>
> Reinette

2022-10-12 17:18:42

by Fenghua Yu

[permalink] [raw]
Subject: RE: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi, Peter,

> > > I don't know how complex it would for the kernel to implement this.
> > > Or whether it would meet Google's needs.
> > >
> >
> > How about moving monitor groups from one control group to another?
> >
> > Based on the initial description I got the impression that there is
> > already a monitor group for every container. (Please correct me if I
> > am wrong). If this is the case then it may be possible to create an
> > interface that could move an entire monitor group to another control
> > group. This would keep the benefit of usage counts remaining intact,
> > tasks get a new closid, but keep their rmid. There would be no need for the
> user to specify process-ids.
>
> Yes, Stephane also pointed out the importance of maintaining RMID
> assignments as well and I don't believe I put enough emphasis on it during my
> original email.
>
> We need to maintain accurate memory bandwidth usage counts on all
> containers, so it's important to be able to maintain an RMID assignment and its
> event counts across a CoS downgrade. The solutions Tony suggested do solve
> the races in moving the tasks, but the container would need to temporarily join
> the default MON group in the new CTRL_MON group before it can be moved to
> its replacement MON group.
>
> Being able to re-parent a MON group would allow us to change the CLOSID
> independently of the RMID in a container and would address the issue.
>
> The only other point I can think of to differentiate it from the automatic CLOSID
> management solution is whether the 1:1 CTRL_MON:CLOSID approach will
> become too limiting going forward. For example, if there are configurations
> where one resource has far fewer CLOSIDs than others and we want to start
> assigning CLOSIDs on-demand, per-resource to avoid wasting other resources'
> available CLOSID spaces. If we can foresee this becoming a concern, then
> automatic CLOSID management would be inevitable.

In the very first resctrl implementation, we did foresee uneven CLOSID per-resource
and allocated CLOSID per-resource on demand to avoid waste CLOSID. But that
implementation was too complex and easier to cause bugs and was not blessed by
the community. Then we changed to allocate statically using minimum CLOSID number.
We decided to change to per-resource on demand if it's really useful.

But so far there is no real usage yet. The current CLOSID assignment still stands so far.

In your case, only two CLOSID is used, right? The current CLOSID assignment can still be used, right?
If that's the case, unnecssary complexity and bug-prone may still be the problem of per-resource on-demand
CLOSID assignment.

Thanks.

-Fenghua

2022-10-12 17:26:03

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi guys,

On 12/10/2022 12:21, Peter Newman wrote:
> On Tue, Oct 11, 2022 at 1:35 AM Reinette Chatre
> <[email protected]> wrote:
>> On 10/7/2022 10:28 AM, Tony Luck wrote:
>>> I don't know how complex it would for the kernel to implement this. Or
>>> whether it would meet Google's needs.
>>>
>>
>> How about moving monitor groups from one control group to another?
>>
>> Based on the initial description I got the impression that there is
>> already a monitor group for every container. (Please correct me if I am
>> wrong). If this is the case then it may be possible to create an interface
>> that could move an entire monitor group to another control group. This would
>> keep the benefit of usage counts remaining intact, tasks get a new closid, but
>> keep their rmid. There would be no need for the user to specify process-ids.

> Yes, Stephane also pointed out the importance of maintaining RMID assignments
> as well and I don't believe I put enough emphasis on it during my
> original email.
>
> We need to maintain accurate memory bandwidth usage counts on all
> containers, so it's important to be able to maintain an RMID assignment
> and its event counts across a CoS downgrade. The solutions Tony
> suggested do solve the races in moving the tasks, but the container
> would need to temporarily join the default MON group in the new CTRL_MON
> group before it can be moved to its replacement MON group.
>
> Being able to re-parent a MON group would allow us to change the CLOSID
> independently of the RMID in a container and would address the issue.
>
> The only other point I can think of to differentiate it from the
> automatic CLOSID management solution is whether the 1:1 CTRL_MON:CLOSID
> approach will become too limiting going forward. For example, if there
> are configurations where one resource has far fewer CLOSIDs than others
> and we want to start assigning CLOSIDs on-demand, per-resource to avoid
> wasting other resources' available CLOSID spaces. If we can foresee this
> becoming a concern, then automatic CLOSID management would be
> inevitable.

You originally asked:
| Any concerns about the CLOSID-reusing behavior?

I don't think this will work well with MPAM ... I expect it will mess up the bandwidth
counters.

MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers,
this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively
extended the PARTID with bits that aren't used to look up the configuration.

x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more
complicated. I've seen details of a system that only has 1 bit of PMG space.

While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough
unique PMG for every control/monitor group to have a unique value. Instead, MPAM's
monitors are expected to be used with both the PARTID and PMG.

('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't
match on just PMG at all - they have to be told the PARTID too)


If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM
systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and
the monitors have to match on both.


I have half-finished patches that add a 'resctrl' cgroup controller that can be used to
group tasks and assign them to control or monitor groups. (the creation and configuration
of control and monitor groups stays in resctrl - it effectively makes the tasks file
read-only). I think this might help, as a group of processes can be moved between two
control/monitor groups with one syscall. New processes that are created inherit from the
cgroup setting instead of their parent task.

If want to take a look, its here:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.0&id=4e5987d8ecbc8647dee0aebfb73c3890843ef5dd

I've not worked the cgroup thread stuff out yet ... it doesn't appear to hook thread
creation, only fork().


Thanks,

James

2022-10-12 17:49:28

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 10/12/2022 4:21 AM, Peter Newman wrote:
> [Adding Gaurang to CC]
>
> On Tue, Oct 11, 2022 at 1:35 AM Reinette Chatre
> <[email protected]> wrote:
>>
>> On 10/7/2022 10:28 AM, Tony Luck wrote:
>>> I don't know how complex it would for the kernel to implement this. Or
>>> whether it would meet Google's needs.
>>>
>>
>> How about moving monitor groups from one control group to another?
>>
>> Based on the initial description I got the impression that there is
>> already a monitor group for every container. (Please correct me if I am
>> wrong). If this is the case then it may be possible to create an interface
>> that could move an entire monitor group to another control group. This would
>> keep the benefit of usage counts remaining intact, tasks get a new closid, but
>> keep their rmid. There would be no need for the user to specify process-ids.
>
> Yes, Stephane also pointed out the importance of maintaining RMID assignments
> as well and I don't believe I put enough emphasis on it during my
> original email.
>
> We need to maintain accurate memory bandwidth usage counts on all
> containers, so it's important to be able to maintain an RMID assignment
> and its event counts across a CoS downgrade. The solutions Tony
> suggested do solve the races in moving the tasks, but the container
> would need to temporarily join the default MON group in the new CTRL_MON
> group before it can be moved to its replacement MON group.
>
> Being able to re-parent a MON group would allow us to change the CLOSID
> independently of the RMID in a container and would address the issue.

What if resctrl adds support to rdtgroup_kf_syscall_ops for
the .rename callback?

It seems like doing so could enable users to do something like:
mv /sys/fs/resctrl/groupA/mon_groups/containerA /sys/fs/resctrl/groupB/mon_groups/

Such a user request would trigger the "containerA" monitor group
to be moved to another control group. All tasks within it could be moved to
the new control group (their CLOSIDs are changed) while their RMIDs
remain intact.

I just read James's response and I do not know how this could be made to
work with the Arm monitoring when it arrives. Potentially there
could be an architecture specific "move monitor group" call.

> The only other point I can think of to differentiate it from the
> automatic CLOSID management solution is whether the 1:1 CTRL_MON:CLOSID
> approach will become too limiting going forward. For example, if there
> are configurations where one resource has far fewer CLOSIDs than others
> and we want to start assigning CLOSIDs on-demand, per-resource to avoid
> wasting other resources' available CLOSID spaces. If we can foresee this
> becoming a concern, then automatic CLOSID management would be
> inevitable.

I think Fenghua answered this well.

Reinette

2022-10-14 13:04:01

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On 12/10/2022 18:23, Reinette Chatre wrote:
> On 10/12/2022 4:21 AM, Peter Newman wrote:
>> On Tue, Oct 11, 2022 at 1:35 AM Reinette Chatre
>> <[email protected]> wrote:
>>>
>>> On 10/7/2022 10:28 AM, Tony Luck wrote:
>>>> I don't know how complex it would for the kernel to implement this. Or
>>>> whether it would meet Google's needs.
>>>>
>>>
>>> How about moving monitor groups from one control group to another?
>>>
>>> Based on the initial description I got the impression that there is
>>> already a monitor group for every container. (Please correct me if I am
>>> wrong). If this is the case then it may be possible to create an interface
>>> that could move an entire monitor group to another control group. This would
>>> keep the benefit of usage counts remaining intact, tasks get a new closid, but
>>> keep their rmid. There would be no need for the user to specify process-ids.
>>
>> Yes, Stephane also pointed out the importance of maintaining RMID assignments
>> as well and I don't believe I put enough emphasis on it during my
>> original email.
>>
>> We need to maintain accurate memory bandwidth usage counts on all
>> containers, so it's important to be able to maintain an RMID assignment
>> and its event counts across a CoS downgrade. The solutions Tony
>> suggested do solve the races in moving the tasks, but the container
>> would need to temporarily join the default MON group in the new CTRL_MON
>> group before it can be moved to its replacement MON group.
>>
>> Being able to re-parent a MON group would allow us to change the CLOSID
>> independently of the RMID in a container and would address the issue.

> What if resctrl adds support to rdtgroup_kf_syscall_ops for
> the .rename callback?
>
> It seems like doing so could enable users to do something like:
> mv /sys/fs/resctrl/groupA/mon_groups/containerA /sys/fs/resctrl/groupB/mon_groups/
>
> Such a user request would trigger the "containerA" monitor group
> to be moved to another control group. All tasks within it could be moved to
> the new control group (their CLOSIDs are changed) while their RMIDs
> remain intact.
>
> I just read James's response and I do not know how this could be made to
> work with the Arm monitoring when it arrives. Potentially there
> could be an architecture specific "move monitor group" call.

If its just moving tasks between groups - this should be fine. You'll get some noise, but
this already exists. User-space should understand that what it is monitoring has changed
in this case.

My comments were about having the kernel transparently change the closid in response to a
schema change. This is where user-space can't know that it is now monitoring something
else. (maybe I should have replied to the top of the thread).



Thanks,

James

2022-10-17 10:24:00

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
> You originally asked:
> | Any concerns about the CLOSID-reusing behavior?
>
> I don't think this will work well with MPAM ... I expect it will mess up the bandwidth
> counters.
>
> MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers,
> this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively
> extended the PARTID with bits that aren't used to look up the configuration.
>
> x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more
> complicated. I've seen details of a system that only has 1 bit of PMG space.
>
> While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough
> unique PMG for every control/monitor group to have a unique value. Instead, MPAM's
> monitors are expected to be used with both the PARTID and PMG.
>
> ('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't
> match on just PMG at all - they have to be told the PARTID too)
>
>
> If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM
> systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and
> the monitors have to match on both.

Yes, that sounds like it would be an issue.

Following your refactoring changes, hopefully the MPAM driver could
offer alternative methods for managing PARTIDs and PMGs depending on the
available hardware resources.

If there are a lot more PARTIDs than PMGs, then it would fit well with a
user who never creates child MON groups. In case the number of MON
groups gets ahead of the number of CTRL_MON groups and you've run out of
PMGs, perhaps you would just try to allocate another PARTID and program
the same partitioning configuration before giving up. Of course, there
wouldn't be much point in reusing PARTIDs in such a configuration
either.

If we used the child MON groups as the primary vehicle for moving a
container's tasks between a small number of CTRL_MON groups like in
Reinette's proposal, then it seems like it would be a better use of
hardware to have many PMGs and few PARTIDs. In that case, the monitors
would only match on PMGs. Provided that there are sufficient monitor
instances, there would never be any need to reprogram a monitor's
PMG.

> I have half-finished patches that add a 'resctrl' cgroup controller that can be used to
> group tasks and assign them to control or monitor groups. (the creation and configuration
> of control and monitor groups stays in resctrl - it effectively makes the tasks file
> read-only). I think this might help, as a group of processes can be moved between two
> control/monitor groups with one syscall. New processes that are created inherit from the
> cgroup setting instead of their parent task.
>
> If want to take a look, its here:
> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.0&id=4e5987d8ecbc8647dee0aebfb73c3890843ef5dd

> I've not worked the cgroup thread stuff out yet ... it doesn't appear to hook thread
> creation, only fork().

This looks very promising for our use case, as it would be very easy to
use for a container manager. I'm glad you're looking into this.

Thanks!
-Peter

2022-10-19 11:10:27

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On Wed, Oct 12, 2022 at 7:23 PM Reinette Chatre
<[email protected]> wrote:
> What if resctrl adds support to rdtgroup_kf_syscall_ops for
> the .rename callback?
>
> It seems like doing so could enable users to do something like:
> mv /sys/fs/resctrl/groupA/mon_groups/containerA /sys/fs/resctrl/groupB/mon_groups/
>
> Such a user request would trigger the "containerA" monitor group
> to be moved to another control group. All tasks within it could be moved to
> the new control group (their CLOSIDs are changed) while their RMIDs
> remain intact.

I think this will be the best approach for us, since we need separate
counters for every job. Unless you were planning to implement this very
soon, I will prototype it for the container manager team to try out and
submit patches for review if it works for them.

> I just read James's response and I do not know how this could be made to
> work with the Arm monitoring when it arrives. Potentially there
> could be an architecture specific "move monitor group" call.

AFAICT all we could do in that situation is hope there are plenty of
CLOSIDs, since we wouldn't be able to create any additional monitoring
groups.

What's still unclear to me is exactly how an application would interpret
the reported CLOSID and RMID counts to decide whether it should create
lots of MON groups vs CTRL_MON groups, given that the RMID count would
mean something semantically different on MPAM. I would not want to see
the container manager asking itself "am I on an ARM system?" when
calculating how many containers' bandwidth usage it can count. (Maybe
James has an answer to this question.)

Thanks!
-Peter

2022-10-19 13:38:50

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 19/10/2022 10:08, Peter Newman wrote:
> On Wed, Oct 12, 2022 at 7:23 PM Reinette Chatre
> <[email protected]> wrote:
>> What if resctrl adds support to rdtgroup_kf_syscall_ops for
>> the .rename callback?
>>
>> It seems like doing so could enable users to do something like:
>> mv /sys/fs/resctrl/groupA/mon_groups/containerA /sys/fs/resctrl/groupB/mon_groups/
>>
>> Such a user request would trigger the "containerA" monitor group
>> to be moved to another control group. All tasks within it could be moved to
>> the new control group (their CLOSIDs are changed) while their RMIDs
>> remain intact.
>
> I think this will be the best approach for us, since we need separate
> counters for every job. Unless you were planning to implement this very
> soon, I will prototype it for the container manager team to try out and
> submit patches for review if it works for them.
>
>> I just read James's response and I do not know how this could be made to
>> work with the Arm monitoring when it arrives. Potentially there
>> could be an architecture specific "move monitor group" call.

> AFAICT all we could do in that situation is hope there are plenty of
> CLOSIDs, since we wouldn't be able to create any additional monitoring
> groups.
>
> What's still unclear to me is exactly how an application would interpret
> the reported CLOSID and RMID counts to decide whether it should create
> lots of MON groups vs CTRL_MON groups, given that the RMID count would
> mean something semantically different on MPAM.

Yeah - its top of the list in the 'ABI problems' section of the KNOWN_ISSUES file.


> I would not want to see
> the container manager asking itself "am I on an ARM system?" when
> calculating how many containers' bandwidth usage it can count.

This would be a terrible!


> (Maybe James has an answer to this question.)

I don't. Its an unfortunate difference that is visible to user-space.

Currently the MPAM tree proposes to expose '1' as num_rmid on arm64, because the right
answer depends on whether you intend to create monitoring groups or control groups.

My best bet is to expose some new properties, 'num_groups' at the root level (which would
have the same value as num_closid), and inside each control group's 'mon_groups'. For x86
the later would be the same as num_rmid, but on arm64 it would be the maximum PMG bits.


Thanks,

James
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2022-10-19 14:44:21

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 17/10/2022 11:15, Peter Newman wrote:
> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>> You originally asked:
>> | Any concerns about the CLOSID-reusing behavior?
>>
>> I don't think this will work well with MPAM ... I expect it will mess up the bandwidth
>> counters.
>>
>> MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers,
>> this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively
>> extended the PARTID with bits that aren't used to look up the configuration.
>>
>> x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more
>> complicated. I've seen details of a system that only has 1 bit of PMG space.
>>
>> While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough
>> unique PMG for every control/monitor group to have a unique value. Instead, MPAM's
>> monitors are expected to be used with both the PARTID and PMG.
>>
>> ('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't
>> match on just PMG at all - they have to be told the PARTID too)
>>
>>
>> If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM
>> systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and
>> the monitors have to match on both.

> Yes, that sounds like it would be an issue.
>
> Following your refactoring changes, hopefully the MPAM driver could
> offer alternative methods for managing PARTIDs and PMGs depending on the
> available hardware resources.

Mmmm, I don't think anything other than one-partid per control group and one-pmg per
monitor group makes much sense.


> If there are a lot more PARTIDs than PMGs, then it would fit well with a
> user who never creates child MON groups. In case the number of MON
> groups gets ahead of the number of CTRL_MON groups and you've run out of
> PMGs, perhaps you would just try to allocate another PARTID and program
> the same partitioning configuration before giving up.

User-space can choose to do this.
If the kernel tries to be clever and do this behind user-space's back, it needs to
allocate two monitors for this secretly-two-control-groups, and always sum the counters
before reporting them to user-space.
If monitors are a contended resource, then you may be unable to monitor the
secretly-two-control-groups group once the kernel has done this.

I don't think the kernel should try to be too clever here.

> Of course, there
> wouldn't be much point in reusing PARTIDs in such a configuration
> either.

> If we used the child MON groups as the primary vehicle for moving a
> container's tasks between a small number of CTRL_MON groups like in
> Reinette's proposal, then it seems like it would be a better use of
> hardware to have many PMGs and few PARTIDs.

> In that case, the monitors would only match on PMGs.

This isn't how MPAM is designed to be used. You'll hit nasty corners.
The big one is the Cache Storage Utilisation counters.

See 11.5.2 of the MPAM spec, "MSMON_CFG_CSU_CTL, MPAM Memory System Monitor Configure
Cache Storage Usage Monitor Control Register". Not setting the MATCH_PARTID bit has this
warning:
| If MATCH_PMG is 1 and MATCH_PARTID is 0, it is CONSTRAINED UNPREDICTABLE whether the
| monitor instance:
| • Measures the storage used with matching PMG and with any PARTID.
| • Measures no storage usage, that is, MSMON_CSU.VALUE is zero.
| • Measures the storage used with matching PMG and PARTID, that is, treats
| MATCH_PARTID as = 1

'constrained unpredictable' is arm's term for "portable software can't rely on this".
The folk that designed MPAM don't believe "monitors would only match on PMGs" makes any
sense. A PMG is not an RMID. A case in point is the system with only 1 PMG bit.

I'm afraid this approach would preclude support for the llc_occupancy counter, and would
artificially reduce the number of control groups that can be created as each control group
needs an 'RMID'. On the machine with 1 PMG bit - you get 2 control groups, even though it
has many more PARTID.


> Provided that there are sufficient monitor
> instances, there would never be any need to reprogram a monitor's
> PMG.

It sounds like this moves the problem to "make everything a monitor group because only
monitor groups can be batch moved".

If the tasks file could be moved between control and monitor groups, causing resctrl to
relabel the tasks - would that solve more of the problem? (it eliminates the need to make
everything a monitor group)

The devil is in the detail, I'm not sure how it serialises with a fork()ing process, I'd
hope to do better than relying on the kernel walking the list of processes a lot quicker
than user-space can.


>> I have half-finished patches that add a 'resctrl' cgroup controller that can be used to
>> group tasks and assign them to control or monitor groups. (the creation and configuration
>> of control and monitor groups stays in resctrl - it effectively makes the tasks file
>> read-only). I think this might help, as a group of processes can be moved between two
>> control/monitor groups with one syscall. New processes that are created inherit from the
>> cgroup setting instead of their parent task.
>>
>> If want to take a look, its here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.0&id=4e5987d8ecbc8647dee0aebfb73c3890843ef5dd
>
>> I've not worked the cgroup thread stuff out yet ... it doesn't appear to hook thread
>> creation, only fork().

> This looks very promising for our use case, as it would be very easy to
> use for a container manager. I'm glad you're looking into this.

Let me know if it solves this problem - I assume the resctrl topology is a subset of the
cgroup topology.

(apparently android needed cgroup support, but now its more complicated)


Thanks,

James

2022-10-20 00:45:24

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 10/19/2022 2:08 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Wed, Oct 12, 2022 at 7:23 PM Reinette Chatre
> <[email protected]> wrote:
>> What if resctrl adds support to rdtgroup_kf_syscall_ops for
>> the .rename callback?
>>
>> It seems like doing so could enable users to do something like:
>> mv /sys/fs/resctrl/groupA/mon_groups/containerA /sys/fs/resctrl/groupB/mon_groups/
>>
>> Such a user request would trigger the "containerA" monitor group
>> to be moved to another control group. All tasks within it could be moved to
>> the new control group (their CLOSIDs are changed) while their RMIDs
>> remain intact.
>
> I think this will be the best approach for us, since we need separate
> counters for every job. Unless you were planning to implement this very
> soon, I will prototype it for the container manager team to try out and
> submit patches for review if it works for them.

I do not have plans for work in this area.

It is still not clear to me how palatable this will be on Arm systems.
This solution also involves changing the CLOSID/PARTID like your original
proposal and James highlighted that it would "mess up the bandwidth counters"
because of the way PARTID.PMG is used for monitoring. Perhaps even a new
PMG would need to be assigned during such a monitor group move. One requirement
for this RFD was to keep usage counts intact and from what I understand
this will not be possible on Arm systems. There could be software mechanisms
to help reduce the noise during the transition. For example, some new limbo
mechanism that avoids re-assigning the old PARTID.PMG, while perhaps still
using the old PARTID.PMG to read usage counts for a while? Or would the
guidance just be that the counters will have some noise after the move?

Reinette

2022-10-20 09:17:47

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Thu, Oct 20, 2022 at 1:54 AM Reinette Chatre
<[email protected]> wrote:
> It is still not clear to me how palatable this will be on Arm systems.
> This solution also involves changing the CLOSID/PARTID like your original
> proposal and James highlighted that it would "mess up the bandwidth counters"
> because of the way PARTID.PMG is used for monitoring. Perhaps even a new
> PMG would need to be assigned during such a monitor group move. One requirement
> for this RFD was to keep usage counts intact and from what I understand
> this will not be possible on Arm systems. There could be software mechanisms
> to help reduce the noise during the transition. For example, some new limbo
> mechanism that avoids re-assigning the old PARTID.PMG, while perhaps still
> using the old PARTID.PMG to read usage counts for a while? Or would the
> guidance just be that the counters will have some noise after the move?

I'm going to have to follow up on the details of this in James's thread.
It sounded like we probably won't be able to create enough mon_groups
under a single control group for the rename feature to even be useful.
Rather, we expect the PARTID counts to be so much larger than the PMG
counts that creating more mon_groups to reduce the number of control
groups wouldn't make sense.

At least in our use case, we're literally creating "classes of service"
to prioritize memory traffic, so we want a small number of control
groups to represent the small number of priority levels, but enough
RMIDs to count every job's traffic independently. For MPAM to support
this MBM/MBA use case in exactly this fashion, we'd have to develop the
monitors-not-matching-on-PARTID use case better in the MPAM
architecture. But before putting much effort into that, I'd want to know
if there's any payoff beyond being able to use resctrl the same way on
both implementations.

-Peter

2022-10-20 10:57:36

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Wed, Oct 19, 2022 at 3:58 PM James Morse <[email protected]> wrote:
> This isn't how MPAM is designed to be used. You'll hit nasty corners.
> The big one is the Cache Storage Utilisation counters.
>
> See 11.5.2 of the MPAM spec, "MSMON_CFG_CSU_CTL, MPAM Memory System Monitor Configure
> Cache Storage Usage Monitor Control Register". Not setting the MATCH_PARTID bit has this
> warning:
> | If MATCH_PMG is 1 and MATCH_PARTID is 0, it is CONSTRAINED UNPREDICTABLE whether the
> | monitor instance:
> | • Measures the storage used with matching PMG and with any PARTID.
> | • Measures no storage usage, that is, MSMON_CSU.VALUE is zero.
> | • Measures the storage used with matching PMG and PARTID, that is, treats
> | MATCH_PARTID as = 1
>
> 'constrained unpredictable' is arm's term for "portable software can't rely on this".
> The folk that designed MPAM don't believe "monitors would only match on PMGs" makes any
> sense. A PMG is not an RMID. A case in point is the system with only 1 PMG bit.
>
> I'm afraid this approach would preclude support for the llc_occupancy counter, and would
> artificially reduce the number of control groups that can be created as each control group
> needs an 'RMID'. On the machine with 1 PMG bit - you get 2 control groups, even though it
> has many more PARTID.

The first sentence of the Resource Monitoring chapter is also quite an
obstacle to my challenging to the PARTID-PMG hierarchy:

| Software environments may be labeled as belonging to a Performance
| Monitoring Group (PMG) within a partition.

It seems like the only real issue is that the user is responsible for
figuring out how best to make use of the available resources. But I seem
to recall that was the expectation with resctrl, so I should probably
stop trying to argue for expecting MPAM configurations which resemble
RDT.


> On 17/10/2022 11:15, Peter Newman wrote:
> > Provided that there are sufficient monitor
> > instances, there would never be any need to reprogram a monitor's
> > PMG.
>
> It sounds like this moves the problem to "make everything a monitor group because only
> monitor groups can be batch moved".
>
> If the tasks file could be moved between control and monitor groups, causing resctrl to
> relabel the tasks - would that solve more of the problem? (it eliminates the need to make
> everything a monitor group)

This was about preserving the RMID and memory bandwidth counts across a
CLOSID change. If the user is forced to conserve CTRL_MON groups due to
a limited number of CLOSIDs, keeping the various containers' tasks
separate is also a concern.

But if there's no need to conserve CTRL_MON groups, then there's no real
issue.

> The devil is in the detail, I'm not sure how it serialises with a fork()ing process, I'd
> hope to do better than relying on the kernel walking the list of processes a lot quicker
> than user-space can.

I wasn't planning to do it any more optimally than the rmdir
implementation today when looking for all tasks impacted by a
CLOSID/RMID deletion.

-Peter

2022-10-20 20:21:42

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group



On 10/20/2022 1:48 AM, Peter Newman wrote:
> On Thu, Oct 20, 2022 at 1:54 AM Reinette Chatre
> <[email protected]> wrote:
>> It is still not clear to me how palatable this will be on Arm systems.
>> This solution also involves changing the CLOSID/PARTID like your original
>> proposal and James highlighted that it would "mess up the bandwidth counters"
>> because of the way PARTID.PMG is used for monitoring. Perhaps even a new
>> PMG would need to be assigned during such a monitor group move. One requirement
>> for this RFD was to keep usage counts intact and from what I understand
>> this will not be possible on Arm systems. There could be software mechanisms
>> to help reduce the noise during the transition. For example, some new limbo
>> mechanism that avoids re-assigning the old PARTID.PMG, while perhaps still
>> using the old PARTID.PMG to read usage counts for a while? Or would the
>> guidance just be that the counters will have some noise after the move?
>
> I'm going to have to follow up on the details of this in James's thread.
> It sounded like we probably won't be able to create enough mon_groups
> under a single control group for the rename feature to even be useful.
> Rather, we expect the PARTID counts to be so much larger than the PMG
> counts that creating more mon_groups to reduce the number of control
> groups wouldn't make sense.
>
> At least in our use case, we're literally creating "classes of service"
> to prioritize memory traffic, so we want a small number of control
> groups to represent the small number of priority levels, but enough
> RMIDs to count every job's traffic independently. For MPAM to support
> this MBM/MBA use case in exactly this fashion, we'd have to develop the
> monitors-not-matching-on-PARTID use case better in the MPAM
> architecture. But before putting much effort into that, I'd want to know
> if there's any payoff beyond being able to use resctrl the same way on
> both implementations.

If the expectation is that PARTID counts are very high then how about
a solution where multiple PARTIDs are associated with the same CTRL_MON group?
A CTRL_MON group presents a resource allocation to user space, CLOSIDs/PARTIDs
are not exposed. So using multiple PARTIDs for a resource group (all with the
same allocation) seems conceptually ok to me. (Please note, I did not do an
audit to see if there are any hidden assumption or look into lifting required
to support his.)

So, if a user moves a MON group to a new CTRL_MON group, if there are no
PARTID.PMG available in the destination CTRL_MON group to support the move
then one of the free PARTID can be used, automatically assigned with the
allocation of the destination CTRL_MON, and a new monitor group created using
the new PMG range brought with the new PARTID.

There may also be a way to guide resctrl to do something like this (use
available PARTID) when a user creates a new MON group. This may be a way
to address the earlier concern of how applications can decide to create
lots of MON groups vs CTRL_MON groups.

Reinette

2022-10-21 10:14:39

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Thu, Oct 20, 2022 at 9:08 PM Reinette Chatre
<[email protected]> wrote:
>
> If the expectation is that PARTID counts are very high then how about
> a solution where multiple PARTIDs are associated with the same CTRL_MON group?
> A CTRL_MON group presents a resource allocation to user space, CLOSIDs/PARTIDs
> are not exposed. So using multiple PARTIDs for a resource group (all with the
> same allocation) seems conceptually ok to me. (Please note, I did not do an
> audit to see if there are any hidden assumption or look into lifting required
> to support his.)

I did propose using PARTIDs to back additional mon_groups a few days ago
on the other sub-thread with James. My understanding was that it would
be less trouble if the user opted to do this on their own rather than
the kernel somehow doing this automatically.

https://lore.kernel.org/all/[email protected]/

So perhaps we can just arrive at some way to inform the user of the
difference in resources. We may not even need to be able to precisely
calculate the number of groups we can create, as the logic for us could
be a simple as:

1) If num_closids >= desired job count, just use CTRL_MON groups
2) Otherwise, fall back to the proposed mon_group-move approach if
num_rmids is large enough for the desired job count

To address the glitchy behavior of moving a PMG to a new PARTID, I found
that the MPAM spec says explicitly that a PMG is subordinate to a
PARTID, so I would be fine with James finding a way for the MPAM driver
to block the rename operation, because it's unable to mix and match
RMIDs and CLOSIDs the way that RDT can.

-Peter

2022-10-21 13:02:10

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Thu, Oct 20, 2022 at 12:39 PM Peter Newman <[email protected]> wrote:
>
> On Wed, Oct 19, 2022 at 3:58 PM James Morse <[email protected]> wrote:
> > The devil is in the detail, I'm not sure how it serialises with a fork()ing process, I'd
> > hope to do better than relying on the kernel walking the list of processes a lot quicker
> > than user-space can.
>
> I wasn't planning to do it any more optimally than the rmdir
> implementation today when looking for all tasks impacted by a
> CLOSID/RMID deletion.

This is probably a separate topic, but I noticed this when looking at how rmdir
moves tasks to a new closid/rmid...

In rdt_move_group_tasks(), how do we know that a task switching in on another
CPU will observe the updated closid and rmid values soon enough?

Even on x86, without an smp_mb(), the stores to t->closid and t->rmid could be
reordered with the task_curr(t) and task_cpu(t) reads which follow. The original
description of this scenario seemed to assume that accesses below would happen
in program order:

WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);

/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
* schedule before the smp function call takes place.
* In such a case the function call is pointless, but
* there is no other side effect.
*/
if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
cpumask_set_cpu(task_cpu(t), mask);

If the task concurrently switches in on another CPU, the code above may not
observed that it's running, and the CPU running the task may not have observed
the updated rmid and closid yet, so it could continue with the old rmid/closid
and not get interrupted.

-Peter

2022-10-21 20:24:14

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On 10/19/2022 6:57 AM, James Morse wrote:
> Hi Peter,
>
> On 17/10/2022 11:15, Peter Newman wrote:
>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>>> You originally asked:
>>> | Any concerns about the CLOSID-reusing behavior?
>>>
>>> I don't think this will work well with MPAM ... I expect it will mess up the bandwidth
>>> counters.
>>>
>>> MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers,
>>> this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively
>>> extended the PARTID with bits that aren't used to look up the configuration.
>>>
>>> x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more
>>> complicated. I've seen details of a system that only has 1 bit of PMG space.
>>>
>>> While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough
>>> unique PMG for every control/monitor group to have a unique value. Instead, MPAM's
>>> monitors are expected to be used with both the PARTID and PMG.
>>>
>>> ('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't
>>> match on just PMG at all - they have to be told the PARTID too)
>>>
>>>
>>> If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM
>>> systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and
>>> the monitors have to match on both.
>
>> Yes, that sounds like it would be an issue.
>>
>> Following your refactoring changes, hopefully the MPAM driver could
>> offer alternative methods for managing PARTIDs and PMGs depending on the
>> available hardware resources.
>
> Mmmm, I don't think anything other than one-partid per control group and one-pmg per
> monitor group makes much sense.
>
>
>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>> user who never creates child MON groups. In case the number of MON
>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>> PMGs, perhaps you would just try to allocate another PARTID and program
>> the same partitioning configuration before giving up.
>
> User-space can choose to do this.
> If the kernel tries to be clever and do this behind user-space's back, it needs to
> allocate two monitors for this secretly-two-control-groups, and always sum the counters
> before reporting them to user-space.

If I understand this scenario correctly, the kernel is already doing this.
As implemented in mon_event_count() the monitor data of a CTRL_MON group is
the sum of the parent CTRL_MON group and all its child MON groups.

> If monitors are a contended resource, then you may be unable to monitor the
> secretly-two-control-groups group once the kernel has done this.

I am not viewing this as "secretly-two-control-groups" - there would still be
only one parent CTRL_MON group that dictates all the allocations. MON groups already
have a CLOSID (PARTID) property but at this time it is always identical to the parent
CTRL_MON group. The difference introduced is that some of the child MON groups
may have a different CLOSID (PARTID) from the parent.

>
> I don't think the kernel should try to be too clever here.
>

That is a fair concern but it may be worth exploring as it seems to address
a few ABI concerns and user space seems to be eyeing using a future "num_closid"
info as a check of "RDT/PQoS" vs "MPAM".

Reinette



2022-10-21 20:40:28

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFD] resctrl: reassigning a running container's CTRL_MON group

> I am not viewing this as "secretly-two-control-groups" - there would still be
> only one parent CTRL_MON group that dictates all the allocations. MON groups already
> have a CLOSID (PARTID) property but at this time it is always identical to the parent
> CTRL_MON group. The difference introduced is that some of the child MON groups
> may have a different CLOSID (PARTID) from the parent.

What would be the resctrl file system operation to change the CLOSID of a child
CTRL_MON group?

I followed the "use rename" so the user would:

# mv /sys/fs/resctrl/g1/mon_groups/work1 /sys/fs/resctrl/g2/mon_groups/

to keep the same RMID, but move from "g1" to "g2" to get a different class of service.

-Tony

2022-10-21 22:04:12

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Tony,

On 10/21/2022 1:22 PM, Luck, Tony wrote:
>> I am not viewing this as "secretly-two-control-groups" - there would still be
>> only one parent CTRL_MON group that dictates all the allocations. MON groups already
>> have a CLOSID (PARTID) property but at this time it is always identical to the parent
>> CTRL_MON group. The difference introduced is that some of the child MON groups
>> may have a different CLOSID (PARTID) from the parent.
>
> What would be the resctrl file system operation to change the CLOSID of a child
> CTRL_MON group?

It could be both mv and mkdir

>
> I followed the "use rename" so the user would:
>
> # mv /sys/fs/resctrl/g1/mon_groups/work1 /sys/fs/resctrl/g2/mon_groups/
>
> to keep the same RMID, but move from "g1" to "g2" to get a different class of service.

Right. On a (RDT) system where RMIDs are independent from CLOSID then a move
like above would mean that MON group "work1" would keep its RMID and inherit the
CLOSID of CTRL_MON group "g2". On these systems a move like above is smooth
and after the move, CTRL_MON group "g2" and all MON groups within "g2" will
have the same CLOSID. The tasks within "work1" will run with new allocations
associated with CTRL_MON group "g2" while its monitoring counters remain
intact.

What I was responding to was the scenario where a (MPAM) system does
not have many PMGs (similar but different from RMID) and the
PMGs the system does have are dependent on the PARTID (MPAM's CLOSID).
Think about these systems as having counters in the hardware accessed as
CLOSID.RMID (PARTID.PMG), not "just" RMID, and "not many PMGs" may mean
one bit.

That brings two problems:
a) Monitoring counters are not moved "intact" since hardware will have data
for old PARTID.PMG pair while task has new PARTID.PMG.
b) Destination CTRL_MON group may not be able to accommodate new MON group
because of lack of local PMG.

The last few messages in this thread focuses on (b).

What Peter and I was wondering was whether resctrl could assign an available
PARTID to a new MON group with the new PARTID automatically inheriting the
allocation associated with the CTRL_MON group. The CTRL_MON group still dictates
the allocations but multiple PARTID (CLOSID) are used to enforce it. As a
reminder, the use case is that the user has two CTRL_MON groups and want
to have a large number of MON groups within each (one MON group per
container) with option to move a MON group from one CTRL_MON group
to another.

What we are considering is thus something like this (consider a system with
only two PMG bits but many PARTID):

# mkdir /sys/fs/resctrl/g1
/* CTRL_MON group "g1" gets CLOSID/PARTID = A and RMID/PMG = 0 */
# mkdir /sys/fs/resctrl/g1/mon_groups/m1
/* MON group "m1" gets CLOSID/PARTID = A and RMID/PMG = 1 */

At this point, due to lack of available PMG, it is not possible to create
a new MON group nor move any MON group to this CTRL_MON group.

The new idea is to support:
# mkdir /sys/fs/resctrl/g1/mon_groups/m2
or
# mv <source MON group> /sys/fs/resctrl/g1/mon_groups/m2
/* MON group "m2" gets PARTID = B (duplicate allocations of PARTID A) and PMG = 0 */

This is expected to be MPAM specific.

Reinette

2022-10-25 16:03:18

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette, Peter,

On 20/10/2022 20:08, Reinette Chatre wrote:
> On 10/20/2022 1:48 AM, Peter Newman wrote:
>> On Thu, Oct 20, 2022 at 1:54 AM Reinette Chatre
>> <[email protected]> wrote:
>>> It is still not clear to me how palatable this will be on Arm systems.
>>> This solution also involves changing the CLOSID/PARTID like your original
>>> proposal and James highlighted that it would "mess up the bandwidth counters"
>>> because of the way PARTID.PMG is used for monitoring. Perhaps even a new
>>> PMG would need to be assigned during such a monitor group move. One requirement
>>> for this RFD was to keep usage counts intact and from what I understand
>>> this will not be possible on Arm systems. There could be software mechanisms
>>> to help reduce the noise during the transition. For example, some new limbo
>>> mechanism that avoids re-assigning the old PARTID.PMG, while perhaps still
>>> using the old PARTID.PMG to read usage counts for a while? Or would the
>>> guidance just be that the counters will have some noise after the move?
>>
>> I'm going to have to follow up on the details of this in James's thread.
>> It sounded like we probably won't be able to create enough mon_groups
>> under a single control group for the rename feature to even be useful.
>> Rather, we expect the PARTID counts to be so much larger than the PMG
>> counts that creating more mon_groups to reduce the number of control
>> groups wouldn't make sense.
>>
>> At least in our use case, we're literally creating "classes of service"
>> to prioritize memory traffic, so we want a small number of control
>> groups to represent the small number of priority levels, but enough
>> RMIDs to count every job's traffic independently. For MPAM to support
>> this MBM/MBA use case in exactly this fashion, we'd have to develop the
>> monitors-not-matching-on-PARTID use case better in the MPAM
>> architecture. But before putting much effort into that, I'd want to know
>> if there's any payoff beyond being able to use resctrl the same way on
>> both implementations.

> If the expectation is that PARTID counts are very high then how about
> a solution where multiple PARTIDs are associated with the same CTRL_MON group?
> A CTRL_MON group presents a resource allocation to user space, CLOSIDs/PARTIDs
> are not exposed. So using multiple PARTIDs for a resource group (all with the
> same allocation) seems conceptually ok to me. (Please note, I did not do an
> audit to see if there are any hidden assumption or look into lifting required
> to support his.)

This would work when systems are built to look like RDT, but MPAM has other control types
where this would have interesting behaviours.

'CPOR' is equivalent to CBM as they are both a bitmap of portions. MPAM also has 'CMAX'
where a fraction of the cache is specified. If you create two control groups with
different PARTIDs but the same configuration, their two 50%s of the cache could become
100%. CPOR can be used like this, CMAX can't.


> So, if a user moves a MON group to a new CTRL_MON group, if there are no
> PARTID.PMG available in the destination CTRL_MON group to support the move
> then one of the free PARTID can be used, automatically assigned with the
> allocation of the destination CTRL_MON, and a new monitor group created using
> the new PMG range brought with the new PARTID.

This would be transparent on some hardware, but not on others. It depends what controls
are supported.

Even when the controls behave in the same way, a different PARTID with the same control
values could be regulated differently, resulting in weirdness.


> There may also be a way to guide resctrl to do something like this (use
> available PARTID) when a user creates a new MON group. This may be a way
> to address the earlier concern of how applications can decide to create
> lots of MON groups vs CTRL_MON groups.

I think we should keep this intelligence in user-space.

Exposing a way to indicate how many groups can be created 'at this level', allows
user-space to determine if its on an RMID-rich machine or a PARTID-rich machine.
If there is a way of moving a group of tasks between control groups, then we'd also need
to expose some indication as to whether the monitors at the old location keep counting
after the move. (which I think is the best way of explaining the difference to user-space)

With these, user-space can change the structure it creates to better fit the resources of
the machine.


Thanks,

James

2022-10-25 16:21:59

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 20/10/2022 11:39, Peter Newman wrote:
> On Wed, Oct 19, 2022 at 3:58 PM James Morse <[email protected]> wrote:
>> This isn't how MPAM is designed to be used. You'll hit nasty corners.
>> The big one is the Cache Storage Utilisation counters.
>>
>> See 11.5.2 of the MPAM spec, "MSMON_CFG_CSU_CTL, MPAM Memory System Monitor Configure
>> Cache Storage Usage Monitor Control Register". Not setting the MATCH_PARTID bit has this
>> warning:
>> | If MATCH_PMG is 1 and MATCH_PARTID is 0, it is CONSTRAINED UNPREDICTABLE whether the
>> | monitor instance:
>> | • Measures the storage used with matching PMG and with any PARTID.
>> | • Measures no storage usage, that is, MSMON_CSU.VALUE is zero.
>> | • Measures the storage used with matching PMG and PARTID, that is, treats
>> | MATCH_PARTID as = 1
>>
>> 'constrained unpredictable' is arm's term for "portable software can't rely on this".
>> The folk that designed MPAM don't believe "monitors would only match on PMGs" makes any
>> sense. A PMG is not an RMID. A case in point is the system with only 1 PMG bit.
>>
>> I'm afraid this approach would preclude support for the llc_occupancy counter, and would
>> artificially reduce the number of control groups that can be created as each control group
>> needs an 'RMID'. On the machine with 1 PMG bit - you get 2 control groups, even though it
>> has many more PARTID.
>
> The first sentence of the Resource Monitoring chapter is also quite an
> obstacle to my challenging to the PARTID-PMG hierarchy:
>
> | Software environments may be labeled as belonging to a Performance
> | Monitoring Group (PMG) within a partition.
>
> It seems like the only real issue is that the user is responsible for
> figuring out how best to make use of the available resources. But I seem
> to recall that was the expectation with resctrl, so I should probably
> stop trying to argue for expecting MPAM configurations which resemble
> RDT.
>
>
>> On 17/10/2022 11:15, Peter Newman wrote:
>>> Provided that there are sufficient monitor
>>> instances, there would never be any need to reprogram a monitor's
>>> PMG.
>>
>> It sounds like this moves the problem to "make everything a monitor group because only
>> monitor groups can be batch moved".
>>
>> If the tasks file could be moved between control and monitor groups, causing resctrl to
>> relabel the tasks - would that solve more of the problem? (it eliminates the need to make
>> everything a monitor group)
>
> This was about preserving the RMID and memory bandwidth counts across a
> CLOSID change. If the user is forced to conserve CTRL_MON groups due to
> a limited number of CLOSIDs, keeping the various containers' tasks
> separate is also a concern.

Ah, of course.


> But if there's no need to conserve CTRL_MON groups, then there's no real
> issue.

Yup. I think part of this is exposing the information user-space needs to make the right
decision.

I don't think we should merge 'task group moving' and 'old monitors keep counting', they
each make sense independently.


>> The devil is in the detail, I'm not sure how it serialises with a fork()ing process, I'd
>> hope to do better than relying on the kernel walking the list of processes a lot quicker
>> than user-space can.
>
> I wasn't planning to do it any more optimally than the rmdir
> implementation today when looking for all tasks impacted by a
> CLOSID/RMID deletion.

Aha - that is the use of for_each_process_thread() which takes the read-lock, instead of
relying on RCU, so it should be safe for processes fork()ing and exit()ing.


Thanks,

James

2022-10-25 16:46:11

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 21/10/2022 11:09, Peter Newman wrote:
> On Thu, Oct 20, 2022 at 9:08 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> If the expectation is that PARTID counts are very high then how about
>> a solution where multiple PARTIDs are associated with the same CTRL_MON group?
>> A CTRL_MON group presents a resource allocation to user space, CLOSIDs/PARTIDs
>> are not exposed. So using multiple PARTIDs for a resource group (all with the
>> same allocation) seems conceptually ok to me. (Please note, I did not do an
>> audit to see if there are any hidden assumption or look into lifting required
>> to support his.)

> I did propose using PARTIDs to back additional mon_groups a few days ago
> on the other sub-thread with James. My understanding was that it would
> be less trouble if the user opted to do this on their own rather than
> the kernel somehow doing this automatically.
>
> https://lore.kernel.org/all/[email protected]/

> So perhaps we can just arrive at some way to inform the user of the
> difference in resources. We may not even need to be able to precisely
> calculate the number of groups we can create, as the logic for us could
> be a simple as:
>
> 1) If num_closids >= desired job count, just use CTRL_MON groups

> 2) Otherwise, fall back to the proposed mon_group-move approach if
> num_rmids is large enough for the desired job count

> To address the glitchy behavior of moving a PMG to a new PARTID, I found
> that the MPAM spec says explicitly that a PMG is subordinate to a
> PARTID, so I would be fine with James finding a way for the MPAM driver
> to block the rename operation, because it's unable to mix and match
> RMIDs and CLOSIDs the way that RDT can.

I'd like to support moving groups of tasks in a sensible way on MPAM too.

I don't think we should conflate it with 'old counters keep counting' - that should be
exposed as a separate property that influences how user-space sets this stuff up.


Thanks,

James

2022-10-25 16:55:48

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 21/10/2022 13:42, Peter Newman wrote:
> On Thu, Oct 20, 2022 at 12:39 PM Peter Newman <[email protected]> wrote:
>>
>> On Wed, Oct 19, 2022 at 3:58 PM James Morse <[email protected]> wrote:
>>> The devil is in the detail, I'm not sure how it serialises with a fork()ing process, I'd
>>> hope to do better than relying on the kernel walking the list of processes a lot quicker
>>> than user-space can.
>>
>> I wasn't planning to do it any more optimally than the rmdir
>> implementation today when looking for all tasks impacted by a
>> CLOSID/RMID deletion.
>
> This is probably a separate topic, but I noticed this when looking at how rmdir
> moves tasks to a new closid/rmid...
>
> In rdt_move_group_tasks(), how do we know that a task switching in on another
> CPU will observe the updated closid and rmid values soon enough?
>
> Even on x86, without an smp_mb(), the stores to t->closid and t->rmid could be
> reordered with the task_curr(t) and task_cpu(t) reads which follow. The original
> description of this scenario seemed to assume that accesses below would happen
> in program order:
>
> WRITE_ONCE(t->closid, to->closid);
> WRITE_ONCE(t->rmid, to->mon.rmid);
>
> /*
> * If the task is on a CPU, set the CPU in the mask.
> * The detection is inaccurate as tasks might move or
> * schedule before the smp function call takes place.
> * In such a case the function call is pointless, but
> * there is no other side effect.
> */
> if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
> cpumask_set_cpu(task_cpu(t), mask);
>
> If the task concurrently switches in on another CPU, the code above may not
> observed that it's running, and the CPU running the task may not have observed
> the updated rmid and closid yet, so it could continue with the old rmid/closid
> and not get interrupted.

Makes sense to me - do you want to send a patch to fix it?


Thanks,

James

2022-10-26 09:11:55

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Tue, Oct 25, 2022 at 5:55 PM James Morse <[email protected]> wrote:
> On 21/10/2022 13:42, Peter Newman wrote:
> > Even on x86, without an smp_mb(), the stores to t->closid and t->rmid could be
> > reordered with the task_curr(t) and task_cpu(t) reads which follow. The original
> > description of this scenario seemed to assume that accesses below would happen
> > in program order:
> >
> > WRITE_ONCE(t->closid, to->closid);
> > WRITE_ONCE(t->rmid, to->mon.rmid);
> >
> > /*
> > * If the task is on a CPU, set the CPU in the mask.
> > * The detection is inaccurate as tasks might move or
> > * schedule before the smp function call takes place.
> > * In such a case the function call is pointless, but
> > * there is no other side effect.
> > */
> > if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
> > cpumask_set_cpu(task_cpu(t), mask);
> >
> > If the task concurrently switches in on another CPU, the code above may not
> > observed that it's running, and the CPU running the task may not have observed
> > the updated rmid and closid yet, so it could continue with the old rmid/closid
> > and not get interrupted.
>
> Makes sense to me - do you want to send a patch to fix it?

Sure, when I think of a solution. For an smp_mb() to be effective above,
we would need to execute another smp_mb() unconditionally before reading
the closid/rmid fields when switching a task in.

The only quick fix I know will work without badly hurting context switch
time would be to go back to pinging all CPUs following a mass
task-movement operation.

I'll see if I can come up with anything better, though.

-Peter

2022-10-26 09:38:33

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On Tue, Oct 25, 2022 at 5:56 PM James Morse <[email protected]> wrote:
> This would work when systems are built to look like RDT, but MPAM has other control types
> where this would have interesting behaviours.
>
> 'CPOR' is equivalent to CBM as they are both a bitmap of portions. MPAM also has 'CMAX'
> where a fraction of the cache is specified. If you create two control groups with
> different PARTIDs but the same configuration, their two 50%s of the cache could become
> 100%. CPOR can be used like this, CMAX can't.

I thought we only allocated caches with CBMs and memory bandwidth with
percentages.
I don't see how CMAX could be used when implementing resctrl's CAT
resources. Percentage
configurations are only used for MBA in resctrl today.

> Even when the controls behave in the same way, a different PARTID with the same control
> values could be regulated differently, resulting in weirdness.

Can you provide further examples?

-Peter

2022-10-26 21:35:38

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 10/26/2022 1:52 AM, Peter Newman wrote:
> On Tue, Oct 25, 2022 at 5:55 PM James Morse <[email protected]> wrote:
>> On 21/10/2022 13:42, Peter Newman wrote:
>>> Even on x86, without an smp_mb(), the stores to t->closid and t->rmid could be
>>> reordered with the task_curr(t) and task_cpu(t) reads which follow. The original
>>> description of this scenario seemed to assume that accesses below would happen
>>> in program order:
>>>
>>> WRITE_ONCE(t->closid, to->closid);
>>> WRITE_ONCE(t->rmid, to->mon.rmid);
>>>
>>> /*
>>> * If the task is on a CPU, set the CPU in the mask.
>>> * The detection is inaccurate as tasks might move or
>>> * schedule before the smp function call takes place.
>>> * In such a case the function call is pointless, but
>>> * there is no other side effect.
>>> */
>>> if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
>>> cpumask_set_cpu(task_cpu(t), mask);
>>>
>>> If the task concurrently switches in on another CPU, the code above may not
>>> observed that it's running, and the CPU running the task may not have observed
>>> the updated rmid and closid yet, so it could continue with the old rmid/closid
>>> and not get interrupted.
>>
>> Makes sense to me - do you want to send a patch to fix it?
>
> Sure, when I think of a solution. For an smp_mb() to be effective above,
> we would need to execute another smp_mb() unconditionally before reading
> the closid/rmid fields when switching a task in.
>
> The only quick fix I know will work without badly hurting context switch
> time would be to go back to pinging all CPUs following a mass
> task-movement operation.
>
> I'll see if I can come up with anything better, though.
>

The original concern is "the stores to t->closid and t->rmid could be
reordered with the task_curr(t) and task_cpu(t) reads which follow". I can see
that issue. Have you considered using the compiler barrier, barrier(), instead?
From what I understand it will prevent the compiler from moving the memory accesses.
This is what is currently done in __rdtgroup_move_task() and could be done here also?

Reinette


2022-10-27 08:25:44

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On Wed, Oct 26, 2022 at 11:12 PM Reinette Chatre
<[email protected]> wrote:
> The original concern is "the stores to t->closid and t->rmid could be
> reordered with the task_curr(t) and task_cpu(t) reads which follow". I can see
> that issue. Have you considered using the compiler barrier, barrier(), instead?
> From what I understand it will prevent the compiler from moving the memory accesses.
> This is what is currently done in __rdtgroup_move_task() and could be done here also?

A memory system (including those on x86) is allowed to reorder a store with a
later load, in addition to the compiler.

Also because the locations in question can be concurrently accessed by another
CPU, a compiler barrier would not be sufficient.

-Peter

2022-10-27 17:43:59

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 10/27/2022 12:56 AM, Peter Newman wrote:
> On Wed, Oct 26, 2022 at 11:12 PM Reinette Chatre
> <[email protected]> wrote:
>> The original concern is "the stores to t->closid and t->rmid could be
>> reordered with the task_curr(t) and task_cpu(t) reads which follow". I can see
>> that issue. Have you considered using the compiler barrier, barrier(), instead?
>> From what I understand it will prevent the compiler from moving the memory accesses.
>> This is what is currently done in __rdtgroup_move_task() and could be done here also?
>
> A memory system (including those on x86) is allowed to reorder a store with a
> later load, in addition to the compiler.
>
> Also because the locations in question can be concurrently accessed by another
> CPU, a compiler barrier would not be sufficient.

This is hard. Regarding the concurrent access from another CPU it seems
that task_rq_lock() is available to prevent races with schedule(). Using this
may be able to prevent task_curr(t) changing during this time and thus the local
reordering may not be a problem. I am not familiar with task_rq_lock() though,
surely there are many details to consider in this area.

Reinette

2022-11-01 16:12:29

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On Thu, Oct 27, 2022 at 7:36 PM Reinette Chatre
<[email protected]> wrote:
> On 10/27/2022 12:56 AM, Peter Newman wrote:
> > On Wed, Oct 26, 2022 at 11:12 PM Reinette Chatre
> > <[email protected]> wrote:
> >> The original concern is "the stores to t->closid and t->rmid could be
> >> reordered with the task_curr(t) and task_cpu(t) reads which follow". I can see
> >> that issue. Have you considered using the compiler barrier, barrier(), instead?
> >> From what I understand it will prevent the compiler from moving the memory accesses.
> >> This is what is currently done in __rdtgroup_move_task() and could be done here also?
> >
> > A memory system (including those on x86) is allowed to reorder a store with a
> > later load, in addition to the compiler.
> >
> > Also because the locations in question can be concurrently accessed by another
> > CPU, a compiler barrier would not be sufficient.
>
> This is hard. Regarding the concurrent access from another CPU it seems
> that task_rq_lock() is available to prevent races with schedule(). Using this
> may be able to prevent task_curr(t) changing during this time and thus the local
> reordering may not be a problem. I am not familiar with task_rq_lock() though,
> surely there are many details to consider in this area.

Yes it looks like the task's rq_lock would provide the necessary
ordering. It's not feasible to ensure the IPI arrives before the target
task migrates away, but the task would need to obtain the same lock in
order to migrate off of its current CPU, so that alone would ensure the
next migration would observe the updates.

The difficulty is this lock is private to sched/, so I'd have to propose
some API.

It would make sense for the API to return the result of task_curr(t) and
task_cpu(t) to the caller to avoid giving the impression that this
function would be useful for anything other than helping someone do an
smp_call_function targeting a task's CPU.

I'll just have to push a patch and see what people say.

-Peter

2022-11-01 16:23:07

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

On Tue, Nov 1, 2022 at 4:23 PM Peter Newman <[email protected]> wrote:
> Yes it looks like the task's rq_lock would provide the necessary
> ordering. It's not feasible to ensure the IPI arrives before the target
> task migrates away, but the task would need to obtain the same lock in
> order to migrate off of its current CPU, so that alone would ensure the
> next migration would observe the updates.
>
> The difficulty is this lock is private to sched/, so I'd have to propose
> some API.

Actually it looks like I can just use task_call_func() to lock down the
task while we do our updates and decide if or where to send IPIs. That
seems easy enough.

-Peter

2022-11-01 17:13:52

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 11/1/2022 8:53 AM, Peter Newman wrote:
> On Tue, Nov 1, 2022 at 4:23 PM Peter Newman <[email protected]> wrote:
>> Yes it looks like the task's rq_lock would provide the necessary
>> ordering. It's not feasible to ensure the IPI arrives before the target
>> task migrates away, but the task would need to obtain the same lock in
>> order to migrate off of its current CPU, so that alone would ensure the
>> next migration would observe the updates.
>>
>> The difficulty is this lock is private to sched/, so I'd have to propose
>> some API.

I thought that we could do something similar to cgroup_move_task(). Instead
of new API it seems that the custom is for subsystems to move their scheduler
related code to kernel/sched/. For example, a new
kernel/sched/resctrl.c that implements the task moving code that benefits
from the private sched/ APIs.

But ...

> Actually it looks like I can just use task_call_func() to lock down the
> task while we do our updates and decide if or where to send IPIs. That
> seems easy enough.

Indeed, this does look promising. Thanks for finding that.

If you do pursue something like this I assume that you have some
challenging environments in which to try it out? I am curious about the
user space visible impact of the additional locking on a task move when
the number of tasks being moved is high.

Reinette

2022-11-03 17:11:00

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

(I've not got to the last message in this part of the thread yes - I'm out of time this
week, back Monday!)

On 21/10/2022 21:09, Reinette Chatre wrote:
> On 10/19/2022 6:57 AM, James Morse wrote:
>> On 17/10/2022 11:15, Peter Newman wrote:
>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>>>> You originally asked:
>>>> | Any concerns about the CLOSID-reusing behavior?
>>>>
>>>> I don't think this will work well with MPAM ... I expect it will mess up the bandwidth
>>>> counters.
>>>>
>>>> MPAM's equivalent to RMID is PMG. While on x86 CLOSID and RMID are independent numbers,
>>>> this isn't true for PARTID (MPAM's version of CLOSID) and PMG. The PMG bits effectively
>>>> extended the PARTID with bits that aren't used to look up the configuration.
>>>>
>>>> x86's monitors match only on RMID, and there are 'enough' RMID... MPAMs monitors are more
>>>> complicated. I've seen details of a system that only has 1 bit of PMG space.
>>>>
>>>> While MPAM's bandwidth monitors can match just the PMG, there aren't expected to be enough
>>>> unique PMG for every control/monitor group to have a unique value. Instead, MPAM's
>>>> monitors are expected to be used with both the PARTID and PMG.
>>>>
>>>> ('bandwidth monitors' is relevant here, MPAM's 'cache storage utilisation' monitors can't
>>>> match on just PMG at all - they have to be told the PARTID too)
>>>>
>>>>
>>>> If you're re-using CLOSID like this, I think you'll end up with noisy measurements on MPAM
>>>> systems as the caches hold PARTID/PMG values from before the re-use pattern changed, and
>>>> the monitors have to match on both.
>>
>>> Yes, that sounds like it would be an issue.
>>>
>>> Following your refactoring changes, hopefully the MPAM driver could
>>> offer alternative methods for managing PARTIDs and PMGs depending on the
>>> available hardware resources.
>>
>> Mmmm, I don't think anything other than one-partid per control group and one-pmg per
>> monitor group makes much sense.
>>
>>
>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>> user who never creates child MON groups. In case the number of MON
>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>> the same partitioning configuration before giving up.
>>
>> User-space can choose to do this.
>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>> before reporting them to user-space.

> If I understand this scenario correctly, the kernel is already doing this.
> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
> the sum of the parent CTRL_MON group and all its child MON groups.

That is true. MPAM has an additional headache here as it needs to allocate a monitor in
order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
then MPAM can export the counter files in the same way RDT does.

While there are systems that have enough monitors, I don't think this is going to be the
norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)

The problem is moving a group of tasks around N groups requires N monitors to be
allocated, and stay allocated until those groups pass through limbo. The perf stuff can't
allocate more monitors once its started.

Even without perf, the only thing that limits the list of other counters that have to be
read is the number of PARTID*PMG. It doesn't look like a very sensible design.


>> If monitors are a contended resource, then you may be unable to monitor the
>> secretly-two-control-groups group once the kernel has done this.
>
> I am not viewing this as "secretly-two-control-groups" - there would still be
> only one parent CTRL_MON group that dictates all the allocations. MON groups already
> have a CLOSID (PARTID) property but at this time it is always identical to the parent
> CTRL_MON group. The difference introduced is that some of the child MON groups
> may have a different CLOSID (PARTID) from the parent.
>
>>
>> I don't think the kernel should try to be too clever here.

> That is a fair concern but it may be worth exploring as it seems to address
> a few ABI concerns and user space seems to be eyeing using a future "num_closid"
> info as a check of "RDT/PQoS" vs "MPAM".

I think the solution to all this is:
* Add rename support to move a monitor group between two control groups.
** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
unaffected.
** On arm64, the PARTID is also relevant to the monitors, so the old counters will
continue to count.

Whether this old counters keep counting needs exposing to user-space so that it is aware.

To solve Peter's use-case, we also need:
* to expose how many new groups can be created at each level.
This is because MPAM doesn't have a property like num_rmid.


Combined, these should solve the cases Peter describes. User-space can determine if the
platform is control-group-rich or monitor-group-rich, and build the corresponding
structure to make best use of the resources.


Thanks,

James

2022-11-03 17:12:29

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 26/10/2022 10:36, Peter Newman wrote:
> On Tue, Oct 25, 2022 at 5:56 PM James Morse <[email protected]> wrote:
>> This would work when systems are built to look like RDT, but MPAM has other control types
>> where this would have interesting behaviours.
>>
>> 'CPOR' is equivalent to CBM as they are both a bitmap of portions. MPAM also has 'CMAX'
>> where a fraction of the cache is specified. If you create two control groups with
>> different PARTIDs but the same configuration, their two 50%s of the cache could become
>> 100%. CPOR can be used like this, CMAX can't.

> I thought we only allocated caches with CBMs and memory bandwidth with
> percentages.

Those are the existing schema, yes.


> I don't see how CMAX could be used when implementing resctrl's CAT
> resources. Percentage
> configurations are only used for MBA in resctrl today.

The problem is if you say "CLOSID/PARTID are random, its the configuration that matters",
you've broken all the control types where the regulation is happening based on the PARTID
and the configuration, not the configuration alone.

If you do this, you can't ever have schema that use those configuration schemes.
There is hardware out there that supports these schemes.


>> Even when the controls behave in the same way, a different PARTID with the same control
>> values could be regulated differently, resulting in weirdness.
>
> Can you provide further examples?

CMAX, MBW_MIN and MBW_MAX: You can have 50%, and I can have 50%. Your secret clones which
have different PARTID and a copy of your configuration also get 50%. As far as the
hardware is concerned, we're trying to play with more than 100% of the resource.

I don't know what the memory controller people are building, but naively I think the MBW
MIN/MAX stuff is a more natural fit that a bandwidth bitmap.


You couldn't ever add new configuration schemes that are based on a fraction or percentage
of the resource.



Thanks,

James

2022-11-08 21:44:30

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On 11/3/2022 10:06 AM, James Morse wrote:
> Hi Reinette,
>
> (I've not got to the last message in this part of the thread yes - I'm out of time this
> week, back Monday!)
>
> On 21/10/2022 21:09, Reinette Chatre wrote:
>> On 10/19/2022 6:57 AM, James Morse wrote:
>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:

...

>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>> user who never creates child MON groups. In case the number of MON
>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>> the same partitioning configuration before giving up.
>>>
>>> User-space can choose to do this.
>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>> before reporting them to user-space.
>
>> If I understand this scenario correctly, the kernel is already doing this.
>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>> the sum of the parent CTRL_MON group and all its child MON groups.
>
> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
> then MPAM can export the counter files in the same way RDT does.
>
> While there are systems that have enough monitors, I don't think this is going to be the
> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)

This sounds related to the way monitoring was done in earlier kernels. This was
long before I become involved with this work. Unfortunately I am not familiar with
all the history involved that ended in it being removed from the kernel. Looks like
this was around v4.6, here is a sample commit that may help point to what was done:

commit 33c3cc7acfd95968d74247f1a4e1b0727a07ed43
Author: Vikas Shivappa <[email protected]>
Date: Thu Mar 10 15:32:09 2016 -0800

perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init


Looking at some history there even seems to have been some work surrounding
"rotating" of RMIDs that seem related to what you mention above:

commit bff671dba7981195a644a5dc210d65de8ae2d251
Author: Matt Fleming <[email protected]>
Date: Fri Jan 23 18:45:47 2015 +0000

perf/x86/intel: Perform rotation on Intel CQM RMIDs

There are many use cases where people will want to monitor more tasks
than there exist RMIDs in the hardware, meaning that we have to perform
some kind of multiplexing.
...


>
> The problem is moving a group of tasks around N groups requires N monitors to be
> allocated, and stay allocated until those groups pass through limbo. The perf stuff can't
> allocate more monitors once its started.
>
> Even without perf, the only thing that limits the list of other counters that have to be
> read is the number of PARTID*PMG. It doesn't look like a very sensible design.
>
>
>>> If monitors are a contended resource, then you may be unable to monitor the
>>> secretly-two-control-groups group once the kernel has done this.
>>
>> I am not viewing this as "secretly-two-control-groups" - there would still be
>> only one parent CTRL_MON group that dictates all the allocations. MON groups already
>> have a CLOSID (PARTID) property but at this time it is always identical to the parent
>> CTRL_MON group. The difference introduced is that some of the child MON groups
>> may have a different CLOSID (PARTID) from the parent.
>>
>>>
>>> I don't think the kernel should try to be too clever here.
>
>> That is a fair concern but it may be worth exploring as it seems to address
>> a few ABI concerns and user space seems to be eyeing using a future "num_closid"
>> info as a check of "RDT/PQoS" vs "MPAM".
>
> I think the solution to all this is:
> * Add rename support to move a monitor group between two control groups.
> ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
> unaffected.
> ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
> continue to count.

This looks like the solution to me also.

The details of the arm64 support is not clear to me though. The destination
group may not have enough PMG to host the new group so failures need to be
handled. As you mention also, the old counters will continue to count.
I assume that you mean the hardware will still have a record of the occupancy
and that needs some time to dissipate? I assume this would fall under the
limbo handling so in some scenarios (for example the just moved monitor
group used the last PMG) it may take some time for the source control
group to allow a new monitor group? The new counters will also not
reflect the task's history.

Moving an arm64 monitor group may thus have a few surprises for user
space while sounding complex to support. Would adding all this additional
support be worth it if the guidance to user space is to instead create many
control groups in such a control-group-rich environment?

> Whether this old counters keep counting needs exposing to user-space so that it is aware.

Could you please elaborate? Do old counters not always keep counting?

> To solve Peter's use-case, we also need:
> * to expose how many new groups can be created at each level.
> This is because MPAM doesn't have a property like num_rmid.

Unfortunately num_rmid is part of the user space interface. While MPAM
does not have "RMIDs" it seems that num_rmid can still be relevant
based on what it is described to represent in Documentation/x86/resctrl.rst:
"This is the upper bound for how many "CTRL_MON" + "MON" groups can
be created."

> Combined, these should solve the cases Peter describes. User-space can determine if the
> platform is control-group-rich or monitor-group-rich, and build the corresponding
> structure to make best use of the resources.

Sounds good to me.

Reinette


2022-11-08 22:09:46

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFD] resctrl: reassigning a running container's CTRL_MON group

> Looking at some history there even seems to have been some work surrounding
> "rotating" of RMIDs that seem related to what you mention above:
>
> commit bff671dba7981195a644a5dc210d65de8ae2d251
> Author: Matt Fleming <[email protected]>
> Date: Fri Jan 23 18:45:47 2015 +0000
>
> perf/x86/intel: Perform rotation on Intel CQM RMIDs
>
> There are many use cases where people will want to monitor more tasks
> than there exist RMIDs in the hardware, meaning that we have to perform
> some kind of multiplexing.

That would work for monitoring memory bandwidth. But not for LLC occupancy
as there's no way to set an occupancy counter to the value of what the new set of
processes are using. So you'd have to live with nonsense values for a potentially
long time until natural LLC evictions and re-fills sorted things out. Either that or
flush the entire LLC when reassigning an RMID so you can count up from zero
as the cache is re-filled.

-Tony

2022-11-08 22:11:37

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On 11/3/2022 10:06 AM, James Morse wrote:
> Hi Peter,
>
> On 26/10/2022 10:36, Peter Newman wrote:
>> On Tue, Oct 25, 2022 at 5:56 PM James Morse <[email protected]> wrote:
>>> This would work when systems are built to look like RDT, but MPAM has other control types
>>> where this would have interesting behaviours.
>>>
>>> 'CPOR' is equivalent to CBM as they are both a bitmap of portions. MPAM also has 'CMAX'
>>> where a fraction of the cache is specified. If you create two control groups with
>>> different PARTIDs but the same configuration, their two 50%s of the cache could become
>>> 100%. CPOR can be used like this, CMAX can't.
>
>> I thought we only allocated caches with CBMs and memory bandwidth with
>> percentages.
>
> Those are the existing schema, yes.
>
>
>> I don't see how CMAX could be used when implementing resctrl's CAT
>> resources. Percentage
>> configurations are only used for MBA in resctrl today.
>
> The problem is if you say "CLOSID/PARTID are random, its the configuration that matters",
> you've broken all the control types where the regulation is happening based on the PARTID
> and the configuration, not the configuration alone.
>
> If you do this, you can't ever have schema that use those configuration schemes.
> There is hardware out there that supports these schemes.
>
>
>>> Even when the controls behave in the same way, a different PARTID with the same control
>>> values could be regulated differently, resulting in weirdness.
>>
>> Can you provide further examples?
>
> CMAX, MBW_MIN and MBW_MAX: You can have 50%, and I can have 50%. Your secret clones which
> have different PARTID and a copy of your configuration also get 50%. As far as the
> hardware is concerned, we're trying to play with more than 100% of the resource.
>
> I don't know what the memory controller people are building, but naively I think the MBW
> MIN/MAX stuff is a more natural fit that a bandwidth bitmap.
>
>
> You couldn't ever add new configuration schemes that are based on a fraction or percentage
> of the resource.

Thank you very much for catching this early and highlighting this. Yes,
MBA also falls into this category so using different PARTID/CLOSID in the
same control group is not an option.

Reinette



2022-11-08 23:52:32

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group



On 11/8/2022 1:56 PM, Luck, Tony wrote:
>> Looking at some history there even seems to have been some work surrounding
>> "rotating" of RMIDs that seem related to what you mention above:
>>
>> commit bff671dba7981195a644a5dc210d65de8ae2d251
>> Author: Matt Fleming <[email protected]>
>> Date: Fri Jan 23 18:45:47 2015 +0000
>>
>> perf/x86/intel: Perform rotation on Intel CQM RMIDs
>>
>> There are many use cases where people will want to monitor more tasks
>> than there exist RMIDs in the hardware, meaning that we have to perform
>> some kind of multiplexing.
>
> That would work for monitoring memory bandwidth. But not for LLC occupancy
> as there's no way to set an occupancy counter to the value of what the new set of
> processes are using. So you'd have to live with nonsense values for a potentially
> long time until natural LLC evictions and re-fills sorted things out. Either that or
> flush the entire LLC when reassigning an RMID so you can count up from zero
> as the cache is re-filled.

Tony helped me to find some more history here. Please see the commit message
of the patch below for some information on why the perf support was removed.
This is not all specific to monitoring of cache occupancy.

commit c39a0e2c8850f08249383f2425dbd8dbe4baad69
Author: Vikas Shivappa <[email protected]>
Date: Tue Jul 25 14:14:20 2017 -0700

x86/perf/cqm: Wipe out perf based cqm

Reinette


2022-11-09 10:51:41

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On Tue, Nov 8, 2022 at 10:28 PM Reinette Chatre
<[email protected]> wrote:
> On 11/3/2022 10:06 AM, James Morse wrote:
> > That is true. MPAM has an additional headache here as it needs to allocate a monitor in
> > order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
> > then MPAM can export the counter files in the same way RDT does.
> >
> > While there are systems that have enough monitors, I don't think this is going to be the
> > norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
> > to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>
> This sounds related to the way monitoring was done in earlier kernels. This was
> long before I become involved with this work. Unfortunately I am not familiar with
> all the history involved that ended in it being removed from the kernel. Looks like
> this was around v4.6, here is a sample commit that may help point to what was done:

Sort of related, this is a problem we have to work around on AMD
implementations that I will be sharing a patch for soon.

Note the second paragraph at the top of page 13:

https://developer.amd.com/wp-content/resources/56375_1.00.pdf

AMD QoS often provides less counters than RMIDs, but the architecture
promises there will be at least as many counters in a QoS domain as
CPUs. Using this we can permanently pin RMIDs to CPUs and read the
counters on every task switch to implement MBM RMIDs in software.

This has the caveats that evictions while one task is running could have
resulted from a previous task on the current CPU, but will be counted
against the new task's software-RMID, and that CMT doesn't work.

I will propose making this available as a mount option for cloud container
use cases which need to monitor a large number of tasks on B/W counter-poor
systems, and of course don't need CMT.

> [...]
>
> > I think the solution to all this is:
> > * Add rename support to move a monitor group between two control groups.
> > ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
> > unaffected.
> > ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
> > continue to count.
>
> This looks like the solution to me also.
>
> The details of the arm64 support is not clear to me though. The destination
> group may not have enough PMG to host the new group so failures need to be
> handled. As you mention also, the old counters will continue to count.
> I assume that you mean the hardware will still have a record of the occupancy
> and that needs some time to dissipate? I assume this would fall under the
> limbo handling so in some scenarios (for example the just moved monitor
> group used the last PMG) it may take some time for the source control
> group to allow a new monitor group? The new counters will also not
> reflect the task's history.
>
> Moving an arm64 monitor group may thus have a few surprises for user
> space while sounding complex to support. Would adding all this additional
> support be worth it if the guidance to user space is to instead create many
> control groups in such a control-group-rich environment?
>
> > Whether this old counters keep counting needs exposing to user-space so that it is aware.
>
> Could you please elaborate? Do old counters not always keep counting?

Based on this, is it even worth it to allocate PMGs given that the
systems James has seen so far only have a single PMG bit? All this will
get us is the ability to create a single child mon_group in each control
group. This seems too limiting for the feature to be useful.

-Peter

2022-11-09 18:10:39

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette, Tony,

On 08/11/2022 23:18, Reinette Chatre wrote:
> On 11/8/2022 1:56 PM, Luck, Tony wrote:
>>> Looking at some history there even seems to have been some work surrounding
>>> "rotating" of RMIDs that seem related to what you mention above:
>>>
>>> commit bff671dba7981195a644a5dc210d65de8ae2d251
>>> Author: Matt Fleming <[email protected]>
>>> Date: Fri Jan 23 18:45:47 2015 +0000
>>>
>>> perf/x86/intel: Perform rotation on Intel CQM RMIDs
>>>
>>> There are many use cases where people will want to monitor more tasks
>>> than there exist RMIDs in the hardware, meaning that we have to perform
>>> some kind of multiplexing.
>>
>> That would work for monitoring memory bandwidth. But not for LLC occupancy
>> as there's no way to set an occupancy counter to the value of what the new set of
>> processes are using. So you'd have to live with nonsense values for a potentially
>> long time until natural LLC evictions and re-fills sorted things out. Either that or
>> flush the entire LLC when reassigning an RMID so you can count up from zero
>> as the cache is re-filled.
>
> Tony helped me to find some more history here. Please see the commit message
> of the patch below for some information on why the perf support was removed.
> This is not all specific to monitoring of cache occupancy.

Thanks!

I'll be sure to cite this in the future perf support, and check I've covered all the
issues described here.


James


> commit c39a0e2c8850f08249383f2425dbd8dbe4baad69
> Author: Vikas Shivappa <[email protected]>
> Date: Tue Jul 25 14:14:20 2017 -0700
>
> x86/perf/cqm: Wipe out perf based cqm

2022-11-09 18:20:53

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On 08/11/2022 21:28, Reinette Chatre wrote:
> On 11/3/2022 10:06 AM, James Morse wrote:
>> (I've not got to the last message in this part of the thread yes - I'm out of time this
>> week, back Monday!)
>>
>> On 21/10/2022 21:09, Reinette Chatre wrote:
>>> On 10/19/2022 6:57 AM, James Morse wrote:
>>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>
> ...
>
>>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>>> user who never creates child MON groups. In case the number of MON
>>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>>> the same partitioning configuration before giving up.
>>>>
>>>> User-space can choose to do this.
>>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>>> before reporting them to user-space.
>>
>>> If I understand this scenario correctly, the kernel is already doing this.
>>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>>> the sum of the parent CTRL_MON group and all its child MON groups.
>>
>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>> then MPAM can export the counter files in the same way RDT does.
>>
>> While there are systems that have enough monitors, I don't think this is going to be the
>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)

> This sounds related to the way monitoring was done in earlier kernels. This was
> long before I become involved with this work. Unfortunately I am not familiar with
> all the history involved that ended in it being removed from the kernel.

Yup, I'm aware there is some history to this. It's not appropriate for the llc_occupancy
counter as that reports state, instead of events.


> Looks like
> this was around v4.6, here is a sample commit that may help point to what was done:
>
> commit 33c3cc7acfd95968d74247f1a4e1b0727a07ed43
> Author: Vikas Shivappa <[email protected]>
> Date: Thu Mar 10 15:32:09 2016 -0800
>
> perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init
>
>
> Looking at some history there even seems to have been some work surrounding
> "rotating" of RMIDs that seem related to what you mention above:
>
> commit bff671dba7981195a644a5dc210d65de8ae2d251
> Author: Matt Fleming <[email protected]>
> Date: Fri Jan 23 18:45:47 2015 +0000
>
> perf/x86/intel: Perform rotation on Intel CQM RMIDs
>
> There are many use cases where people will want to monitor more tasks
> than there exist RMIDs in the hardware, meaning that we have to perform
> some kind of multiplexing.
> ...
>

Thanks - this one was new. (I can't see how that would work reliably!)

The perf stuff is a way off, but it is an influence on how some of the MPAM monitoring
stuff has been done.

Initial support will only be for systems that have enough hardware monitors for each
control/monitor group to have one. This is the simplest to support in software, but is
costly for the hardware.


>> The problem is moving a group of tasks around N groups requires N monitors to be
>> allocated, and stay allocated until those groups pass through limbo. The perf stuff can't
>> allocate more monitors once its started.
>>
>> Even without perf, the only thing that limits the list of other counters that have to be
>> read is the number of PARTID*PMG. It doesn't look like a very sensible design.
>>
>>
>>>> If monitors are a contended resource, then you may be unable to monitor the
>>>> secretly-two-control-groups group once the kernel has done this.
>>>
>>> I am not viewing this as "secretly-two-control-groups" - there would still be
>>> only one parent CTRL_MON group that dictates all the allocations. MON groups already
>>> have a CLOSID (PARTID) property but at this time it is always identical to the parent
>>> CTRL_MON group. The difference introduced is that some of the child MON groups
>>> may have a different CLOSID (PARTID) from the parent.
>>>
>>>>
>>>> I don't think the kernel should try to be too clever here.
>>
>>> That is a fair concern but it may be worth exploring as it seems to address
>>> a few ABI concerns and user space seems to be eyeing using a future "num_closid"
>>> info as a check of "RDT/PQoS" vs "MPAM".
>>
>> I think the solution to all this is:
>> * Add rename support to move a monitor group between two control groups.
>> ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
>> unaffected.
>> ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
>> continue to count.

> This looks like the solution to me also.

Great. I've had a stab at implementing it so we can have a more concrete discussion.


> The details of the arm64 support is not clear to me though. The destination
> group may not have enough PMG to host the new group so failures need to be
> handled.

> As you mention also, the old counters will continue to count.
> I assume that you mean the hardware will still have a record of the occupancy
> and that needs some time to dissipate?

Yes,


> I assume this would fall under the
> limbo handling so in some scenarios (for example the just moved monitor
> group used the last PMG) it may take some time for the source control
> group to allow a new monitor group?

Yup!


> The new counters will also not reflect the task's history.

Indeed. I anticipate user-space is sampling this file periodically, otherwise it can't
calculate a MB/s from the raw byte-count. I don't think losing the history is problem.

The state before the change being lost could be a problem, but this is a difference with
the way MPAM works. I think its best to just expose this property to user-space, as I
don't think its feasible to work around.

User-space would probably ignore the counter for a period of time after the move, as
depending on where the regulation is happening, it may take a little while for the CLOSID
change to take effect.


> Moving an arm64 monitor group may thus have a few surprises for user
> space while sounding complex to support. Would adding all this additional
> support be worth it if the guidance to user space is to instead create many
> control groups in such a control-group-rich environment?

I'd prefer it didn't exist at all, but if there are reasons to support it on x86, I'd like
the MPAM support to be as similar as possible. I'm willing to accept (advertised!) noise
in the counters, but a whole missing syscall is a harder sell.


>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>
> Could you please elaborate? Do old counters not always keep counting?

Its not new - but the expectation is the mv/rename support does this atomically without
glitching/resetting the counters. Because of that new expectation, I think it needs
exposing to user-space.

Something should be indicated to user-space so it knows it can move monitor groups around,
otherwise its another 'try it and see'.


>> To solve Peter's use-case, we also need:
>> * to expose how many new groups can be created at each level.
>> This is because MPAM doesn't have a property like num_rmid.

> Unfortunately num_rmid is part of the user space interface. While MPAM
> does not have "RMIDs" it seems that num_rmid can still be relevant
> based on what it is described to represent in Documentation/x86/resctrl.rst:
> "This is the upper bound for how many "CTRL_MON" + "MON" groups can
> be created."

I agree it can't be removed, and MPAM systems will need to put a value there.
The problem is 'rmid' has a well known definition, even if the kernel documentation is
nuanced.

This might be contentious, but ideally I'd 'deprecate' num_rmid, and split it into two
properties that don't reference an architecture. (Obviously the files have to stay for at
least the next 10 years!)


>> Combined, these should solve the cases Peter describes. User-space can determine if the
>> platform is control-group-rich or monitor-group-rich, and build the corresponding
>> structure to make best use of the resources.
>
> Sounds good to me.


Thanks,

James

2022-11-09 19:21:49

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Peter,

On 11/9/2022 1:50 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Tue, Nov 8, 2022 at 10:28 PM Reinette Chatre
> <[email protected]> wrote:
>> On 11/3/2022 10:06 AM, James Morse wrote:
>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>> then MPAM can export the counter files in the same way RDT does.
>>>
>>> While there are systems that have enough monitors, I don't think this is going to be the
>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>>
>> This sounds related to the way monitoring was done in earlier kernels. This was
>> long before I become involved with this work. Unfortunately I am not familiar with
>> all the history involved that ended in it being removed from the kernel. Looks like
>> this was around v4.6, here is a sample commit that may help point to what was done:
>
> Sort of related, this is a problem we have to work around on AMD
> implementations that I will be sharing a patch for soon.
>
> Note the second paragraph at the top of page 13:
>
> https://developer.amd.com/wp-content/resources/56375_1.00.pdf
>
> AMD QoS often provides less counters than RMIDs, but the architecture
> promises there will be at least as many counters in a QoS domain as
> CPUs. Using this we can permanently pin RMIDs to CPUs and read the
> counters on every task switch to implement MBM RMIDs in software.
>
> This has the caveats that evictions while one task is running could have
> resulted from a previous task on the current CPU, but will be counted
> against the new task's software-RMID, and that CMT doesn't work.
>
> I will propose making this available as a mount option for cloud container
> use cases which need to monitor a large number of tasks on B/W counter-poor
> systems, and of course don't need CMT.

Thank you for the notice.

>
>> [...]
>>
>>> I think the solution to all this is:
>>> * Add rename support to move a monitor group between two control groups.
>>> ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
>>> unaffected.
>>> ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
>>> continue to count.
>>
>> This looks like the solution to me also.
>>
>> The details of the arm64 support is not clear to me though. The destination
>> group may not have enough PMG to host the new group so failures need to be
>> handled. As you mention also, the old counters will continue to count.
>> I assume that you mean the hardware will still have a record of the occupancy
>> and that needs some time to dissipate? I assume this would fall under the
>> limbo handling so in some scenarios (for example the just moved monitor
>> group used the last PMG) it may take some time for the source control
>> group to allow a new monitor group? The new counters will also not
>> reflect the task's history.
>>
>> Moving an arm64 monitor group may thus have a few surprises for user
>> space while sounding complex to support. Would adding all this additional
>> support be worth it if the guidance to user space is to instead create many
>> control groups in such a control-group-rich environment?
>>
>>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>>
>> Could you please elaborate? Do old counters not always keep counting?
>
> Based on this, is it even worth it to allocate PMGs given that the
> systems James has seen so far only have a single PMG bit? All this will
> get us is the ability to create a single child mon_group in each control
> group. This seems too limiting for the feature to be useful.

I'll mostly defer to James here. From my side I do not see motivation to
not support environments in which only one monitor group can be created.
My concern was the additional complexity involved to support
"mv" of monitor groups in such a constrained environment but I understand
from James (re. https://lore.kernel.org/lkml/[email protected]/)
that it is worth it.

Reinette

2022-11-09 19:27:18

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On 11/9/2022 9:59 AM, James Morse wrote:
> Hi Reinette,
>
> On 08/11/2022 21:28, Reinette Chatre wrote:
>> On 11/3/2022 10:06 AM, James Morse wrote:
>>> (I've not got to the last message in this part of the thread yes - I'm out of time this
>>> week, back Monday!)
>>>
>>> On 21/10/2022 21:09, Reinette Chatre wrote:
>>>> On 10/19/2022 6:57 AM, James Morse wrote:
>>>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>>
>> ...
>>
>>>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>>>> user who never creates child MON groups. In case the number of MON
>>>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>>>> the same partitioning configuration before giving up.
>>>>>
>>>>> User-space can choose to do this.
>>>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>>>> before reporting them to user-space.
>>>
>>>> If I understand this scenario correctly, the kernel is already doing this.
>>>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>>>> the sum of the parent CTRL_MON group and all its child MON groups.
>>>
>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>> then MPAM can export the counter files in the same way RDT does.
>>>
>>> While there are systems that have enough monitors, I don't think this is going to be the
>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>
>> This sounds related to the way monitoring was done in earlier kernels. This was
>> long before I become involved with this work. Unfortunately I am not familiar with
>> all the history involved that ended in it being removed from the kernel.
>
> Yup, I'm aware there is some history to this. It's not appropriate for the llc_occupancy
> counter as that reports state, instead of events.

Perf counts events while a process is running so memory bandwidth monitoring may
also be impacted by the caveats Peter mentioned for the upcoming AMD changes:

https://lore.kernel.org/lkml/CALPaoCidd+WwGTyE3D74LhoL13ce+EvdTmOnyPrQN62j+zZ1fg@mail.gmail.com/
("This has the caveats that evictions while one task is running could have
resulted from a previous task on the current CPU, but will be counted
against the new task's software-RMID, ...")

...
>> The new counters will also not reflect the task's history.
>
> Indeed. I anticipate user-space is sampling this file periodically, otherwise it can't
> calculate a MB/s from the raw byte-count. I don't think losing the history is problem.

Indeed. Cache occupancy may experience more corner cases depending on
the workloads. Your point that user space needs to know how/that counters
are impacted is important.

>
> The state before the change being lost could be a problem, but this is a difference with
> the way MPAM works. I think its best to just expose this property to user-space, as I
> don't think its feasible to work around.
>
> User-space would probably ignore the counter for a period of time after the move, as
> depending on where the regulation is happening, it may take a little while for the CLOSID
> change to take effect.

Agree.


>> Moving an arm64 monitor group may thus have a few surprises for user
>> space while sounding complex to support. Would adding all this additional
>> support be worth it if the guidance to user space is to instead create many
>> control groups in such a control-group-rich environment?
>
> I'd prefer it didn't exist at all, but if there are reasons to support it on x86, I'd like
> the MPAM support to be as similar as possible. I'm willing to accept (advertised!) noise
> in the counters, but a whole missing syscall is a harder sell.

ok.

>
>
>>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>>
>> Could you please elaborate? Do old counters not always keep counting?
>
> Its not new - but the expectation is the mv/rename support does this atomically without
> glitching/resetting the counters. Because of that new expectation, I think it needs
> exposing to user-space.
>
> Something should be indicated to user-space so it knows it can move monitor groups around,
> otherwise its another 'try it and see'.

ok.

>
>>> To solve Peter's use-case, we also need:
>>> * to expose how many new groups can be created at each level.
>>> This is because MPAM doesn't have a property like num_rmid.
>
>> Unfortunately num_rmid is part of the user space interface. While MPAM
>> does not have "RMIDs" it seems that num_rmid can still be relevant
>> based on what it is described to represent in Documentation/x86/resctrl.rst:
>> "This is the upper bound for how many "CTRL_MON" + "MON" groups can
>> be created."
>
> I agree it can't be removed, and MPAM systems will need to put a value there.
> The problem is 'rmid' has a well known definition, even if the kernel documentation is
> nuanced.
>
> This might be contentious, but ideally I'd 'deprecate' num_rmid, and split it into two
> properties that don't reference an architecture. (Obviously the files have to stay for at
> least the next 10 years!)

I think this may be difficult considering the various user space clients
already in use but doing so is reasonable.

Reinette


2022-11-11 18:46:36

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette, Peter,

On 09/11/2022 19:11, Reinette Chatre wrote:
> On 11/9/2022 1:50 AM, Peter Newman wrote:
>> On Tue, Nov 8, 2022 at 10:28 PM Reinette Chatre
>> <[email protected]> wrote:
>>> On 11/3/2022 10:06 AM, James Morse wrote:
>>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>>> then MPAM can export the counter files in the same way RDT does.
>>>>
>>>> While there are systems that have enough monitors, I don't think this is going to be the
>>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>>>
>>> This sounds related to the way monitoring was done in earlier kernels. This was
>>> long before I become involved with this work. Unfortunately I am not familiar with
>>> all the history involved that ended in it being removed from the kernel. Looks like
>>> this was around v4.6, here is a sample commit that may help point to what was done:
>>
>> Sort of related, this is a problem we have to work around on AMD
>> implementations that I will be sharing a patch for soon.
>>
>> Note the second paragraph at the top of page 13:
>>
>> https://developer.amd.com/wp-content/resources/56375_1.00.pdf

>> AMD QoS often provides less counters than RMIDs, but the architecture
>> promises there will be at least as many counters in a QoS domain as
>> CPUs.

How do you know which RMIDs the hardware is tracking?

This reads like the counters are unreliable unless the task is running, and even then they
might lose values when the task is switched out.


>> Using this we can permanently pin RMIDs to CPUs and read the
>> counters on every task switch to implement MBM RMIDs in software.

>> This has the caveats that evictions while one task is running could have
>> resulted from a previous task on the current CPU, but will be counted
>> against the new task's software-RMID, and that CMT doesn't work.

(Sounds like the best thing to do in a bad situation)


>> I will propose making this available as a mount option for cloud container
>> use cases which need to monitor a large number of tasks on B/W counter-poor
>> systems, and of course don't need CMT.

Why does it need to be a mount option?

If this is the only way of using the counters on this platform, then the skid from the
counters is just a property of the platform. It can be advertised to user-space via some
file in 'info'.

Architecture specific mount options are a bad idea, platform specific ones are even worse!


>>> [...]
>>>
>>>> I think the solution to all this is:
>>>> * Add rename support to move a monitor group between two control groups.
>>>> ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
>>>> unaffected.
>>>> ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
>>>> continue to count.
>>>
>>> This looks like the solution to me also.
>>>
>>> The details of the arm64 support is not clear to me though. The destination
>>> group may not have enough PMG to host the new group so failures need to be
>>> handled. As you mention also, the old counters will continue to count.
>>> I assume that you mean the hardware will still have a record of the occupancy
>>> and that needs some time to dissipate? I assume this would fall under the
>>> limbo handling so in some scenarios (for example the just moved monitor
>>> group used the last PMG) it may take some time for the source control
>>> group to allow a new monitor group? The new counters will also not
>>> reflect the task's history.
>>>
>>> Moving an arm64 monitor group may thus have a few surprises for user
>>> space while sounding complex to support. Would adding all this additional
>>> support be worth it if the guidance to user space is to instead create many
>>> control groups in such a control-group-rich environment?
>>>
>>>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>>>
>>> Could you please elaborate? Do old counters not always keep counting?
>>
>> Based on this, is it even worth it to allocate PMGs given that the
>> systems James has seen so far only have a single PMG bit? All this will
>> get us is the ability to create a single child mon_group in each control
>> group. This seems too limiting for the feature to be useful.

It lets you exclude tasks, or only monitor a specific task. Its evidently enough for the
markets those parts are manufactured for!


> I'll mostly defer to James here. From my side I do not see motivation to
> not support environments in which only one monitor group can be created.
> My concern was the additional complexity involved to support
> "mv" of monitor groups in such a constrained environment but I understand
> from James (re. https://lore.kernel.org/lkml/[email protected]/)
> that it is worth it.

I'm strongly against having parts of this interface work differently on different
architectures or platforms. If it does, we may as well have completely different
interfaces as user-space has to be architecture/platform aware.

Its perfectly possible for the filesystem bits of resctrl to support renaming monitor
groups between control groups, with only a minimum of 'swap the RMID' that can be skipped
if an architecture doesn't support it.

'mv' should be supported on all architectures/platforms, and we should expose enough
information to user-space for it to work out if its going to build a control/monitor group
structure that relies on that.


Thanks,

James

2022-11-11 19:21:39

by James Morse

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi Reinette,

On 09/11/2022 19:12, Reinette Chatre wrote:
> On 11/9/2022 9:59 AM, James Morse wrote:
>> On 08/11/2022 21:28, Reinette Chatre wrote:
>>> On 11/3/2022 10:06 AM, James Morse wrote:
>>>> (I've not got to the last message in this part of the thread yes - I'm out of time this
>>>> week, back Monday!)
>>>>
>>>> On 21/10/2022 21:09, Reinette Chatre wrote:
>>>>> On 10/19/2022 6:57 AM, James Morse wrote:
>>>>>> On 17/10/2022 11:15, Peter Newman wrote:
>>>>>>> On Wed, Oct 12, 2022 at 6:55 PM James Morse <[email protected]> wrote:
>>>
>>> ...
>>>
>>>>>>> If there are a lot more PARTIDs than PMGs, then it would fit well with a
>>>>>>> user who never creates child MON groups. In case the number of MON
>>>>>>> groups gets ahead of the number of CTRL_MON groups and you've run out of
>>>>>>> PMGs, perhaps you would just try to allocate another PARTID and program
>>>>>>> the same partitioning configuration before giving up.
>>>>>>
>>>>>> User-space can choose to do this.
>>>>>> If the kernel tries to be clever and do this behind user-space's back, it needs to
>>>>>> allocate two monitors for this secretly-two-control-groups, and always sum the counters
>>>>>> before reporting them to user-space.
>>>>
>>>>> If I understand this scenario correctly, the kernel is already doing this.
>>>>> As implemented in mon_event_count() the monitor data of a CTRL_MON group is
>>>>> the sum of the parent CTRL_MON group and all its child MON groups.
>>>>
>>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>>> then MPAM can export the counter files in the same way RDT does.
>>>>
>>>> While there are systems that have enough monitors, I don't think this is going to be the
>>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>>
>>> This sounds related to the way monitoring was done in earlier kernels. This was
>>> long before I become involved with this work. Unfortunately I am not familiar with
>>> all the history involved that ended in it being removed from the kernel.
>>
>> Yup, I'm aware there is some history to this. It's not appropriate for the llc_occupancy
>> counter as that reports state, instead of events.

> Perf counts events while a process is running

It's hooked up as an uncore PMU driver and it rejects attempts to attach it to a task.
Some useful background is it has to be told which of the existing resctrl control/monitor
groups to monitor. On x86 its just returning the the increase in events from the mbm files
in resctrl via resctrl_arch_rmid_read().
Unless you're curious [0], the details can come if/when I post it!


> so memory bandwidth monitoring may
> also be impacted by the caveats Peter mentioned for the upcoming AMD changes:
>
> https://lore.kernel.org/lkml/CALPaoCidd+WwGTyE3D74LhoL13ce+EvdTmOnyPrQN62j+zZ1fg@mail.gmail.com/
> ("This has the caveats that evictions while one task is running could have
> resulted from a previous task on the current CPU, but will be counted
> against the new task's software-RMID, ...")

If the logic to implement that is hidden entirely behind resctrl_arch_rmid_read(), then
there should be no problem. (the values will be noisy, but that is the best that can be
done on that platform)


Thanks,

James

[0] Beware, the changes to x86 to make resctrl_arch_rmid_read() irq safe aren't quite right.
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot/v6.0&id=b8ae575bd17e1d56db0f84dc456b964a23d252d6

2022-11-14 18:38:03

by Reinette Chatre

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James and Peter,

On 11/11/2022 10:38 AM, James Morse wrote:
> On 09/11/2022 19:11, Reinette Chatre wrote:
>> On 11/9/2022 1:50 AM, Peter Newman wrote:
>>> On Tue, Nov 8, 2022 at 10:28 PM Reinette Chatre
>>> <[email protected]> wrote:
>>>> On 11/3/2022 10:06 AM, James Morse wrote:
>>>>> That is true. MPAM has an additional headache here as it needs to allocate a monitor in
>>>>> order to read the counters. If there are enough monitors for each CLOSID*RMID to have one,
>>>>> then MPAM can export the counter files in the same way RDT does.
>>>>>
>>>>> While there are systems that have enough monitors, I don't think this is going to be the
>>>>> norm. To allow systems that don't have a surfeit of monitors to use the counters, I plan
>>>>> to export the values from resctrl_arch_rmid_read() via perf. (but only for bandwidth counters)
>>>>
>>>> This sounds related to the way monitoring was done in earlier kernels. This was
>>>> long before I become involved with this work. Unfortunately I am not familiar with
>>>> all the history involved that ended in it being removed from the kernel. Looks like
>>>> this was around v4.6, here is a sample commit that may help point to what was done:
>>>
>>> Sort of related, this is a problem we have to work around on AMD
>>> implementations that I will be sharing a patch for soon.
>>>
>>> Note the second paragraph at the top of page 13:
>>>
>>> https://developer.amd.com/wp-content/resources/56375_1.00.pdf

Please note that there is a more recent version, v1.03, of the spec
available:
https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions

>
>>> AMD QoS often provides less counters than RMIDs, but the architecture
>>> promises there will be at least as many counters in a QoS domain as
>>> CPUs.
>
> How do you know which RMIDs the hardware is tracking?
>
> This reads like the counters are unreliable unless the task is running, and even then they
> might lose values when the task is switched out.
>
>
>>> Using this we can permanently pin RMIDs to CPUs and read the
>>> counters on every task switch to implement MBM RMIDs in software.
>
>>> This has the caveats that evictions while one task is running could have
>>> resulted from a previous task on the current CPU, but will be counted
>>> against the new task's software-RMID, and that CMT doesn't work.
>
> (Sounds like the best thing to do in a bad situation)
>
>
>>> I will propose making this available as a mount option for cloud container
>>> use cases which need to monitor a large number of tasks on B/W counter-poor
>>> systems, and of course don't need CMT.
>
> Why does it need to be a mount option?
>
> If this is the only way of using the counters on this platform, then the skid from the
> counters is just a property of the platform. It can be advertised to user-space via some
> file in 'info'.

It is not clear to me from the snippet in the spec if these platforms can easily
be identified. It sounds like the only way these platforms are different is that
they more often return "unavailable" when attempting to read a counter. If this
is the case, then knowing when to change the mechanism of counting does not seem
like a simple check.

>
> Architecture specific mount options are a bad idea, platform specific ones are even worse!
>
>
>>>> [...]
>>>>
>>>>> I think the solution to all this is:
>>>>> * Add rename support to move a monitor group between two control groups.
>>>>> ** On x86, this is guaranteed to preserve the RMID, so the destination counter continues
>>>>> unaffected.
>>>>> ** On arm64, the PARTID is also relevant to the monitors, so the old counters will
>>>>> continue to count.
>>>>
>>>> This looks like the solution to me also.
>>>>
>>>> The details of the arm64 support is not clear to me though. The destination
>>>> group may not have enough PMG to host the new group so failures need to be
>>>> handled. As you mention also, the old counters will continue to count.
>>>> I assume that you mean the hardware will still have a record of the occupancy
>>>> and that needs some time to dissipate? I assume this would fall under the
>>>> limbo handling so in some scenarios (for example the just moved monitor
>>>> group used the last PMG) it may take some time for the source control
>>>> group to allow a new monitor group? The new counters will also not
>>>> reflect the task's history.
>>>>
>>>> Moving an arm64 monitor group may thus have a few surprises for user
>>>> space while sounding complex to support. Would adding all this additional
>>>> support be worth it if the guidance to user space is to instead create many
>>>> control groups in such a control-group-rich environment?
>>>>
>>>>> Whether this old counters keep counting needs exposing to user-space so that it is aware.
>>>>
>>>> Could you please elaborate? Do old counters not always keep counting?
>>>
>>> Based on this, is it even worth it to allocate PMGs given that the
>>> systems James has seen so far only have a single PMG bit? All this will
>>> get us is the ability to create a single child mon_group in each control
>>> group. This seems too limiting for the feature to be useful.
>
> It lets you exclude tasks, or only monitor a specific task. Its evidently enough for the
> markets those parts are manufactured for!
>
>
>> I'll mostly defer to James here. From my side I do not see motivation to
>> not support environments in which only one monitor group can be created.
>> My concern was the additional complexity involved to support
>> "mv" of monitor groups in such a constrained environment but I understand
>> from James (re. https://lore.kernel.org/lkml/[email protected]/)
>> that it is worth it.
>
> I'm strongly against having parts of this interface work differently on different
> architectures or platforms. If it does, we may as well have completely different
> interfaces as user-space has to be architecture/platform aware.

The architectures respond differently though and software cannot hide that. For example,
resctrl can support "mv" for all but from what I understand there is no way to hide
different architecture behaviors. Users will notice that sometimes the counters
keep counting and sometimes they don't.

> Its perfectly possible for the filesystem bits of resctrl to support renaming monitor
> groups between control groups, with only a minimum of 'swap the RMID' that can be skipped
> if an architecture doesn't support it.

right

>
> 'mv' should be supported on all architectures/platforms, and we should expose enough
> information to user-space for it to work out if its going to build a control/monitor group
> structure that relies on that.

Yes, I agree about supporting "mv". My concern is about the additional
complexity of attempting to have it behave ideally (eg. counters keep counting) while
users are not expected to use/rely on the complexity but instead use the additional exposed
information to build its control/monitor groups differently.

Reinette

2022-11-16 13:30:51

by Peter Newman

[permalink] [raw]
Subject: Re: [RFD] resctrl: reassigning a running container's CTRL_MON group

Hi James,

On Fri, Nov 11, 2022 at 7:38 PM James Morse <[email protected]> wrote:
> On 09/11/2022 19:11, Reinette Chatre wrote:
> > On 11/9/2022 1:50 AM, Peter Newman wrote:
> >> Using this we can permanently pin RMIDs to CPUs and read the
> >> counters on every task switch to implement MBM RMIDs in software.
>
> >> This has the caveats that evictions while one task is running could have
> >> resulted from a previous task on the current CPU, but will be counted
> >> against the new task's software-RMID, and that CMT doesn't work.
>
> (Sounds like the best thing to do in a bad situation)
>
>
> >> I will propose making this available as a mount option for cloud container
> >> use cases which need to monitor a large number of tasks on B/W counter-poor
> >> systems, and of course don't need CMT.
>
> Why does it need to be a mount option?
>
> If this is the only way of using the counters on this platform, then the skid from the
> counters is just a property of the platform. It can be advertised to user-space via some
> file in 'info'.

No, it's not the only way of using the counters. The limitation is much
more problematic for users who monitor all tasks all the time. RMIDs
would be more likely to remain in use on systems that only monitor
select tasks, so they should be able to benefit from skid-free bandwidth
counts and CMT, so I think the proposed mode should be opt-in.


> Architecture specific mount options are a bad idea, platform specific ones are even worse!

It is already the case today in resctrlfs that the platform's features
will determine which mount options are available to the user. I believe
the same implementation would work on Intel platforms, but it would just
be silly to enable when these platforms have enough counters to back all
their RMIDs.

Also I believe it's fine for this option to be missing on MPAM until
someone is interested in paying the tradeoffs to monitor more groups and
is motivated enough to implement it.

-Peter