Hi Reinette,
I've been working with users of the recently-added mongroup rename
operation[1] who have observed the impact of back-to-back operations on
latency-sensitive, thread pool-based services. Because changing a
resctrl group's CLOSID (or RMID) requires all task_structs in the system
to be inspected with the tasklist_lock read-locked, a series of move
operations can block out thread creation for long periods of time, as
creating threads needs to write-lock the tasklist_lock.
To avoid this issue, this series replaces the CLOSID and RMID values
cached in the task_struct with a pointer to the task's rdtgroup, through
which the current CLOSID and RMID can be obtained indirectly during a
context switch. Updating a group's ID then only requires the current
task to be switched back in on all CPUs. On server hosts with very large
thread counts, this is much less disruptive than making thread creation
globally unavailable. However, this is less desirable on CPU-isolated,
realtime workloads, so I am interested in suggestions on how to reach a
compromise for the two use cases.
Also note that the soft-ABMC[2] proposal is based on swapping the RMID
values in mongroups when monitors are assigned to them, which will
result in the same slowdown as was encountered with re-parenting
monitoring groups.
Using pointers to rdtgroups from the task_struct been used internally at
Google for a few years to support an alternate CLOSID management
technique[3], which was replaced by mongroup rename. Usage of this
technique during production revealed an issue where threads in an
exiting process become unreachable via for_each_process_thread() before
destruction, but can still switch in and out. Consequently, an rmdir
operation can fail to remove references to an rdtgroup before it frees
it, causing the pointer to freed memory to be dereferenced after the
structure is freed. This would result in invalid CLOSID/RMID values
being written into MSRs, causing an exception. The workaround for this
is to clear a task's rdtgroup pointer when it exits.
As a benefit, pointers to rdtgroups are architecture-independent,
resulting in __resctrl_sched_in() and more of the task assignment code
becoming portable, simplifying the effort of supporting MPAM. Using a
single pointer allows the task's current monitor and control groups to
be determined atomically.
Similar changes have been used internally on a kernel supporting MPAM.
Upon request, I can provide the required changes to the MPAM-resctrl
interface based on James Morse's latest MPAM snapshot[4] for reference.
Thanks!
-Peter
[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/lkml/CALPaoChhKJiMAueFtgCTc7ffO++S5DJCySmxqf9ZDmhR9RQapw@mail.gmail.com
[3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
[4] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.7-rc2
Peter Newman (6):
x86/resctrl: Move __resctrl_sched_in() out-of-line
x86/resctrl: Add hook for releasing task_struct references
x86/resctrl: Disallow mongroup rename on MPAM
x86/resctrl: Use rdtgroup pointer to indicate task membership
x86/resctrl: Abstract PQR_ASSOC from generic code
x86/resctrl: Don't search tasklist in mongroup rename
arch/x86/include/asm/resctrl.h | 93 --------
arch/x86/kernel/cpu/resctrl/core.c | 14 +-
arch/x86/kernel/cpu/resctrl/internal.h | 17 ++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 279 +++++++++++++---------
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
include/linux/resctrl.h | 38 +++
include/linux/sched.h | 3 +-
kernel/exit.c | 3 +
10 files changed, 239 insertions(+), 216 deletions(-)
base-commit: 4cece764965020c22cff7665b18a012006359095
--
2.44.0.396.g6e790dbe36-goog
Hi Peter,
On 3/25/2024 10:27 AM, Peter Newman wrote:
> Hi Reinette,
>
> I've been working with users of the recently-added mongroup rename
> operation[1] who have observed the impact of back-to-back operations on
> latency-sensitive, thread pool-based services. Because changing a
> resctrl group's CLOSID (or RMID) requires all task_structs in the system
> to be inspected with the tasklist_lock read-locked, a series of move
> operations can block out thread creation for long periods of time, as
> creating threads needs to write-lock the tasklist_lock.
Could you please give some insight into the delays experienced? "long
periods of time" mean different things to different people and this
series seems to get more ominous as is progresses with the cover letter
starting with "long periods of time" and by the time the final patch
appears it has become "disastrous".
>
> To avoid this issue, this series replaces the CLOSID and RMID values
> cached in the task_struct with a pointer to the task's rdtgroup, through
> which the current CLOSID and RMID can be obtained indirectly during a
On a high level this seems fair but using a pointer to the rdtgroup in the
task_struct means that the scheduling code needs to dereference that pointer
without any lock held. The changes do take great care and it
would be valuable to clearly document how these accesses are safe. (please
see patch #4 and #6)
> context switch. Updating a group's ID then only requires the current
> task to be switched back in on all CPUs. On server hosts with very large
> thread counts, this is much less disruptive than making thread creation
> globally unavailable. However, this is less desirable on CPU-isolated,
> realtime workloads, so I am interested in suggestions on how to reach a
> compromise for the two use cases.
As I understand this only impacts moving a monitor group? To me this sounds
like a user space triggered event associated with a particular use case that
may not be relevant to the workloads that you refer to. I think this could be
something that can be documented for users with this type of workloads.
(please see patch #6)
>
> Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> values in mongroups when monitors are assigned to them, which will
> result in the same slowdown as was encountered with re-parenting
> monitoring groups.
>
> Using pointers to rdtgroups from the task_struct been used internally at
"have been used internally"?
> Google for a few years to support an alternate CLOSID management
> technique[3], which was replaced by mongroup rename. Usage of this
> technique during production revealed an issue where threads in an
> exiting process become unreachable via for_each_process_thread() before
> destruction, but can still switch in and out. Consequently, an rmdir
> operation can fail to remove references to an rdtgroup before it frees
> it, causing the pointer to freed memory to be dereferenced after the
> structure is freed. This would result in invalid CLOSID/RMID values
> being written into MSRs, causing an exception. The workaround for this
> is to clear a task's rdtgroup pointer when it exits.
This does not sound like a problem unique to this new implementation but the
"invalid CLOSID/RMID values being written into MSRs" sounds like something
that happens today? Probably not worth pulling out for this reason because
in its current form the exiting process will keep using the original
CLOSID/RMID that have no issue being written to MSRs.
Reinette
Hi Reinette,
On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
<[email protected]> wrote:
> On 3/25/2024 10:27 AM, Peter Newman wrote:
> > I've been working with users of the recently-added mongroup rename
> > operation[1] who have observed the impact of back-to-back operations on
> > latency-sensitive, thread pool-based services. Because changing a
> > resctrl group's CLOSID (or RMID) requires all task_structs in the system
> > to be inspected with the tasklist_lock read-locked, a series of move
> > operations can block out thread creation for long periods of time, as
> > creating threads needs to write-lock the tasklist_lock.
>
> Could you please give some insight into the delays experienced? "long
> periods of time" mean different things to different people and this
> series seems to get more ominous as is progresses with the cover letter
> starting with "long periods of time" and by the time the final patch
> appears it has become "disastrous".
There was an incident where 99.999p tail latencies of a service
increased from 100 milliseconds to over 2 seconds when the container
manager switched from our legacy downstream CLOSID-reuse technique[1]
to mongrp_rename().
A more focused study benchmarked creating 128 threads with
pthread_create() on a production host found that moving mongroups
unrelated to any of the benchmark threads increased the completion cpu
time from 30ms to 183ms. Profiling the contention on the tasklist_lock
showed that the average contention time on the tasklist_lock was about
70ms when mongroup move operations were taking place.
It's difficult for me to access real production workloads, but I
estimated a crude figure by measuring the task time of "wc -l
/sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
converged to a line where every additional 1000 threads added about 1
millisecond.
Incorporating kernfs_rename() into the solution for changing a group's
class of service also contributes a lot of overhead (about 90% of a
mongroup rename seems to be spent here), but the global impact is far
less than that of the tasklist_lock contention.
>
> >
> > To avoid this issue, this series replaces the CLOSID and RMID values
> > cached in the task_struct with a pointer to the task's rdtgroup, through
> > which the current CLOSID and RMID can be obtained indirectly during a
>
> On a high level this seems fair but using a pointer to the rdtgroup in the
> task_struct means that the scheduling code needs to dereference that pointer
> without any lock held. The changes do take great care and it
> would be valuable to clearly document how these accesses are safe. (please
> see patch #4 and #6)
I'll clarify that the existing technique for revoking a CLOSID or RMID
through a blocking IPI solves most of the problem already.
>
> > context switch. Updating a group's ID then only requires the current
> > task to be switched back in on all CPUs. On server hosts with very large
> > thread counts, this is much less disruptive than making thread creation
> > globally unavailable. However, this is less desirable on CPU-isolated,
> > realtime workloads, so I am interested in suggestions on how to reach a
> > compromise for the two use cases.
>
> As I understand this only impacts moving a monitor group? To me this sounds
> like a user space triggered event associated with a particular use case that
> may not be relevant to the workloads that you refer to. I think this could be
> something that can be documented for users with this type of workloads.
> (please see patch #6)
All of the existing rmdir cases seem to have the same problem, but
they must not be used frequently enough for any concerns to be raised.
It seems that it's routine for the workload of hosts to be increased
until memory bandwidth saturation, so applying and unapplying
allocation restrictions happens rather frequently.
>
> >
> > Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> > values in mongroups when monitors are assigned to them, which will
> > result in the same slowdown as was encountered with re-parenting
> > monitoring groups.
> >
> > Using pointers to rdtgroups from the task_struct been used internally at
>
> "have been used internally"?
Thanks, I'll fix that.
>
> > Google for a few years to support an alternate CLOSID management
> > technique[3], which was replaced by mongroup rename. Usage of this
> > technique during production revealed an issue where threads in an
> > exiting process become unreachable via for_each_process_thread() before
> > destruction, but can still switch in and out. Consequently, an rmdir
> > operation can fail to remove references to an rdtgroup before it frees
> > it, causing the pointer to freed memory to be dereferenced after the
> > structure is freed. This would result in invalid CLOSID/RMID values
> > being written into MSRs, causing an exception. The workaround for this
> > is to clear a task's rdtgroup pointer when it exits.
>
> This does not sound like a problem unique to this new implementation but the
> "invalid CLOSID/RMID values being written into MSRs" sounds like something
> that happens today? Probably not worth pulling out for this reason because
> in its current form the exiting process will keep using the original
> CLOSID/RMID that have no issue being written to MSRs.
Today the values are at worst stale and in the range of valid CLOSIDs
and RMIDs. If __resctrl_sched_in() dereferences a freed struct
rdtgroup pointer, the resulting value could be an arbitrary u32, the
vast majority of which are not valid CLOSID/RMID values.
-Peter
[1] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com/
Hi Peter,
On 4/5/2024 2:30 PM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, Apr 4, 2024 at 4:09 PM Reinette Chatre
> <[email protected]> wrote:
>> On 3/25/2024 10:27 AM, Peter Newman wrote:
>>> I've been working with users of the recently-added mongroup rename
>>> operation[1] who have observed the impact of back-to-back operations on
>>> latency-sensitive, thread pool-based services. Because changing a
>>> resctrl group's CLOSID (or RMID) requires all task_structs in the system
>>> to be inspected with the tasklist_lock read-locked, a series of move
>>> operations can block out thread creation for long periods of time, as
>>> creating threads needs to write-lock the tasklist_lock.
>>
>> Could you please give some insight into the delays experienced? "long
>> periods of time" mean different things to different people and this
>> series seems to get more ominous as is progresses with the cover letter
>> starting with "long periods of time" and by the time the final patch
>> appears it has become "disastrous".
>
> There was an incident where 99.999p tail latencies of a service
> increased from 100 milliseconds to over 2 seconds when the container
> manager switched from our legacy downstream CLOSID-reuse technique[1]
> to mongrp_rename().
>
> A more focused study benchmarked creating 128 threads with
> pthread_create() on a production host found that moving mongroups
> unrelated to any of the benchmark threads increased the completion cpu
> time from 30ms to 183ms. Profiling the contention on the tasklist_lock
> showed that the average contention time on the tasklist_lock was about
> 70ms when mongroup move operations were taking place.
>
> It's difficult for me to access real production workloads, but I
> estimated a crude figure by measuring the task time of "wc -l
> /sys/fs/resctrl" with perf stat on a relatively idle Intel(R) Xeon(R)
> Platinum 8273CL CPU @ 2.20GHz. As I increased the thread count, it
> converged to a line where every additional 1000 threads added about 1
> millisecond.
Thank you very much for capturing this. Could you please include this in
next posting? This data motivates this work significantly more than
terms that are not measurable.
> Incorporating kernfs_rename() into the solution for changing a group's
> class of service also contributes a lot of overhead (about 90% of a
> mongroup rename seems to be spent here), but the global impact is far
> less than that of the tasklist_lock contention.
Is the kernfs_rename() overhead in an acceptable range?
>>> context switch. Updating a group's ID then only requires the current
>>> task to be switched back in on all CPUs. On server hosts with very large
>>> thread counts, this is much less disruptive than making thread creation
>>> globally unavailable. However, this is less desirable on CPU-isolated,
>>> realtime workloads, so I am interested in suggestions on how to reach a
>>> compromise for the two use cases.
>>
>> As I understand this only impacts moving a monitor group? To me this sounds
>> like a user space triggered event associated with a particular use case that
>> may not be relevant to the workloads that you refer to. I think this could be
>> something that can be documented for users with this type of workloads.
>> (please see patch #6)
>
> All of the existing rmdir cases seem to have the same problem, but
> they must not be used frequently enough for any concerns to be raised.
>
> It seems that it's routine for the workload of hosts to be increased
> until memory bandwidth saturation, so applying and unapplying
> allocation restrictions happens rather frequently.
Thank you.
Reinette
Hi Peter,
On 3/25/2024 12:27 PM, Peter Newman wrote:
> Hi Reinette,
>
> I've been working with users of the recently-added mongroup rename
> operation[1] who have observed the impact of back-to-back operations on
> latency-sensitive, thread pool-based services. Because changing a
> resctrl group's CLOSID (or RMID) requires all task_structs in the system
> to be inspected with the tasklist_lock read-locked, a series of move
> operations can block out thread creation for long periods of time, as
> creating threads needs to write-lock the tasklist_lock.
>
> To avoid this issue, this series replaces the CLOSID and RMID values
> cached in the task_struct with a pointer to the task's rdtgroup, through
> which the current CLOSID and RMID can be obtained indirectly during a
> context switch. Updating a group's ID then only requires the current
> task to be switched back in on all CPUs. On server hosts with very large
> thread counts, this is much less disruptive than making thread creation
> globally unavailable. However, this is less desirable on CPU-isolated,
> realtime workloads, so I am interested in suggestions on how to reach a
> compromise for the two use cases.
Before going this route, have you thought about your original solution
where CONTROL_MON groups sharing the same CLOSID?
[3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
May be it is less disruptive than changing the context switch code.. Thoughts?
thanks
Babu
>
> Also note that the soft-ABMC[2] proposal is based on swapping the RMID
> values in mongroups when monitors are assigned to them, which will
> result in the same slowdown as was encountered with re-parenting
> monitoring groups.
>
> Using pointers to rdtgroups from the task_struct been used internally at
> Google for a few years to support an alternate CLOSID management
> technique[3], which was replaced by mongroup rename. Usage of this
> technique during production revealed an issue where threads in an
> exiting process become unreachable via for_each_process_thread() before
> destruction, but can still switch in and out. Consequently, an rmdir
> operation can fail to remove references to an rdtgroup before it frees
> it, causing the pointer to freed memory to be dereferenced after the
> structure is freed. This would result in invalid CLOSID/RMID values
> being written into MSRs, causing an exception. The workaround for this
> is to clear a task's rdtgroup pointer when it exits.
>
> As a benefit, pointers to rdtgroups are architecture-independent,
> resulting in __resctrl_sched_in() and more of the task assignment code
> becoming portable, simplifying the effort of supporting MPAM. Using a
> single pointer allows the task's current monitor and control groups to
> be determined atomically.
>
> Similar changes have been used internally on a kernel supporting MPAM.
> Upon request, I can provide the required changes to the MPAM-resctrl
> interface based on James Morse's latest MPAM snapshot[4] for reference.
>
> Thanks!
> -Peter
>
> [1] https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/lkml/CALPaoChhKJiMAueFtgCTc7ffO++S5DJCySmxqf9ZDmhR9RQapw@mail.gmail.com
> [3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/v6.7-rc2
>
> Peter Newman (6):
> x86/resctrl: Move __resctrl_sched_in() out-of-line
> x86/resctrl: Add hook for releasing task_struct references
> x86/resctrl: Disallow mongroup rename on MPAM
> x86/resctrl: Use rdtgroup pointer to indicate task membership
> x86/resctrl: Abstract PQR_ASSOC from generic code
> x86/resctrl: Don't search tasklist in mongroup rename
>
> arch/x86/include/asm/resctrl.h | 93 --------
> arch/x86/kernel/cpu/resctrl/core.c | 14 +-
> arch/x86/kernel/cpu/resctrl/internal.h | 17 ++
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 279 +++++++++++++---------
> arch/x86/kernel/process_32.c | 2 +-
> arch/x86/kernel/process_64.c | 2 +-
> include/linux/resctrl.h | 38 +++
> include/linux/sched.h | 3 +-
> kernel/exit.c | 3 +
> 10 files changed, 239 insertions(+), 216 deletions(-)
>
>
> base-commit: 4cece764965020c22cff7665b18a012006359095
Hi Babu,
On Thu, Apr 11, 2024 at 3:34 PM Moger, Babu <[email protected]> wrote:
>
> Hi Peter,
>
> On 3/25/2024 12:27 PM, Peter Newman wrote:
> > To avoid this issue, this series replaces the CLOSID and RMID values
> > cached in the task_struct with a pointer to the task's rdtgroup, through
> > which the current CLOSID and RMID can be obtained indirectly during a
> > context switch. Updating a group's ID then only requires the current
> > task to be switched back in on all CPUs. On server hosts with very large
> > thread counts, this is much less disruptive than making thread creation
> > globally unavailable. However, this is less desirable on CPU-isolated,
> > realtime workloads, so I am interested in suggestions on how to reach a
> > compromise for the two use cases.
>
> Before going this route, have you thought about your original solution
> where CONTROL_MON groups sharing the same CLOSID?
>
> [3] https://lore.kernel.org/lkml/CALPaoCj-zav4x6H3ffXo_O+RFan8Qb-uLy-DdtkaQTfuxY4a0w@mail.gmail.com
>
> May be it is less disruptive than changing the context switch code.. Thoughts?
If had I ever posted that series, it would have contained the very
same changes to the context switch code, because that solution causes
rdtgroup->closid fields to change on almost every schemata write.
-Peter