After removing a monitor group, its RMID may not be freed immediately
unless its llc_occupancy is less than the re-allocation threshold. If
turning up the threshold, the RMID can be reused. In order to know how
much the threshold should be, it's necessary to acquire the llc_occupancy.
The patch series provides a new tracepoint to track the llc_occupancy.
Changes since v1:
- Rename pseudo_lock_event.h instead of creating a new header file.
- Modify names used in the new tracepoint.
- Update changelog.
Changes since v2:
- Fix typo and use the x86/resctrl subject prefix in the cover letter.
- Track both CLOSID and RMID in the tracepoint.
- Remove the details on how perf can be used in patch2's changelog.
Haifeng Xu (2):
x86/resctrl: Rename pseudo_lock_event.h to trace.h
x86/resctrl: Add tracepoint for llc_occupancy tracking
arch/x86/kernel/cpu/resctrl/monitor.c | 8 +++++++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
.../resctrl/{pseudo_lock_event.h => trace.h} | 23 +++++++++++++++----
3 files changed, 28 insertions(+), 5 deletions(-)
rename arch/x86/kernel/cpu/resctrl/{pseudo_lock_event.h => trace.h} (59%)
--
2.25.1
Now only pseudo-locking part uses tracepoints to do event tracking, but
other parts of resctrl may need new tracepoints. It is unnecessary to
create separate header files and define CREATE_TRACE_POINTS in different
c files which fragments the resctrl tracing.
Therefore, give the resctrl tracepoint header file a generic name to support
its use for tracepoints that are not specific to pseudo-locking.
No functional change.
Signed-off-by: Haifeng Xu <[email protected]>
Suggested-by: Reinette Chatre <[email protected]>
---
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
.../kernel/cpu/resctrl/{pseudo_lock_event.h => trace.h} | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
rename arch/x86/kernel/cpu/resctrl/{pseudo_lock_event.h => trace.h} (86%)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 884b88e25141..492c8e28c4ce 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -31,7 +31,7 @@
#include "internal.h"
#define CREATE_TRACE_POINTS
-#include "pseudo_lock_event.h"
+#include "trace.h"
/*
* The bits needed to disable hardware prefetching varies based on the
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h b/arch/x86/kernel/cpu/resctrl/trace.h
similarity index 86%
rename from arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h
rename to arch/x86/kernel/cpu/resctrl/trace.h
index 428ebbd4270b..495fb90c8572 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h
+++ b/arch/x86/kernel/cpu/resctrl/trace.h
@@ -2,8 +2,8 @@
#undef TRACE_SYSTEM
#define TRACE_SYSTEM resctrl
-#if !defined(_TRACE_PSEUDO_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
-#define _TRACE_PSEUDO_LOCK_H
+#if !defined(_TRACE_RESCTRL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RESCTRL_H
#include <linux/tracepoint.h>
@@ -35,9 +35,9 @@ TRACE_EVENT(pseudo_lock_l3,
TP_printk("hits=%llu miss=%llu",
__entry->l3_hits, __entry->l3_miss));
-#endif /* _TRACE_PSEUDO_LOCK_H */
+#endif /* _TRACE_RESCTRL_H */
#undef TRACE_INCLUDE_PATH
#define TRACE_INCLUDE_PATH .
-#define TRACE_INCLUDE_FILE pseudo_lock_event
+#define TRACE_INCLUDE_FILE trace
#include <trace/define_trace.h>
--
2.25.1
In our production environment, after removing monitor groups, those unused
RMIDs get stuck in the limbo list forever because their llc_occupancy are
always larger than the threshold. But the unused RMIDs can be successfully
freed by turning up the threshold.
In order to know how much the threshold should be, perf can be used to acquire
the llc_occupancy of RMIDs in each rdt domain.
Instead of using perf tool to track llc_occupancy and filter the log manually,
it is more convenient for users to use tracepoint to do this work. So add a new
tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
list.
Signed-off-by: Haifeng Xu <[email protected]>
Suggested-by: Reinette Chatre <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++++++
arch/x86/kernel/cpu/resctrl/trace.h | 15 +++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c34a35ec0f03..ada392ca75b2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -24,6 +24,7 @@
#include <asm/resctrl.h>
#include "internal.h"
+#include "trace.h"
/**
* struct rmid_entry - dirty tracking for all RMID.
@@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
limbo_release_entry(entry);
}
cur_idx = idx + 1;
+
+ /* x86's CLOSID and RMID are independent numbers, so the entry's
+ * closid is a invalid CLOSID. But on arm64, the RMID value isn't
+ * a unique number for each CLOSID. It's necessary to track both
+ * CLOSID and RMID because there may be dependencies between each
+ * other on some architectures */
+ trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
}
resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
index 495fb90c8572..35149a75c951 100644
--- a/arch/x86/kernel/cpu/resctrl/trace.h
+++ b/arch/x86/kernel/cpu/resctrl/trace.h
@@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
TP_printk("hits=%llu miss=%llu",
__entry->l3_hits, __entry->l3_miss));
+TRACE_EVENT(mon_llc_occupancy_limbo,
+ TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
+ TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
+ TP_STRUCT__entry(__field(u32, ctrl_hw_id)
+ __field(u32, mon_hw_id)
+ __field(int, id)
+ __field(u64, occupancy)),
+ TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
+ __entry->mon_hw_id = mon_hw_id;
+ __entry->id = id;
+ __entry->occupancy = occupancy;),
+ TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
+ __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
+ );
+
#endif /* _TRACE_RESCTRL_H */
#undef TRACE_INCLUDE_PATH
--
2.25.1
Hello!
On 29/02/2024 07:11, Haifeng Xu wrote:
> In our production environment, after removing monitor groups, those unused
> RMIDs get stuck in the limbo list forever because their llc_occupancy are
> always larger than the threshold. But the unused RMIDs can be successfully
> freed by turning up the threshold.
>
> In order to know how much the threshold should be, perf can be used to acquire
> the llc_occupancy of RMIDs in each rdt domain.
>
> Instead of using perf tool to track llc_occupancy and filter the log manually,
> it is more convenient for users to use tracepoint to do this work. So add a new
> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
> list.
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index c34a35ec0f03..ada392ca75b2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
[expanded the diff hunk for better context]
> entry = __rmid_entry(idx);
> if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
> QOS_L3_OCCUP_EVENT_ID, &val,
> arch_mon_ctx)) {
> rmid_dirty = true;
> } else {
> rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
> }
>
> if (force_free || !rmid_dirty) {
> clear_bit(idx, d->rmid_busy_llc);
> if (!--entry->busy)
> limbo_release_entry(entry);
> }
> cur_idx = idx + 1;
Ideally this would be the last line in the loop, its how the iterator advances to the next
bit in the bitmap.
> + /* x86's CLOSID and RMID are independent numbers, so the entry's
> + * closid is a invalid CLOSID. But on arm64, the RMID value isn't
> + * a unique number for each CLOSID. It's necessary to track both
> + * CLOSID and RMID because there may be dependencies between each
> + * other on some architectures */
> + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
I agree outputting both these values is the right thing to do.
resctrl_arch_rmid_read() could return an error, in which case val here is either
uninitialised, or the value of another RMID.
Could you put the tracepoint in the 'else' section of the if/else after
resctrl_arch_rmid_read()? This way it will only output a value to user-space if it is correct.
> diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
> index 495fb90c8572..35149a75c951 100644
> --- a/arch/x86/kernel/cpu/resctrl/trace.h
> +++ b/arch/x86/kernel/cpu/resctrl/trace.h
> @@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
> TP_printk("hits=%llu miss=%llu",
> __entry->l3_hits, __entry->l3_miss));
>
> +TRACE_EVENT(mon_llc_occupancy_limbo,
> + TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
> + TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
> + TP_STRUCT__entry(__field(u32, ctrl_hw_id)
> + __field(u32, mon_hw_id)
> + __field(int, id)
Nit: Could we call this 'domain_id'? We've got two other ids already, so we should be
clear what each one is!
> + __field(u64, occupancy)),
Nit: 'occupancy_bytes'? Just to avoid user-space thinking it needs to convert from the
hardware unit in Intel's SDM ... and just in case we ever have to add another parameter
that is sort of occupancy too.
> + TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
> + __entry->mon_hw_id = mon_hw_id;
> + __entry->id = id;
> + __entry->occupancy = occupancy;),
> + TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
> + __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
> + );
> +
Tracepoints always expose some implicit details of the kernel internals which can make
supporting them a headache. In this case - I've had some discussion with folk about future
work to defer the limbo work as late as possible - until a new control or monitor group is
allocated. This would be to avoid interrupting user-space tasks to update the limbo list
when the result isn't needed until alloc time.
In this case the tracepoint wouldn't be hit unless user-space made a mkdir() syscall to
force an update - I think this can just be a documentation problem.
I also don't think we should commit to this outputting values for all dirty RMID - which
it does today. If group creation were to fail because there weren't any free RMID you'd
get all the values, I think this is the only case where user-space would care.
Could we document the properties of the the trace-point that can be relied on in
Documentation/arch/x86/resctrl.rst ?
Something like:
| This tracepoint gives you the precise occupancy values for a subset of RMID that are not
| immediately available for allocation. This can't be relied on to produce output every
| second, it may be necessary to attempt to create an empty monitor group to force an
| update. Output may only be produced if creation of a control or monitor group fails.
I think we'll always walk a list of dirty RMID before failing to allocate an RMID, so this
should be future proof.
With the val and documentation bits:
Reviewed-by: James Morse <[email protected]>
Thanks,
James
On 2024/3/2 01:47, James Morse wrote:
> Hello!
>
> On 29/02/2024 07:11, Haifeng Xu wrote:
>> In our production environment, after removing monitor groups, those unused
>> RMIDs get stuck in the limbo list forever because their llc_occupancy are
>> always larger than the threshold. But the unused RMIDs can be successfully
>> freed by turning up the threshold.
>>
>> In order to know how much the threshold should be, perf can be used to acquire
>> the llc_occupancy of RMIDs in each rdt domain.
>>
>> Instead of using perf tool to track llc_occupancy and filter the log manually,
>> it is more convenient for users to use tracepoint to do this work. So add a new
>> tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
>> list.
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index c34a35ec0f03..ada392ca75b2 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
>
> [expanded the diff hunk for better context]
>> entry = __rmid_entry(idx);
>> if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid,
>> QOS_L3_OCCUP_EVENT_ID, &val,
>> arch_mon_ctx)) {
>> rmid_dirty = true;
>> } else {
>> rmid_dirty = (val >= resctrl_rmid_realloc_threshold);
>> }
>>
>> if (force_free || !rmid_dirty) {
>> clear_bit(idx, d->rmid_busy_llc);
>> if (!--entry->busy)
>> limbo_release_entry(entry);
>> }
>
>> cur_idx = idx + 1;
>
> Ideally this would be the last line in the loop, its how the iterator advances to the next
> bit in the bitmap.
>
>
>> + /* x86's CLOSID and RMID are independent numbers, so the entry's
>> + * closid is a invalid CLOSID. But on arm64, the RMID value isn't
>> + * a unique number for each CLOSID. It's necessary to track both
>> + * CLOSID and RMID because there may be dependencies between each
>> + * other on some architectures */
>> + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
>
> I agree outputting both these values is the right thing to do.
>
> resctrl_arch_rmid_read() could return an error, in which case val here is either
> uninitialised, or the value of another RMID.
> Could you put the tracepoint in the 'else' section of the if/else after
> resctrl_arch_rmid_read()? This way it will only output a value to user-space if it is correct.
>
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
>> index 495fb90c8572..35149a75c951 100644
>> --- a/arch/x86/kernel/cpu/resctrl/trace.h
>> +++ b/arch/x86/kernel/cpu/resctrl/trace.h
>> @@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
>> TP_printk("hits=%llu miss=%llu",
>> __entry->l3_hits, __entry->l3_miss));
>>
>> +TRACE_EVENT(mon_llc_occupancy_limbo,
>> + TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
>> + TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
>> + TP_STRUCT__entry(__field(u32, ctrl_hw_id)
>> + __field(u32, mon_hw_id)
>
>> + __field(int, id)
>
> Nit: Could we call this 'domain_id'? We've got two other ids already, so we should be
> clear what each one is!
>
>
>> + __field(u64, occupancy)),
>
> Nit: 'occupancy_bytes'? Just to avoid user-space thinking it needs to convert from the
> hardware unit in Intel's SDM ... and just in case we ever have to add another parameter
> that is sort of occupancy too.
>
>
>> + TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
>> + __entry->mon_hw_id = mon_hw_id;
>> + __entry->id = id;
>> + __entry->occupancy = occupancy;),
>> + TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
>> + __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
>> + );
>> +
>
> Tracepoints always expose some implicit details of the kernel internals which can make
> supporting them a headache. In this case - I've had some discussion with folk about future
> work to defer the limbo work as late as possible - until a new control or monitor group is
> allocated. This would be to avoid interrupting user-space tasks to update the limbo list
> when the result isn't needed until alloc time.
>
> In this case the tracepoint wouldn't be hit unless user-space made a mkdir() syscall to
> force an update - I think this can just be a documentation problem.
>
> I also don't think we should commit to this outputting values for all dirty RMID - which
> it does today. If group creation were to fail because there weren't any free RMID you'd
> get all the values, I think this is the only case where user-space would care.
Yes.
>
> Could we document the properties of the the trace-point that can be relied on in
> Documentation/arch/x86/resctrl.rst ?
>
> Something like:
> | This tracepoint gives you the precise occupancy values for a subset of RMID that are not
> | immediately available for allocation. This can't be relied on to produce output every
> | second, it may be necessary to attempt to create an empty monitor group to force an
> | update. Output may only be produced if creation of a control or monitor group fails.
>
> I think we'll always walk a list of dirty RMID before failing to allocate an RMID, so this
> should be future proof.
>
>
> With the val and documentation bits:
> Reviewed-by: James Morse <[email protected]>
>
>
> Thanks,
>
> James
Thanks for your suggestions.