In the Open Telemetry profiling SIG [1], we are trying to find a way to
grab a tracing association quickly on a per-sample basis. The team at
Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture
upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS or other value(s) when a sampling event is generated. This is
supported today on Windows via EventActivityIdControl() [3]. Since
Open Telemetry works on both Windows and Linux, ideally we can do
something as efficient for Linux based workloads.
This series is to explore how it would be best possible to collect
supporting data from a user process when a profile sample is collected.
Having a value stored in TLS makes a lot of sense for this however
there are other ways to explore. Whatever is chosen, kernel samples
taken in process context should be able to get this supporting data.
In these patches on X64 the fsbase and gsbase are used for this.
An option to explore suggested by Mathieu Desnoyers is to utilize rseq
for processes to register a value location that can be included when
profiling if desired. This would allow a tighter contract between user
processes and a profiler. It would allow better labeling/categorizing
the correlation values.
An idea flow would look like this:
User Task Profile
do_work(); sample() -> IP + No activity
..
set_activity(123);
..
do_work(); sample() -> IP + activity (123)
..
set_activity(124);
..
do_work(); sample() -> IP + activity (124)
Ideally, the set_activity() method would not be a syscall. It needs to
be very cheap as this should not bottleneck work. Ideally this is just
a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl()
using EVENT_ACTIVITY_CTRL_SET_ID.
For those not aware, Open Telemetry allows collecting data from multiple
machines and show where time was spent. The tracing context is already
available for logs, but not for profiling samples. The idea is to show
where slowdowns occur and have profile samples to explain why they
slowed down. This must be possible without having to track context
switches to do this correlation. This is because the profiling rates
are typically 20hz - 1Khz, while the context switching rates are much
higher. We do not want to have to consume high context switch rates
just to know a correlation for a 20hz signal. Often these 20hz signals
are always enabled in some environments.
Regardless if TLS, rseq, or other source is used I believe we will need
a way for perf_events to include it within a sample. The changes in this
series show how it could be done with TLS. There is some factoring work
under perf to make it easier to add more dump types using the existing
ABI. This is mostly to make the patches clearer, certainly the refactor
parts could get dropped and we could have duplicated/specialized paths.
1. https://opentelemetry.io/blog/2024/profiling/
2. https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
3. https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol
Beau Belgrave (4):
perf/core: Introduce perf_prepare_dump_data()
perf: Introduce PERF_SAMPLE_TLS_USER sample type
perf/core: Factor perf_output_sample_udump()
perf/x86/core: Add tls dump support
arch/Kconfig | 7 ++
arch/x86/Kconfig | 1 +
arch/x86/events/core.c | 14 +++
arch/x86/include/asm/perf_event.h | 5 +
include/linux/perf_event.h | 7 ++
include/uapi/linux/perf_event.h | 5 +-
kernel/events/core.c | 166 +++++++++++++++++++++++-------
kernel/events/internal.h | 16 +++
8 files changed, 180 insertions(+), 41 deletions(-)
base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
--
2.34.1
When samples are generated, there is no way via the perf_event ABI to
fetch per-thread data. This data is very useful in tracing scenarios
that involve correlation IDs, such as OpenTelemetry. They are also
useful for tracking per-thread performance details directly within a
cooperating user process.
The newly establish OpenTelemetry profiling group requires a way to get
tracing correlations on both Linux and Windows. On Windows this
correlation is on a per-thread basis directly via ETW. On Linux we need
a fast mechanism to store these details and TLS seems like the best
option, see links for more details.
Add a new sample type (PERF_SAMPLE_TLS_USER) that fetches TLS data up to
X bytes per-sample. Use the existing PERF_SAMPLE_STACK_USER ABI for
outputting data out to consumers. Store requested data size by the user
in the previously reserved u16 (__reserved_2) within perf_event_attr.
Add tls_addr and tls_user_size to perf_sample_data and calculate them
during sample preparation. This allows the output side to know if
truncation is going to occur and not having to re-fetch the TLS value
from the user process a second time.
Add CONFIG_HAVE_PERF_USER_TLS_DUMP so that architectures can specify if
they have a TLS specific register (or other logic) that can be used for
dumping. This does not yet enable any architecture to do TLS dump, it
simply makes it possible by allowing a arch defined method named
arch_perf_user_tls_pointer().
Add perf_tls struct that arch_perf_user_tls_pointer() utilizes to set
TLS details of the address and size (for 32bit on 64bit compat cases).
Link: https://opentelemetry.io/blog/2024/profiling/
Link: https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
Signed-off-by: Beau Belgrave <[email protected]>
---
arch/Kconfig | 7 +++
include/linux/perf_event.h | 7 +++
include/uapi/linux/perf_event.h | 5 +-
kernel/events/core.c | 105 +++++++++++++++++++++++++++++++-
kernel/events/internal.h | 16 +++++
5 files changed, 137 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 9f066785bb71..6afaf5f46e2f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -430,6 +430,13 @@ config HAVE_PERF_USER_STACK_DUMP
access to the user stack pointer which is not unified across
architectures.
+config HAVE_PERF_USER_TLS_DUMP
+ bool
+ help
+ Support user tls dumps for perf event samples. This needs
+ access to the user tls pointer which is not unified across
+ architectures.
+
config HAVE_ARCH_JUMP_LABEL
bool
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a..7fac81929eed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1202,8 +1202,15 @@ struct perf_sample_data {
u64 data_page_size;
u64 code_page_size;
u64 aux_size;
+ u64 tls_addr;
+ u64 tls_user_size;
} ____cacheline_aligned;
+struct perf_tls {
+ unsigned long base; /* Base address for TLS */
+ unsigned long size; /* Size of base address */
+};
+
/* default value for data source */
#define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\
PERF_MEM_S(LVL, NA) |\
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 3a64499b0f5d..b62669cfe581 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -162,8 +162,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22,
PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23,
PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24,
+ PERF_SAMPLE_TLS_USER = 1U << 25,
- PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 26, /* non-ABI */
};
#define PERF_SAMPLE_WEIGHT_TYPE (PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT)
@@ -509,7 +510,7 @@ struct perf_event_attr {
*/
__u32 aux_watermark;
__u16 sample_max_stack;
- __u16 __reserved_2;
+ __u16 sample_tls_user; /* Size of TLS data to dump on samples */
__u32 aux_sample_size;
__u32 __reserved_3;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 07de5cc2aa25..f848bf4be9bd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6926,6 +6926,45 @@ static u64 perf_ustack_task_size(struct pt_regs *regs)
return TASK_SIZE - addr;
}
+/*
+ * Get remaining task size from user tls pointer.
+ *
+ * Outputs the address to use for the dump to avoid doing
+ * this twice (prepare and output).
+ */
+static u64
+perf_utls_task_size(struct pt_regs *regs, u64 dump_size, u64 *tls_addr)
+{
+ struct perf_tls tls;
+ unsigned long addr;
+
+ *tls_addr = 0;
+
+ /* No regs, no tls pointer, no dump. */
+ if (!regs)
+ return 0;
+
+ perf_user_tls_pointer(&tls);
+
+ if (WARN_ONCE(tls.size > sizeof(addr), "perf: Bad TLS size.\n"))
+ return 0;
+
+ addr = 0;
+ arch_perf_out_copy_user(&addr, (void *)tls.base, tls.size);
+
+ if (addr < dump_size)
+ return 0;
+
+ addr -= dump_size;
+
+ if (!addr || addr >= TASK_SIZE)
+ return 0;
+
+ *tls_addr = addr;
+
+ return TASK_SIZE - addr;
+}
+
static u16
perf_sample_dump_size(u16 dump_size, u16 header_size, u64 task_size)
{
@@ -6997,6 +7036,43 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
}
}
+static void
+perf_output_sample_utls(struct perf_output_handle *handle, u64 addr,
+ u64 dump_size, struct pt_regs *regs)
+{
+ /* Case of a kernel thread, nothing to dump */
+ if (!regs) {
+ u64 size = 0;
+ perf_output_put(handle, size);
+ } else {
+ unsigned int rem;
+ u64 dyn_size;
+
+ /*
+ * We dump:
+ * static size
+ * - the size requested by user or the best one we can fit
+ * in to the sample max size
+ * data
+ * - user tls dump data
+ * dynamic size
+ * - the actual dumped size
+ */
+
+ /* Static size. */
+ perf_output_put(handle, dump_size);
+
+ /* Data. */
+ rem = __output_copy_user(handle, (void *)addr, dump_size);
+ dyn_size = dump_size - rem;
+
+ perf_output_skip(handle, rem);
+
+ /* Dynamic size. */
+ perf_output_put(handle, dyn_size);
+ }
+}
+
static unsigned long perf_prepare_sample_aux(struct perf_event *event,
struct perf_sample_data *data,
size_t size)
@@ -7474,6 +7550,13 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
perf_output_put(handle, data->code_page_size);
+ if (sample_type & PERF_SAMPLE_TLS_USER) {
+ perf_output_sample_utls(handle,
+ data->tls_addr,
+ data->tls_user_size,
+ data->regs_user.regs);
+ }
+
if (sample_type & PERF_SAMPLE_AUX) {
perf_output_put(handle, data->aux_size);
@@ -7759,6 +7842,19 @@ void perf_prepare_sample(struct perf_sample_data *data,
data->sample_flags |= PERF_SAMPLE_STACK_USER;
}
+ if (filtered_sample_type & PERF_SAMPLE_TLS_USER) {
+ u16 tls_size = event->attr.sample_tls_user;
+ u64 task_size = perf_utls_task_size(data->regs_user.regs,
+ tls_size,
+ &data->tls_addr);
+
+ tls_size = perf_prepare_dump_data(data, event, regs,
+ tls_size, task_size);
+
+ data->tls_user_size = tls_size;
+ data->sample_flags |= PERF_SAMPLE_TLS_USER;
+ }
+
if (filtered_sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
data->weight.full = 0;
data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
@@ -12159,7 +12255,7 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
attr->size = size;
- if (attr->__reserved_1 || attr->__reserved_2 || attr->__reserved_3)
+ if (attr->__reserved_1 || attr->__reserved_3)
return -EINVAL;
if (attr->sample_type & ~(PERF_SAMPLE_MAX-1))
@@ -12225,6 +12321,13 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
return -EINVAL;
}
+ if (attr->sample_type & PERF_SAMPLE_TLS_USER) {
+ if (!arch_perf_have_user_tls_dump())
+ return -ENOSYS;
+ else if (!IS_ALIGNED(attr->sample_tls_user, sizeof(u64)))
+ return -EINVAL;
+ }
+
if (!attr->sample_max_stack)
attr->sample_max_stack = sysctl_perf_event_max_stack;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5150d5f84c03..b42747b1eb04 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -243,4 +243,20 @@ static inline bool arch_perf_have_user_stack_dump(void)
#define perf_user_stack_pointer(regs) 0
#endif /* CONFIG_HAVE_PERF_USER_STACK_DUMP */
+#ifdef CONFIG_HAVE_PERF_USER_TLS_DUMP
+static inline bool arch_perf_have_user_tls_dump(void)
+{
+ return true;
+}
+
+#define perf_user_tls_pointer(tls) arch_perf_user_tls_pointer(tls)
+#else
+static inline bool arch_perf_have_user_tls_dump(void)
+{
+ return false;
+}
+
+#define perf_user_tls_pointer(tls) memset(tls, 0, sizeof(*tls))
+#endif /* CONFIG_HAVE_PERF_USER_TLS_DUMP */
+
#endif /* _KERNEL_EVENTS_INTERNAL_H */
--
2.34.1
On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave <[email protected]> wrote:
>
> In the Open Telemetry profiling SIG [1], we are trying to find a way to
> grab a tracing association quickly on a per-sample basis. The team at
> Elastic has a bespoke way to do this [2], however, I'd like to see a
> more general way to achieve this. The folks I've been talking with seem
> open to the idea of just having a TLS value for this we could capture
Presumably TLS == Thread Local Storage.
> upon each sample. We could then just state, Open Telemetry SDKs should
> have a TLS value for span correlation. However, we need a way to sample
> the TLS or other value(s) when a sampling event is generated. This is
> supported today on Windows via EventActivityIdControl() [3]. Since
> Open Telemetry works on both Windows and Linux, ideally we can do
> something as efficient for Linux based workloads.
>
> This series is to explore how it would be best possible to collect
> supporting data from a user process when a profile sample is collected.
> Having a value stored in TLS makes a lot of sense for this however
> there are other ways to explore. Whatever is chosen, kernel samples
> taken in process context should be able to get this supporting data.
> In these patches on X64 the fsbase and gsbase are used for this.
>
> An option to explore suggested by Mathieu Desnoyers is to utilize rseq
> for processes to register a value location that can be included when
> profiling if desired. This would allow a tighter contract between user
> processes and a profiler. It would allow better labeling/categorizing
> the correlation values.
It is hard to understand this idea. Are you saying stash a cookie in
TLS for samples to capture to indicate an activity? Restartable
sequences are about preemption on a CPU not of a thread, so at least
my intuition is that they feel different. You could stash information
like this today by changing the thread name which generates comm
events. I've wondered about having similar information in some form of
reserved for profiling stack slot, for example, to stash a pointer to
the name of a function being interpreted. Snapshotting all of a stack
is bad performance wise and for security. A stack slot would be able
to deal with nesting.
> An idea flow would look like this:
> User Task Profile
> do_work(); sample() -> IP + No activity
> ...
> set_activity(123);
> ...
> do_work(); sample() -> IP + activity (123)
> ...
> set_activity(124);
> ...
> do_work(); sample() -> IP + activity (124)
>
> Ideally, the set_activity() method would not be a syscall. It needs to
> be very cheap as this should not bottleneck work. Ideally this is just
> a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl()
> using EVENT_ACTIVITY_CTRL_SET_ID.
>
> For those not aware, Open Telemetry allows collecting data from multiple
> machines and show where time was spent. The tracing context is already
> available for logs, but not for profiling samples. The idea is to show
> where slowdowns occur and have profile samples to explain why they
> slowed down. This must be possible without having to track context
> switches to do this correlation. This is because the profiling rates
> are typically 20hz - 1Khz, while the context switching rates are much
> higher. We do not want to have to consume high context switch rates
> just to know a correlation for a 20hz signal. Often these 20hz signals
> are always enabled in some environments.
>
> Regardless if TLS, rseq, or other source is used I believe we will need
> a way for perf_events to include it within a sample. The changes in this
> series show how it could be done with TLS. There is some factoring work
> under perf to make it easier to add more dump types using the existing
> ABI. This is mostly to make the patches clearer, certainly the refactor
> parts could get dropped and we could have duplicated/specialized paths.
fs and gs may be used for more than just the C runtime's TLS. For
example, they may be used by emulators or managed runtimes. I'm not
clear why this specific case couldn't be handled through BPF.
Thanks,
Ian
> 1. https://opentelemetry.io/blog/2024/profiling/
> 2. https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
> 3. https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol
>
> Beau Belgrave (4):
> perf/core: Introduce perf_prepare_dump_data()
> perf: Introduce PERF_SAMPLE_TLS_USER sample type
> perf/core: Factor perf_output_sample_udump()
> perf/x86/core: Add tls dump support
>
> arch/Kconfig | 7 ++
> arch/x86/Kconfig | 1 +
> arch/x86/events/core.c | 14 +++
> arch/x86/include/asm/perf_event.h | 5 +
> include/linux/perf_event.h | 7 ++
> include/uapi/linux/perf_event.h | 5 +-
> kernel/events/core.c | 166 +++++++++++++++++++++++-------
> kernel/events/internal.h | 16 +++
> 8 files changed, 180 insertions(+), 41 deletions(-)
>
>
> base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
> --
> 2.34.1
>
On Fri, Apr 12, 2024 at 12:17:28AM +0000, Beau Belgrave wrote:
> An idea flow would look like this:
> User Task Profile
> do_work(); sample() -> IP + No activity
> ...
> set_activity(123);
> ...
> do_work(); sample() -> IP + activity (123)
> ...
> set_activity(124);
> ...
> do_work(); sample() -> IP + activity (124)
This, start with this, because until I saw this, I was utterly confused
as to what the heck you were on about.
I started by thinking we already have TID in samples so you can already
associate back to user processes and got increasingly confused the
further I went.
What you seem to want to do however is have some task-state included so
you can see what the thread is doing.
Anyway, since we typically run stuff from NMI context, accessing user
data is 'interesting'. As such I would really like to make this work
depend on the call-graph rework that pushes all the user access bits
into return-to-user.
On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote:
> On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave <[email protected]> wrote:
> >
> > In the Open Telemetry profiling SIG [1], we are trying to find a way to
> > grab a tracing association quickly on a per-sample basis. The team at
> > Elastic has a bespoke way to do this [2], however, I'd like to see a
> > more general way to achieve this. The folks I've been talking with seem
> > open to the idea of just having a TLS value for this we could capture
>
> Presumably TLS == Thread Local Storage.
>
Yes, the initial idea is to use thread local storage (TLS). It seems to
be the fastest option to save a per-thread value that changes at a fast
rate.
> > upon each sample. We could then just state, Open Telemetry SDKs should
> > have a TLS value for span correlation. However, we need a way to sample
> > the TLS or other value(s) when a sampling event is generated. This is
> > supported today on Windows via EventActivityIdControl() [3]. Since
> > Open Telemetry works on both Windows and Linux, ideally we can do
> > something as efficient for Linux based workloads.
> >
> > This series is to explore how it would be best possible to collect
> > supporting data from a user process when a profile sample is collected.
> > Having a value stored in TLS makes a lot of sense for this however
> > there are other ways to explore. Whatever is chosen, kernel samples
> > taken in process context should be able to get this supporting data.
> > In these patches on X64 the fsbase and gsbase are used for this.
> >
> > An option to explore suggested by Mathieu Desnoyers is to utilize rseq
> > for processes to register a value location that can be included when
> > profiling if desired. This would allow a tighter contract between user
> > processes and a profiler. It would allow better labeling/categorizing
> > the correlation values.
>
> It is hard to understand this idea. Are you saying stash a cookie in
> TLS for samples to capture to indicate an activity? Restartable
> sequences are about preemption on a CPU not of a thread, so at least
> my intuition is that they feel different. You could stash information
> like this today by changing the thread name which generates comm
> events. I've wondered about having similar information in some form of
> reserved for profiling stack slot, for example, to stash a pointer to
> the name of a function being interpreted. Snapshotting all of a stack
> is bad performance wise and for security. A stack slot would be able
> to deal with nesting.
>
You are getting the idea. A slot or tag for a thread would be great! I'm
not a fan of overriding the thread comm name (as that already has a
use). TLS would be fine, if we could also pass an offset + size + type.
Maybe a stack slot that just points to parts of TLS? That way you could
have a set of slots that don't require much memory and selectively copy
them out of TLS (or where ever those slots point to in user memory).
When I was talking to Mathieu about this, it seems that rseq already had
a place to potentially put these slots. I'm unsure though how the per
thread aspects would work.
Mathieu, can you post your ideas here about that?
> > An idea flow would look like this:
> > User Task Profile
> > do_work(); sample() -> IP + No activity
> > ...
> > set_activity(123);
> > ...
> > do_work(); sample() -> IP + activity (123)
> > ...
> > set_activity(124);
> > ...
> > do_work(); sample() -> IP + activity (124)
> >
> > Ideally, the set_activity() method would not be a syscall. It needs to
> > be very cheap as this should not bottleneck work. Ideally this is just
> > a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl()
> > using EVENT_ACTIVITY_CTRL_SET_ID.
> >
> > For those not aware, Open Telemetry allows collecting data from multiple
> > machines and show where time was spent. The tracing context is already
> > available for logs, but not for profiling samples. The idea is to show
> > where slowdowns occur and have profile samples to explain why they
> > slowed down. This must be possible without having to track context
> > switches to do this correlation. This is because the profiling rates
> > are typically 20hz - 1Khz, while the context switching rates are much
> > higher. We do not want to have to consume high context switch rates
> > just to know a correlation for a 20hz signal. Often these 20hz signals
> > are always enabled in some environments.
> >
> > Regardless if TLS, rseq, or other source is used I believe we will need
> > a way for perf_events to include it within a sample. The changes in this
> > series show how it could be done with TLS. There is some factoring work
> > under perf to make it easier to add more dump types using the existing
> > ABI. This is mostly to make the patches clearer, certainly the refactor
> > parts could get dropped and we could have duplicated/specialized paths.
>
> fs and gs may be used for more than just the C runtime's TLS. For
> example, they may be used by emulators or managed runtimes. I'm not
> clear why this specific case couldn't be handled through BPF.
>
Agree about the fs/gs possibly being used for other things. If we had a
stack slot we could avoid the confusion and have tighter couplings.
You can do this in eBPF (see [2]). However, it's very clunky and depends
on specific SDKs per-language/runtime. We ourselves don't run our profilers
with anything other than CAP_PERFMON, and also have environments without
BPF enabled due to various reasons. It'd be great if we could get this data
directly from perf. At the very least, I'd love to get a standardized
way to attribute thread values accessible to other performance systems
(like eBPF and perf).
Thanks,
-Beau
> Thanks,
> Ian
>
> > 1. https://opentelemetry.io/blog/2024/profiling/
> > 2. https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
> > 3. https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol
> >
> > Beau Belgrave (4):
> > perf/core: Introduce perf_prepare_dump_data()
> > perf: Introduce PERF_SAMPLE_TLS_USER sample type
> > perf/core: Factor perf_output_sample_udump()
> > perf/x86/core: Add tls dump support
> >
> > arch/Kconfig | 7 ++
> > arch/x86/Kconfig | 1 +
> > arch/x86/events/core.c | 14 +++
> > arch/x86/include/asm/perf_event.h | 5 +
> > include/linux/perf_event.h | 7 ++
> > include/uapi/linux/perf_event.h | 5 +-
> > kernel/events/core.c | 166 +++++++++++++++++++++++-------
> > kernel/events/internal.h | 16 +++
> > 8 files changed, 180 insertions(+), 41 deletions(-)
> >
> >
> > base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
> > --
> > 2.34.1
> >
On Fri, Apr 12, 2024 at 09:12:45AM +0200, Peter Zijlstra wrote:
>
> On Fri, Apr 12, 2024 at 12:17:28AM +0000, Beau Belgrave wrote:
>
> > An idea flow would look like this:
> > User Task Profile
> > do_work(); sample() -> IP + No activity
> > ...
> > set_activity(123);
> > ...
> > do_work(); sample() -> IP + activity (123)
> > ...
> > set_activity(124);
> > ...
> > do_work(); sample() -> IP + activity (124)
>
> This, start with this, because until I saw this, I was utterly confused
> as to what the heck you were on about.
>
Will do.
> I started by thinking we already have TID in samples so you can already
> associate back to user processes and got increasingly confused the
> further I went.
>
> What you seem to want to do however is have some task-state included so
> you can see what the thread is doing.
>
Yeah, there is typically an external context (not on the machine) that
wants to be tied to each sample. The context could be a simple integer,
UUID, or something else entirely. For OTel, this is a 16-byte array [1].
> Anyway, since we typically run stuff from NMI context, accessing user
> data is 'interesting'. As such I would really like to make this work
> depend on the call-graph rework that pushes all the user access bits
> into return-to-user.
Cool, I assume that's the SFRAME work? Are there pointers to work I
could look at and think about what a rebase looks like? Or do you have
someone in mind I should work with for this?
Thanks,
-Beau
1. https://www.w3.org/TR/trace-context/#version-format
On 2024-04-12 12:28, Beau Belgrave wrote:
> On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote:
>> On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave <[email protected]> wrote:
>>>
>>> In the Open Telemetry profiling SIG [1], we are trying to find a way to
>>> grab a tracing association quickly on a per-sample basis. The team at
>>> Elastic has a bespoke way to do this [2], however, I'd like to see a
>>> more general way to achieve this. The folks I've been talking with seem
>>> open to the idea of just having a TLS value for this we could capture
>>
>> Presumably TLS == Thread Local Storage.
>>
>
> Yes, the initial idea is to use thread local storage (TLS). It seems to
> be the fastest option to save a per-thread value that changes at a fast
> rate.
>
>>> upon each sample. We could then just state, Open Telemetry SDKs should
>>> have a TLS value for span correlation. However, we need a way to sample
>>> the TLS or other value(s) when a sampling event is generated. This is
>>> supported today on Windows via EventActivityIdControl() [3]. Since
>>> Open Telemetry works on both Windows and Linux, ideally we can do
>>> something as efficient for Linux based workloads.
>>>
>>> This series is to explore how it would be best possible to collect
>>> supporting data from a user process when a profile sample is collected.
>>> Having a value stored in TLS makes a lot of sense for this however
>>> there are other ways to explore. Whatever is chosen, kernel samples
>>> taken in process context should be able to get this supporting data.
>>> In these patches on X64 the fsbase and gsbase are used for this.
>>>
>>> An option to explore suggested by Mathieu Desnoyers is to utilize rseq
>>> for processes to register a value location that can be included when
>>> profiling if desired. This would allow a tighter contract between user
>>> processes and a profiler. It would allow better labeling/categorizing
>>> the correlation values.
>>
>> It is hard to understand this idea. Are you saying stash a cookie in
>> TLS for samples to capture to indicate an activity? Restartable
>> sequences are about preemption on a CPU not of a thread, so at least
>> my intuition is that they feel different. You could stash information
>> like this today by changing the thread name which generates comm
>> events. I've wondered about having similar information in some form of
>> reserved for profiling stack slot, for example, to stash a pointer to
>> the name of a function being interpreted. Snapshotting all of a stack
>> is bad performance wise and for security. A stack slot would be able
>> to deal with nesting.
>>
>
> You are getting the idea. A slot or tag for a thread would be great! I'm
> not a fan of overriding the thread comm name (as that already has a
> use). TLS would be fine, if we could also pass an offset + size + type.
>
> Maybe a stack slot that just points to parts of TLS? That way you could
> have a set of slots that don't require much memory and selectively copy
> them out of TLS (or where ever those slots point to in user memory).
>
> When I was talking to Mathieu about this, it seems that rseq already had
> a place to potentially put these slots. I'm unsure though how the per
> thread aspects would work.
>
> Mathieu, can you post your ideas here about that?
Sure. I'll try to summarize my thoughts here. By all means, let me
know if I'm missing important pieces of the puzzle.
First of all, here is my understanding of what information we want to
share between userspace and kernel.
A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random
UUID allows) a portion of the dependency chain involved in doing some
work (e.g. answer a HTTP request) across one or many participating hosts.
Activity IDs have a parent/child relationship: a parent activity ID can
create children activity IDs.
For instance, if one host has the service "dispatch", another host
has a "web server", and a third host has a SQL database, we should
be able to follow the chain of activities needed to answer a web
query by following those activity IDs, linking them together
through parent/child relationships. This usually requires the
communication protocols to convey those activity IDs across hosts.
The reason why this information must be provided from userspace is
because it's userspace that knows where to find those activity IDs
within its application-layer communication protocols.
With tracing, taking a full trace of the activity ID spans begin/end
from all hosts allow reconstructing the activity IDs parent/child
relationships, so we typically only need to extract information about
activity ID span begin/end with parent/child info to a tracer.
Using activity IDs from a kernel profiler is trickier, because
we do not have access to the complete span begin/end trace to
reconstruct the activity ID parent/child relationship. This is
where I suspect we'd want to introduce a notion of "activity ID
stack", so a profiler could reconstruct the currently active
stack of activity IDs for the current thread by walking that
stack.
This profiling could be triggered either from an interrupt
(sampling use-case), which would then walk the user-space data
on return-to-userspace as noted by Peter, or it could also be
triggered from a system call return to userspace. This latter
option would make it possible to correlate system calls with
their associated activity ID stacks.
The basic scenario is simple enough: a thread pushes a new
current activity ID (starts a span), possibly nests other
spans, and ends them. It all happens neatly within a single
thread.
More advanced scenarios require more thoughts:
- non-blocking communication, where a thread can hop between
different requests. Technically, it should be able to swap
its current activity ID stack as it swaps handled requests.
- green threads (userspace scheduler): the userspace scheduler
should be able to swap the activity ID stack of the current
thread when swapping between user level threads.
- task "posting" (e.g. work queues types of work dispatch):
the activity ID stacks should probably be handed over with
the work item, and set as current activity ID stack by the
worker thread.
- exceptions should be able to restore the activity ID stack
from a previously saved state.
- Interpreters, JITs. Not sure about the constraints there, we
may need support from the runtimes.
Those activity IDs are frequently updated, so going through a
system call each time would be a non-starter. This is where
thinking in terms of sharing a per-thread data structure
(populated by user-space, read by the kernel) becomes relevant.
A few words about how the rseq(2) system call could help: the
main building block of rseq is a per-thread "struct rseq" ABI,
which is registered by libc on thread creation, and is guaranteed
to be accessible at an offset from the thread pointer in
userspace, and can be accessed using the task struct "rseq" pointer
from the kernel (in contexts that can handle page faults).
Since Linux v6.3 the rseq structure becomes extensible, and we're
working with GNU libc to add support for this extension scheme [1].
So even though the primary use-case for rseq was to enable per-cpu
data structures in user-space, it can be used for other purposes
where shared per-thread data is needed between kernel and userspace.
We could envision adding a new field to struct rseq which would contain
the top level pointer to the current "activity ID stack". The layout
of this stack would have to be defined as a kernel ABI. We'd want
to support push/pop of activity IDs from that stack, as well as
moving all of or portions of the activity ID stack somewhere else,
as well as saving/recovering the stack from a saved state to accommodate
the "advanced" scenarios described above (and probably other scenarios
I'm missing).
rseq(2) could also play a role in letting the kernel expose a seed to
be used for generation of random activity IDs through yet another new
struct rseq field if this happens to be relevant. It could either be
a seed, or just a generation counter to be used to check whether the
seed needs to be regenerated after sleep/hibernate/fork/clone [2].
I'm hoping all this makes some sense, or at least highlights holes
in my understanding. Feedback is welcome!
Thanks,
Mathieu
[1] https://sourceware.org/pipermail/libc-alpha/2024-March/155587.html
[2] https://sourceware.org/pipermail/libc-alpha/2024-March/155577.html
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> > Anyway, since we typically run stuff from NMI context, accessing user
> > data is 'interesting'. As such I would really like to make this work
> > depend on the call-graph rework that pushes all the user access bits
> > into return-to-user.
>
> Cool, I assume that's the SFRAME work? Are there pointers to work I
> could look at and think about what a rebase looks like? Or do you have
> someone in mind I should work with for this?
I've been offline for a little while and still need to catch up with
things myself.
Josh was working on that when I dropped off IIRC, I'm not entirely sure
where things are at currently (and there is no way I can ever hope to
process the backlog).
Anybody know where we are with that?
On Sat, 13 Apr 2024 12:53:38 +0200
Peter Zijlstra <[email protected]> wrote:
> On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
>
> > > Anyway, since we typically run stuff from NMI context, accessing user
> > > data is 'interesting'. As such I would really like to make this work
> > > depend on the call-graph rework that pushes all the user access bits
> > > into return-to-user.
> >
> > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > could look at and think about what a rebase looks like? Or do you have
> > someone in mind I should work with for this?
>
> I've been offline for a little while and still need to catch up with
> things myself.
>
> Josh was working on that when I dropped off IIRC, I'm not entirely sure
> where things are at currently (and there is no way I can ever hope to
> process the backlog).
>
> Anybody know where we are with that?
It's still very much on my RADAR, but with layoffs and such, my
priorities have unfortunately changed. I'm hoping to start helping out
in the near future though (in a month or two).
Josh was working on it, but I think he got pulled off onto other
priorities too :-p
-- Steve
On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote:
> On Sat, 13 Apr 2024 12:53:38 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> >
> > > > Anyway, since we typically run stuff from NMI context, accessing user
> > > > data is 'interesting'. As such I would really like to make this work
> > > > depend on the call-graph rework that pushes all the user access bits
> > > > into return-to-user.
> > >
> > > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > > could look at and think about what a rebase looks like? Or do you have
> > > someone in mind I should work with for this?
> >
> > I've been offline for a little while and still need to catch up with
> > things myself.
> >
> > Josh was working on that when I dropped off IIRC, I'm not entirely sure
> > where things are at currently (and there is no way I can ever hope to
> > process the backlog).
> >
> > Anybody know where we are with that?
>
> It's still very much on my RADAR, but with layoffs and such, my
> priorities have unfortunately changed. I'm hoping to start helping out
> in the near future though (in a month or two).
>
> Josh was working on it, but I think he got pulled off onto other
> priorities too :-p
Yeah, this is still a priority for me and I hope to get back to it over
the next few weeks (crosses fingers).
--
Josh